Non-convex nested Benders decomposition

We propose a new decomposition method to solve multistage non-convex mixed-integer (stochastic) nonlinear programming problems (MINLPs). We call this algorithm non-convex nested Benders decomposition (NC-NBD). NC-NBD is based on solving dynamically improved mixed-integer linear outer approximations of the MINLP, obtained by piecewise linear relaxations of nonlinear functions. Those MILPs are solved to global optimality using an enhancement of nested Benders decomposition, in which regularization, dynamically refined binary approximations of the state variables and Lagrangian cut techniques are combined to generate Lipschitz continuous non-convex approximations of the value functions. Those approximations are then used to decide whether the approximating MILP has to be dynamically refined and in order to compute feasible solutions for the original MINLP. We prove that NC-NBD converges to an ε\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon $$\end{document}-optimal solution in a finite number of steps. We provide promising computational results for some unit commitment problems of moderate size.


Introduction
We propose a new decomposition method to solve multistage non-convex mixedinteger (stochastic) nonlinear programming problems (MINLPs), i.e., optimization problems modeling a sequential decision making process. Continuous and integer If multistage (stochastic) problems are too large to be solved by off-the-shelf solvers, then tailored solution techniques are required. One example are decomposition algorithms making use of the specific sequential and block-diagonal structure of the constraints. The problems are decomposed into a large number of smaller but coupled subproblems which are solved iteratively. One of the most common decomposition methods is Benders decomposition, introduced by Benders [6] for linear programs. Since then, it has been enhanced to several more general cases, such as convex problems (generalized Benders decomposition (GBD) [19]), two-stage stochastic linear problems (L-shaped method [49]) and multistage (stochastic) linear problems (nested Benders decomposition (NBD) [8]). To mitigate the curse-of-dimensionality related to NBD in the stochastic case, Pereira and Pinto introduced its sampling-based variant stochastic dual dynamic programming (SDDP) [35], which was followed by various extensions [23,37].
The basic principle of NBD is to use the dynamic programming formulation of a given multistage problem. For each stage t ∈ {1, . . . , T }, a parametric subproblem is considered. This subproblem contains only those constraints, variables and parts of the objective function related to this specific stage, plus a value function determining the optimal value of all following stages for a given stage t solution. Since the value functions are not known in advance, they are iteratively approximated with linear cutting-planes. However, this approach requires the value functions to be convex. Therefore, most decomposition methods for multistage problems cover linear programs, as their value functions are guaranteed to be piecewise linear and convex.
However, in many applications, also integer variables or non-linearities occur naturally. In such case, the value functions are no longer convex and may also no longer be continuous. Therefore, the classical Benders approach fails, as it is impossible to construct a tight convex polyhedral approximation [47].
Thus, more sophisticated approaches have been developed to use Benders-type decomposition methods for non-convex MINLPs, mostly for the two-stage case. Li et al. propose an extension of GBD to the non-convex case for two-stage stochastic MINLPs with functions separable in integer and continuous variables [29,30]. In [28], a branch-and-cut framework is presented, where in each node Lagrangian and generalized Benders cuts are constructed. Related methods are proposed in [26,33]. All these methods have not been generalized to the multistage case yet.
To handle non-convexities in multistage problems, a common idea is to use convex relaxations of the value function, e.g., by relaxing the integrality constraints for MILPs or by convexifying nonlinear terms in a static manner. Dynamically convexifying the non-convex value functions using Lagrangian relaxation techniques allows for a polyhedral approximation by Lagrangian cuts [10,45,46]. None of these discussed approaches can guarantee to compute an optimal solution for non-convex multistage problems, though.
Only recently, some substantial progress has been made in generalizing the Benders decomposition idea to multistage problems with non-convex value functions directly. In [36], step functions are used, instead of cutting-planes, to approximate the value functions, presuming their monotonicity.
For the mixed-integer linear case, the stochastic dual dynamic integer programming (SDDiP) approach is proposed [55]. SDDiP is an enhancement of NBD and SDDP which allows the solution of multistage (stochastic) MILPs in case of binary state variables. The method is based on generating special Lagrangian cuts, which reproduce the lower convex envelope of the value function. As the latter is piecewise linear and exact at binary state variables, strong duality is ensured and the problem is solved to global optimality in a finite number of iterations. SDDiP is applied to multistage unit commitment in [54]. It is also applied to a problem containing non-convex functions in context of hydro power scheduling by using a static binary expansion of the state variables and a Big-M reformulation [22].
As long as the value functions are assured to be Lipschitz continuous and some recourse property is satisfied, the requirement of binary state variables can be dropped, as is shown by the Stochastic Lipschitz Dynamic programming (SLDP) method in [1]. Here, two types of non-convex Lipschitz continuous cuts are introduced: reverse-norm cuts and augmented Lagrangian cuts.
In [52], Zhang and Sun present a new framework to solve multistage non-convex stochastic MINLPs, generalizing both SDDiP and SLDP. Similarly to [1], nonlinear generalized conjugacy cuts are constructed by solving augmented dual problems. Moreover, as Lipschitz continuity is not assured for the value functions, a Lipschitz continuous regularized value function is considered within the decomposition method.
In this article, we propose a new method to solve multistage non-convex MINLPs to proven global optimality, which we refer to as non-convex nested Benders decomposition (NC-NBD). The method combines piecewise linear relaxations, regularization, binary approximation and the SDDiP Lagrangian cuts in a unique and dynamic fashion. Its basic idea is to solve a MINLP by iteratively improved MILP outer approximations, which in turn are solved using a NBD-based decomposition scheme similar to that in [52]. The binary and piecewise linear approximations are dynamically refined.
In particular, the original MINLP is outer approximated by MILPs, which are iteratively improved in an outer loop. Those MILPs are obtained by piecewise linear approximations of all occuring nonlinear functions, which is an established method in global optimization [50]. In general, using MILP relaxations is a common approach to global optimization solvers [27,32,53].
In an inner loop, the multistage MILPs are solved to approximate optimality in finitely many steps. This is achieved using a NBD-based decomposition method. In a forward pass through the stages, trial solutions for the dynamic programming equations are determined. As Lipschitz continuity of the value functions is not guaranteed, this is done solving a regularized forward pass problem, as proposed in [52]. For a sufficiently large, but finite parameter, the regularization is exact [14,52], so that still the desired MILP is solved.
In a backward pass through the stages, nonlinear non-convex cuts are constructed to approximate the non-convex value functions of the MILP. To this end, we make use of a binary approximation of the state variables in the subproblems. As proven in [55], for MILPs with binary state variables we obtain (sufficiently) tight cuts by solving Lagrangian dual problems. The constructed linear cuts are then projected back to the original state space, yielding a nonlinear, non-convex, but Lipschitz continuous approximation of the value functions. The binary approximation is refined dynamically within the inner loop if required. By careful construction, all existing cuts remain valid even with such refinements.
Once the MILP approximation is solved to approximate optimality, the cut approximation of the value functions is used in the outer loop to determine bounds for the optimal value of the original MINLP. If the bounds are sufficiently close, the algorithm terminates with an ε-optimal solution. Otherwise, the piecewise linear approximations are refined, and thus the approximating MILP is tightened. Again, by careful construction it is ensured that all previously generated cuts remain valid.
To our best knowledge, the above concepts have not been combined in this dynamic way to solve multistage non-convex MINLPs yet. In that regard, our work also differs significantly from the aforementioned solution techniques.
Our proposed decomposition scheme uses the same regularization technique and similar convergence ideas as in [52]. However, a fundamental difference is that we only apply this technique to solve MILP outer approximations of the original MINLP. This has the advantage that in our framework MINLPs have to be solved only occasionally. In contrast, in [52], MINLPs are assumed to be solved by some oracle in each iteration and cuts are generated directly for the MINLP, which is computationally challenging. Moreover, contrary to our approach, the method in [52] does not require recourse assumptions, but in return it only allows for state variables in the objective function.
In contrast to SDDiP [55] and SLDP [1], we solve MINLPs, and thus consider a larger solution framework with an inner and an outer loop. However, even the inner loop, in which MILPs are solved, differs from both approaches.
To solve MILPs with non-binary state variables using SDDiP, it is proposed to apply a static binary approximation [22,55]. This way, the original MILP is replaced by an approximating problem with only binary state variables. It can be shown that for a sufficiently small approximation precision, i.e., an sufficiently large number of binary variables, an ε-optimal solution of an MILP can be determined with this approach under some recourse assumption [55]. However, for a given problem at hand, it is not necessarily clear in advance how this precision has to be chosen, as knowledge on a problem-specific Lipschitz constant is required. This becomes even more challenging in our framework, where an MINLP is iteratively approximated by MILPs, for which the required precision may change. On the contrary, within NC-NBD the binary approximation is refined dynamically if required.
More crucially, in NC-NBD the binary approximation is applied temporarily only to derive cuts in the backward pass. These cuts are then projected back to the original state space. This construction has a few key advantages: Firstly, it is ensured that cuts remain valid even if the binary precision is refined later on. Secondly, the original state variables remain continuous and are not limited to values which can be exactly represented by the binary approximation. This, in turn, ensures that the true MILPs are solved in the inner loop. Consequently, the generated cuts are valid for the value functions of these MILPs and, due to their relaxation property, also the original MINLP. Analogously, the obtained lower bounds are valid for the corresponding optimal values. Importantly, this is not true for SDDiP with static binary approximation, where the state space is permanently modified and only approximations of the true MILPs are solved in the inner loop. In our approach to solve MINLPs, it is crucial to determine guaranteed valid cuts for the value functions in both loops. Therefore, SDDiP cannot be used effectively in this setting.
Our cut generation approach also differs from that in SLDP [1] (and also [52]), where augmented Lagrangian problems are solved to determine nonlinear cuts. While our method comes at the cost of introducing additional (binary) variables and constraints compared to those approaches, e.g., for the cut projection, we avoid solving dual problems containing nonlinear penalization in the objective. Such penalization may be disadvantageous as it prevents decomposition of the primal problems which are solved in the solution process of the dual problem. Additionally, in contrast to SLDP [1], we do not assume continuously complete recourse, but only the weaker complete recourse, as we circumvent the requirement of Lipschitz continuity of the true value functions by regularization.
The main contributions of this paper are as follows: (1) We present the non-convex nested Benders decomposition (NC-NBD) method to globally solve general multistage non-convex MINLPs. The method combines piecewise linear relaxations, regularization, binary approximation and cutting-planes techniques in a unique way. In contrast to existing approaches, all approximations are improved dynamically where and when it is reasonable. To our knowledge, this is the first decomposition method for general multistage non-convex MINLPs. (2) A crucial requirement using dynamic refinements is to ensure that all previously determined cuts remain valid within the refinement process and have not to be generated from scratch. We ensure this by a special cut projection and careful choice of the MILP relaxations. (3) We prove that the proposed NC-NBD method converges to an ε-optimal solution of P in a finite number of steps under some mild assumptions. (4) We provide first computational results of applying NC-NBD to moderate-sized instances of a unit commitment problem to illustrate its efficacy.
To enhance readability, we focus our discussions solely on deterministic MINLPs. However, the presented NC-NBD idea can also be applied to stochastic programs with stagewise independent and finite random variables.
The remainder of the paper is organized as follows. We present the considered problem formulation and assumptions in Sect. 2. Then, we introduce the NC-NBD with its different steps in Sect. 3, before presenting convergence results in Sect. 4. Afterwards, we provide computational results for instances of a simple unit commitment problem in Sect. 5. We conclude with Sect. 6.

Problem formulation
We consider the following multistage non-convex MINLP problems Here t = 1, . . . , T denotes the different stages with the final stage T ∈ N. For each stage t, the decision variables can be separated into mixed-integer state variables x t ∈ R n 1 t + × Z n 2 t + and local variables y t ∈ R n 3 t × Z n 4 t , with x 0 = 0. We define n t := n 1 t + n 2 t as the number of state variables. The sets M t (x t−1 ) appearing in the constraints for each stage t are defined by X t and Y t denote box constraints; X 0 := {0}. As such, X t and Y t are compact sets for all stage-t variables. All functions f t : t are well-defined on their domains. To exploit its multistage structure, we solve ( P) by some extension of NBD. NBD makes use of the dynamic programming formulation of ( P), where each stage-t subproblem, t = 1, . . . , T , can be denoted by with the value function Q t (·) of stage t and Q T +1 (·) ≡ 0. Note that x t links different stages, i.e., x t is a decision variable for ( P t (x t−1 )) and a parameter for ( P t+1 (x t )). For the first stage, we obtain that Q 1 (x 0 ) = v with x 0 ≡ 0. Importantly, subproblem ( P t (x t−1 )) is enhanced by introducing local copies z t of the state variables x t−1 and the copy constraints z t = x t−1 . Those copy constraints will prove crucial for the cut generation later on. Taking into account the local copies, we define As the subproblems ( P t (x t−1 )) are non-convex MINLPs, the value functions Q t (·) may be non-continuous and non-convex, two detrimental properties for Benders decomposition approaches. To ensure that the value functions Q t (·) are at least lower semicontinuous (l.sc.), we make the following technical assumptions: (A2) (Complete recourse). For any stage t and anyx t−1 ∈ X t−1 , there exists some As all variables are box-constrained, the feasible set M t (x t−1 ) of ( P t (x t−1 )) is bounded. With assumption (A1) and the recourse assumption (A2), all subproblems ( P t (x t−1 )) are feasible and bounded. Analogously, ( P) is feasible with finite optimal value v. Note that under assumption (A2) we can restrict to generating optimality cuts in NC-NBD without the need to introduce Benders feasibility cuts.
We obtain our required l.sc. property of the value functions Q t (·). Proof Fixing all integer variables, the l.sc. follows from Exercise 1.19 in [41]. As X t and Y t are bounded, only finitely many different values can be attained by the integer variables. The minimum of finitely many l.sc. functions is l.sc.
In the next section, we introduce the NC-NBD method, which combines regularization, piecewise linear approximations, binary expansion and special cutting-plane techniques in a unique way to solve ( P).

The NC-NBD principle
The basic idea of the NC-NBD algorithm is to employ that MILP problems can be solved exactly by enhancements of NBD under certain assumptions and that MINLPs can be outer approximated by MILPs iteratively. Thus, the method consists of two main components. The first component is an inner loop which is used to determine an approximately optimal solution of some MILP outer approximation ( P ) of problem ( P). This approximation is determined by piecewise linear relaxations of nonlinear functions in ( P). The second component is an outer loop which refines this outer approximation iteratively (indexed by ) to improve the approximation of the optimal value v of ( P). The NC-NBD is summarized in Algorithm 1 and illustrated in Fig. 1.
The inner loop follows the general principle of NBD to solve ( P ). It consists of a forward and a backward pass through the stages t = 1, . . . , T in each iteration i. In the forward pass, the stage-t subproblem ( P t (x t−1 )) is approximated in two different ways: The value function Q t+1 (·) of the following stage is replaced by some outer approximation Q i t+1 (·). Moreover, a regularization is added to ensure Lipschitz continuity of the corresponding value functions. Thus, regularized subproblems ( P R, i t (x t−1 )) are solved, as proposed in [52], yielding trial solutions x i t−1 and an upper bound v i for ( P ).
In the backward pass, the approximations Q i t+1 (·) of Q t+1 (·) are improved iteratively by constructing additional cuts. As the value functions are possibly non-convex, those cuts are nonlinear. Importantly, cuts for Q t+1 (·) are also valid for Q t+1 (·), as the first is an outer approximation of the latter.
In the literature, different ways are proposed to obtain nonlinear optimality cuts and to ensure that the inner loop converges to the optimal value v of ( P ). One method is to generate reverse-norm cuts [1]. However, this only works if the value functions themselves are Lipschitz continuous which is not guaranteed in our setting. Another, more general method is to solve some augmented Lagrangian dual problem, as proposed in [1,52].
We propose a third and new method, based on the SDDiP technique [55]. We utilize that we can generate sufficiently tight cuts by solving a Lagrangian dual in a lifted space, where all state variables are binary. Thus, we (temporarily) approximate the Algorithm 1 NC-NBD Input: Problem ( P) satisfying (A1), (A2), tolerances ε > ε > 0, scalar K t for all t, initial bin. approx.
Refine the piecewise linear approx. of all γ ∈ Γ to obtain T γ by longest-edge bisection of the simplex corresponding to ( z t , x t , y t ) t=1,...,T .

4:
Determine an outer approximation ( P ) of ( P) by using a MILP model and appropriate shifts for the piecewise linear approximations.

10: end for
BINARY APPROXIMATION REFINEMENT 11: if Forward Pass solution in i equals that in i − 1 then 12:
t+1 )). Store the optimal multiplier π i t and the corresponding optimal value c i t of the Lagrangian dual function. 17: Model the optimal value function φ i t of projecting φ i Bt to the original space by MILP constraints using the KKT conditions. 19:

Inner Loop (solving ( P ) iteratively, generating cuts for Q t )
Outer Loop (solving (P ) iteratively using cuts from inner loop, refining MILP ( P )) ε-optimal solution of (P ) state variables with binary ones, construct cuts in the binary space and then project those cuts back to the original space. As we show, these projections can be modeled by mixed-integer linear constraints in the original space. By careful construction, these cuts remain valid even if the binary approximation is refined in later iterations.
In this way, we circumvent solving an augmented Lagrangian dual, which may be even more expensive than solving the classical Lagrangian dual, as with the additional nonlinear term in the objective, the primal problems lose their decomposability. In return, we require more (binary) variables and constraints in the Lagrangian duals and for an MILP representation of our cuts than the approach in [1].
In principle, the MILPs as they occur in the inner loop could also be solved by using SDDiP with a static binary approximation of the state variables [55]. As discussed in Sect. 1, this approach has some properties which prevent an efficient integration into our algorithmic framework, though.
As we show in the next section, for a sufficiently fine binary approximation, the obtained cuts in the NC-NBD provide a sufficiently good approximation at the trial solutions x i t−1 . Additionally, the cut approximations Q t (·) are generated in such a way that they are Lipschitz continuous. This is sufficient to ensure convergence to a globally optimal solution of ( P ).
At the end of the backward pass, a lower bound v i is determined. If v i and v i are sufficiently close to each other, an approximate globally minimal point ( z t , x t , y t ) t=1,...,T of ( P ) has been identified and the inner loop is left. Otherwise, further cuts have to be constructed or the binary approximation has to be refined. We discuss this decision in more detail in Sect. 3.3.6.
Once the inner loop is left, subproblems ( P t (x t−1 , Q t+1 )) are solved to determine trial points x t−1 and an upper bound v to v for the original problem ( P). If this upper bound is sufficiently close to v , the solution (z t , x t , y t ) t=1,...,T is approximately optimal for problem ( P). If not, the MILP relaxation ( P +1 ) is created by refining ( P ) in the neighborhood of ( z t , x t , y t ) t=1,...,T and a new inner loop is started.
As for the inner loop, it is crucial that with these refinements in the outer loop all previously generated cuts remain valid. Otherwise, the cut approximation Q t+1 (·) would have to be built from scratch, counteracting the idea of a dynamic solution framework. In the following subsections, we show how such persistent validity can be achieved by careful design. Note that, even though we make use of the same regularization idea, our framework with nested loops and dynamic refinements also differs from the method presented in [52].
We explain the different steps of NC-NBD in more detail in the following subsections, before we discuss convergence results in Sect. 4. As long as the index is not needed for the discussions of the inner loop, we omit it for notational convenience. Moreover, we note that several of the considered subproblems require the introduction of additional decision variables, e.g., for piecewise linear approximation or cut projection. For reasons of clarity and comprehensibility, by the terms optimal point or optimal solution we refer to the projection of their actual optimal points to the space X t−1 × X t × Y t , which we are interested in.

Piecewise linear relaxations
In the outer loop of NC-NBD, all nonlinear functions γ ∈ Γ in problem ( P) are approximated by some piecewise linear functions. This is achieved by determining a triangulation of their domain, which in our box-constrained setting is always possible. Then, the piecewise linear functions can be defined on the simplices of this triangulation using the function values of γ at their vertices. For a thorough discussion and state-of-the-art approaches to construct piecewise linear approximations and triangulations, see [18,39,40].
The piecewise linear approximations can then be reformulated as mixed-integer linear constraints using auxiliary continuous and binary variables. In the literature, several modeling techniques have been proposed, such as the convex combination model, the incremental model and some logarithmic variants [4,18,38,51]. Later on, we draw on refinement and convergence ideas from [9], which work for several of these models, such as the generalized incremental model [9] or the disaggregated logarithmic convex combination model [51].
By shifting the approximations appropriately, it can be ensured that the obtained MILP ( P j ) is indeed a relaxation of the original problem ( P) [18]. Alternatively, one can construct piecewise linear underestimators and overestimators, yielding tubes for nonlinear equations [25].
Applying the piecewise linear approximations to problem ( P), we obtain the MILP outer approximation with copy constraints For reasons of clarity, we denote the piecewise linear relaxations of f t (·), g t (·) and h t (·) by f t (·), g t (·) and h t (·), although they are modeled using auxiliary constraints and variables. The set M t is defined by replacing the functions g t (·) and The dynamic programming equations for t = 1, . . . , T are given by For the MILP subproblems ( P t (·)), we obtain the following properties.
The complete recourse follows from the complete recourse of ( P t (·)) by construction. The l.sc. then follows from Theorem 3.1 in [31].

The inner loop
In the inner loop of NC-NBD, the MILP subproblems ( P t (x t−1 )) are considered. As stated before, we omit the index for its discussion.
The copy constraints are crucial for all problems solved in the inner loop. In the forward pass, to ensure Lipschitz continuity, we consider regularized subproblems. The regularization is based on relaxing and penalizing the copy constraints. In the backward pass, to generate cuts, a special Lagrangian dual subproblem is solved based on dualizing the copy constraints. This is effective, since combined with a binary expansion of the state variables, the copy constraints yield a local convexification [55].

Regularization
Lipschitz continuity of the value functions is difficult to ensure in the general nonconvex case. However, as shown recently in [52], for l.sc. value functions, it is possible to determine some underestimating Lipschitz continuous function by enhancing the original subproblem with an appropriate penalty function ψ t . In contrast to the more general regularization approach in [52], we require only so-called sharp penalty func- Definition 3.2 (Regularized subproblem and value function) Let σ t > 0 for t = 2, . . . T , σ 1 = 0 and define ( P R t ) is called regularized subproblem and Q R t (·) regularized value function. By recursion, this approach yields the regularized optimal value v R := Q R 1 (x 0 ) for the first stage. Lemma 3.1 implies that under assumption (A2), the function Q t (·) is l.sc. Then, the regularized value function Q R t (·) has the following properties.
As also stated in [52], using sharp penalty functions as in Definition 3.2, the penalization is exact for sufficiently large (but finite) σ t > 0. For such σ t , the problems ( P) and ( P R ) have the same optimal points and v R = v. This result goes back to [14], in which augmented Lagrangian problems are analyzed for MILPs. It is shown that using sharp penalty functions and a sufficiently large augmenting parameter, strong duality holds. As this result holds for any value of the dual multipliers, it is also valid for the regularized subproblems. [14]) Using sharp penalty functions ψ t , there exist somē σ t > 0 such that the penalty reformulation in ( P R t (x t−1 )) is exact for all σ t >σ t . Lemma 3.4 indicates that using the regularized subproblems within our decomposition method NC-NBD, we obtain convergence to v in the inner loop. To exploit this, we take the following assumption:
If (A3) is not satisfied, σ t has to be increased gradually in the course of the NC-NBD method to ensure convergence.

Forward pass
In the forward pass of the inner loop we solve approximations of the regularized subproblems ( P R t (x t−1 )). For iteration i, the stage-t forward pass problem is defined as follows for the trial state variable . This approximation is constructed in the backward pass, see Sect. 3.3.4. As those value functions are nonconvex, the cut approximation Q i t+1 (·) is required to be nonlinear and non-convex. However, as we show later, it can be expressed with mixed-integer linear constraints by lifting the problems to a higher dimension. Therefore, in addition to x t , y t and z t , the forward pass problem contains further decision variables, which are hidden in Q i t+1 (·) and the piecewise linear relaxations f t , g t and h t . Note that expressing Q i t+1 (·) by mixed-integer linear constraints with bounded integer variables, the same reasoning as in Lemma 3.1 can be applied to Even with a mixed-integer linear representation of Q i t+1 (·), subproblem ( P R,i t ( x i t−1 , Q i t+1 )) is a MINLP due to the regularization. For · 1 or · ∞ , it can be modeled by MILP constraints using standard reformulation techniques for absolute values, though.
The optimal point ( z i t , x i t , y i t ) of each subproblem ( P R,i t ( x i t−1 )) is stored and x i t is passed to the following stage. Since ( z i t , x i t , y i t ) t=1,...,T satisfies all constraints of ( P R ), after all stages have been considered, an upper bound v on the optimal value v R of the regularized problem can be determined by With assumption (A3) and Lemma 3.4, this is also an upper bound to v.

Backward pass-Part 1: binary approximation
The aim of the backward pass of an inner loop iteration i is twofold: Firstly, a lower bound v i on v is determined. Secondly, cuts for Q t (·) are derived to improve and update the current approximation Q i t (·).
As mentioned before, we use a dynamically refined binary approximation of the state variables and then apply cutting-plane techniques from the SDDiP algorithm [55]. This approximation is based on static binary expansion [21].
Binary expansion can be applied component-wise to some vector x t . Some integer component x t j ∈ 0, ..., U j can be exactly and uniquely expressed as with variables λ tk j ∈ {0, 1} and K t j = log 2 U j + 1. Some continuous component x t j ∈ [0, U j ] can be expressed by discretizing the interval with precision β t j ∈ (0, 1). We then have For vector x t , this yields K t = n t j=1 K t j number of binary variables. Defining an (n t × K t )-matrix B t containing all the coefficients of the binary expansion and collecting all binary variables in one large vector λ t ∈ B K t , the binary expansion then can be written compactly as x t = B t λ t + r t .
Based on this definition, to generate cuts, for each stage t and iteration i, a binary approximation of x i t−1 is used, i.e., it is replaced by B t−1 λ i t−1 . Note that the approximation is not necessarily exact for continuous components of x i t−1 . Therefore, the cuts are not necessarily constructed at the trial point x i t−1 but at the deviating anchor point x i B,t−1 := B t−1 λ i t−1 . In the backward pass, we start from the following subproblem, where due to the binary approximation of the state variables, we also adapt the copy constraint to λ i t−1 = z t with variables z t ∈ [0, 1] K t−1 .
Asymptotically, i.e., for an infinitely fine binary approximation, the anchor point converges to the actual trial point.
With Lemma 3.6, asymptotically, the cuts are constructed at x i t−1 . While this is not directly useful in practice, since it requires an infinite number of binary variables, it also implies that for componentwise sufficiently small β t−1 ∈ (0, 1), the cuts are constructed very close to x i t−1 . As NC-NBD constructs Lipschitz continuous cuts, this guarantees a sufficiently good approximation of the value function at x i t−1 , as we show in Sect. 4.
Importantly, in our framework the binary approximation is only applied temporarily to derive cuts, while the state variables x t−1 in the forward pass remain continuous. In other words, the anchor points determine where cuts can be constructed, but do not limit where they can be evaluated. This is a crucial difference to applying a static binary expansion, as suggested in the original SDDiP work to solve MILPs with continuous state variables [55].
Moreover, let us emphasize again that applying such static approximation is not appropriate in our inner loop, as the obtained lower bounds are not guaranteed to be valid for v or v. Similarly, the obtained cuts are not guaranteed to be valid for Q t (·) or Q t (·), and therefore cannot be re-used within the outer loop. Our proposed inner loop method does not share these issues. We follow a dynamic approach where the binary precision is dynamically refined if required and, as we show later, all cuts remain valid with later refinements.

Backward pass-Part 2: cut generation
As proposed in [55], the copy constraint is dualized to generate cuts. Applied to our context, the following Lagrangian dual subproblem has to be solved where L i Bt (·) denotes the Lagrangian function for π t defined by and · * denotes the dual norm to the norm used in the regularized forward pass problems ( P R,i t ( x i t−1 , Q i t+1 )). A linear (optimality) cut in binary space {0, 1} K t−1 is then given by where π i t is an optimal solution of the Lagrangain dual subproblem ( D i Bt (λ i t−1 , Q i+1 t+1 )). Those Lagrangian cuts are introduced in [55] and identified to be finite, valid and tight in the SDDiP setting. In our setting, we obtain the following validity result. Lemma 3.7 Let Q Bt (·) denote the MILP value function of stage t with additional binary approximations. Then, Lemma 3.7 a) follows directly from the validity proof for the SDDiP cuts, which does also hold for Theorem 3 in [55]). Part b) then follows using similar arguments as in Remark 3.5. Hence, φ Bt is, in fact, a valid cut in [0, 1] K t−1 . This enables us to obtain valid under-approximations also for those points, which are not exactly approximated by the current binary expansion. As it refers to an outer approximation, Q t (·) underestimates the original MINLP value function Q t (·). Thus, the obtained cuts are valid for Q t (·) as well.
Contrary to [55], but following [52], we bound the dual variable π t in the Lagrangian dual subproblem. Therefore, tightness for Q i Bt (·, Q i+1 t+1 ) is not guaranteed. However, the cuts are at least guaranteed to overestimate the value function Q R,i Bt (·, Q i+1 t+1 ) at λ i t−1 . This value function is obtained by regularizing Q i Bt (·, Q i+1 t+1 ) in the binary space using the same norm as in the forward pass problem. By careful choice of the regularization factor, then, also the regularized value function Q R,i t (·, Q i+1 t+1 ) in the original space is overestimated at x i B,t−1 . This result is formalized in the following lemma.

Lemma 3.8 Assume that we use · 1 for regularization and its dual norm · ∞ for bounding the dual multipliers. Then, as long as l t ≥ σ t B t−1 , where the latter denotes the induced matrix norm of B t−1 , we have
Proof See Appendix A.

Remark 3.9
The induced matrix norm B t−1 depends on the chosen precision of the binary approximation. It can be bounded from above independent of the precision, e.g., B t−1 1 ≤ U t−1,max with U t−1,max the largest component of the upper bounds in X t−1 .

Backward pass-Part 3: cut projection
Solving the forward pass problems ( P

R,i t ( x i t−1 , Q i t+1 )) and the backward pass dual problems ( D i
Bt (λ i t−1 , Q i+1 t+1 )) requires expressing the cut approximation Q i t+1 (·) in the original state variables x t . Recall that the computed cut φ B,t+1 (·) is a function of According to Lemma 3.7 a), the obtained cuts φ B,t+1 (·) are not only valid for all binary points, but for all values in [0, 1] K t . Allowing for λ t ∈ [0, 1] K t in the binary approximation, there exist infinitely many combinations of λ t to exactly describe some point x t ∈ X t , though. Therefore, following from Lemma 3.7 b), one cut in binary space entails infinitely many underestimators of Q t+1 (·) at x t in the original space X t . Including infinitely many inequalities in Q t+1 (·) is computationally infeasible. Instead, we consider the pointwise maximum of the projection of the cuts to X t . That way, only the best underestimation for each point x t is taken into account. In doing so, we obtain a nonlinear, i.e., piecewise linear, cut in the original state space. For simplicity, in the following, by cut projection we always mean the pointwise maximum of the actual projection.
The projection of some cut φ B,t+1 (·) to X t can be described as the value function of a linear program where e denotes a vector of ones of dimension K t . The dual problem to (2) yields Note that the dual feasible region does not depend on x t and has a finite number of extreme points. Therefore, the cut projection is piecewise linear and concave. As problem (2) is feasible and bounded for any x t ∈ X t , this also holds for the dual problem (3). Therefore, in a dual optimal solution, η t and μ t are bounded. Note that this bound may change with the binary approximation precision β t , though, and that, if we would generate tight cuts for Q i t+1 (·, Q i+1 t+2 ), those cuts may become infinitely steep close to discontinuities. However, as we can bound π t in the Lagrangian dual subproblem independent of β t , see Remark 3.9, and thus construct cuts which at least overestimate the regularized value function Q R,i t+1 (·, Q i+1 t+2 ) at the anchor point x i B,t , such cases should be ruled out.
We formalize this by assuming the existence of a global bound for η t .
For example, if we obtain cuts which are, in fact, tight for Q and consider only basic solutions in the Lagrangian dual, the gradient of the cuts is bounded by σ t+1 . With Assumption (A4) it follows that the linear cuts φ B,t+1 (·) derived in the binary space yield a nonlinear, but Lipschitz continuous projection φ t+1 (·) in the original space. To express this projection by mixed-integer linear constraints, we use the KKT conditions to problems (2) and (3). To emphasize that these conditions are considered for the projection of one specific cut r (the index denoting the r -th cut constructed), we index all occurring variables and coefficients by r .
The complementary slackness constraints (9) and (10) are nonlinear, but componentwise can be expressed linearly using a Big-M formulation (alternatively, SOS-1 constraints may be used): For all components k, we can choose M 1k = 1 and M 3k = −1 due to λ tk ∈ [0, 1]. Moreover, using (A4), we are able to obtain explicit choices for M 2k and M 4k as well. Proof See Appendix B.

Lemma 3.11 The cut approximation Q t+1 (·) is Lipschitz continuous in X t with Lipschitz constant ρ t .
The cut projection requires to introduce the variables λ r

Stopping and refining
At the end of the backward pass, a lower bound v i is determined by solving the firststage subproblem ( P i 1 (0, Q i+1 2 )). Here, no Lagrangian dual is solved, since no cuts have to be derived. The lower bound is non-decreasing because the cut approximation is only improved.
If the updated bounds are sufficiently close to each other, i.e., if for some predefined tolerance ε > 0, an approximately optimal point of problem ( P) has been determined. We show in the following section that this is the case after finitely many iterations i. If the gap between the bounds does not meet the stopping criteria yet, two cases are possible: In the first case, the algorithm has not determined the best possible approximation for the given binary approximation precision, yet. New cuts have been determined in iteration i such that the lower bound v i has been updated, and the forward solution will change in iteration i + 1 as the previous one is cut away.
In the second case, despite not meeting the stopping criterion, the forward solution does not change at the beginning of iteration i + 1. This case is related to the binary approximation. It can occur if the binary approximation is too coarse and therefore, for all t, the determined cuts at x i Bt do not improve the approximation at x i t . Moreover, it can occur if in subsequent iterations the same cuts are constructed, since x i B,t−1 = x i+1 B,t−1 . Finally, it can also occur if all possible cuts have been generated: For a fixed binary approximation, there exist only finitely many points x Bt . If we restrict the Lagrangian dual subproblem to basic solutions, then only finitely many different cuts can be determined [55].
In the second case, at the beginning of the backward pass of iteration i, the binary approximation is refined. The refinement is computed by increasing K t j by +1 for all components j and all stages t with For simplicity, we refine in Algorithm 1 all stages and components equally by +1. Note that each refinement requires the introduction of an additional vector λ t , as described in the previous subsection. As all previously generated cuts have been projected to the original space X t , they remain valid and have not to be recomputed when refining the binary approximation. This is computationally important.

The outer loop problem
Once the inner loop is left, we set v : ..,T . To approximate the optimal value v of ( P), we solve subproblems in a forward manner for t = 1, . . . , T with x 0 ≡ 0 and x t := x t , where x t is an optimal solution of ( P t (x t−1 , Q t+1 )) for t. Here, we exploit that the cut approximation Q t (·), constructed in the inner loop, is valid for Q t (·) by design as well. By solving these subproblems, we obtain a feasible solution (z t , x t , y t ) t=1,...,T for ( P) and we can determine a valid upper bound for v as subproblems ( P t (x t−1 , Q t+1 )) are non-convex MINLP problems. This means that in order to solve the original non-convex problem ( P), easier, but still non-convex subproblems have to be solved to optimality for each stage t in each outer loop iteration . This might be a hard challenge by itself. We make the following assumption for the remainder of this article: (A5). An oracle exists that is able to solve subproblems ( P t (x t−1 , Q t+1 )) to global optimality.
In case that no such global optimization algorithm is available, one can solve appropriate inner approximations of ( P t (x t−1 , Q t+1 )), which are improved in the course of the algorithm.
If v − v ≤ ε, then NC-NBD terminates and (z t , x t , y t ) t=1,...,T is an ε-optimal solution for ( P). Otherwise, the cut approximations Q t+1 (·) are not sufficiently good underestimators for the true value functions, even though they give a good approximation of Q t (·). This implies that the piecewise linear relaxations have to be improved. Instead of refining them everywhere, they are refined dynamically where it is promising, i.e., in a neighborhood of the approximate optimal solution ( z t , x t , y t ) t=1,...,T of ( P ). In refining the piecewise linear relaxations in its neighborhood, the current solution can be excluded in the next inner loop and the lower bound v improves.

Remark 3.12
Instead of v , an even better lower bound for v is given by the optimal value of the first stage subproblem ( P 1 (x 0 , Q 2 )).

Refining the piecewise linear relaxations
The refinement consists of two steps: (1) the piecewise linear approximations are refined and (2) the corresponding MILP ( P ) is updated -in such a way that the new MILP ( P +1 ) again yields a relaxation of ( P).
Different strategies are possible to achieve this. For a thorough overview, we refer to [18]. In the following, we make use of a specific adaptive refinement scheme for triangulations from [9] for any nonlinear function γ t ∈ Γ t . The given piecewise linear approximation at iteration is defined by a triangulation T of X t−1 × X t × Y t (or a subspace) and the corresponding function values of γ t . Instead of refining this triangulation everywhere now, the main idea is to only refine it in a neighborhood of ( z t , x t , y t ) t=1,...,T . Therefore, first, the simplex in T containing this point is identified. It is then divided by a longest-edge bisection, yielding a refined triangulation, for which a new MILP model can be set up. As proven in [9], this refinement strategy has some favorable properties with respect to convergence, see Sect. 4.2.
It is important that the obtained relaxation ( P +1 ) is tighter than ( P ) so that the corresponding value functions improve monotonically. This is required to ensure that previously generated cuts remain valid in later iterations. For concave functions, this is always satisfied using the presented refinement strategy. For other functions, e.g., convex ones, a more careful determination of the relaxation is required or the MILP models for earlier relaxations have to be kept instead of being replaced. For our theoretical results, it is sufficient that such monotonically improving relaxations can always be determined.
After refining the piecewise linear relaxations, a new iteration + 1 is started, beginning with the inner loop.

Convergence results
In this section, we prove the convergence of the NC-NBD algorithm. We start proving the convergence of the inner loop to an optimal solution of ( P ) based on some results on the binary refinements. Afterwards, we prove that the outer loop converges to an optimal solution of the original problem ( P).

Convergence of the inner loop
As explained in Sect. 3.3.3, within NC-NBD the cuts are not generated at the trial points x i t−1 , but instead at anchor points x i B,t−1 := B t−1 λ i t−1 . This means that the generated cuts, and with that also the cut approximations Q t (·), implicitly depend on the binary approximation precision β t . However, Lemma 3.6 implies that x i t−1 and x i B,t−1 should become equal asymptotically in the refinements of the binary approximations. Therefore, asymptotically, the cuts are guaranteed to overestimate Q R,i t ( x i t−1 , Q i+1 t+1 ) and, due to their Lipschitz continuity, for some sufficiently small precision, they are at least ε Bt -close. This, in turn, leads to convergence of the inner loop, as we formalize and prove below.
Prior to this, let us introduce two useful ideas. Firstly, using the Lipschitz continuity results from Lemma 3.3, page 12 and Lemma 3.11, we can bound the cut approximation error in x i t−1 as follows:

Lemma 4.1 With Assumption (A4), for any iteration i and stage t it follows
Proof See Appendix C.
Secondly, for any stage t and any fixed binary approximation, if we restrict to basic solutions in the Lagrangian duals, only finitely many different realizations of cut approximations Q t (·) can be generated. Thus, after a finite number of iterations, the binary approximation is refined. Assuming that the inner loop does not terminate for ε = 0, we can then observe infinitely many such refinements. Hence, with j → ∞, we also get β t → 0 for all t = 1, . . . , T . Now, we address convergence of the inner loop of NC-NBD to an optimal solution of ( P). First, we provide a preliminary result, which can be proven by backward induction using Lemmas 3.11 and 4.1.

Lemma 4.2
Suppose that the inner loop does not terminate for ε = 0. Then, the infinite sequence of forward pass trial solutions ( x i ) i∈N possesses some cluster point x * with a corresponding convergent subsequence ( x i j ) j∈N . This subsequence satisfies Proof See Appendix D.
Using this result, convergence can be proven.

Theorem 4.3
Suppose that the inner loop does not terminate for ε = 0. Then, the sequence ( v i ) i∈N of lower bounds determined by the algorithm converges to v and every cluster point of the sequence of feasible forward pass solutions generated by the inner loop is an optimal solution of ( P).
Note that with a similar argument it can be shown that the inner loop terminates as soon as .., T . Considering that the inner loop is integrated into an outer loop improving the MILP approximations of ( P), infinite convergence is not directly useful. Moreover, infinitely many binary refinements are not computationally feasible. However, we can deduce that an approximately optimal solution of ( P) is determined in a finite number of iterations.

Corollary 4.4
For any stopping tolerance ε > 0, the inner loop stops in a finite number of iterations with an ε-optimal solution of ( P).

Convergence of the outer loop
We start our convergence analysis of the outer loop with a feasibility result for the solutions determined in the inner loop, which follows from the convergence results in [9]. The main idea is that, as the domain is bounded for all functions γ ∈ Γ , using a longest-edge bisection, after a finite number of steps, all considered simplices become sufficiently small (since in the worst case all simplices have been refined sufficiently often). Next we show that with decreasing the feasibility error also the deviation in the optimal value between ( P ) and ( P) can be controlled.
As a preliminary result, we obtain that for sufficiently small feasibility tolerances ε γ for all γ ∈ Γ , there exists a neighborhood of the optimal solution x := ( z t , x t , y t ) t=1,...,T of problem ( P ) containing a feasible point x := ( z t , x t , y t ) t=1,...,T of ( P). This follows primarily from the continuity of all functions in ( P).

Lemma 4.6
For any δ > 0, there exists someˆ ∈ N such that for all ≥ˆ there exists some feasible point x of ( P) with x − x 2 ≤ δ.
Applying Lemma 4.6 yields the following result with respect to the deviation in the optimal value between ( P ) and ( P). Theorem 4.7 There exists someˆ ∈ N such that for all ≥ˆ we have Proof The proof makes use of the Lipschitz continuity of f t , Lemma 4.5 and Lemma 4.6 to bound v − v from above by L f δ + T t=1 ε f t (with ε f t deduced from ε γ with γ = f t ). The assertion then follows with ε := L f δ + T t=1 ε f t . For a detailed proof see Appendix F.
We obtain the central convergence result for NC-NBD: (b) Let ε = ε > 0. Then, if NC-NBD does not terminate, it converges to an ε-optimal solution of ( P). (c) For any ε > ε > 0, NC-NBD terminates with an ε-optimal solution of ( P) after a finite number of steps.
Proof See Appendix G.

Computational results
We illustrate the adequacy of using the NC-NBD to solve multistage non-convex MINLPs by applying it to moderate-sized instances of a unit commitment problem. NC-NBD is implemented in Julia-1.5.3 [7] based on the SDDP.jl package [12], which provides an existing implementation for SDDP. More implementation details are presented in Appendix H. The considered unit commitment problem is formally described in detail in Appendix I. Importantly, the considered problem contains binary state variables, but also continuous state variables, such that a binary approximation of the state variables is required in the backward pass of NC-NBD. Additionally, all instances contain a nonlinear function in the objective. In the base instances, we consider a concave quadratic emission cost curve in the objective. In the valve-point instances, additionally, we consider a non-convex fuel cost curve with a sinusoidal term. In both cases, we analyze instances with 2 to 36 stages and 3 to 10 generators, resulting in 6 to 20 state variables. More details on our parameter settings and the complete test results for all instances are presented in Appendix I.
The results show that NC-NBD succeeds to solve multistage non-convex MINLPs with a moderate number of stages and state variables to (approximate) global optimality. It converges to the globally minimal point for each of the instances and, considering our 1% tolerance, terminates with valid upper and lower bounds for v.
For the base instances, we observe long computation times of several minutes compared to state-of-the-art solvers for MINLPs, which solve the problems in a few seconds, though. We address some of the reasons and possible solutions for this behavior at the end of this section. As the results for problems with a small number of state variables, but many stages look most promising, for our valve-point instance tests we focus on such instances.
For these instances, the sinusoidal terms in the objective exclude many existing general purpose solvers from application. A sample of the obtained results is presented in Table 1, for the complete ones, see Appendix I. The results show that NC-NBD is less efficient than existing solvers for problems with few stages, but becomes competitive with a larger number of stages. Especially for the instances with 36 stages, conventional solvers have difficulties closing the optimality gap while NC-NBD manages to solve the instances in reasonable time.
These results confirm that NC-NBD should be best suited for multistage problems with a large number of stages, but a relatively small number of state variables, as the obtained subproblems remain sufficiently small even for a larger number of iterations, while general purpose solvers may start to struggle due to the combination of many stages and nonlinear terms. Therefore, NC-NBD may also be useful for stochastic programs where the deterministic equivalent becomes computationally infeasible for monolith approaches. To investigate this is left for future research.
While some of the test results look promising, we still see substantial potential for improvement. This should also help to make NC-NBD more efficient and competitive for problems with a larger number of state variables. It is a known drawback of SDDiP [55], which is inherited by NC-NBD, that existing methods to solve the Lagrangian dual problems may take extremely long to converge. To some extent, this could possibly be mitigated by additionally using different cut types such as strengthened Benders cuts [55], thus, only constructing tight cuts every few iterations. Yet, developing more efficient solution methods is an important open research question.
Additionally, with each projected cut, the considered subproblems become considerably larger. While we implemented a simple cut selection scheme to reduce the subproblem size, more sophisticated approaches may be required to keep the subproblems tractable for applications with many state variables.
Finally, so far, we assume that the outer loop MINLPs are solved to global optimality (A5) directly. In a more efficient implementation of NC-NBD, these subproblems should be approximated as well.

Conclusion
We propose the non-convex nested Benders decomposition (NC-NBD) method to solve multistage non-convex MINLPs. The method is based on combining piecewise linear relaxations, regularization, binary approximation and cutting-plane techniques in a unique and dynamic way. We are able to prove that NC-NBD is guaranteed to compute an ε-optimal solution for the original MINLP in a finite number of steps. Computational results for some moderate-sized instances of a unit commitment problem demonstrate its applicability to multistage problems.
We require all constraints to be continuous and the objective function to be Lipschitz continuous, which are common assumptions in nonlinear optimization. We also assume complete recourse for the multistage problem. Moreover, the regularization factors are assumed to be sufficiently large to ensure exact penalization in the regularized subproblems. If this is not the case, the factors can be increased gradually within NC-NBD.
In contrast to previous approaches to solve multistage non-convex problems, we do not require the value functions to be monotonic in the state variables [36] and allow the state variables to enter not only the objective function, but also the constraints. The latter avoids the assumption of oracles to handle indicator functions [52].
In NC-NBD, we combine dynamic binary approximation of the state variables, cutting-plane techniques tailor-made for binary state variables and a projection from the binary to the original space. This way, we are able to obtain non-convex, piecewise linear cuts to approximate the non-convex value functions of multistage MILPs. Using some additional regularization, this is even possible if those value functions are not (Lipschitz) continuous. Together with piecewise linear relaxations, this yields non-convex underestimators for the non-convex value functions of MINLPs. All approximations are refined dynamically and, by careful design, it is ensured that all cuts remain valid even with such refinements.
The proposed method can be enhanced to solve stochastic MINLPs as well. In particular, a sampling-based approach like in SDDP could be used. In such case some adaptions have to be made with respect to the refinement criteria (forward solutions may remain unchanged for several iterations until the right scenarios are sampled) or the convergence checks, though.
While the presented version of NC-NBD already uses approximations which are dynamically refined, different strategies may be even more dynamic and efficient in practice. For instance, the piecewise linear relaxations could be refined dynamically in the inner loop as well.
The main drawback of NC-NBD is that the considered subproblems can become severely large, since for binary approximation, for piecewise linear approximations and for cut projection, a high number of additional variables and constraints may have to be introduced. This can become problematic, especially, if a very high binary expansion precision is required to approximate the value functions sufficiently good in the forward solutions. Recent results show that the number of binary variables K required grows linearly with the dimension n t of state variables and logarithmically with the inverse of the binary precision β t [55].
Therefore, in its current form, NC-NBD is best applicable to multistage MINLPs which are too large to solve in their extensive form, but for which each subproblem is sufficiently small and contains only a few nonlinear functions.
in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Proof of Lemma 3.8
Proof We start proving the second inequality. We have The inequality follows from B t−1 being the induced matrix norm to the used vector norm. The last equality is obtained by choosing the same norm and α t := σ t B t−1 as regularization factor in ( P R,i Bt (λ i t−1 , Q i+1 t+1 )). To show the first inequality, we construct a dual vector componentwise by This vector is feasible, as it satisfies π t ∞ ≤ l t . By feasibility and by definition of the Lagrangian dual ( D i Moreover, by construction we have π t j (λ i t−1, j − z t j ) = l t |λ i t−1, j − z t j | in each component j. Inserting this into (14) and choosing l t ≥ α t , we obtain the first inequality.

B Proof of Lemma 3.10
Proof Consider line (4) in the KKT conditions. By rearranging and taking norms on both sides, we obtain ν t − μ t = π t+1 + B t η t ≤ π t+1 + B t η t ≤ π t+1 + B t η t . (15) The inequalities follow with the triangle inequality and with the compatibility of the matrix norm.
We can bound all three norms in (15) individually. In the Lagrangian dual, we have π t+1 ≤ σ t+1 B t . With Remark 3.9, we can bound B t and with Assumption (A4), we have η t ≤ ρ t .
For example, using the ∞-norm, we obtain By the equivalence of norms, we obtain similar bounds using other norms. This means that every entry of ν t − μ t is bounded by this constant. Moreover, since in each component only ν t or μ t can be non-zero, this also implies that the components of ν t and μ t are bounded by this constant.

C Proof of Lemma 4.1
Proof From the Lipschitz continuity of Q R,i t we have Analogously, using Assumption (A4) and Lemma 3.11, for the cut approximation we obtain Starting with (17) it follows The second inequality follows from the definition of Q i+1 t (·). The third inequality follows from Lemma 3.8 and the last one is obtained using (16).

D Proof of Lemma 4.2
Proof The structure of the proof is inspired by the proof for Lemma 4 in [1].
As the inner loop does not terminate and X is compact, there exists an infinite sequence of forward pass trial solutions ( x i ) i∈N with cluster points. Let x * ∈ X be such cluster point and ( x i j ) j∈N a subsequence of ( x i ) i∈N with lim j→∞ x i j = x * .
We show that lim j Q For t = T + 1, this relation is trivially true, since no stage follows after T . Now, assume it already holds for stage t + 1, i.e., We consider two subsequent indices in the subsequence ( where the first inequality follows from the monotonicity of Q t (·) in i and the second inequality uses Lemma 3.11. By adding zero, we obtain We can now also use the monotonicity of Q R t (·) in i and apply Lemma 4.1 to obtain Moreover, we expand and with corresponding optimal points ( z ) and (z t ,x t ,ỹ t ). Then with (20) it follows as the solution from (19) is feasible. Thus, ).

Rearranging yields
With inserting (21) in (18), we obtain We take limits on both sides. ( * ) converges to Q R t ( x * t−1 ), since the function is continuous. (#) becomes greater than or equal to zero by the induction hypothesis. (+) tends to zero with Lemma 3.6 since with j to ∞, the binary precision β t goes to 0. (−) tends to zero as x i j t−1 and x i j−1 t−1 both converge to x * t−1 . Thus, the induction is proven for t. As this result holds for any cluster point of ( x i ) i∈N , the assertion follows.

E Proof of Theorem 4.3
Proof Consider the first stage optimal value v F P,i of the forward pass. By recursion we obtain As in the proof of Lemma 4.2, let ( x i j ) j∈N denote a convergent subsequence of ( x i ) i∈N , with lim j x i j = x * . Applying (22) to this subsequence and taking limits on both sides, yields but also the anchor points are used to determine dominated cuts. Our code is available on GitHub [15]. All computations are performed on a machine with Intel(R) Xeon(R) E5-1630 v4 CPU and 128 GB RAM. All benchmark runs using state-of-the-art solvers are executed in GAMS 32.1.0.

I Unit commitment problem formulation and results
We consider a unit commitment problem with thermal generators based on [2] for some first tests of NC-NBD. To formulate this problem, we define the following elements: Sets: -G: set of thermal generators The objective is to minimize the total costs of electricity generation, which consists of different cost components. For all instances, the objective function is nonlinear due to a concave quadratic function modeling emission costs with a g < 0 for all g ∈ G [11].
Additionally, we consider two different types of fuel cost function. In the first case (base instances), the fuel cost function is linear  are solved to an optimality tolerance of 10 −3 . We use the lower bound provided by the MINLP solver to ensure that still a valid lower bound is obtained in the outer loop. The Lagrangian duals are solved with an optimality tolerance of 10 −4 . In case that σ t is not chosen large enough for some stage t from the beginning, it is increased iteratively within the solution procedure once identified. For the base instances, we use BARON [42,48] to solve the outer loop subproblem, for the valve-point instances we draw on LINDOGlobal [44], as BARON does not support sinusoidal functions. For the same reason, ANTIGONE [32], SCIP [16] and Gurobi [20] cannot be applied to the valve-point instances, so that we refer to LINDOGlobal and Couenne [5] as benchmarks.
The obtained upper bounds (UB) and lower bounds (LB) for the base instances are summarized in Table 2 and compared with the optimal point obtained by BARON.
All test instances can be solved by benchmark solvers in a few seconds, thus, outperforming NC-NBD. Still, these results can be regarded as a proof of concept for applying NC-NBD to multistage non-convex MINLPs, as in each case, the globally minimal point is succesfully approximated.
For the valve-point instances, we consider a larger number of stages, but only 3 or 4 generators, i.e., 6 or 8 state variables, to focus on cases, in which NC-NBD looks most promising. For cases with many stages, we test differently scaled demand time series, as this seems to have a considerable effect on solution times. All instances are solved with a maximum solution time of two hours. The results are summarized in Table 3. If a solver does not terminate within the time limit, this is indicated by "-".
For a small number of stages, NC-NBD takes significantly more time than conventional solvers. With a larger number of stages, this difference vanishes, though. For 36 stages, NC-NBD manages to solve all considered instances within less than two hours, while LINDOGlobal and Couenne show more variance in computation time. For one instance, when terminated after two hours, they still show a 5% optimality gap.