Mixed-integer optimal control problems with switching costs: a shortest path approach

We investigate an extension of Mixed-Integer Optimal Control Problems by adding switching costs, which enables the penalization of chattering and extends current modeling capabilities. The decomposition approach, consisting of solving a partial outer convexification to obtain a relaxed solution and using rounding schemes to obtain a discrete-valued control can still be applied, but the rounding turns out to be difficult in the presence of switching costs or switching constraints as the underlying problem is an Integer Program. We therefore reformulate the rounding problem into a shortest path problem on a parameterized family of directed acyclic graphs (DAGs). Solving the shortest path problem then allows to minimize switching costs and still maintain approximability with respect to the tunable DAG parameter θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document}. We provide a proof of a runtime bound on equidistant rounding grids, where the bound is linear in time discretization granularity and polynomial in θ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta $$\end{document}. The efficacy of our approach is demonstrated by a comparison with an integer programming approach on a benchmark problem.


Introduction
This work considers optimal control problems of the form  The continuous function J assigns costs to the state vector y and the function C : L ∞ ((0, T ), R n V ) → R assigns costs to the control function v. We leave its regularity unspecified here and note that a natural choice in this article is a measure of switching costs, like for example the total variation or a weighted variant thereof.
Several approaches to deal with (MSCP) have been considered. Reformulation methods such as variable-time transformation [7,25] necessitate the sometimes difficult evaluation of the costate but allow insertion of new switches, while bilevel optimization approaches require an a-priori fixed maximal number of switchings [6]. The presented approach is complementary to the aforementioned reformulation methods. We use a two-step approach [13] and aim to extend the modeling capabilities of the second step with respect to switching costs and combinatorial constraints.
The challenging feature of (MSCP) is that the discrete-valued variable v is distributed in the time horizon [0, T ]. Consequently, a direct discretization of (MSCP) gives an integer optimization problem, in which the number of integer variables grows with the mesh refinement. In the absence of the costs C, the infimum of (MSCP) can be approximated by deriving and solving a continuous relaxation of (MSCP). The solution of the relaxation is then rounded to a discrete-valued control satisfying an approximation property with respect to the relaxed solution.
While solving a discretization of (MSCP) to optimality is often considered intractable in practice, roundings can be computed efficiently. What is more, the infimum in (MSCP) can be approximated arbitrarily well using this methodology [17,18]. Consequently, rounding approaches have been used to solve a variety of problems [11,13,20,21,23].
Still, there are some drawbacks. First, the infimum is approximated in the limit of the mesh refinements, necessitating very fine grids in practice. Related to this, the rounding algorithms inevitably introduce an (often undesired) chattering behavior into the resulting discrete control because the fractional control is approximated in the weak * topology of L ∞ ((0, T ), R n V ) (see [18]). Second, the approximation properties usually do not hold anymore once the control costs C are present in (MSCP).
The authors of [13] propose to solve an IP optimizing the approximation of a rounding of the relaxation. Recently this approach was extended to constrain the total variation of the rounded control to reduce chattering [24].
We continue these steps towards practical applicability but take a slightly different point of view. For a given discretization mesh, we fix the approximation quality for the controls as the constraint and seek for a binary valued control such that the control costs C are minimized. Therefore the approach allows to incorporate additional combinatorial constraints into the problem. A first version of this idea has been presented by the authors in the short proceedings article [2]. We highlight that the switching costs are considered in the rounding process and are not part of the continuous relaxation. This allows to maintain the approximation properties of the decomposition approach, see Proposition 3 and Remark 4 for a summary of the related results.
While there has been quite some research on approximation guarantees for rounding algorithms, see e.g., [14,23,24], there has not been much work on complexity and theoretic runtime analysis in case the rounding algorithm necessitates the solution of an IP [2,13]. We address this question in detail for a generalization of the rounding algorithm proposed in [2] in the case of equidistant grids that decompose the domain.

Contribution
We reconsider the IP-based rounding algorithm introduced in [2]. While the algorithm provides solutions satisfying established approximation criteria with minimum costs, the computational effort of solving the IP is substantial. In order to improve computational efficiency, we exploit the structure of the IP. We show that the IP corresponds to a shortest path problem in a directed acyclic graph (DAG), which is reduced by means of an equivalence relation. The shortest path problem is known to be solvable in linear time in the size of the graph (see [5]). We prove bounds of O(N (2θ + 3) M ) and O(N (2θ + 3) 2M ) on the number of vertices and arcs of the reduced DAG, respectively, where N denotes the number of intervals discretizing the time horizon [0, T ], and the tuning parameter θ ≥ 1 may be chosen independently of the coarseness of the discretization grid on which the discrete-valued control is defined. We obtain an O(N (2θ + 3) 2M ) algorithm computing the rounded control.
We show that the algorithm can be adapted to obey additional constraints, such as minimum dwell times (see [26]) and vanishing constraints without any increase in complexity. What is more, the objective function to be optimized can be adapted to mirror problems such as the Combinatorial Integral Approximation (CIA) problem (see [13]). It is also possible to include the approximation quality as an objective to obtain solutions approximating the fractional control as least as well as those provided by the well known Sum-Up Rounding (SUR) algorithm from [20].
We conclude by conducting a computational experiment. It shows that the developed algorithm reduces the running times by several orders of magnitude compared to the use of a state-of-the-art solver for the original IP formulation.

Structure of the remainder
The remainder of this article is structured as follows. Section 2 introduces the approximation of (MSCP) and gives a short introduction to rounding algorithms for MIOCPs. In Sect. 3 we reformulate the problem and derive the DAG formulation as well as a labeling scheme, which will be used throughout the article. We state that the optimal solution of a shortest path problem leads to a control for the original problem (MSCP) minimizing the costs C and staying within chosen bounds for the relaxed solution. Afterwards, Sect. 4 provides the worst case analysis for our approach and the proof that the runtime is linear in the grid discretization. Section 5 provides a computational comparison. We close with a summary of the article in Sect. 6.

Notation
The symbol e i denotes the i-th canonical basis vector in an Euclidean space. For an equivalence relation denoted by ∼ on the set V , we denote the quotient set by V ∼ , and the equivalence class

Approximation of (MSCP)
We follow the idea of partial outer convexification (see [21]) to derive an equivalent reformulation as well as a continuous relaxation of (MSCP). Then, we state the abstract consistency property, Definition 2, that is required for rounding algorithms, Proposition 3, and the following main result of the described two step approximation methodology.

Reformulation and relaxation of (MSCP)
We reformulate (MSCP) equivalently and replace the control function v by a {0, 1} Mvalued function ω that models a one-hot or special ordered set of type 1 (SOS-1) activation of the different control realizations v 1 ,.
This formulation allows to move the activation of the different control realizations in the second argument of f to the different right hand sides of the ordinary differential equation (ODE). The problem BC is stated below. inf y,ω The problem (BC) can be relaxed straightforwardly to a continuous optimal control problem by relaxing the SOS-1 constraint to convex coefficients. Moreover, we omit the switching cost term here, because important approximation properties cannot be sustained, see Proposition 3 below, and transfer it to the rounding problem (SCARP) instead. The resulting continuous relaxation (RC) of (BC) is stated below.
By changing the infimum in the formulation of (BC) to a minimum in (RC), we highlight that we tacitly assume that (RC) admits a solution throughout the remainder of the manuscript. We note that the approximation method is not restricted to optimal solutions of (RC) but can be applied to all feasible points. As in [17], we call a function ω * ∈ L ∞ ((0, T ), R M ) satisfying the second constraint of (BC) a binary control. Similarly, a function α * ∈ L ∞ ((0, T ), R M ) satisfying the second constraint of (RC) is a relaxed control. From now on we denote binary and relaxed controls that are functions in L ∞ ((0, T ), R M ) with * and discretized functions using piecewise constant discretizations as matrices without * .

Rounding algorithms
Naturally, binary controls are piecewise constant functions. Therefore, the rounding algorithms operate on grids, which we call rounding grids from now on and which are defined formally below together with the rounding algorithms themselves. A function or algorithm is called a rounding algorithm if it maps a relaxed control and a rounding grid to a binary control that is constant per interval.
We define a consistency property for rounding algorithms that leverages the approximation relationship between (BC) and (RC).

Definition 2
A rounding algorithm is called consistent if there exists a constant θ > 0 such that for all relaxed controls α * and all rounding grids, the produced control ω * satisfies d(ω * , α * ) ≤ θ h, where d is the pseudo-metric given as The consistency property gives rise to the main approximation relationship between (BC) and (RC) in the proposition below [17,18]. This relationship constitutes the theoretical justification for the described approximation methodology. Then, the solutions y (n) of the IVPs in (BC) for the control inputs ω (n), * satisfy We note that the algorithms Sum-Up Rounding (SUR) [20,23], Next-forced Rounding (NFR) [12], and the CIA [13] satisfy Definition 2 and consequently, the claim of Proposition 3 holds for them for refinements of the rounding grids such that h → 0, that is the mesh size vanishes.

Remark 4
The last identity in Proposition 3 usually does not hold for objectives of the form J (y) + C(ω). Therefore, we consider the switching cost term C in the rounding process instead, which is described and analyzed in the next section. For example consider the total variation as switching costs. Then, the discrete-valued approximation of a fractional-valued control function always leads to a divergence of the switching costs towards infinity if the mesh is refined because the control becomes a rapidly oscillating function, see also [18, Section 4, Figure 1] for visualization.

Switching cost aware rounding
Let α * be a relaxed control. We seek an approximating binary control ω * ∈ L ∞ ((0, T ), R M ), such that the costs C in the resulting controls v are minimal. More-over, we seek guarantees on the consistency principle in Definition 2. We propose to use the solution of the optimization problem min ω * feasible for ( BC) as the rounding algorithm for a given rounding grid with mesh size h and a given relaxed control α * ∈ L ∞ ((0, T ), R M ). The slack-parameter θ > 0 steers the tradeoff between the approximation accuracy and the costs of the optimized control function ω * . The existence of solutions is proven for a value of θ ≥ M i=2 1/i in [14] and for θ = 1 in [12].
For a given rounding grid, we formulate (SCARP) as a finite-dimensional IP and provide elementary consistency properties in Sect. 3.1. Then, we derive the graphbased formulation in Sect. 3.2.

Switching cost aware rounding as an integer program
Switching Cost Aware Rounding was introduced in the form of an IP in [2]. Throughout the remainder of this article, we optimize (SCARP) for a fixed rounding grid, implying that the resulting control ω * ∈ L ∞ ((0, T ), R M ) is constant on the intervals that make up the rounding grid. We briefly repeat the steps required to derive the IP formulation.
First, notice that the function α * − ω * is monotone in each component per interval because α * is nonnegative and ω * is interval-wise constant. Consequently, the condition d(ω * , α * ) ≤ θ h is equivalent to the set of linear constraints where the matrix α ∈ R N ×M is defined by , that is α k ∈ R M is the average of the function α * ∈ L ∞ ((0, T ), R M ) over the k-th interval. Analogously, the resulting matrix ω ∈ {0, 1} N ×M allows to reconstruct the piecewise-constant control function ω * . Second, many cost functions, for instance the switch-on of control i from interval k − 1 to k, which may be modeled by nonlinear functional formulations such as c i (1 − ω k−1,i )ω k,i can be rewritten as linear terms after adding additional binary variables to the problem formulation, see for example the formulation of (SCARP) in [2]. We assume that the cost function C can be included into the IP in this way, arriving at the following formulation in which binary variables model the control ω: Based on this formulation, for a given relaxed control α * and a given rounding grid, the solution of (SCARP-IP) yields a piecewise constant binary control defined on the rounding grid. This control minimizes the switching costs while being in a θ h proximity of α * with respect to d. The following proposition follows immediately from the definitions: . It then holds that the function ω * : The literature guarantees that (SCARP-IP) always has a feasible point for θ ≥ 1, see [2,12], which thereby proves the desired consistency from Definition 2 that is required in Proposition 3.

Proposition 6 (Proposition 3.1 in [2]) Let a rounding grid with mesh size h
In particular, the rounding algorithm defined by solving (SCARP-IP), for given relaxed control and rounding grid, is consistent.
If the costs C increase with the number of switches, the optimal costs of the solution of (SCARP-IP) become unbounded when h → 0. However, the maximal frequency of switching (often (min k {h k }) −1 ) is subject to some physical (mechanical or electrical) limits in practice. This may imply that h → 0 cannot be reasonably assumed. The problem formulation (SCARP-IP) gives us the interpretation of θ as a trade-off parameter to balance the deviation of ω from an optimal relaxed control with the induced increase in switching costs.
One of the key ingredients of branch-and-bound solvers for IPs are good continuous relaxations of the IP formulation. If one considers (SCARP-IP) and minimizes θ , we obtain the problem CIA from [13]. An initial solution, originating from the continuous relaxation, is given by α with the worst lower bound zero on the objective value. Consequently the continuous relaxation does not add any information and using blackbox branch-and-bound solvers is not advisable. This observation is confirmed by the long computing times we observe when solving (SCARP-IP) for higher numbers of intervals N in [2], and the speedup reported when the software package pycombina [4] is applied to CIA from [13] with SCIP [1].

Switching cost aware rounding as shortest path in a directed acyclic graph
We propose to consider (SCARP) as a shortest path problem on a DAG. This allows us to use nonlinear cost functions of the form where c s are start costs, c t are intermediate costs from interval t − 1 to t and c f are final costs. This formulation covers switching costs, the main motivation for this article. We now set forth to transform the set defined in terms of the two constraints of (SCARP-IP), namely We begin by defining the set of binary feasible solutions (that is binary controls as elements of the set {0, 1} N ×M ) for a given relaxed control, a given rounding grid, and a given trade-off parameter value. We also say that V t (α, θ ) are the feasible solutions that are spanned by α, θ and This enables us to define vertices and arcs for our DAG. Following the definition of (SCARP), we obtain, given a relaxed control α, a rounding grid and a trade-off parameter value θ , one DAG. Our DAG can be grouped into N layers such that arcs can only exist between vertices of two subsequent layers.
For t ∈ [N ], the vertices in the t-th layer are the matrices ω ∈ {0, 1} t×M that are binary feasible until the t-th interval, that is ω ∈ V t (α, θ ). For t ∈ [N − 1], an arc exists between two nodes ω a ∈ V t (α, θ ) and . This means that a BinFS until the t-th interval ω a ∈ {0, 1} t,M can be extended to a BinFS until the t + 1-th interval, ω b ∈ {0, 1} t+1,M . Before reducing the graph, we formalize this idea in Definition 8 below.

Definition 8
Let α ∈ [0, 1] N ×M satisfy (SOS-1), and let θ > 0. We define the vertex set V as and the corresponding set of arcs A : For t ∈ [N − 1], we define the costs for an arc Note that the size of the DAG, given by |V | + |A|, determines the time required to traverse the graph in a shortest path search. Moreover, arc costs are considered as part of the input of (SCARP-IP), and are not used in the DAG construction. Therefore they do not influence the complexity description of the DAG. Thus the size of the DAG is independent of the value of C. A second observation is facilitated by the following assumption and enables the identification of control realizations with vertices. This assumption allows to factor (Slack) by h, which we do throughout the remainder of the article. We now observe that under Assumption 9 and with (SOS-1), a columnwise sum of ω ∈ V t (α, θ ) yields the number of intervals in which the different control realizations v 1 ,. . .,v M are switched on. We call the resulting vectors the labels of ω.
Definition 10 (Labels of Binary Solutions) Let Assumption 9 hold. Let t ∈ [N ], α ∈ [0, 1] t×M , let θ > 0, and ω ∈ V t (α, θ ). Then the vector L(ω) ∈ N M defined as We observe that if two BinFSs until the t-th interval have the same label value, they also have the same emerging arcs in the DAG in the sense that they may be extended by the same SOS-1 control vector without losing feasibility in terms of (Slack). This also implies that the costs induced by subsequent layers are the same. Moreover, the t-th layer of the DAG consists of a subset of the BinFSs with a sum of label entries of exactly t, that is that the 1-norm of the label is t. We formalize these observations in the following proposition.
Proof Because of ω b ∈ V k and Assumption 9, we obtain This proves the first claim. The second claim follows inductively from the first. The third claim follows from the SOS-1 constraint and the definition of L.
We may observe that the function L induces an equivalence relation ∼ on the set of vertices of the t-th layer V t (α, θ ) as well as on the whole set of vertices V (α, θ ). For the equivalence relation, we define an induced quotient DAG G ∼ := (V ∼ , A). Definition 12 Let Assumption 9 hold, let α ∈ [0, 1] N ×M satisfy (SOS-1) and let θ > 0. Let G = (V , A) be a DAG as defined in Definition 8 with label function L from Definition 10 and induced equivalence relation ∼. The vertex set V ∼ for the quotient DAG G ∼ to G is defined as and the set of arcs is Moreover, there exists a bijection between a (sub)path of vertices (v 1 , . . . , v t ) through (V , A, C) and a corresponding sequence (L(v 1 ), . . . , L(v t )). Finally, the quotient set V ∼ of V decomposes consistently into the quotient sets V k,∼ of V k for k ∈ [N ]. We formalize these observations in the following proposition.  (V , A), that is

∼ be a subpath and let the costs defined by
Proof The first claim is verified straightforwardly. The second claim follows from Proposition 11.3, and in turn L(v) 1 = k for v ∈ V k . The third claim follows from the second. For the fourth claim, we strive an inversion of (v 1 , . . . , v t ) → (L(v 1 ), . . . , L(v t )). We observe that | [v]| = 1 for v ∈ V 1 , that is no information is lost because of the equivalence relation. Thus, L(v 1 ) = L 1 has exactly one solution v 1 ∈ V 1 . Thus, with d k := L k+1 − L(v k ), we can recover the change from layer k to layer k + 1 by virtue of Proposition 11.1, and obtain the injectivity inductively with the formula . The fifth claim follows inductively from Definition 8 and the fourth claim.
The recursive structure of C yields that a shortest path algorithm only needs to store the information on the current and previous vertex of the path in (V ∼ , A, C) to be able to evaluate the costs for an arc in the quotient DAG. Let (v 1 , . . . , v t ) be a path in the DAG (V , A, C). Then, the cost of the path is given by (3.4) By virtue of Propositions 11 and 13 , we are able to prove the main result of Sect. 3, which establishes the equivalence between solving (SCARP-IP) and solving a shortest path over the (quotient) DAG.
Proof The equivalence of 1. and 2. follows from the construction of (V , A, C) and Proposition 11. The equivalence of 2. and 3. follows from Proposition 13.
The Theorem 14 allows to apply a shortest path algorithm to the DAG constructed from a relaxed solution and a parameter θ > 0. The solution can then be used to construct a feasible binary control. Before briefly discussing the applied algorithm we note that the used DAG construction induces a topological order on the vertices of V and V ∼ as all outgoing arcs for vertex [v] ∈ V t,∼ (or its equivalent in V t ) are incident to vertices in V t+1,∼ (V t+1 ) and there exist no outgoing arcs from V t,∼ (V t ) to V t−1,∼ (V t−1 ). With this in mind the general procedure for the shortest-path algorithm, using elements of Dijkstras algorithm, is as follows, see also [5, p. 655]. Because of the topological order of G ∼ the runtime of Algorithm 3.1 is O(|V ∼ | + | A|). The cardinalities |V ∼ | and | A| will be estimated in the next Sect. 4. Theorem 14 now allows to convert the result from applying Algorithm 3.1 to G ∼ into a binary control for the MSCP, which is feasible with respect to (Slack) and optimal with respect to the switching cost function. Furthermore, we observe that, if no path of length N through the DAG can be found, then infeasibility of the instance for the combination of combinatorial constraints and the slack-parameter θ is certified.

Remark 15
We would like to point out that the DAG approach to rounding algorithms offers a lot of modeling flexibility, for example: 1. The presented approach is able to handle additional pointwise vanishing constraints in (RC), see also [14]. In this case the constraint that an arc from v ∈ V t to w ∈ V t+1 with L(w) = L(v) + e i can only be traversed, if α t,i > 0, has to be added. 2. Minimum dwell time constraints [26] can be incorporated by choosing a path through the DAG which satisfies the dwell times for all controls. 3. The CIA problem from [13] can be solved efficiently with our approach by including changing costs C.
Note that these modifications do not negatively affect complexity: On the one hand, a shortest path through a DAG can be computed in linear time for any cost function, in particular negative ones [5]. On the other hand, the removal of arcs or vertices from the DAG only makes the shortest path problem easier.
Furthermore, we observe that, if no path of length N through the DAG can be found, then we have an infeasibility certificate of the instance for the combination of combinatorial constraints and the slack-parameter θ .

Runtime complexity estimates
Section 4.1 establishes a worst-case scenario for the approach with regards to runtime for calculating an optimal solution. In Sect. 4.2 it is shown that the time needed to calculate the optimal solution with respect to α and θ is linear in N and an upper bound on the total computing time in dependency of M, N and θ is given for the established worst-case.

Worst case for the graph based algorithm
From the previous section we know that to establish the runtime of the proposed approach knowledge of the maximal cardinalities of |V ∼ | and |Ã| are necessary. As arcs are only defined in between vertex sets of subsequent time intervals, the underlying graph is bipartite. Thus |Ã| is only dependent on |V ∼ |, which in turn is dependent on the cardinality of the set of BinFSs spanned by a relaxed solution. Therefore the worst case for the proposed shortest path search is the relaxed solution which spans the largest set of BinFSs.
We define the set of all relaxed solutions α, that is the ones satisfying (SOS-1), below.
We finish at a finite subset R such that max α∈R |V ∼ (α, θ + 1)| can be bounded favorably in Sect. 4.2. Remembering Assumption 9, we define the following two subsets of S(M, t), which assume the role of R in the remainder.

Outline of the proof of Theorem 18
We divide the proof into two parts. First, we prove the inequality (4.4) for the choices α ∈ S(M, t) and α ∈ I (M, t). Second, we improve the statement to the desired claim by proving (4.4) for the choices α ∈ I (M, t) and α ∈ S I (M, t).

Lemma 19 Let Assumption 9 hold and let θ ≥ 1. Then for all t ∈ [N ] and all α ∈ S(M, t), there exists an α ∈ I (M, t) such that
To prove Lemma 19, we will require an α ∈ I (M, t) that satisfies two important bounds, which are introduced below in (4.6). The existence follows with the help of the Next-forced Rounding (NFR) algorithm from [12].

Lemma 20 For all t ∈ [N ] there exists α ∈ I (M, t) such that for all i ∈ [M]
(4.6) Proof We use the (NFR) algorithm from [12] to deduce that there exists an α ∈ I (M, t) that satisfies for all s ∈ [t] (Proposition 4.8, [12]). Furthermore, by definition of · it holds that b ≤ c implies b ≤ c for b ∈ N and c ∈ R. Analogously, a ≤ b implies a ≤ b for a ∈ R and b ∈ N. Because of α ∈ I (M, t), we have s k=1 α k,i ∈ N for all s ∈ [t] and i ∈ [M] and the claim follows.

Proof of Lemma 19
There is nothing to prove if α ∈ I (M, t). Thus we restrict to α ∈ S(M, t)\I (M, t). Lemma 20 yields the existence of an α ∈ I (M, t) such that the inequalities (4.6) hold. Let v ∈ V t ( α, θ) be arbitrary. Then for all i ∈ [M] at an arbitrary grid point t ∈ [N ] it holds that The last inequality in both (4.8) and (4.9) holds by Definition 7 and (Slack), while the first inequality holds because of the chain of inequalities (4.6). Because v was arbitrary by Definition 7 the claim follows.

Proof of inequality (4.4) for the choices˛∈ I(M, t) and˛∈ SI(M, t)
To complete the proof of Theorem 18 it remains to show that inequality (4.4) holds for α ∈ I (M, t) and α ∈ S I (M, t), which is the purpose of the following lemma.
The base case is laid out in Lemma 26 and the step is elaborated in Lemma 27, which together conclude the proof.
The constraint (Slack) allows to obtain upper and lower bounds on the number of intervals k where α k,i = 1 holds. We introduce corresponding notation below.
We can therefore start to construct the last row of the desired ω ∈ V t (α, θ). by setting ω t, j = 1 and ω t, = 0 for = j. Then we repeat the described procedure backwards in t with L − e j to determine the entries of the row ω t−1 ∈ Z M until t = 1. We obtain a recursively defined ω ∈ V t (α, θ) with L(ω) = L.
In our arguments, we require a means to quantify the distance between a given α ∈ I (M, t) and the set S I (M, t) in terms of the label function L. We introduce a distance function that satisfies our needs. 1. There exists a minimizer α of (D). In particular the value d t (α) is well defined.

The difference between two consecutive integral label distances satisfies d t
Proof For the first claim, we notice that minimizing (D) reduces to enumerating the set S I (M, t), which is finite by the construction from Definition 17. For the second claim, we notice that α satisfies (SOS-1) because of α ∈ I (M, t). Therefore the integral label distance can differ by at most 1 in between two consecutive distances d t (α) and d t−1 (α) due to sum-norm and the factor 1 2 in front of it. We are ready to prove the base case and induction step of Lemma 21. The arguments bear similarities, and we begin with the base case, which is less technical. We proceed by distinguishing between two cases for (4.17). In case t k=1 α k,i − t k=1 α k, j = 1, we obtain the identities t k=1 α k, j (4.14) Therefore, we may conclude for g ∈ [M] that j by (4.16). (4.20) Let ω ∈ V t (α, θ). We define the vector L as From equations (4.20) we deduce that θ − t (α, θ ) ≤ L ≤ θ + t (α, θ ). Hence, we apply Lemma 23 to deduce the existence of an ω ∈ V t (α, θ ) such that L(ω) = L. Thus [ω] ∈ V t,∼ (α, θ ). This procedure can be repeated for all ω ∈ V t (α, θ). Thus, factorizing by label values using ∼ we conclude that the claimed inequality (4.13) holds.
For the second case, t k=1 α k,i − t k=1 α k, j ≤ 0, it follows that We conclude the proof by means of a case distinction on the sign of θ − t, j (α, θ).
Case θ − t, j (α, θ) > 0. For all ω ∈ V t (α, θ) there exists an element ω ∈ V t (α, θ ) with L(ω) = L := L(ω) + e i − e j by virtue of Lemma 23 because Combining this with the fact that Z M → + e i − e j ∈ Z M is a bijection on the codomain of the labeling function L, the claimed inequality (4.13) follows.
For L it holds that Thus by virtue of Lemma 23, we find ω ∈ V t (α, θ ) with L(ω) = L. In this case defines a bijection on the set of label values, where is the label of ω ∈ V t (α, θ) interpreted as a vector and i , j are the entries associated to control i and j respectively.
Lemma 26 states that if d t (α) = 1, there exist an element of S I (M, t) whose set of BinFSs is at least as large as the one from α when factorizing for identical label values. We continue with the induction step in which we construct the labels similar to the base case to prove the claim for higher values of d t (α).
The claim for the first case follows analogously to the corresponding case in Lemma 26 because (4.14) and (4.15) follow from (4.28)-(4.30).
The case θ − k, j (α, θ) = 0 can be handled analogously to the corresponding case in Lemma 26 as well because the equality (4.22) and inequality (4.24) follow from the chain of inequalities (4.31).
This concludes the proof of the assumptions made in the proof of Lemma 21 and therefore the second part of the proof of Theorem 18. Theorem 18 is now a result of putting Lemmas 19 and 21 together.
and concludes the proof of Theorem 18.

Remark 28
Note that relaxed solutions in computational practice are often fractionalvalued. Due to Theorem 18 the worst-case estimate on the number of vertices is attained by an already binary-valued function. Additionally the set of worst-case relaxed solutions S I (·, ·) has a peculiar structure, see Definition 12, which is also scarcely seen in practice. Thus the worst-case for our approach occurs rarely in practice.
To prove (4.34), let k > M θ . It is immediate that each ω ∈ V k−1 (α, θ ) can be extended to ω ∈ V k (α, θ ) with ω = ω for all ∈ [k − 1]. It remains to prove |V k,∼ (α, θ )| ≤ |V k−1,∼ (α, θ )|. We choose again i ∈ [M] such that α k,i = 1 and α k, j = 0 for all j ∈ [M]\{i}. Because α ∈ S I (M, k) and k > M θ , every column of α contains at least θ nonzero entries. Thus, (4.36) We pick an arbitrary ω ∈ V k (α, θ ). We set L = L(ω) − e i and note that L i ≥ 0 holds because of inequality (4.36). The feasibility inequalities θ − k−1,i (α, θ ) ≤ L j ≤ θ + k−1,i are satisfied by construction. Moreover, We apply Lemma 23 to deduce that there exists ω ∈ V k−1 (α, θ ). Because of (4.36), we may use the bijectivity of the mapping Z M → − e i ∈ Z M to deduce that  Naturally for α ∈ S I (M, N ) any solution of the system (4.41) represents one element of V M θ ,∼ (α, θ ), because the (Slack) condition is encoded in 0 ≤ x i ≤ 2θ , while the (SOS-1) condition is enforced through the sum. A closed form for the number of solutions can be determined by using formal power series [19] in combination with coefficient extractions and partial geometric sums.
Unfortunately the structure of Eq. (4.42) is not suitable for a precise calculation. However, using Stirling's formula and the local limit theorem, Eger [8] has shown the following asymptotic approximation for (4.42).

Theorem 32 ([8]) Let M be a positive integer and
Using Theorem 32 in correspondence with the bipartite graph structure constructed in Sect. 3.1 allows to obtain a worst case runtime.
Theorem 33 (Runtime of the graph based algorithm for (SCARP-IP)) Let assumption 9 hold and let θ ≥ 1. Then searching for a binary feasible solution which is optimal in a radius of θ around a given solution α of the (discretized) relaxed problem (RC) has the worst case runtime of Thus proving the claim.  1 Computation times of (SCARP) and SUR [20] for different values of the mesh size h θ , we ran benchmarks with values of θ ∈ 5 6 · {1, 1.5, 2}, that is we chose values of θ for which existence of a solution is guaranteed, see Sect. 3, [12,14]. Furthermore, we included the running time of the SUR rounding algorithm [20] as a baseline. As noted before, we do not consider the switching cost term in (LV) but transfer the minimization of switching costs to the rounding step.
The continuous relaxed problem was solved by means of a first discretize, then optimize methodology using direct multiple shooting to discretize the dynamics [3]. We chose equidistant discretizations with h ∈ {2 −1 T , . . . , 2 −10 T }. We employed the solver Muscod-II, see [15], and solved (SCARP-IP) using version 8.1 of the IP solver Gurobi [16]. The shortest path algorithm 3.1 from [5] to solve (SCARP-IP) was implemented in the C++ programming language compiled using the GNU C++ compiler with the optimizing option -O2. All experiments were conducted on an Intel Core i7-965 clocked at 3.20 Ghz.
The running times of the algorithms are depicted in Fig. 1. Running times below 1 S were not reported by the IP-solver, which explains the incomplete data lines for (SCARP-IP). Sum-Up-Rounding, having linear running time, performs best in terms of runtime, averaging a running time of below 1 ms (drawn solid). Conversely, the IP-solver takes a significant time to solve (SCARP-IP), with a maximum of more than 16 h, increasing both in θ and 1/h (depicted dotted). The shortest path (SP-SCARP) implementation of (SCARP-IP) presents a vast improvement over the IP formulation in terms of running time, being far less sensitive to the parameters with a maximum running time of below 10 ms (drawn dashed).
The cardinality of the vertex and arc set and its importance for the presented approach was discussed in Sect. 4. Figure 2 shows the cardinality of the vertex and arc sets visited during the (SP-SCARP) computations and compares them with the proven worst-case bound from Theorem 33. As expected, the number of vertices and arcs increases linearly with the refinement of the mesh size h, but the number of vertices and arcs actually seen during the computation are significantly lower than the theoretical bound, Theorem 18 and Remark 28. was used with respect to the common relaxed solution α for controls determined by SUR and (SP-SCARP). We omit the results of (SCARP-IP) here, because they are identical to the results calculated with (SP-SCARP). In general it is possible that multiple (SCARP) results have the same switching costs but different (LV) values, but comparing the (LV) values can be done quickly as the corresponding controls are known for the whole time horizon, see predecessors P from Algorithm 3.1. Figure 3 visualizes the behavior of the relative error. Since the considered initial value problem is nonlinear, the error does not necessarily decrease monotone over the grid refinements. The convergence property guaranteed by Proposition 6 is validated and the objective value for (LV) grows as the θ parameter is increased, as predicted in Sect. 3. The switching costs of the optimal control solutions are visualized in Fig. 4. We observe that for all θ the switching costs computed by (SP-SCARP) are significantly lower than the switching costs of (SUR) and that a higher value of θ leads to smaller switching costs because the controls can be chosen farther away from the relaxed solution. This validates the trade-off character of θ in this example and shows one of the features with which (SP-SCARP) enhances the current capabilities of the relaxation approach. Others are mentioned in Remark 15.
Comparing Figs. 3 and 4, one observes that simultaneously minimizing the switching costs and driving the approximation error to 0 is not possible in our example. Note that including C(·) in the objective of (LV) leads to divergence of the relative error e (h) quantity towards infinity unless the optimized control of the continuous relaxation is already a binary control.
We want to point out that we could have just as easily chosen to optimize the approximation quality instead of adding switching costs in order to obtain solutions approximating the fractional control at least as well as the SUR solution. For a given discretization grid this corresponds to replacing C(·) by θ in (SCARP), which allows us to use the DAG structure and the shortest path algorithm. We also note that if the Fig. 3 Validation of theoretical results regarding the approximations of the objective for (SCARP) and SUR [20] for different values of h. The grey line visualizes the desired linear decrease of the error Fig. 4 Switching costs of (SCARP) and SUR [20] for different values of the mesh size h control deviation parameter θ is chosen too small or combinatorial constraints can never be satisfied, then an infeasibility certificate can be easily provided by (SCARP) as no path from t 0 to t N will exist.
For the purpose of applying rounding algorithms, we recommend computing controls from both approaches and comparing the results if one is not sure which algorithm to prefer a-priori. Note that the costs are still negligible compared to solving a large mixed-integer problem and that the shortest path approach offers much flexibility with respect to constraints, switching costs and performance guarantee for the integral control deviation.

Conclusion
We have shown that a feasible control for (MSCP) minimizing a general switching cost term can be derived by solving an IP. The IP is based on the relaxed solution α of the partial outer convexified counterpart to (MSCP). The feasible set of this IP has a DAG structure whose size is governed by a user-defined slack-parameter θ , which governs the approximation quality with respect to α. By computing a solution of the corresponding shortest path problem, an optimal solution or an infeasibility certificate with respect to the chosen slack-parameter θ and relaxed solution α of the IP can be found. Additional combinatorial constraints on the permitted switching structures accelerate our approach as they thin out the DAG we have to process, Remark 15. We have proven favorable bounds on the worst case for the number of vertices in the DAG in the case of an equidistant grid, Theorem 18, which is linear in time, polynomial in θ and exponential in the number of binary controls, Theorem 33. Our approach promises to be beneficial in practice as we have exemplarily demonstrated a speed up of several orders of magnitude over a naive IP-based approach.