1 Introduction and literature

Coordinate descent methods have a long history in optimization. In the continuous case the simplest form is the Gauß–Seidel method. In Hildreth (1957) analyzed this method for quadratic programs and proved that the sequence of objective values converges to the optimal objective value and that, if the constraint matrix has full rank, also the sequence of solutions converges to an optimal solution. Two years later, D’Esopo (1959) generalized the procedure to convex optimization problems and proved that under rather strong conditions (feasible set is a box, all subproblems have a unique optimal solution) the sequence of objective values converges to the optimal objective value and every accumulation point of the sequence of generated solutions is an optimal solution. Blockwise coordinate descent, in which a group of variables is optimized simultaneously in each iteration, was introduced by Warga (1963), who transferred the convergence result of D’Esopo. Subsequently, several weaker conditions that ensure convergence to an optimal solution appeared. For example, Grippo and Sciandrone (2000) showed that if the feasible region is the product of the closed, convex feasible regions of the subproblems, the objective function is continuously differentiable and pseudoconvex, and the set of feasible solutions whose objective value is at most the objective value of the initial solution is compact, then the sequence of solutions generated by the blockwise coordinate descent method has limit points and every limit point is a global minimizer. Furthermore, Jäger (2016) proved that the compactness assumption can be replaced by the assumption that the feasible sets of the subproblems are either polyhedra or intersections of sublevel sets of strictly convex functions and that the sequence converges. For nonpseudoconvex objective functions Zangwill (1969) gave a condition that only guarantees coordinatewise optimality of the limit points because convergence to a globally optimal solution does not hold in this case. This result was generalized for the blockwise coordinate descent by Tseng (2001), who showed that if the feasible region is the product of the closed, convex feasible regions of the subproblems, the objective function is continuous, the set of feasible solutions whose objective value is at most the objective value of the initial solution is compact, and every subproblem has a unique optimal solution, then every limit point of the sequence generated by the blockwise coordinate descent algorithm is blockwise optimal. Similar algorithms are compass search and the Frank–Wolfe algorithm. The notion of blockwise optimality is related to the notions of an equilibrium point from equilibrium programming and of a Nash equilibrium from game theory, namely blockwise optimal solutions correspond to Nash equilibria of potential games, introduced in Monderer and Shapley (1996). Coordinate descent applied to find them is known under the name of best-response paths and for example treated by Voorneveld (2000).

In discrete optimization there are many approaches in which a problem is solved by in turn fixing parts of the variables and solving the resulting subproblem. This is done in many publications under many different names. We list a few examples. In facility location, blockwise coordinate descent has become popular as Cooper’s method (Cooper 1964). Here a set of new facilities is randomly chosen as starting solution, then the assignment of the customers to the facilities is determined, and then this assignment is kept fixed and the facilities are re-optimized. Cooper’s method is still under research, see, e.g., Drezner et al. (2015), Drezner and Salhi (2017). A similar iterative method is used also in statistics for multi regression lines or for trimmed regression, see, e.g., Rousseeuw (1987), Hawkins (1994). Here a (set of) regression lines is randomly chosen and fixed, then the assignment and/or the outliers are determined and the lines are further optimized according to these fixed assignments. Using a large number of starting solutions provides good results. The same procedure is used for k-means clustering and there known as Lloyd’s algorithm (Lloyd 1982). It has been shown to converge to the so-called centroidal Voronoi tessellation in Du et al. (2006). Under the name iterative variable fixing heuristics, blockwise coordinate descent is furthermore popular in many applications in logistics, e.g., for solving a combined blending and distribution planning problem (Bilgen 2007) or in transportation planning when combining vehicle and crew scheduling (Gintner et al. 2005). A general scheme on how to apply blockwise coordinate descent in problems of transportation planning is proposed in the eigenmodel in Schöbel (2017).

2 Problem definition

We consider integer programming problems given as

$$\begin{aligned} \text{(P) } \ \ \min \{f(x): G(x) \le 0, \quad \ x \in {\mathbb {Z}}^n\} \end{aligned}$$

for \(f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\), \(G:{\mathbb {R}}^n \rightarrow {\mathbb {R}}^m\). We denote the feasible set of \({\mathrm {P}}\) by \({\mathfrak {X}}:=\{x \in {\mathbb {Z}}^n: G(x) \le 0\}\). Note that (without loss of generality) we require the functions f and G to be defined anywhere on \({\mathbb {R}}^n\).

Let \(I \subseteq \{1,\ldots ,n\}\) be a group of variables, called block, and \(I^C:=\{1,\ldots ,n\} {{\setminus }} I\) its complement. We split the variables in a part \(x_I \in {\mathbb {R}}^{|I|}\) and \(x_{I^C} \in {\mathbb {R}}^{n-|I|}\) such that we may write \(f(x)=f(x_I,x_{I^C})\) and \(G(x)=G(x_I,x_{I^C})\).

For any fixed \(y \in {\mathbb {R}}^{n-|I|}\) we receive a subproblem

$$\begin{aligned} (\mathrm{P}_I(y)) \ \ \min \{f(x,y): G(x,y) \le 0, \quad \ x \in {\mathbb {Z}}^{|I|}\} \end{aligned}$$

in which we only determine values for the variables with indices in I. We often write subproblem I referring to the subproblem \(\hbox {P}_I\) determined by the variables in set I.

In the blockwise coordinate descent method we iteratively solve a sequence of such subproblems. To this end, denote the family of all subsets of \(\{1,\ldots , n\}\) by \({\mathcal {P}}(\{1,2,\ldots ,n\})\) and let

$$\begin{aligned} I_1,I_2,\ldots ,I_p \in {{\mathcal {P}}}(\{1,2,\ldots ,n\}) \end{aligned}$$

be the indices of the variables of the respective p subproblems. We require that every variable appears in at least one of the subproblems, i.e., \(\bigcup _{j=1,\ldots ,p} I_j = \{1,\ldots ,n\}\). We write \({\mathfrak {J}}:=\{I_1,\ldots ,I_p\}\) for the set of subproblems to be considered. Note that \({\mathfrak {J}}\) is a cover of \(\{1,\ldots ,n\}\). Let \(n_j:=|I_j|\) be the number of variables of subproblem j.

In order to simplify notation, we use the index j instead of the subproblem \(I_j\) and formulate the subproblem \(\mathrm {P}_{I_j}\) in which the variables from \(I_j^C\) are fixed and we only determine new values for the variables in \(I_j\) as

$$\begin{aligned} (\mathrm{P}_j(x_{-j})) \ \ \min \{f(x_{+j},x_{-j}): G(x_{+j},x_{-j}) \le 0, \quad \ x_{+j} \in {\mathbb {Z}}^{n_j} \}, \end{aligned}$$

where

$$\begin{aligned} \text{ P }_j:= & {} \text{ P }_{I_j} \text{ is } \text{ the } \text{ subproblem } \text{ with } \text{ variables } \text{ only } \text{ in } I_j,\\ x_{+j}:= & {} x_{I_j} =(x_i: i \in I_j) \in {\mathbb {R}}^{n_j} \ \text{ is } \text{ its } \text{ set } \text{ of } \text{ variables, } \text{ and }\\ x_{-j}:= & {} x_{I^C_j} =(x_i: i \not \in I_j) \in {\mathbb {R}}^{n-n_j} \text{ are } \text{ the } \text{ remaining } \text{ variables. } \end{aligned}$$

The blockwise coordinate descent method can then be formulated as follows.

figure a

We call a complete execution of the For-loop a round and the single execution of Steps 1 and 2 an iteration. When nothing else is mentioned, we use the following stopping criterion: We terminate if in one round all problems \(I_1,\ldots ,I_p\) have been executed without an improvement in the objective function.

As an illustration we use the planning process in public transportation. Here, one looks for a line plan, a timetable, and schedules for the vehicles covering all trips. Usually, this is done sequentially: First, the line planning problem is solved, then a timetable is determined and finally the vehicle schedules are planned. But why should one stop here? One could now fix the vehicle schedule and the timetable and try to improve the line plan, then fix the line plan and the vehicle schedule and improve the timetable, and then improve the vehicle schedule while the other two plans are fixed, see Fig. 5. This is described in more detail in Sect. 6.

3 Properties of the sequence

We denote by \(x^{(k)}\) the solution at the end of the k-th iteration of Algorithm BCD.

3.1 Existence of optimal solutions in the subproblems

Since the solution \(x^{(k)}\) of iteration k is a feasible solution to \(\text{ P }_j(x^{(k)}_{-j})\) for all \(j=1,\ldots ,p\), all subproblems occurring during the blockwise coordinate descent have a feasible solution if the starting solution \(x^{(0)}\) is feasible for \({\mathrm {P}}\). However, if \({\mathfrak {X}}\) is an unbounded set it might happen that some of the problems \(\mathrm {P}_j(x)\) are unbounded and hence admit no optimal solution. We investigate this case for integer linear programs.

Lemma 1

Let \({\mathrm {P}}\) be an integer linear program with rational data, i.e., \(\min \{c^{\mathrm {T}} x \mid A x = b,\ x \in {\mathbb {Z}}_{\ge 0}\}\) with \(A \in {\mathbb {Q}}^{m \times n}\), \(b \in {\mathbb {Q}}^m\), and \(c \in {\mathbb {Q}}^n\). Let \(j \in \{1,\cdots ,p\}\), and let \(x, y \in {\mathfrak {X}}\) such that \({\mathrm {P}}_j(x_{-j})\) and \({\mathrm {P}}_j(y_{-j})\) are feasible. Then \({\mathrm {P}}_j(x_{-j})\) is unbounded if and only if \({\mathrm {P}}_j(y_{-j})\) is unbounded.

Proof

Suppose that \({\mathrm {P}}_j(y_{-j})\) is unbounded. Then also its LP relaxation is unbounded. By the decomposition theorem for polyhedra (Minkowski’s theorem, see, e.g., Nemhauser and Wolsey 1988), its feasible region can be written as \({{\,\mathrm{conv}\,}}(\{v^{(1)}, \cdots , v^{(g)}\}) + {{\,\mathrm{conic}\,}}(\{r^{(1)}, \cdots , r^{(h)}\})\) for some \(v^{(1)}, \cdots , v^{(g)}, r^{(1)}, \cdots , r^{(h)} \in {\mathbb {R}}^{n_j}\), and \(c_{j}^t r^{(k)} < 0\) for some \(k \in \{1,\cdots ,h\}\). The vector \(((r^{(k)})^{{\mathrm {T}}}, 0_{-j}^{{\mathrm {T}}})^{{\mathrm {T}}}\) lies in the recession cone (see, e.g., Rockafellar 1970) of the LP relaxation of \({\mathrm {P}}\), so \(r^{(k)}\) lies in the recession cone of the LP relaxation of \({\mathrm {P}}_j(x_{-j})\), i.e., of the convex hull of the feasible region of \({\mathrm {P}}_j(x_{-j})\). As \(r^{(k)}\) has a negative objective value, the latter is unbounded, so the problem \({\mathrm {P}}_{j}(x_{-j})\) must be unbounded as well. \(\square \)

Note that the assumption of linearity is crucial in Lemma 1 since for integer non-linear programs, the lemma does not hold in general as the following example shows.

Example 2

Consider the integer non-linear program

$$\begin{aligned} ({\mathrm {P}}) \quad \min \{x_1 + x_2 \mid x_1 \cdot x_2 \ge -1 , \ x_1 \le 0, \ x_1, x_2 \in {\mathbb {Z}}\}. \end{aligned}$$

For \(I=\{1\}\) and \(x_2=1\) the subproblem \({\mathrm {P}}_I(1)\) has the optimal solution \(x_1 = -1\) but for \(x_2 \le 0\) the subproblem \({\mathrm {P}}_I(0)\) is unbounded.

For integer linear programs with rational data, Lemma 1 tells us the following: If Algorithm BCD finds an optimal solution for every problem in the first round (i.e., when every subproblem \(I \in {\mathfrak {J}}\) has been considered once), then it will find an optimal solution in every subproblem considered later. This can be tested by only looking at the first feasible solution \(x^{(0)}\).

Corollary 3

Let \({\mathrm {P}}\) be an integer linear program with rational data. If \(x^{(0)}\) is feasible for \({\mathrm {P}}\) and \({\mathrm {P}}_j(x^{(0)}_{-j})\) is bounded for all \(j=1,\ldots ,p\), then all subproblems \({\mathrm {P}}_j(x_{-j})\) which appear when executing Algorithm BCD have an optimal solution.

3.2 Objective values

If all subproblems have an optimal solution, then the sequence \((f(x^{(k)}))_{k \in {\mathbb {N}}}\) of objective values is monotonically decreasing since the solution \(x^{(k)}\) of step k is a feasible solution to \(\text{ P }_j(x^{(k)}_{-j})\) for all \(j=1,\ldots ,p\). This implies convergence of the sequence if the optimization problem \({\mathrm {P}}\) is bounded. Together with Corollary 3 we hence get the following condition for convergence.

Corollary 4

Let \({\mathrm {P}}\) be an integer linear program with rational data. If \(x^{(0)}\) is feasible for \({\mathrm {P}}\) and \(\mathrm {P}_j(x^{(0)}_{-j})\) is bounded for all \(j=1,\ldots ,p\), then the sequence \((f(x^{(k)}))_{k \in {\mathbb {N}}}\) of objective values generated by Algorithm BCD is monotonically decreasing. Furthermore, if \({\mathrm {P}}\) is bounded, the sequence converges.

Note that boundedness of all subproblems \(\text{ P }_j(x_{-j})\) and for all \(x \in {\mathfrak {X}}\) does not imply boundedness for P, also not in the case of integer linear programming of Corollary 4, hence the assumption that P is bounded is in fact, necessary. We next investigate the question if the sequence \((f(x^{(k)}))_{k \in {\mathbb {N}}}\) (if it converges) becomes eventually constant. In this case, we can be sure that Algorithm BCD terminates if we stop after a round without any improvement. This need not hold in general even if \({\mathrm {P}}\) has an optimal solution, as the next example demonstrates.

Example 5

Consider the integer nonlinear program

$$\begin{aligned} \ \ \min \left\{ \frac{2x_1+2x_2-4}{(x_1+x_2)^2} : -x_1-x_2 \le -1, \ x_1-x_2 \le 1, \ -x_1+x_2 \le 1, \quad \ x_1, x_2 \in {\mathbb {Z}}\right\} . \end{aligned}$$

An optimal solution to it is (0, 1) with objective value \(-2\). However, for the initial solution \(x^{(0)}=(3,3)\) the sequence generated by Algorithm BCD for \(I_1=\{1\}\) and \(I_2=\{2\}\) is given by

$$\begin{aligned} x^{(k)}=\left\{ \begin{array}{ll} (k+3,k+2) &{} \text{ if } k \text{ is } \text{ odd } \\ (k+2,k+3) &{} \text{ if } k \text{ is } \text{ even } \end{array} \right. \text{ for } k=1,2,\ldots \end{aligned}$$

and the resulting sequence of objective values for \(k \ge 1\) is \(f(x^{(k)})=\tfrac{6+4k}{(5+2k)^2} \rightarrow 0\) which converges to zero but never becomes constant.

The next lemma gives some cases in which the sequence \((f(x^{(k)}))_{k \in {\mathbb {N}}}\) becomes constant.

Lemma 6

Let \({\mathrm {P}}\) be given and let the sequence \((x^{(k)})_{k \in {\mathbb {N}}}\) exist (i.e., assume that Algorithm BCD is executable). The sequence of objective values \((f(x^{(k)}))_{k \in {\mathbb {N}}}\) generated by Algorithm BCD becomes constant if one of the following conditions hold.

  1. (i)

    The set \({\mathfrak {X}}\) is bounded.

  2. (ii)

    The sublevel set \(\{x \in {\mathfrak {X}}: f(x) \le f(x^{(l)}) \}\) is bounded for some \(x^{(l)}\).

  3. (iii)

    The function f is a polynomial with rational coefficients and \({\mathrm {P}}\) is bounded.

Proof

We show that \((f(x^{(k)}))\) can only take finitely many different values. Since it is monotonically decreasing, it then must become constant. For (i) this is clear, since a bounded set of integers is finite. For (ii) we use that \((f(x^{(k)}))_{k \in {\mathbb {N}}}\) is decreasing, i.e., \(f(x^{(k)}) \le f(x^{(l)})\) for all \(k \ge l\) and hence the sequence \((x^{(k)})\) remains for \(k \ge l\) in the bounded set of feasible solutions with objective value at most \(f(x^{(l)})\), which is a finite set. In case (iii) this holds because \(f(x^{(k)})\) can only take values that are integer multiples of the product of the denominators of the coefficients of f, and again there are only finitely many such values between \(f(x^{(0)})\) and a finite lower bound. \(\square \)

In particular, for integer linear programming, f is certainly a polynomial. We hence obtain the following condition for termination of Algorithm BCD.

Corollary 7

Let \({\mathrm {P}}\) be a bounded integer linear program with rational coefficients. Algorithm BCD terminates if \(x^{(0)}\) is feasible for \({\mathrm {P}}\) and \(\mathrm {P}_j(x^{(0)}_{-j})\) is bounded for all \(j=1,\ldots ,p\).

Proof

If \(x^{(0)}\) is feasible for \({\mathrm {P}}\) and \(\mathrm {P}_j(x^{(0)}_{-j})\) is bounded for all \(j=1,\ldots ,p\), then the sequence generated by Algorithm BCD exists due to Corollary 3. Hence the result follows from Lemma 6, (iii). \(\square \)

We remark that Lemma 6 and Corollary 7 rely on the integrality of the variables: both do not hold for continuous linear programming.

3.3 Sequence elements

We now investigate the localization of the sequence elements. For continuous strictly quasiconcave optimization problems, the following is well known: If there exists a minimum then all minimizers lie on the boundary \(\partial {\mathfrak {X}}\) of the feasible set \({\mathfrak {X}}\). In this case, the sequence of points \((x^{(k)})_{k \in {\mathbb {N}}}\) (if it exists) is contained in the boundary of \({\mathfrak {X}}\). For continuous quasiconcave functions, if a minimum exists, a minimizer can be chosen as a boundary point, which is in fact common for most algorithms (e.g. for the Simplex algorithm in linear programming). If such an algorithm is chosen for solving the subproblems we again receive that \((x^{(k)})_{k \in {\mathbb {N}}} \subseteq \partial {\mathfrak {X}}\). In this section we transfer this localization result to integer problems.

Definition 8

Let \({\mathrm {P}}\) be an integer optimization problem with feasible set \({\mathfrak {X}}\subseteq {\mathbb {Z}}^n\). We call \(x \in {\mathfrak {X}}\)close-to-boundary if there is some \(y \in {\mathbb {Z}}^n {{\setminus }} {\mathfrak {X}}\) with \(\Vert x - y \Vert _1 = 1\).

Note that \(\Vert x-y\Vert _1=1\) if and only if there exists \(i \in \{1,\ldots ,n\}\) such that \(y=x+e_i\) or \(y=x-e_i\) (\(e_i \in {\mathbb {R}}^n\) being the i-th unit vector). Let us first give some meaning to this definition for the case that \({\mathfrak {X}}=\{x \in {\mathbb {Z}}^n: x \in F\}\) is given as the set of integer points of some set \(F\). For example, \(F\) might be the feasible set of the linear programming relaxation of an integer program.

Consider a close-to-boundary point \(x \in {\mathfrak {X}}\) and a point \(y \in {\mathbb {Z}}^n {\setminus } {\mathfrak {X}}\) with \(\Vert x-y\Vert _1=1\), say, \(y=x + e_i\) for some \(i \in \{1,\ldots ,n\}\). Since \(x \in F\) and \(y \not \in F\), there exists a point \(z \in \partial F\) on the straight line between x and y which satisfies \(\Vert z-x\Vert _2=\Vert z-x\Vert _1 \le \Vert y-x\Vert _1=1\). This means that the (Euclidean) distance between a close-to-boundary point x and the boundary \(\partial F\) satisfies

$$\begin{aligned} d(x, \partial F) =\inf _{z' \in \partial F} \Vert z'-x\Vert _2 \le \Vert x -z\Vert _2 \le 1, \end{aligned}$$

i.e., all close-to-boundary points in \({\mathfrak {X}}\) are located within a strip along the boundary of \(F\) with a diameter of at most one.

The next lemma shows that for strictly quasiconcave functions the sequence \((x^{(k)})_{k \in {\mathbb {N}}}\) generated by Algorithm BCD is contained in such a strip along the boundary of \(F\).

Lemma 9

Let \({\mathrm {P}}\) be given with a strictly quasiconcave objective function f and let the sequence \((x^{(k)})_{k \in {\mathbb {N}}}\) exist. Then \(x^{(k)}\) is a close-to-boundary point for all \(k \ge 1\).

For an illustration of the lemma, see Fig. 1.

Fig. 1
figure 1

Let \({\mathfrak {X}}\) be the integer points contained in F. The close-to-boundary points are the filled circles. They are contained in a strip along the boundary \(\partial F\)

Proof

Let \(x=x^{(k)}\) be a solution generated by Algorithm BCD, let’s say after optimizing the variables \(x_{+j}\) of subproblem \(I_j \in {\mathfrak {J}}\). This means that \(f(x) =f(x_{+j},x_{-j}) \le f(y,x_{-j})\) for all y with \((y,x_{-j}) \in {\mathfrak {X}}\). Now, choose \(i \in I_j\) as one of the variables of subproblem \(I_j\) and consider the two points \(x':=x-e_i\) and \(x'':=x+e_i\). Since \(i \in {\mathfrak {J}}_j\) we have that \(x'=\left( (x-e_i)_{+j},x_{-j}\right) \) and \(x''=\left( (x+e_i)_{+j},x_{-j}\right) \). Furthermore, x is a convex combination of \(x'\) and \(x''\) and f is strictly quasiconcave; hence we get

$$\begin{aligned} f(x) > \min \{f(x-e_i), f(x+e_i)\} = \min \{f( (x-e_i)_{+j},x_{-j}), f( (x+e_i)_{+j},x_{-j})\}, \end{aligned}$$

and due to the optimality of x we conclude that

$$\begin{aligned} f(y,x_{-j}) > \min \{f( (x-e_i)_{+j},x_{-j}), f( (x+e_i)_{+j},x_{-j})\} \text{ for } \text{ all } y \text{ with } (y,x_{-j}) \in {\mathfrak {X}}. \end{aligned}$$

Hence, either \(x-e_i\) or \(x+e_i\) has to be infeasible and x is close-to-boundary. \(\square \)

4 Properties of the resulting solutions

4.1 Notions of optimality

Definition 10

A feasible solution \(x^*\) of \({\mathrm {P}}\) is called (globally) optimal if for every \(x' \in {\mathfrak {X}}\) we have \(f(x^*) \le f(x')\).

We certainly would wish to receive an optimal solution, and we will see that in special cases this is in fact the result of the Algorithm BCD. However, in general the algorithm converges to some suboptimal solution which can be characterized as follows.

Definition 11

(see, e.g., Tseng 2001) A feasible solution \(x^*\) of \({\mathrm {P}}\) is called blockwise optimal if for every \(j \in \{1,\cdots ,p\}\) and every \(x_{+j}^\prime \in {\mathbb {Z}}^{n_j}\) such that \((x_{+j}^\prime , x_{-j}^*)\) is feasible, it holds that \(f(x^*) \le f(x_{+j}^\prime , x_{-j}^*)\).

If Algorithm BCD terminates, i. e., if the objective value of the solution is not improved during a whole round, then the resulting solution is obviously blockwise optimal. In its game-theoretic interpretation, blockwise optimality is a stable situation since none of the subproblems can be improved if only its own variables may be changed. Here, we assume that all players have the same objective function. A more detailed description of the relation to game theory can for example be found in Kleinberg and Tardos (2014, Section 13.7). The question arising is: what happens if also other variables can be changed? A state in which no subproblem can improve its objective value without worsening at least one other subproblem would be a Pareto solution. In game theory and multi-objective optimization, Pareto solutions are of particular interest. For defining notions of optimality we show that Pareto optimality is stronger than just blockwise optimality if the set \({\mathfrak {J}}\) is a partition, but that this need not be the case otherwise. To use the concept of Pareto optimality, we first have to define objective functions for each of the subproblems. This is easy if the subproblems \({\mathfrak {J}}=\{I_1,\ldots ,I_p\}\) are a partition and the objective function f is additively separable with respect to this partition, i.e.,

$$\begin{aligned} f(x)=\sum _{j=1}^p f_j(x_{+j}) \end{aligned}$$

can be decomposed into a sum of functions \(f_j\) where each function \(f_j\) only depends on variables from the subproblems \(I_j \in {\mathfrak {J}}\). However, \({\mathfrak {J}}\) does not need to be a partition of the variables \(\{1,\ldots ,n\}\) but may be just a cover. We now transfer the notion of separability to this situation.

To this end, we need the coarsest partition of \(\{1,\ldots ,n\}\) that is conform to the given cover \({\mathfrak {J}}\).

Definition 12

For a given cover \({\mathfrak {J}}=\{I_1,\ldots ,I_p\}\) we define its induced partition.

$$\begin{aligned} {\mathfrak {K}}:=\Bigl \{\Bigl (\bigcap _{j \in J} I_j\Bigr ) {\setminus } \Bigl (\bigcup _{j \notin J} I_j\Bigr ) : J \subseteq \{1,\cdots ,p\}\Bigr \} {\setminus } \{\emptyset \}. \end{aligned}$$

\({\mathfrak {K}}\) contains for every subset \(J \subseteq \{1,\cdots ,p\}\) the set of all variable indices that lie in exactly the sets \(I_j\) with \(j\in J\) if this is non-empty. This is a partition of \(\{1,\cdots ,n\}\) into sets \(K_1,\cdots ,K_q\) which satisfies

$$\begin{aligned} K_{\ell } \subseteq I_j \text{ or } K_{\ell } \subseteq I_{j}^C \text{ for } \text{ all } \quad \ell =1,\ldots ,q,\ j=1,\ldots ,p \end{aligned}$$

and is the coarsest partition (i.e., the one with the smallest number of sets) under this condition.

The identification of the sets \(K_1,\ldots ,K_q\) is demonstrated in two examples.

Example 13

Let us consider five variables \(\{1,2,3,4,5\}\) and three subproblems \({\mathfrak {J}}=\{I_1,I_2,I_3\}\) with \(I_1=\{1,2,3\}\), \(I_2=\{3,4,5\}\), and \(I_3=\{1,2,4,5\}\). This results in

$$\begin{aligned} {\mathfrak {K}}=\left\{ \{1,2\}, \{3\}, \{4,5\} \right\} \ \text{ and } \ q=3 \end{aligned}$$

since for \(J=\{1,2\}\) we receive \((I_1 \cap I_2) {{\setminus }} I_3=\{3\}\), \(J=\{1,3\}\) gives \((I_1 \cap I_3) {{\setminus }} I_2 = \{1,2\}\), \(J=\{2,3\}\) gives \((I_2 \cap I_3) {{\setminus }} I_1 = \{4,5\}\), and all other subsets J result in empty sets.

Example 14

Changing \(I_3\) to \(I'_3=\{1,5\}\) also changes the partition \({\mathfrak {K}}\), in this case we receive for \({\mathfrak {J}}=\{I_1,I_2,I'_3\}\) that

$$\begin{aligned} {\mathfrak {K}}=\left\{ \{1\}, \{2\}, \{3\}, \{4\}, \{5\} \right\} \ \text{ and } \ q=5 \end{aligned}$$

with, e.g., \(I=\{1\}\) leading to \(I_1 {{\setminus }} (I_2 \cup I'_3)=\{2\}\).

In case that the sets \({\mathfrak {J}}=\{I_1,\ldots ,I_p\}\) already form a partition we receive \({\mathfrak {K}}={\mathfrak {J}}\) and \(q=p\). Hence, the next definition is a proper generalization of separability as it is known for partitions.

Definition 15

Let \({\mathfrak {J}}\subseteq \{1,\ldots ,n\}\) and \({\mathfrak {K}}\) be its induced partition with q sets. A function \(f :{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) is called additively separable with respect to \({\mathfrak {J}}\) if \(f(x) = \sum _{\ell =1}^q g_\ell (x_{K_\ell })\) for some functions \(g_\ell :{\mathbb {R}}^{|K_\ell |} \rightarrow {\mathbb {R}}\).

If the objective function f is additively separable, we can assign one objective function to each of the subproblems \(I_j \in {\mathfrak {J}}\), namely, we determine which of the sets \(K_\ell \) are contained in \(I_j\) and sum up their corresponding functions \(g_\ell \). This is formalized next.

Definition 16

Let \({\mathfrak {J}}\) be given and let \({\mathfrak {K}}\) be its induced partition. Let f be additively separable (with respect to \({\mathfrak {J}}\)) with \(f(x) = \sum _{\ell =1}^q g_\ell (x_{K_\ell })\), and let

$$\begin{aligned} f_j(x_{+j}) := \sum _{\ell \in \{1,\cdots ,q\} : K_\ell \subseteq I_j} g_\ell (x_{K_\ell }). \end{aligned}$$

A feasible solution \(x^*\) to \({\mathrm {P}}\) is called

  • weakly Pareto optimal if there does not exist a solution \(x^\prime \) which is feasible for \({\mathrm {P}}\) such that \(f_j(x^\prime _{+j}) < f_j(x^*_{+j})\) for all \(j \in \{1,\ldots ,p\}\).

  • Pareto optimal if there does not exist a solution \(x^\prime \) which is feasible for \({\mathrm {P}}\) and dominates \(x^*\), i.e., such that \(f_j(x^\prime _{+j}) \le f_j(x^*_{+j})\) for all \(j \in \{1,\cdots ,p\}\) and such that \(f_j(x^\prime _{+j}) < f_j(x^*_{+j})\) for some \(j \in \{1,\cdots ,p\}\).

Let us demonstrate the construction of the objective functions for the subproblems by continuing Example 13.

Example 17

We continue Example 13, i.e., we have \(n=5\), \({\mathfrak {J}}=\{I_1,I_2,I_3\}\) with \(I_1=\{1,2,3\}\), \(I_2=\{3,4,5\}\), and \(I_3=\{1,2,4,5\}\) and \({\mathfrak {K}}=\left\{ \{1,2\}, \{3\}, \{3,4\} \right\} \). Let us consider a polynomial

$$\begin{aligned} f(x)= & {} x_2^2 + 2x_3^2 + x_5^2 + x_1x_2 - 3x_4x_5 + 3x_1 + x_4 + 2 x_5 \\= & {} \underbrace{3x_1 + x_2^2 + x_1x_2}_{=:g_1(x_1,x_2)}+ \underbrace{2x_3^2}_{=:g_2(x_3)} +\underbrace{x_5^2 - 3x_4x_5 + x_4 + 2 x_5}_{=:g_3(x_4,x_5)} \end{aligned}$$

The resulting functions \(f_1,f_2,f_3\) for the subproblems \(I_1,I_2,I_3\) are then given as

$$\begin{aligned} f_1(x_1,x_2,x_3)= & {} g_1(x_1,x_2) + g_2(x_3) = 3x_1 + x_2^2 + x_1x_2 + 2x_3^2\\ f_2(x_3,x_4,x_5)= & {} g_2(x_3) + g_3(x_4,x_5) = 2x_3^2 + x_5^2 - 3x_4x_5 + x_4 + 2 x_5\\ f_3(x_1,x_2,x_4,x_5)= & {} g_1(x_1,x_2) + g_3(x_4,x_5) \\&= 3x_1 + x_2^2 + x_1x_2 + x_5^2 - 3x_4x_5 + x_4 + 2 x_5 \end{aligned}$$

Note that \(\sum _{j=1}^3 f_j(x_{+j})=2 f(x)\) since each of the variables \(x_i\) appears in exactly two of the sets \(I_1,I_2,I_3\).

4.2 Relations between the notions

In this section, for the case that the objective function is additively separable and every element of \({\mathfrak {K}}\) is contained in exactly \(\pi \in {\mathbb {N}}\) elements of \({\mathfrak {J}}\), the relations between the three notions optimality, blockwise optimality, and Pareto optimality are investigated.

Clearly, Pareto optimality implies weak Pareto optimality. However, in general there is no relation between Pareto optimality and blockwise optimality: Pareto optimality does not imply blockwise optimality, nor does blockwise optimality imply weak Pareto optimality.

Example 18

In the integer linear program

with the two subproblems given by \(I_1 = \{1\}\) and \(I_2 = \{2\}\) the solution (1, 1) is blockwise optimal since in \({\mathrm {P}}_1(1)\) the only feasible solution is \(x_1=1\) and in \({\mathrm {P}}_2(1)\) the only feasible solution is \(x_2=1\). For discussing Pareto optimality we look at \(f_1(x_1)=x_1\) and \(f_2(x_2)=x_2\) as the two parts of the additively separable objective function and receive that (1, 1) is not weakly Pareto optimal since (0, 0) is a strict improvement for both subproblems.

Example 19

We consider the integer linear program

with the groups \(I_1 = \{1,2\}\), \(I_2 = \{2,3\}\), and \(I_3 = \{1,3\}\), yielding the partition\({\mathfrak {K}}=\{ \{1\},\{2\},\{3\}\}\), and hence the objective functions for the subproblems are \(f_1(x_1,x_2)=x_1+x_2\), \(f_2(x_2,x_3)=x_2+x_3\), and \(f_3(x_1,x_3)=x_1+x_3\). The solution \(x^*=(0,0,0)\) is then Pareto optimal: To see this, we assume that \(x'\) is also feasible. If \(x'_2=0\) then \(x'=x^*\). If \(x'_2 > 0\) then \(f_2(x'_2,x'_3)>0=f_2(x^*_2,x^*_3)\), so \(x'\) does not dominate \(x^*\), and \(x^*\) is Pareto optimal.

On the other hand, \(x^*\) is not blockwise optimal, since an optimal solution to

$$\begin{aligned} ({\mathrm {P}}_1(0)) \ \ \min \{ x_1 + x_2: x_1 + 2 x_2 =0, x_2 \le 2, x_2 \ge 0 \} \end{aligned}$$

is given as \(x'_1=-4,\ x'_2=2\) with an objective function value of \(f(x'_1,x'_2,0) =-2 < 0= f(0,0,0)\).

Nevertheless, we now give conditions under which Pareto optimality implies blockwise optimality and under which blockwise optimality implies weak Pareto optimality.

Theorem 20

Let \({\mathrm {P}}\) be given with a partition \(\{I_1, \cdots , I_p\}\) of the variable indices and an additively separable objective function f. Then every Pareto optimal solution of \({\mathrm {P}}\) is blockwise optimal.

Proof

Since \(\{I_1,\ldots ,I_p\}\) is a partition we have \({\mathfrak {K}}={\mathfrak {J}}\) and \(f(x)=\sum _{j=1}^p f_j(x_{+j})\) with \(x_{+j}\) containing the variables from \(I_j\), \(j=1,\ldots ,p\). Assume that \(x^*\) is not blockwise optimal, i. e., that there is a \(j_0 \in \{1,\cdots ,p\}\) and an \(x_{+j_0}^\prime \) such that \((x_{+j_0}^\prime , x_{-j_0}^*)\) is feasible and \(f(x_{+j_0}^\prime , x_{-j_0}^*) < f(x^*)\). Define \(x^\prime := (x_{+j_0}^\prime , x_{-j_0}^*)\). Then for all \(j \in \{1,\cdots ,p\} {{\setminus }} \{j_0\}\) it holds that \(x_{+j}^\prime = x_{+j}^*\), whence \(f_j(x_{+j}^\prime ) = f_j(x_{+j}^*)\), and for \(j_0\) we have

$$\begin{aligned} f_{j_0}(x_{+j_0}^\prime )= & {} f(x^\prime ) - \sum _{j \in \{1,\cdots ,p\} {{\setminus }} \{j_0\}} f_j(x_{+j}^\prime ) < f(x^*)\\&- \sum _{j \in \{1,\cdots ,p\} {{\setminus }} \{j_0\}} f_j(x_{+j}^*) = f_{j_0}(x_{+j_0}^*). \end{aligned}$$

So \(x^*\) is not Pareto optimal. \(\square \)

The theorem says that Pareto optimality is a stronger concept than blockwise optimality in case that the subproblems form a partition and the objective function is additively separable. The interpretation is the following: if it is not possible to improve any of the subproblems without worsening at least another one, then also none of the subproblems can improve its solution by changing only its own variables. Intuitively, the reverse direction will usually not hold. This can be seen in Example 18 showing that blockwise optimality does not imply weakly Pareto optimality even for linear programs and subproblems forming a partition.

However, as the next theorem shows, there exist cases in which the solutions of integer programs with linear objective function found by Algorithm BCD are weakly Pareto optimal. We need the following condition.

Definition 21

Let a set of subproblems \({\mathfrak {J}}=\{I_1, \cdots , I_p\}\) with corresponding induced partition \({\mathfrak {K}}=\{K_1,\ldots ,K_q\}\) be given. We say that \({\mathfrak {J}}\) has uniform structure if every set \(K_\ell \) is included in the same number \(\pi \) of sets in \({\mathfrak {J}}\), i.e., \(|\{j \in \{1,\cdots ,p\} \mid K_\ell \subseteq I_j\}| = \pi \) for every \(\ell \in \{1,\cdots ,q\}\).

We remark that \({\mathfrak {J}}\) has uniform structure in Example 13 (\(\pi =2\)), in Example 18 (\(\pi =1\)), and in Example 19 (again, \(\pi =2\)). Also in typical settings in which \({\mathfrak {J}}\) is chosen as the set of all groups of variables with the same number of elements, i.e.,

$$\begin{aligned} {\mathfrak {J}}=\{ J \subseteq \{1,\ldots ,n\}: |J|=k\} \end{aligned}$$

for some given (usually small) number \(k \in {\mathbb {N}}\), it has uniform structure with \(\pi =k\).

Using the uniform structure we now can give a sufficient condition under which each blockwise optimal solution is weakly Pareto optimal.

Theorem 22

Let \({\mathrm {P}}\) be given with \({\mathfrak {J}}=\{I_1, \cdots , I_p\}\) and induced partition \({\mathfrak {K}}=\{K_1,\ldots ,K_q\}\). Let \({\mathfrak {J}}\) have uniform structure. If the objective function f of \({\mathrm {P}}\) is linear and the feasible set is given as

$$\begin{aligned} {\mathfrak {X}}=\{x \in {\mathbb {Z}}^n: a^{{\mathrm {T}}} x \le b,\ G_{\ell }(x_{K_{\ell }}) \le 0 \text{ for } \quad \ell \in \{1,\ldots ,q\}\} \end{aligned}$$

with one coupling constraint \(a^{{\mathrm {T}}} x \le b\) for \(a \in {\mathbb {R}}^n, b\in {\mathbb {R}}\) and for every \(\ell \in \{1,\cdots ,q\}\) some \(m_\ell \in {\mathbb {N}}\) constraints \(G_{\ell }:{\mathbb {R}}^{|K_\ell |} \rightarrow {\mathbb {R}}^{m_\ell }\) only containing variables from the same set \(K_\ell \in {\mathfrak {K}}\), then every blockwise optimal solution to \({\mathrm {P}}\) is weakly Pareto optimal.

Proof

Since f is linear, it is additively separable, i.e., \(f(x) = \sum _{\ell =1}^q g_\ell (x_{K_\ell })\) for some linear functions \(g_\ell \). Consequently, we have for every subproblem \(I_j\) and \((x_{+j},0_{-j})\) that

$$\begin{aligned} f(x_{+j},0_{-j}) = \sum _{\ell \in \{1,\cdots ,q\} : K_\ell \subseteq I_j} g_\ell (x_{K_\ell }) + \sum _{\ell \in \{1,\cdots ,q\} : K_\ell \subseteq I^C_j} \underbrace{g_\ell (0_{K_\ell })}_{=0} = f_j(x_{+j}) \end{aligned}$$

with \(f_j(x_{+j})\) the objective for subproblem \(I_j\) from Definition 16.

Let \(x^* \in {\mathbb {R}}^n\) be feasible, but not weakly Pareto optimal. So there is an \(x^\prime \in {\mathbb {R}}^n\) with \(a^{\mathrm {T}} x^\prime \le b\), \(G_{\ell }(x^\prime _{K_\ell }) \le 0\), \(\ell =1,\ldots ,q\), and \(f_j(x_{+j}^\prime ) < f_j(x_{+j}^*)\) for all \(j \in \{1,\cdots ,p\}\). Let \(y := x^\prime - x^*\), and \(\beta := \max \{0, a^{{\mathrm {T}}} y\}\). Since every variable is contained in exactly \(\pi \) blocks, we have that

$$\begin{aligned}y = \frac{1}{\pi }\cdot \sum _{j=1}^p (y_{+j},0_{-j}).\end{aligned}$$

Hence,

$$\begin{aligned} \beta \ge a^{\mathrm {T}} y = \frac{1}{\pi }\sum _{j=1}^p a^{\mathrm {T}} (y_{+j},0_{-j}) \ge \frac{p}{\pi }\min _{j \in \{1,\cdots ,p\}} a_{+j}^{\mathrm {T}} y_{+j}. \end{aligned}$$

Let \(j_0\) be an index where the minimum is attained, so \(\frac{p}{\pi }a^{\mathrm {T}} (y_{+j_0}, 0_{-j_0}) \le \beta \). Since \(\pi \le p\) and \(\beta \ge 0\), this implies that \(a^{\mathrm {T}} (y_{+j_0}, 0_{-j_0}) \le \beta \). Define

$$\begin{aligned} x^{\prime \prime } := x^* + (y_{+j_0},0_{-j_0})= (x_{+j_0}',x_{-j_0}^*), \end{aligned}$$

i.e., \(x^{\prime \prime }\) coincides with \(x^\prime \) for all variables \(i \in I_{j_0}\) and it coincides with \(x^*\) for all variables \(i \not \in I_{j_0}\). We obtain that \(x^{\prime \prime }\) is feasible:

  • For every set \(K_\ell \) we have either \(K_\ell \subseteq I_{j_0}\) or \(K_\ell \subseteq I_{j_0}^C\). In the former case \(G_\ell (x^{\prime \prime })=G_\ell (x^\prime ) \le 0\) and in the latter case \(G_\ell (x^{\prime \prime })=G_\ell (x^*) \le 0\).

  • \(a^{\mathrm {T}} x^{\prime \prime } = a^{\mathrm {T}} x^* + a^{\mathrm {T}} (y_{+j_0},0_{-j_0}) \le a^{\mathrm {T}} x^* + \beta = \max \{a^{\mathrm {T}} x^*, a^{\mathrm {T}} x^\prime \} \le b\).

Furthermore, using that f is linear and the construction of \(x'\), we receive

$$\begin{aligned} f(x^{\prime \prime }) = f(x^*) + f(y_{+j_0},0_{-j_0}) = f(x^*) +f_{j_0}(x_{+j_0}^\prime ) - f_{j_0}(x_{+j_0}^*) < f(x^*). \end{aligned}$$

So \(x^*\) was not blockwise optimal. \(\square \)

We now apply Theorem 22 to integer linear programs and assume that \({\mathfrak {J}}\) forms a partition, i.e., \({\mathfrak {J}}={\mathfrak {K}}\).

Corollary 23

Let \({\mathrm {P}}\) be an integer linear program with a partition \({\mathfrak {J}}=\{I_1, \cdots , I_p\}\) of the variable indices with \(n_j:=|I_j|\), which may have

  • \(m_j\) constraints involving variables from the same set \(I_j \in {\mathfrak {J}}\),

  • but only one coupling constraint,

i.e.,

$$\begin{aligned} \min \hbox { } \sum _{j=1}^p c_{+j}^{{\mathrm {T}}} x_{+j}&\\ \text{ s.t } \hbox { } \qquad A_{j} x_{+j} \hbox { }&\ \le&b_j,\,\quad \ j=1,\ldots ,p,\\ a^{{\mathrm {T}}}(x_{+1},\ldots ,x_{+p}) \hbox { }&\hbox { } \le&b\\ x_{+j} \hbox { }&\hbox { } \in \hbox { }&\hbox { } {\mathbb {R}}^{n_j},\quad j=1,\ldots ,p \end{aligned}$$

with \(c_{+j} \in {\mathbb {R}}^{n_j}\), \(A_j \in {\mathbb {R}}^{m_j,n_j}\), \(b_j \in {\mathbb {R}}^{m_j}\), and \(a \in {\mathbb {R}}^n,\)\(b \in {\mathbb {R}}\).

Then Algorithm BCD terminates with a weakly Pareto optimal solution for P.

Note that Algorithm BCD terminates with a global optimum in the case of Corollary 23 if no coupling constraint is present.

We finally investigate the relation to (global) optimal solutions. Per definition, each optimal solution is blockwise optimal, otherwise we could improve it by the next step of BCD. The relation between optimal solutions and Pareto optimal solutions is more interesting.

Lemma 24

Let \({\mathrm {P}}\) be given with \({\mathfrak {J}}=\{I_1, \cdots , I_p\}\) and induced partition \({\mathfrak {K}}=\{K_1,\ldots ,K_q\}\). Let \({\mathfrak {J}}\) have uniform structure. Then every (globally) optimal solution is Pareto optimal.

Proof

Assume that x is a feasible solution to \({\mathrm {P}}\) which is optimal, but not Pareto optimal. This means that there exists a solution \(x'\) such that \(f_j(x'_{+j}) \le f_j(x_{+j})\) for all sets \(I_j\), \(j=1,\ldots ,p\), and the inequality is strict for one \({\bar{j}} \in \{1,\ldots ,p\}\). Since \({\mathfrak {J}}\) has uniform structure, we use separability to see that

$$\begin{aligned} \sum _{j=1}^p f_j(x_{+j})= & {} \sum _{j=1}^p \ \ \sum _{\ell \in \{1,\cdots ,q\} : K_\ell \subseteq I_j} g_\ell (x_{K_\ell })\\= & {} \sum _{\ell = 1,\cdots ,q} \pi g_\ell (x_{K_\ell }) \ = \ \pi \sum _{\ell = 1}^q g_\ell (x_{K_\ell })\\= & {} \pi f(x) \end{aligned}$$

We conclude that

$$\begin{aligned} f(x')=\frac{1}{\pi } \sum _{j=1}^p f_j(x'_{+j}) < \frac{1}{\pi } \sum _{j=1}^p f_j(x_{+j}) = f(x), \end{aligned}$$

a contradiction to x being an optimal solution to \({\mathrm {P}}\). \(\square \)

Finally, the next example shows that optimal solutions need not be Pareto optimal in general if the assumption of Lemma 24 does not hold.

Example 25

We consider the integer linear program

with the groups \(I_1 = \{1,2\}\) and \(I_2 = \{2,3\}\), yielding the partition \({\mathfrak {K}}=\{ \{1\},\{2\},\{3\}\}\), and hence the objective functions for the subproblems are \(f_1(x_1,x_2)=x_1+x_2\), \(f_2(x_2,x_3)=x_2+x_3\).

By reducing \({{\mathrm {P}}}\) to \(\min \{\frac{7}{2} x_1: x_1 \ge 0\}\) we easily see that \(x=(0,0,0)\) is the optimal solution to the problem. For the two subproblems \(I_1\) and \(I_2\) it yields the objective values \(f_1(x)=f_2(x)=0\).

However, this solution is not a Pareto solution. To this end, consider \(x'=(2,-3,2)\). This is a feasible solution with objective function value \(f(x')=1 > f(x)\), hence it is not optimal. However, it dominates the objective function values in both subproblems: \(f_1(x'_{+1})=x'_1+x'_2=-1 < 0 = f_1(x_{+1})\) and, analogously, \(f_2(x'_{+2})=-1 < 0 = f_2(x_{+2})\).

In Fig. 2 we summarize our findings: In the general case, i.e., if \({\mathfrak {J}}\) does not form a partition, each (global) optimal solution is blockwise optimal, but an optimal solution need not be Pareto optimal, see Example 25. If we strengthen our assumptions and require uniform structure of \({\mathfrak {J}}\), then each optimal solution is both, blockwise optimal and Pareto optimal. Finally, in the case of \({\mathfrak {J}}\) being a partition, each optimal solution is Pareto optimal and each Pareto optimal solution is blockwise optimal.

Fig. 2
figure 2

Summary of the relations between the notions of optimality. Left: for general problems. Middle: for problems with uniform structure. Right: for problems with \({\mathfrak {J}}\) being a partition

At the end of this section, we mention that in matroids, the blockwise coordinate descent method in fact converges to an optimal solution. This is more of theoretical interest since algorithmically one would use the well-known Greedy approach to find an optimal solution of a matroid optimization problem.

Recall that the matroid optimization problem is defined as follows: Let \((E,{\mathcal {I}})\) be a matroid, i.e., E is a set with n elements and weights \(c_e\) for all \(e \in E\), and \({\mathcal {I}}\subseteq {{\mathcal {P}}}(E)\) contains the so-called independent sets of E. The matroid optimization problem asks for an independent set \(A \subseteq E\) with maximum weight, i.e.,

$$\begin{aligned} \max \left\{ f(A)=\sum _{e \in A} w_e: A \in {\mathcal {I}}\right\} . \end{aligned}$$

A matroid optimization problem can be written as integer program with variables \(x_e\) being one if and only if e is contained in the independent set:

$$\begin{aligned} (P) \ \ \max \{ \sum w_e x_e: \sum _{e \in E} x_e \le r(A) \text{ for } \text{ all } A \subseteq E\} \end{aligned}$$

where \(r:{\mathcal {I}}\rightarrow {\mathbb {R}}\) is the (submodular) rank function of the matroid. Each variable of \({\mathrm {P}}\) corresponds to an element \(e \in E\). We hence can use the elements of E directly to define the subsets of variables in our optimization problems \(\text{ P }_{I_j}\).

For matroid optimization problems, it is known (see, e.g., Dasgupta et al. 2008) that one can start with any independent set and by the exchange property one can reach any other independent set by two-element swaps. Such a sequence of exchanges can easily be constructed. For blockwise coordinate descent, however, we assume that the sequence of subproblems is fixed, i.e., it cannot be chosen according to the actual independent set. Still, Algorithm BCD finds an optimal solution in matroid optimization problems when all two-element sets of variables are contained in \({\mathcal {I}}\).

Theorem 26

Let \({\mathrm {P}}\) be a matroid optimization problem. Let all two-element subsets be contained in \({\mathfrak {J}}\), i.e.,

$$\begin{aligned} \{ \{e,e'\}: e,e' \in E \text{ and } e \not =e'\} \subseteq {\mathfrak {J}}. \end{aligned}$$

Then Algorithm BCD terminates with a (globally) optimal solution of \({\mathrm {P}}\) for any feasible starting solution.

Proof

(Sketch): Let \(x^{\mathrm{BCD}}\) be a non-optimal solution, i.e., there exists some solution \(x^*\) to the matroid optimization problem \({\mathrm {P}}\) with \(f(x^*) > f(x^{\mathrm{BCD}})\). Let \(e^*\) be an element with highest weight which is not contained in the solution \(x^{\mathrm{BCD}}\), but contained in \(x^*\) (i.e., with \(x^{\mathrm{BCD}}_e=0\) and \(x^*_e=1\)). Based on the exchange property for matroids, we exchange e with another element \(e'\). It can be justified that we receive a feasible solution with better objective value. Hence, Algorithm BCD cannot have terminated with \(x^{\mathrm{BCD}}\) since for \(I:=\{e',e^*\} \in {\mathfrak {J}}\) the algorithm would have improved \(x^{\mathrm{BCD}}\) to \(x^*\). \(\square \)

5 Relation to the continuous relaxation

In many applications, even the subproblems \({\mathrm {P}}_{I}(x)\) are hard to solve (in particular in the application mentioned in the next section). Hence, one could try to find a solution by not solving the integer programming problems, but their relaxations in every step of Algorithm BCD. Here we investigate what we can conclude from a solution obtained by applying Algorithm BCD to the continuous relaxation of \({\mathrm {P}}\) instead of applying it to the problem \({\mathrm {P}}\) itself.

Let \({\mathrm {R}}\) denote the continuous relaxation of \({\mathrm {P}}\). Clearly, applying Algorithm BCD to \({\mathrm {R}}\) instead of \({\mathrm {P}}\) does not yield an upper bound on its objective value. The following example of an integer linear program shows that it neither yields a lower bound, even if

  • Algorithm BCD is started at the same feasible solution \(x^{(0)}\) when solving \({\mathrm {R}}\) and \({\mathrm {P}}\), and,

  • it is applied to the relaxation of the ideal formulation of \({\mathrm {P}}\), and

  • it terminates with an integer point when applied to the continuous relaxation \({\mathrm {R}}\).

Example 27

Consider the integer linear program

This is the ideal formulation for the feasible set because every feasible point of its linear programming relaxation \({\mathrm {R}}\) can be written as a convex combination of feasible points of \({\mathrm {P}}\). We apply Algorithm BCD to this problem as well as to its continuous relaxation \({\mathrm {R}}\) w.r.t. the subproblems \(I_1 = \{1\}\) and \(I_2 = \{2\}\). If \(x^{(0)} = (9,9)\), then for the integer program, we obtain \(x^{(1)}_{{\mathrm {P}}} = (5,9), \ x^{(2)}_{{\mathrm {P}}} = (5,0)\), and \(x^{(3)}_{{\mathrm {P}}} = (0,0)\), where the algorithm terminates. This point has objective value 0. Applied to the linear relaxation, we obtain \(x^{(1)}_{{\mathrm {R}}} = \bigl (\frac{9}{2},9 \bigr ), \ x^{(2)}_{{\mathrm {R}}} = \bigl (\frac{9}{2},-1\bigr )\), and \(x^{(3)}_{{\mathrm {R}}} = (2,-1)\), which has the objective value 1 being worse than the value obtained from applying Algorithm BCD to \({\mathrm {P}}\). Both sequences are illustrated in Fig. 3.

Fig. 3
figure 3

Illustration of Example 27, on the left Algorithm BCD applied to \({\mathrm {P}}\), and on the right to its ideal formulation

Note that in the example, the solution obtained by Algorithm BCD applied to the continuous relaxation is still blockwise optimal for the original problem \({\mathrm {P}}\). In case Algorithm BCD happens to find a solution which is feasible for \({\mathrm {P}}\) this holds in general, as the next lemma shows. Moreover, the lemma gives a condition under which also Pareto optimality for a solution to \({\mathrm {R}}\) is transferred to \({\mathrm {P}}\).

Lemma 28

Let \(Y \subset {\mathbb {R}}^n\) be a set containing the feasible set X of \({\mathrm {P}}\), and let

$$\begin{aligned} \begin{array}{rrl} ({\mathrm {R}}) &{} \min &{} f(x) \\ &{}\text {s. t.} &{}x \in Y. \end{array} \end{aligned}$$

Let \({\mathfrak {J}}\) be the subproblems to be considered.

  1. 1.

    Let \(x^*\) be a (strictly) blockwise optimal solution to \({\mathrm {R}}\) which is feasible to \({\mathrm {P}}\). Then \(x^*\) is (strictly) blockwise optimal to \({\mathrm {P}}\).

  2. 2.

    Let f be additively separable with respect to \({\mathfrak {J}}\), and let \(x^*\) be a (weakly) Pareto optimal solution to \({\mathrm {R}}\) which is feasible to \({\mathrm {P}}\). Then \(x^*\) is (weakly) Pareto optimal to \({\mathrm {P}}\).

Proof

  1. 1.

    If \(x^*\) were not blockwise optimal for \({\mathrm {P}}\), there would be a \(j \in \{1,\cdots ,p\}\) and an \(x_{+j}^\prime \) with \((x_{+j}^\prime , x_{-j}^*) \in X\) and \(f(x_{+j}^\prime , x_{-j}^*) < f(x^*)\). Then \((x_{+j}^\prime , x_{-j}^*) \in Y\) so that \(x^*\) would not be blockwise optimal for \({\mathrm {R}}\). For strict blockwise optimality, the proof is analogous.

  2. 2.

    If \(x^*\) were not weakly Pareto optimal for \({\mathrm {P}}\), there would be an \(x^\prime \in X\) with \(f_j(x_{+j}^\prime ) < f_j(x_{+j}^*)\) for all \(j \in \{1,\cdots ,p\}\). Since \(x^\prime \in Y\), the solution \(x^*\) would not be weakly Pareto optimal to \({\mathrm {P}}\). The statement for Pareto optimality can be proven in the same way.

\(\square \)

From Lemma 28 we obtain the following consequences for Algorithm BCD.

Corollary 29

Let \({\mathrm {R}}\) be the continuous relaxation of \({\mathrm {P}}\), and let \(x_{{\mathrm {R}}}\) be the solution obtained by applying Algorithm BCD to \({\mathrm {R}}\). We then have:

  • If \(x_{{\mathrm {R}}}\) is feasible for \({\mathrm {P}}\), it is blockwise optimal for \({\mathrm {P}}\).

  • Under the conditions of Theorem 22, and if \(x_{{\mathrm {R}}}\) is feasible for \({\mathrm {P}}\), then it is weakly Pareto optimal for \({\mathrm {P}}\).

6 Application of Algorithm BCD for improving solutions of sequential processes

Many applications deal with multi-stage optimization problems in which the stages are optimized sequentially because it is computationally too complex to solve the integrated problem as a whole. Let us look at such a series of p optimization problems. The sequential process starts by solving the first problem \((\text{ P }'_1)\), i.e., it determines the variables \(x_{+1} \in {\mathbb {Z}}^{n_1}\) appearing in the first stage. The obtained solution is then fixed in the second stage in which the variables \(x_{+2}\) are determined. The process continues until in the last step, the values of all previous stages, i.e., \(x_{+1},\ldots ,x_{+(p-1)}\) are fixed and a solution \(x_{+p}\) to the last stage is determined. This gives the sequential solution \(x^{\mathrm{seq}}=(x_{+1},\ldots ,x_{+p})\) to the multi-stage optimization problem.

$$\begin{aligned} (\text{ P }'_1) \hbox { }&\hbox { }\min \{f_1(x_{+1}): G_1(x_{+1}) \le 0,\ x_{+1} \in {\mathbb {Z}}^{n_1}\} \\ (\text{ P }'_2)(x_{+1}) \hbox { }&\hbox { }\min \{f_2(x_{+1},x_{+2}): G_2(x_{+1},x_{+2}) \le 0,\ x_{+2} \in {\mathbb {Z}}^{n_2}\} \\ (\text{ P }'_3)(x_{+1},x_{+2}) \hbox { }&\hbox { } \min \{f_3(x_{+1},x_{+2},x_{+3}): G_3(x_{+1},x_{+2},x_{+3}) \le 0,\ x_{+3} \in {\mathbb {Z}}^{n_3}\} \\ \vdots \\ (\text{ P }'_p)(x_{+1},\ldots ,x_{+(p-1)}) \hbox { }&\hbox { } \min \{f_p(x_{+1},\ldots ,x_{+p}): G_p(x_{+1},\ldots ,x_{+p}) \le 0,\ \quad x_{+p} \in {\mathbb {Z}}^{n_p}\} \end{aligned}$$

In many applications the resulting solution \(x^{\mathrm{seq}}\) of the sequential process is then returned as solution to the problem. However, it is usually not optimal for the (integrated) multi-stage optimization problem

$$\begin{aligned} \text{(MP) } \ \ \min \{f(x_{+1},\ldots ,x_{+p}): G(x_{+1},\ldots ,x_{+p}) \le 0, \quad \ x \in {\mathbb {Z}}^{n}\} \end{aligned}$$
(1)

where \(n:=\sum _{i=1}^p n_i\) is the number of all variables of all stages, \(G:{\mathbb {R}}^n \rightarrow {\mathbb {R}}^M\) contains all constraints of the single stages, and f is the objective function given as a positive linear combination of the objective functions of the p planning steps, i.e.,

$$\begin{aligned} f(x_{+1},\ldots ,x_{+p}):=\sum _{i=1}^p \alpha _i f_i(x_{+1},\ldots ,x_{+i}). \end{aligned}$$
(2)

In order to improve the sequential solution \(x^{\mathrm{seq}}\) we propose to apply Algorithm BCD with subproblems \(I_1,\ldots ,I_p\) corresponding to the variables \(x_{+1},\ldots ,x_{+p}\) of the stages \(1,\ldots ,p\), i.e., \(|I_i|=n_i\). The p subproblems arising during the execution of Algorithm BCD are

$$\begin{aligned} (\text{ P }_i) \ \ \ \min \{f(x_{+i},x_{-i}): G(x_{+i},x_{-i}) \le 0,\ x=(x_{+1},x_{-i}) \in {\mathbb {Z}}^{n}\}, \quad \ i=1,\ldots ,p \end{aligned}$$

The problems \((\text{ P }_i)\) and \((\text{ P }'_i)\) have the same variables and may be similar to each other such that algorithms for solving the subproblems \((\text{ P }_i)\) may be derived from known algorithms used in the sequential planning process. In fact, \(\text{ P }_p\) and \(\text{ P }'_p\) coincide. The resulting algorithm is depicted in Fig. 4, where the construction of the starting solution is shown in the first p steps, before Algorithm BCD iterates along a cycle to improve the starting solution. As indicated in Fig. 4, the iterations of BCD can be depicted as a path in a network whose nodes are the subproblems of (MP). The full network including all possible own (“eigen”) subproblems of (MP) is called Eigenmodel and is proposed as a systematic way to formalize coordinate descent methods in applications (Schöbel 2017). Algorithm BCD generates a path within the network for any sequence of subproblems. Note that changing the order in which the subproblems are solved to construct the starting solution generates other paths (which are not depicted in Fig. 4). For public transportation such other paths have been investigated in Pätzold et al. (2017). For general problems, such starting paths are under investigation in Schiewe and Schöbel (2019).

Fig. 4
figure 4

Construction of a starting solution and applying Algorithm BCD for a multi-stage optimization problem MP

In the following we summarize what the results derived in this paper mean for improving a sequential solution if all problems \((\text{ P }_i)\) are linear integer optimization problems. We apply Algorithm BCD for the subproblems \({\mathfrak {J}}=\{I_1,\ldots ,I_p\}\) which correspond to the different stages \(1,\ldots ,p\) as described above. This is a special case since \(I_1,\ldots ,I_p\) form a partition, and the objective function (2) is a linear combination of the single objectives of the p planning stages and hence always additively separable. Using this, Theorem 20 gives the following result.

Corollary 30

Let the multi-stage optimization problem (MP) be given. Then every optimal solution to (MP) is Pareto optimal, and every Pareto solution to (MP) is blockwise optimal. (I.e., we have the situation in the rightmost picture of Fig. 2).

We now summarize the results when we apply Algorithm BCD to integer linear multi-stage optimization problems.

Corollary 31

Let the multi-stage optimization problem (MP) be a linear integer optimization problem with rational coefficients. Let \(x^{(0)}=(x^{(0)}_{+1},\ldots ,x^{(0)}_{+p})\) be the starting solution obtained by solving the problems \(\mathrm {P}'_1,\mathrm {P}'_2(x^{(0)}_{+1}),\ldots ,\mathrm {P}'_p(x^{(0)}_{+1},\ldots ,x^{(0)}_{+(p-1)})\) sequentially. Then the following hold:

  1. 1.

    Algorithm BCD is executable if any of the following two conditions is satisfied:

    • \(\text{ P }_i(x^{(0)}_i)\) is a bounded optimization problem for all \(i=1,\ldots ,p\).

    • The p problems of the first round of Algorithm BCD have optimal solutions.

  2. 2.

    If Algorithm BCD is executable, then we furthermore have:

    1. (i)

      If none of the variables is constant in the objective function f, then all sequence elements are close-to-boundary.

    2. (ii)

      If (P) is bounded, then the sequence of objective function values converges and becomes constant, i.e., Algorithm BCD terminates.

  3. 3.

    Finally, if Algorithm BCD terminates with a solution x, then we have:

    1. (i)

      x is blockwise optimal.

    2. (ii)

      If G contains only one coupling constraint, x is Pareto optimal.

Proof

The first statement follows from Lemma 1 and Corollary 3. Statement 2(i) follows from Lemma 9 since the (linear) objective function is strictly quasiconcave if it is not constant with respect to any of the variables. Statement 2(ii) follows from Corollary 7. Finally, Statement 3(i) is always true, and Statement 3(ii) follows from Theorem 22. \(\square \)

For the special case that P is a combinatorial optimization problems, boundedness of the problem and of the subproblems is guaranteed. We can hence strengthen the result of Corollary 31 as follows.

Corollary 32

Let the multi-stage optimization problem (MP) be a combinatorial optimization problem with rational data. Let \(x^{(0)}=(x^{(0)}_{+1},\ldots ,x^{(0)}_{+p})\) be the starting solution obtained by solving the problems \(\mathrm {P}'_1,\mathrm {P}'_2(x^{(0)}_{+1}),\ldots ,\mathrm {P}'_p(x^{(0)}_{+1},\ldots ,x^{(0)}_{+(p-1)})\) sequentially. Then Algorithm BCD is executable, stops after a finite number of iterations and the solution obtained is blockwise optimal.

We finally illustrate the application of Algorithm BCD using the multi-stage optimization problems arising in planning of public transportation. The planning process in public transportation can be split into the following planning stages: After the design of the public transportation network, the lines have to be planned. After that, the timetable can be designed, followed by vehicle scheduling, crew scheduling, and crew rostering. For all of these planning problems, models are known and advanced solution techniques are available. However, going through all these stages sequentially leads only to suboptimal solutions. This motivates the tremendous amount of recent research on integrated planning in which two or even more of the planning stages are treated simultaneously, see, e.g., Liebchen and Möhring (2007), Guihaire and Hao (2008), Steinzen et al. (2010), Cadarso and Marin (2012), Petersen et al. (2013), Schmidt and Schöbel (2015), Schmid and Ehmke (2016), Burggraeve et al. (2017), Pätzold et al. (2017), Meng et al. (2018), Fonseca et al. (2018).

In Schöbel (2017), Algorithm BCD has been set up for the three stages line planning, timetabling, and vehicle scheduling, and Gattermann et al. (2016), Schiewe and Schiewe (2018) presents experimental results showing that the solution quality can be improved by Algorithm BCD in this application. Figure 5 shows the process: First, a starting solution is found by the classic sequential approach in which first the line planning problem is solved, then a timetable is determined and finally the vehicle schedules are planned. This sequential solution is then treated as starting solution and improved by iterating through the circle depicted in Fig. 5. It should be noted that modeling these three highly intertwined improvement stages as integer programs is not possible by defining a partition \({\mathfrak {J}}\), but there are several groups of variables which appear in two of the three integer programs.

Fig. 5
figure 5

Construction of a starting solution and applying Algorithm BCD for the case of the three stages line planning, timetabling, and vehicle scheduling in public transportation

We finally give an interpretation of results for multi-stage problems when applied to MP arising in public transportation depicted in Fig. 5. The notion of blockwise optimal of Definition 11 means that we cannot improve any of the planning stages without changing another one. E.g., we cannot improve the timetable without changing the vehicle schedules or the line plan. Looking at Pareto optimality, we may change all variables, and not only the variables of one of the stages. A solution is Pareto optimal if we cannot improve one of the planning stages without worsening another one. In general, blockwise optimal solutions need not be Pareto optimal. For example, we might be able to improve the timetable when making a few changes to the vehicle schedules and this may be done without increasing the costs of the vehicle schedule. According to Theorem 22 Pareto optimality only follows from blockwise optimality if there is only one coupling constraint, i.e., when the subproblems are almost independent of each other.

Finally, planning a public transportation system involves integer linear optimization problems with only a finite number of feasible solutions. For this case we can use Corollaries 31 and 32 showing that Algorithm BCD is always executable and terminates after a finite number of iterations with a blockwise optimal solution.

7 Conclusion and further research

In this paper we presented a systematic analysis for blockwise coordinate descent methods applied for integer (linear or nonlinear) programs. We analyzed existence of the sequence elements, convergence of the algorithm, and discussed properties of the (local) minima found. We in particular discussed not only blockwise coordinate descent on a partition of the variables but also on a cover.

For further research we plan to discuss how to construct a good set of subproblems \({\mathfrak {J}}\) as input for Algorithm BCD. Clearly, nested sets \(I_j \subseteq I_{j+1}\) can be avoided in an efficient sequence since the smaller one is always redundant. We also performed some first experiments regarding the choice of the sets in \({\mathfrak {J}}\). To this end, we applied BCD to an easy and a hard combinatorial problem, namely to the minimal spanning tree and to the knapsack problem. We tested several possibilities on how to choose the sets in \({\mathfrak {J}}\), e.g., disjoint blocks of equal or different sizes, several ways of choosing overlapping blocks, choosing all k-element subsets). As result for both combinatorial problems, it turned out that choosing overlapping blocks is superior to choosing disjoint blocks (of the same size). I.e., using covers instead of partitions yields better results in less time. Details can be found in Spühler (2018). Using this as a starting point, different strategies on how to choose the cover \({\mathfrak {J}}\) should be discussed, among them having all equal-size subsets of the variables (yieldings \({\mathfrak {J}}\) with uniform structure), having larger or smaller sets and sets with big or small overlap to others.

Based on the observations on matroids and on the decomposition results, we furthermore aim at identifying problem classes in which the gap between the quality of the solution obtained by Algorithm BCD compared to an optimal solution can be bounded. For the performance of BCD it is also of crucial interest how to find a (good) starting solution. Along the lines depicted in Fig. 4 different sequential sequences for finding starting solutions are under investigation (Schiewe and Schöbel 2019).

Finally, blockwise coordinate descent methods should be further refined and used in applications, e.g., for improving the sequential process in public transportation as recently in Pätzold et al. (2017), Amberg et al. (2018), or for finding solutions to interwoven systems (Klamroth et al. 2017). Also, location problems and supply chain management problems are both classes having practical applications and being suitable to study the performance of the respective approaches.