1 Introduction

We consider the problem

$$\begin{aligned} \min _{\textbf{x}\in \mathcal {C}} f(\textbf{x}) \,, \end{aligned}$$
(1)

with objective f having L-Lipschitz regular gradient, and feasible set \(\mathcal {C}\subseteq \mathbb {R}^n\) closed and convex. Furthermore, we assume that \(\mathcal {C}\) is block separable, that is

$$\begin{aligned} \mathcal {C}= \mathcal {C}_{(1)} \times ... \times \mathcal {C}_{(m)} \end{aligned}$$
(2)

with \(\mathcal {C}_{(i)} \subset \mathbb {R}^{n_i}\) closed and convex for \( i \in [1\!: \! m]\), and of course \(\sum _{i=1}^m n_i=n\).

Notice that problem (1) falls in the class of composite optimization problems

$$\begin{aligned} \min _{\textbf{x}\in \mathcal {C}} [f(\textbf{x}) + g(\textbf{x})] \end{aligned}$$
(3)

with f smooth and \(g(\textbf{x}) =\sum _{i=1}^m \chi _{\mathcal {C}_{(i)}}(\textbf{x}^{(i)})\) convex and block separable (see, e.g., [37] for an overview of methods for this class of problems); here \(\chi _D: \mathbb {R}^d \rightarrow [0,+\infty ]\) denotes the indicator function of a convex set \(D\subseteq \mathbb {R}^d\), and for a block vector \(\textbf{x}\in \mathbb {R}^{n} = \mathbb {R}^{n_1}\times ... \times \mathbb {R}^{n_m}\) we denote by \(\textbf{x}^{(i)} \in \mathbb {R}^{n_i}\) the component corresponding to the i-th block, so that \(\textbf{x}= (\textbf{x}^{(1)},..., \textbf{x}^{(m)})\).

Problems of this type arise in a wide number of real-world applications like, e.g., traffic assignment [31], structural SVMs [28], trace-norm based tensor completion [32], reduced rank nonparametric regression [22], semi-relaxed optimal transport [23], structured submodular minimization [26], group fused lasso [1], and dictionary learning [18].

Block-coordinate gradient descent (BCGD) strategies (see, e.g., [4]) represent a standard approach to solve problem (1) in the convex case. When dealing with non-convex objectives, those methods can anyway still be used as an efficient tool to perform local searches in probabilistic global optimization frameworks (see, e.g., [33] for further details). The way BCGD approaches work is very easy to understand: those methods build up, at each iteration, a suitable model of the original function for a block of variables and then perform a projection on the feasible set related to that block.

Projection-based strategies (see, e.g., [6, 19, 21, 40] for further details, with significant contributions by Daniela di Serafino) are in practice widely used also in a block-coordinate fashion (see, e.g., [35]). However, they might be costly even when the projection is performed over some structured sets like, e.g., the flow polytope, the nuclear-norm ball, the Birkhoff polytope, or the permutahedron (see, e.g., [20]). This is the reason why, in recent years, projection-free methods (see, e.g., [14, 25, 29]) have been massively used when dealing with those structured constraints.

These methods simply rely on a suitable oracle that minimizes, at each iteration, a linear approximation of the function over the original feasible set, returning a point in

$$\begin{aligned} \text {argmin}_{\textbf{x}\in \mathcal {C}} \langle \textbf{g}, \textbf{x}\rangle . \end{aligned}$$

When \(\mathcal {C}\) is defined as in (2), this decomposes in m independent problems thanks to the block separable structure of the feasible set. In turn, the resulting problems on the blocks can then be solved in parallel, a possibility that has widely been explored in the literature, especially in the context of traffic assignment (see, e.g., [31]). In a big data context, performing a full update of the variables might still represent a computational bottleneck that needs to be properly handled in practice. This is the reason why block-coordinate variants of the classic Frank–Wolfe (FW) method have been recently proposed (see, e.g., [28, 36, 41]). This method is proposed in [28] for structured support vector machine training, and randomly selects a block at each iteration to perform an FW update on the block. Several improvements on this algorithm, e.g., adaptive block sampling, use of pairwise and away-step directions, or oracle call caching, are described in [36], which obviously work in a sequential fashion.

However, in case one wants to take advantage of modern multicore architectures or of distributed clusters, parallel and distributed versions of the block-coordinate FW algorithm are also available [41]. It is important to highlight that all the papers mentioned above only consider convex programming problems and use random sampling variants as the main block selection strategy.

Furthermore, as noticed in [36], the standard convergence analysis for FW variants (e.g., pairwise and away step FW) cannot be easily extended to the block-coordinate case. In particular, there has been no extension in this setting of the well known linear convergence rate guarantees for FW variants applied to strongly convex objectives (see [14] and references therein). This is mainly due to the difficulties in handling the bad/short steps (i.e., those steps that do not give a good progress and are taken to guarantee feasibility of the iterate) within a block-coordinate framework. In [36], the authors hence extend the convergence analysis of FW variants to the block coordinate setting under the strong assumption that there are no bad steps, claiming that novel proof techniques are required to carry out the analysis in general and close the gap between FW and BCFW in this context.

Here we focus on the non-convex case and define a new general block-coordinate algorithmic framework that gives flexibility in the use of both block selection strategies and FW-like directions. Such a flexibility is mainly obtained thanks to the way we perform approximate minimizations in the blocks. At each iteration, after selecting one block at least, we indeed use the Short Step Chain (SSC) procedure described in [39], which skips gradient computations in consecutive short steps until proper conditions are satisfied, to get the approximate minimization done in the selected blocks.

Concerning the block selection strategies, we explore three different options. The first one we consider is a parallel or Jacobi-like strategy (see, e.g., [5]), where the SSC procedure is performed for all blocks. This obviously reduces the computational burden with respect to the use of the SSC in the whole variable space (see, e.g., [39]) and eventually enables to use multicore architectures to perform those tasks in parallel. The second one is the random sampling (see, e.g., [28]), where the SSC procedure is performed at each iteration on a randomly selected subset of blocks. Finally we have a variant of the Gauss–Southwell rule (see, e.g., [34]), where we perform SSC in all blocks and then select a block which violates optimality conditions at most. Such a greedy rule may make more progress in the objective function, since it uses first order information to choose the right block, but is, in principle, more expensive than the other options we mentioned before (notice that the SSC is performed, at each iteration, for all blocks).

Furthermore, we consider the following projection-free strategies: Away-step Frank–Wolfe (AFW), Pairwise Frank–Wolfe (PFW), and Frank–Wolfe method with in face directions (FDFW), see, e.g., [39] and references therein for further details. The AFW and PFW strategies depend on a set of “elementary atoms” A such that \(\mathcal {C}= \text {conv}(A)\). Given A, for a base point \(\textbf{x}\in \mathcal {C}\) we can define

$$\begin{aligned} S_\textbf{x}= \{S \subset A: \textbf{x}\text { is a proper convex combination of all the elements in }S \} \,, \end{aligned}$$

the family of possible active sets for a given point \(\textbf{x}\). For \(\textbf{x}\in \mathcal {C}\) and \(S \in S_\textbf{x}\), \(\textbf{d}^{\text {PFW}}\) is a PFW direction with respect to the active set S and gradient \(-\textbf{g}\) if and only if

$$\begin{aligned} \textbf{d}^{\text {PFW}} = \textbf{s}- {\textbf{q}}\text { with } \textbf{s}\in \text {argmax}_{\textbf{s}\in \mathcal {C}} \langle \textbf{s}, \textbf{g}\rangle \text { and } {\textbf{q}}\in \text {argmin}_{{\textbf{q}}\in S} \langle {\textbf{q}}, \textbf{g}\rangle \,. \end{aligned}$$
(4)

Similarly, given \(\textbf{x}\in \mathcal {C}\) and \(S \in S_\textbf{x}\), \(\textbf{d}^{\text {AFW}}\) is an AFW direction with respect to the active set S and gradient \(-\textbf{g}\) if and only if

$$\begin{aligned} \textbf{d}^{\text {AFW}} \in \text {argmax}\{\langle \textbf{g}, \textbf{d}\rangle : \textbf{d}\in \{\textbf{d}^{\text {FW}}, \textbf{d}^{\text {AS}}\} \} \,, \end{aligned}$$
(5)

where \(\textbf{d}^{\text {FW}}\) is a classic Frank–Wolfe direction

$$\begin{aligned} \textbf{d}^{\text {FW}} = \textbf{s}- \textbf{x}\text { with } \textbf{s}\in {\text {argmax}}_{\textbf{s}\in \mathcal {C}} \langle \textbf{s}, \textbf{g}\rangle \,, \end{aligned}$$
(6)

and \(\textbf{d}^{\text {AS}}\) is the away direction

$$\begin{aligned} \textbf{d}^{\text {AS}} = \textbf{x}-{\textbf{q}}\text { with } {\textbf{q}}\in \text {argmin}_{{\textbf{q}}\in S} \langle {\textbf{q}}, \textbf{g}\rangle \,. \end{aligned}$$
(7)

The FDFW only requires the current point \(\textbf{x}\) and gradient \(-\textbf{g}\) to select a descent direction (i.e., it does not need to keep track of the active set) and is defined as

$$\begin{aligned} \textbf{d}^{F} = \textbf{x}- \textbf{x}_{F} \text { with } \textbf{x}_{F} \in \text {argmin}\{\langle \textbf{g}, \textbf{y}\rangle : \textbf{y}\in \mathcal {F}(\textbf{x}) \} \end{aligned}$$

for \(\mathcal {F}(\textbf{x})\) the minimal face of \(\mathcal {C}\) containing \(\textbf{x}\). The selection criterion is then analogous to the one used by the AFW:

$$\begin{aligned} \textbf{d}^{\text {FD}} \in \text {argmax} \{ \langle \textbf{g}, \textbf{d}\rangle : \textbf{d}\in \{\textbf{d}^{F}, \textbf{d}^{\text {FW}} \} \} \,. \end{aligned}$$
(8)

From a theoretical point of view, this new algorithmic framework enables us to give:

  • a local linear convergence rate for any choice of block selection strategy and FW-like direction. This result is obtained under a Kurdyka-Łojasiewicz (KL) property (see, e.g., [3, 7, 8]) and a tailored angle condition (see, e.g., [39]). Thanks to the way we handle short steps in our framework we are thus able to extend the analysis given for FW variants to the block-coordinate case and then to close the relevant gap in the theory highlighted in [36].

  • a local active set identification result (see, e.g., [12, 13, 15, 24]) for a specific structure of the Cartesian product defining the feasible set \(\mathcal {C}\), suitable choices of projection-free strategy (i.e., AFW direction is used), and general smooth non convex objectives. In particular, we prove that our framework identifies in finite time the support of a solution. Such a theoretical feature allows to reduce the dimension of the problem at hand and, consequently, the overall computational cost of the optimization procedure.

This is, to the best of our knowledge, the first time that both a (bad step free) linear convergence rate and an active set identification result are given for block-coordinate FW variants. In particular, we solve the open question from [36] discussed above, proving that the linear convergence rate of FW variants can indeed be extended to the block coordinate setting. Furthermore, our results guarantee, for the first time in the literature of projection free optimization methods, identification of the local active set in a single iteration without a tailored active set strategy.

We also report some preliminary numerical results on a specific class of structured problems with a block separable feasible set. Those results show that the proposed framework outperforms the classic block-coordinate FW and, thanks to its flexibility, it can be effectively embedded into a probabilistic global optimization framework thus significantly boosting its performances.

The paper is organized as follows. Section 2 describes the details of our new algorithmic framework. An in-depth analysis of its convergence properties is reported in Sect. 3. An active set identification result is reported in Sect. 4. Preliminary numerical results, focusing on the computational analysis of both the local identification and the convergence properties of our framework, are reported in Sect. 5. Finally, some concluding remarks are included in Sect. 6.

1.1 Notation

For a closed and convex set \(C \subset \mathbb {R}^h\) we denote by \(\pi (C, \textbf{x})\) the Euclidean projection of \(\textbf{x}\in \mathbb {R}^h\) onto C, and by \(T_{C}(\textbf{x})\) the tangent cone to C at \(\textbf{x}\in C\), itself again a closed convex set:

$$\begin{aligned} {T_C(\textbf{x})= \text{ closure }\{ t(\textbf{v}- \textbf{x}): t\ge 0, \textbf{v}\in C\}\,.} \end{aligned}$$

For \(\textbf{g}\in \mathbb {R}^h\) we also use \(\pi _\textbf{x}(\textbf{g})\) as a shorthand for \(\Vert \pi (T_{C}(\textbf{x}), \textbf{g})\Vert \). We denote by \(\hat{\textbf{y}}\) the vector \(\frac{\textbf{y}}{\Vert \textbf{y}\Vert }\) for \(\textbf{y}\ne \textbf{o}\), and \(\hat{\textbf{y}}=\textbf{o}\) otherwise. We finally denote by \(\bar{B}_r(\textbf{x})\) and \(B_r(\textbf{x})\) the closed and open balls of radius r centered at \(\textbf{x}\).

2 A new block-coordinate projection-free method

The block-coordinate framework we consider here applies the Short Step Chain (SSC) procedure from [39], described below as Algorithm 2, to some of the blocks at every iteration. A detailed scheme is specified as Algorithm 1; recall notation \(\textbf{x}= (\textbf{x}^{(1)},..., \textbf{x}^{(m)})\) with \(\textbf{x}^{(i)}\in \mathcal {C}_{(i)}\), all \(i\in [1\!: \! m]\).

Algorithm 1
figure a

Block coordinate method with Short Step Chain (SSC) procedure

In Algorithm 1, we perform two main operations at each iteration. First, in Step 3, we pick a suitable subset of blocks \(\mathcal {M}_k\) according to a given block selection strategy. We then update (Steps 4 and 5) the variables related to the selected blocks by means of the SSC procedure, while keeping all the variables in the other blocks unchanged.

We now briefly recall the SSC procedure from [39], designed to recycle the gradient in consecutive bad steps until suitable stopping conditions are met, in Algorithm 2. Here and in the sequel we denote by \( \alpha _{\max }(\textbf{y}_j, \textbf{d}_j) \) the set of feasible step sizes at \(\textbf{y}_j\) in direction \(\textbf{d}_j\).

figure b

By \(\mathcal {A}\) we indicate a projection-free strategy to generate first-order feasible descent directions for smooth functions on the block where the SSC is applied (e.g., FW, PFW, AFW directions). Since the gradient, \(-\textbf{g}\), is constant during the SSC procedure, it is easy to see that the procedure represents an application of \(\mathcal {A}\) to minimize the linearized objective \(f_\textbf{g}(\textbf{z}) = \langle - \textbf{g}, \textbf{z}- \bar{\textbf{x}} \rangle + f(\bar{\textbf{x}})\), with suitable stepsizes and stopping condition. More specifically, after a stationarity check (see Steps 2–4), the stepsize \(\alpha _j\) is the minimum of an auxiliary stepsize \(\beta _j>0\) and the maximal stepsize \(\alpha ^{(j)}_{\max }\) (which we always assume to be strictly positive). The point \(\textbf{y}_{j + 1}\) generated at Step 7 is always feasible since \(\alpha _j \le \alpha ^{(j)}_{\max }\). Notice that if the method \(\mathcal {A}\) used in the SSC performs a FW step (see equation (6) for the definition of FW step), then the SSC terminates, with \(\alpha _j = \beta _j\) or with \(\textbf{y}_{j + 1}\) a global minimizer of \(f_\textbf{g}\).

The auxiliary step size \(\beta _j\) (see Step 5 of the SSC procedure) is thus defined as the maximal feasible stepsize (at \(\textbf{y}_j\)) for the trust region

$$\begin{aligned} \varOmega _j = B_{\Vert \textbf{g}\Vert /2L}(\bar{\textbf{x}} + \frac{\textbf{g}}{2L}) \cap B_{\langle \textbf{g}, \hat{\textbf{d}}_j \rangle /L}(\bar{\textbf{x}}) \,. \end{aligned}$$
(9)

This guarantees the sufficient decrease condition

$$\begin{aligned} f(\textbf{y}_j) \le f(\textbf{x}_k) - \frac{L}{2}\Vert \textbf{x}_k - \textbf{y}_j\Vert ^2 \end{aligned}$$
(10)

and hence a monotone decrease of f in the SSC. For further details see [39].

2.1 Block selection strategies

As briefly mentioned in the introduction, we will consider three different block selection strategies in our analysis. The first one is a parallel or Jacobi-like strategy (see, e.g., [5]). In this case, we select all the blocks at each iteration. As we already observed, this is computationally cheaper than handling the whole variable space at once. Furthermore, multicore architectures might eventually be considered to perform those tasks in parallel. A definition of the strategy is given below:

Definition 1

(Parallel selection) Set \(\mathcal {M}_k = [1 \!: \! m]\).

The second strategy is a variant of the GS rule (see, e.g., [34]), where we first perform SSC in all blocks and then select a block that violates optimality conditions at most. The formal definition is reported below.

Definition 2

(Gauss–Southwell (GS) selection) Set \(\mathcal {M}_k= \{i(k)\}\), with

$$\begin{aligned} i(k) \in \text {argmax}_{i \in [1: m]} \langle \textbf{g}^{(i)}, \text {SSC}(\textbf{x}_k^{(i)}, -\nabla f(\textbf{x}_k)^{(i)})-\textbf{x}_{k}^{(i)} \rangle . \end{aligned}$$

Finally, we have random sampling (see, e.g., [28]). Here we randomly generate one index at each iteration with uniform probability distribution. The definition we have in this case is the following:

Definition 3

(Random sampling) Set \(\mathcal {M}_k = \{i(k)\}\), with i(k) index chosen uniformly at random in \([1 \!: \! m]\).

3 Convergence analysis

In this section, we analyze the convergence properties of our algorithmic framework. In particular, we show that under a suitably defined angle condition on the blocks and a local KL condition on the objective function, we get, for any block selection strategy used, a linear convergence rate. The convergence analysis presented in this section extends the results given in [39] to the block coordinate setting, a demanding task which is by no means straightforward. Hence, some novel arguments are required for this extension, which are now introduced, and then described in detail in the appendix.

Our convergence framework makes use of the angle condition introduced in [38, 39]. Such a condition ensures that the slope of the descent direction selected by the method is optimal up to a constant. We now recall this angle condition. For \(\textbf{x}\in \mathcal {C}\) and \(\textbf{g}\in \mathbb {R}^n\) we first define the directional slope lower bound as

$$\begin{aligned} \text {DSB}_{\mathcal {A}}(\mathcal {C}, \textbf{x}, \textbf{g}) = \inf _{\textbf{d}\in \mathcal {A}(\textbf{x},\textbf{g})} \frac{\langle \textbf{g}, \textbf{d}\rangle }{\pi _\textbf{x}(\textbf{g}) \Vert \textbf{d}\Vert }, \end{aligned}$$
(11)

if \(\textbf{x}\) is not stationary for \(-\textbf{g}\), otherwise we set \(\text {DSB}_{\mathcal {A}}(\mathcal {C}, \textbf{x}, \textbf{g}) = 1\). Given a subset P of \(\mathcal {C}\), we then define the slope lower bound as

$$\begin{aligned} \text {SB}_{\mathcal {A}}(\mathcal {C}, P) = \inf _{\begin{array}{c} \textbf{g}\in \mathbb {R}^n \\ \textbf{x}\in P \end{array}} \text {DSB}_{\mathcal {A}}(\mathcal {C}, \textbf{x}, \textbf{g}) = \inf _{\begin{array}{c} \textbf{g}: \pi _\textbf{x}(\textbf{g}) \ne 0 \textbf{x}\in P \end{array}} \text {DSB}_{\mathcal {A}}(\mathcal {C}, \textbf{x}, \textbf{g}) \,. \end{aligned}$$
(12)

We use \(\text {SB}_{\mathcal {A}}(\mathcal {C})\) as a shorthand for \(\text {SB}_{\mathcal {A}}(\mathcal {C}, \mathcal {C})\), and say that the angle condition holds for the method \(\mathcal {A}\) if

$$\begin{aligned} \text {SB}_{\mathcal {A}}(\mathcal {C}) = \tau > 0 \,. \end{aligned}$$
(13)

Remark 1

AFW, PFW and FDFW all satisfy the angle condition, when \(\mathcal {C}\) is a polytope. A detailed proof of this result is reported in [39], together with some other examples of methods satisfying the angle condition for convex sets with smooth boundary described in [38].

We now report the local KL condition used to analyze the convergence of our algorithm. The same condition was used previously as well [39, Assumption 2.1].

Assumption 1

Given a stationary point \(\textbf{x}_* \in \mathcal {C}\), there exists \(\eta , \delta , {\mu } > 0\) such that for every \(\textbf{x}\in B_{\delta }(\textbf{x}_*)\) with \(f(\textbf{x}_*)< f(\textbf{x}) < f(\textbf{x}_*) + \eta \) we have

$$\begin{aligned} \pi _\textbf{x}(-\nabla f(\textbf{x})) \ge \sqrt{2\mu }[f(\textbf{x}) - f(\textbf{x}_*)]^{\frac{1}{2}} \,. \end{aligned}$$
(14)

When dealing with convex programming problems, a Hölderian error bound with exponent 2 on the solution set implies condition (14), see [9, Corollary 6]. Therefore, our assumption holds when dealing with \(\mu \)-strongly convex functions (see, e.g., [27]), and in particular for the setting of the open question from [36] discussed in the introduction. It is however important to highlight that the error bound (14) holds in a variety of both convex and non-convex settings (see [39] for a detailed discussion on this matter). An interesting example for our analysis is the setting where f is (non-convex) quadratic, i.e., \(f(\textbf{x}) = \textbf{x}^{\top }{\mathsf Q}\textbf{x}+ \textbf{b}^{\top }\textbf{x}\), and \(\mathcal {C}\) is a polytope.

We now report our main convergence result.

Theorem 1

Let Assumption 1 hold at \(\textbf{x}_*\). Let us consider the sequence \(\{\textbf{x}_k\}\) generated by Algorithm 1. Assume that:

  • the angle condition (13) holds in every block for the same \(\tau > 0\);

  • the SSC procedure always terminates in a finite number of steps.

  • \(f(\textbf{x}_*)\) is a minimum in the connected component of \( \{ \textbf{x}\in \mathcal {C}: f(\textbf{x}) \le f(\textbf{x}_0)\} \) containing \(\textbf{x}_0\).

Then, there exists \(\tilde{\delta } > 0\) such that, if \(\textbf{x}_0 \in B_{\tilde{\delta }}(\textbf{x}_*)\):

  • for the parallel block selection strategy, we have

    $$\begin{aligned} f(\textbf{x}_k) - f(\textbf{x}_*) \le (q_P)^k [f(\textbf{x}_0) - f(\textbf{x}_*)] \,, \end{aligned}$$
    (15)

    and \(\textbf{x}_k \rightarrow \tilde{\textbf{x}}_*\) with

    $$\begin{aligned} \Vert \textbf{x}_k - \tilde{\textbf{x}}_*\Vert \le \frac{\sqrt{2-2q_P}}{\sqrt{L}(1-\sqrt{q_P})}\, (q_P)^{\frac{k}{2}}[f(\textbf{x}_0) - f(\tilde{\textbf{x}}_*)] \,, \end{aligned}$$
    (16)

    for

    $$\begin{aligned} q_P= 1 - \frac{\mu \tau ^2}{4L(1 + \tau )^2}\,.\end{aligned}$$
    (17)
  • for the GS block selection strategy, we have

    $$\begin{aligned} f(\textbf{x}_k) - f(\textbf{x}_*) \le (q_{GS})^k [f(\textbf{x}_0) - f(\textbf{x}_*)] \,, \end{aligned}$$
    (18)

    and \(\textbf{x}_k \rightarrow \tilde{\textbf{x}}_*\) with

    $$\begin{aligned} \Vert \textbf{x}_k - \tilde{\textbf{x}}_*\Vert \le \frac{\sqrt{2-2q_{GS}}}{\sqrt{L}(1-\sqrt{q_{GS}})}\, (q_{GS})^{\frac{k}{2}}[f(\textbf{x}_0) - f(\tilde{\textbf{x}}_*)] \,, \end{aligned}$$
    (19)

    for

    $$\begin{aligned} q_{GS}= 1 - \frac{\mu \tau ^2}{4mL(1 + \tau )^2} \,, \end{aligned}$$
    (20)
  • for the random block selection strategy we have, under the additional condition that

    $$\begin{aligned} \min \{ f(\textbf{x}): \Vert \textbf{x}- \textbf{x}_*\Vert = \delta \}> f(\textbf{x}_*)\end{aligned}$$
    (21)

    holds for some \( \delta > 0\), that

    $$\begin{aligned} \mathbb {E}[f(\textbf{x}_k) - f(\textbf{x}_*)] \le (q_R)^k [f(\textbf{x}_0) - f(\textbf{x}_*)] \,, \end{aligned}$$
    (22)

    and \(\textbf{x}_k \rightarrow \tilde{\textbf{x}}_*\) almost surely with

    $$\begin{aligned} \mathbb {E}[\Vert \textbf{x}_k - \tilde{\textbf{x}}_*\Vert ] \le \frac{\sqrt{2-2q_R}}{\sqrt{L}(1-\sqrt{q_R})}\, (q_R)^{\frac{k}{2}}[f(\textbf{x}_0) - f(\tilde{\textbf{x}}_*)] \end{aligned}$$
    (23)

    for \(q_R= q_{GS}\).

This convergence result extends [39, Theorem 4.2] to our block coordinate setting. However, since the SSC is here applied independently to different blocks, we cannot directly apply the results from [39]. Instead, we combine the properties of the SSC applied in single blocks by exploiting the structure of the tangent cone for product domains:

$$\begin{aligned} T_{\mathcal {C}}(\textbf{x}) = T_{\mathcal {C}_{(1)}} (\textbf{x}^{(1)}) \times \cdots \times T_{\mathcal {C}_{(m)}} (\textbf{x}^{(m)})\,. \end{aligned}$$
(24)

This requires proving stronger properties for the sequence generated by the SSC than those presented in [39]. The details with references to relevant results from [39] can be found in the appendix. Finite termination of the SSC procedure is instead directly ensured by the results proved in [38, 39], in particular for the AFW, PFW and FDFW applied on polytopes.

Remark 2

If the feasible set \(\mathcal {C}\) is a polytope and if we assume that the objective function f satisfies condition (14) on every point generated by the algorithm, with fixed \(f(\textbf{x}_*)\), then Algorithm 1 with AFW (PFW or FDFW) in the SSC converges at the rates given above. Condition (14) holds in case of \(\mu \)-strongly convex functions, and hence we have that in those cases our algorithm globally converges with the rates given in Theorem 1.

Remark 3

Both the parallel and the GS strategy give the same rate with different constants. In particular, the constant ruling the GS case depends on the number of blocks used (the larger the number of blocks, the worse the rate) and is larger than the one we have for the parallel case.

Remark 4

The random block selection strategy has the same rate as the GS strategy, but it is given in expectation. In particular, the constant ruling the rate is the same as the GS one, hence depends on the number of blocks used. Note that a further technical assumption (21) on \(\textbf{x}_*\) is needed in this case.

4 Active set identification

We now report an active set identification result for our framework, assuming that the sets in the Cartesian product have a specific structure:

$$\begin{aligned} \mathcal {C}= \varDelta ^{n_1} \times ... \times \varDelta ^{n_m}, \end{aligned}$$
(25)

so that the set \(\mathcal {C}_{(i)}\) is the \((n_i - 1)\)-dimensional standard simplex

$$\begin{aligned} \varDelta ^{n_i}=\{ \textbf{x}\in \mathbb {R}^{n_i}_+: \textbf{x}^\top \textbf{e}^{(i)}=1\}\,,\quad i \in [1 \!: \! m]\,, \end{aligned}$$

for \(\textbf{e}\in \mathbb {R}^{n}\) the vector with components all equal to 1. We only focus on Algorithm 1 with AFW in the SSC and assume that strict complementarity holds at a stationary point \(\textbf{x}_*\), i.e., either \((x_*)_i=0\) or \(\frac{\partial f}{\partial x_i}(\textbf{x}_*)=\langle \textbf{x}_*, \nabla f(\textbf{x}_*) \rangle \) holds but not both simultaneously, for all i. As usual, the support \(\text {supp}~\textbf{x}= \{i \in [1 \!: \! n]: x_i>0\}\). We now report our main identification result. A detailed proof is included in the appendix.

Theorem 2

Under the above assumptions on \(\mathcal {C}\), let \(\mathcal {A}^{(i)}\) be the AFW for \(i \in [1 \!: \! m]\), and let strict complementarity conditions hold at \(\textbf{x}_* \in \mathcal {C}\).

  • If \(\{\textbf{x}_k\}\) is generated by Algorithm 1 with parallel selection, then there exists a neighborhood U of \(\textbf{x}_*\) such that if \(\textbf{x}_k \in U\) then \(\text {supp}(\textbf{x}_{k + 1}) = \text {supp}(\textbf{x}_*)\).

  • If \(\{\textbf{x}_k\}\) is generated by Algorithm 1 with randomized or GS selection, then there exists a neighborhood U of \(\textbf{x}_*\) such that if \(\textbf{x}_k \in U\) then \(\text {supp}(\textbf{x}_{k + 1}^{i(k)}) = \text {supp}(\textbf{x}_*^{i(k)})\).

When the sequence generated by our algorithm converges to the point \(\textbf{x}_*\), it is then easy to see that the support of the iterate matches the final support of \(\textbf{x}_*\) for k large enough.

Corollary 1

Under the above assumptions on \(\mathcal {C}\), let \(\mathcal {A}^{(i)}\) be the AFW for \(i \in [1 \!: \! m]\), and let strict complementarity conditions hold at \(\textbf{x}_* \in \mathcal {C}\). If \(\textbf{x}_k \rightarrow \textbf{x}_*\) (almost surely), then for parallel and GS selection (for random sampling) we have \(\text {supp}(\textbf{x}_k) = \text {supp}(\textbf{x}_*)\) for k large enough.

This result has relevant practical implications, especially when handling sparse optimization problems. Since the algorithm iterates have a constant support when k is large, we can simply focus on the few support components and forget about the others in this case. We hence can exploit this by embedding sophisticated tools (like, e.g., caching strategies, second-order methods) in the algorithm, thus obtaining a significant speed up in the end.

5 Numerical results

We report here some preliminary numerical results for a non-convex quadratic optimization problem referred to as Multi-StQP [16] on a product of (here identical) simplices, that is

$$\begin{aligned} \min \left\{ \textbf{x}^\top {\mathsf Q}\textbf{x}: \textbf{x}\in (\varDelta ^l)^m \right\} \,. \end{aligned}$$
(26)

The matrix \({\mathsf Q}\) was generated in such a way that the solutions of problem (26) had components sparse but different from vertices. This is in fact the setting where FW variants have proved to be more effective [15, 39]. In order to obtain the desired property, we consider a perturbation of a stochastic StQP [11]. Given \(\{\bar{{\mathsf Q}}_i\}_{i \in [1: m]}\) representing m possible StQPs, with \(\bar{{\mathsf Q}}_i \in \mathbb {R}^{l \times l}\) for \(i \in [1 \!: \! m]\), the corresponding stochastic StQP with sample space \([1 \!: \! m]\) is given by

$$\begin{aligned} \max \left\{ \sum _{i = 1}^{m} p_i \textbf{y}_i^\top \bar{{\mathsf Q}}_i \textbf{y}_i: \textbf{y}_i \in \varDelta ^l\quad \text {for all } i \in [1 \!: \! m] \right\} \,. \end{aligned}$$
(27)

with \(p_i\) probability of the StQP i. Equivalently, (27) is an instance of problem (26) with \({\mathsf Q}= \bar{{\mathsf Q}}\), for

$$\begin{aligned} \bar{{\mathsf Q}} = \begin{bmatrix} -p_1\bar{{\mathsf Q}}_1 &{} 0 &{} \cdots &{} 0 \\ 0 &{} -p_2\bar{{\mathsf Q}}_2 &{} \cdots &{} 0 \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} \cdots &{} -p_m\bar{{\mathsf Q}}_m \end{bmatrix} \,. \end{aligned}$$
(28)

In our tests, we added to the stochastic StQP a perturbation coupling the blocks. More precisely, the matrix \({\mathsf Q}\) was set equal to \(\bar{{\mathsf Q}} + \varepsilon \tilde{{\mathsf Q}} \), for \(\tilde{{\mathsf Q}}\) a random matrix with standard Gaussian independent entries. The coefficient \(\varepsilon \) was set equal to \(\frac{1}{2m^2}\). We set \(\bar{{\mathsf Q}}_i =\bar{{\mathsf A}}_i + \alpha {\mathsf I}_l \), for \(\alpha = 0.5\) and \(\bar{{\mathsf A}}_i\) the adjacency matrix of an Erdős-Rényi random graph, where each couple of vertices has probability p of being connected by an edge, independently from the other couples. Hence, for \(i\in [1 \!: \! m]\) the problem

$$\begin{aligned} \min \left\{ - \textbf{y}^\top \bar{{\mathsf Q}}_i \textbf{y}: \textbf{y}\in \varDelta ^l \right\} \end{aligned}$$
(29)

is a regularized maximum-clique formulation, where each maximal clique corresponds to a unique strict local maximizer with support equal to its vertices, and conversely (see [10] and references therein). The probability p is set as follows

$$\begin{aligned} p = \left( {\begin{array}{c}l\\ s\end{array}}\right) ^{\frac{2}{s(s-1)}} \,, \end{aligned}$$
(30)

for s the nearest integer to 0.4l, so that the expected number of cliques with size \( \approx 0.4l \) is 1, (see, e.g., [2]). Notice that the perturbation term \(\tilde{{\mathsf Q}}\) ensures that problem (26) cannot be solved by optimizing each block separately.

We remark here that different ways to build large StQPs starting from smaller instances and preserving the structure of their solutions have been discussed in [17]. However, while the resulting problems decouple on the feasible set of the larger problem, they still decouple on the product of the feasible sets of the smaller instances, and for our purposes are equivalent to the block diagonal structure.

We tested four methods in total: AFW + SSC with parallel, GS and randomized updates (PAFW + SSC, GSAFW + SSC, BCAFW + SSC respectively), and FW with randomized updates (BCFW, coinciding with the block coordinate FW introduced in [28]). Our tests focused on the local identification and on the convergence properties of our methods.

The code was written in Python using the numpy package, and the tests were performed on an Intel Core i9-12900KS CPU 3.40GHz, 32GB RAM. The codes relevant to the numerical tests are available at the following link: https://github.com/DamianoZeffiro/Projection-free-product-domain.

Fig. 1
figure 1

Comparison using multistart between GSAFW + SSC, PAFW + SSC and BCAFW + SSC. \(l = m = 100\)

5.1 Multistart

We first considered a multistart approach, where the results are averaged across 20 runs, choosing 4 starting points for each of 5 random initializations of the objective.

We measure both optimality gap (error estimate) and sparsity (number of nonzero components, \(\ell _0\) norm) of the iterates, reporting average and standard deviation in the plots. The estimated global optimum used in the optimality gap is obtained by subtracting \(10^{-5}\) from the best local solution found by the algorithms. We mostly consider the performance with respect to block gradient computations, with one gradient counted each time the SSC is performed in one of the blocks, as in previous works (see, e.g., [28]). In some tests involving the GSAFW + SSC, we consider instead block updates, with one block update counted each time the algorithms modifies the current iterate in one of the blocks. It is important to highlight that, since at each block update the gradient is constant and only one linear minimization is required at the beginning of the SSC, the number of gradient computations for our algorithms also coincides with the number of linear minimizations on the blocks for the FW variants we consider.

We first compare PAFW + SSC, BCAFW + SSC and GSAFW + SSC (Fig. 1). As expected, while GSAFW + SSC shows good performance with respect to block updates, it has a very poor performance with respect to block gradient computations, since at every iteration m gradients must be computed to update a single block. We then compared PAFW + SSC, BCAFW + SSC and BCFW. The results (Fig. 2) clearly show that PAFW + SSC and BCAFW + SSC outperform BCFW. All these findings are consistent with the theoretical results described in “An active set identification criterion" section.

Fig. 2
figure 2

Comparison using multistart between BCFW, PAFW + SSC and BCAFW + SSC. \(l = m = 100\) in the first row, \(l=40\) and \(m = 250\) in the second row, \(l=250\) and \(m=40\) in the third row

5.2 Monotonic basin hopping

We then consider the monotonic basin hopping approach (see, e.g., [30, 33]) described in Algorithm 3. The method computes a local optimizer \(\textbf{x}_{*, i}\) close to the current iterate \(\bar{\textbf{x}}_i\) (Step 2). There \(\mathcal {M}\) is a local optimization algorithm, and given as input \(\mathcal {M}\) and \(\bar{\textbf{x}}_i\), the subroutine LO returns the result of applying \(\mathcal {M}\) starting from \(\bar{\textbf{x}}_i\), with a suitable stopping criterion which in our case is given by a limit on the number of gradient computations, set to 10m. The sequence of best points found in the first i iterations \(\{\bar{\textbf{x}}_{*, i}\}\) is updated in Step 3, and in Step 5, \(\bar{\textbf{x}}_{i + 1}\) is chosen in a neighborhood of \(\bar{\textbf{x}}_{*, i}\). The neighborhood \(B(\textbf{x}, \gamma )\) for \(\textbf{x}\in \mathcal {C}\) and \(\gamma \in (0, 1]\) is given by

$$\begin{aligned} B(\textbf{x}, \gamma ) = \{\textbf{x}+ \gamma (\textbf{y}- \textbf{x}): \textbf{y}\in \mathcal {C}\} \,. \end{aligned}$$
(31)
figure c

In the tests, we chose \(\textbf{y}\) uniformly at random in \(\mathcal {C}\) and set \(\bar{\textbf{x}}_{i + 1} = \bar{\textbf{x}}_i + \gamma (\textbf{y}- \bar{\textbf{x}}_i)\), with \(\gamma = 0.25\). sThe methods we consider as subroutines in Step 2 are PAFW + SSC, BCAFW + SSC and BCFW. We set \(i_{\max } = 9\), and perform 10 runs of Algorithm 3, randomly initializing the starting point. We plot once again (Fig. 3) average and standard deviation for \(\{f(\bar{\textbf{x}}_{*, i}) - \tilde{f}^*\}\) with \(\tilde{f}^*\) estimating the global optimum (obtained by subtracting \(10^{-1}\) from the best solution found by the methods).

The results again show that PAFW + SSC and BCAFW + SSC find better solutions than BCFW, with BCAFW + SSC outperforming PAFW + SSC in most instances if \(l \le m\).

Fig. 3
figure 3

Comparison using Monotonic Basin Hopping with BCFW, PAFW + SSC and BCAFW + SSC. From left to right: \(l = m = 100\); \(l=40\) and \(m = 250\); \(l=250\) and \(m=40\)

6 Conclusions

For a quite general optimization problem on product domains, we offer a seemingly new convergence theory, which ensures both convergence of objective values and (local) linear convergence of the iterates under widely accepted conditions, for block-coordinate FW variants. Convergence is global for \(\mu \)-strongly convex objectives, but we mainly focus on the non-convex case. In case of randomized selection of the blocks, all results are in expectation, and need a further technical assumption. As usual, constants and rates are specified in terms of the Lipschitz constant L for the gradient map, the constant \(\mu \) used in the local Kurdyka-Łojasiewicz-condition, and the parameter \(\tau \) in the so-called angle condition.

The results are complemented by an active set identification result for a specific structure of the product domain and suitable choices of a projection-free strategy (FW-approach with away steps for the search direction): it is proved that our framework identifies the support of a solution in a finite number of iterations.

To the best of our knowledge, this is the first time that both a linear convergence rate and an active set identification result are given for (bad step-free) block-coordinate FW variants, in an effort to narrow the research gap observed in [36].

In our preliminary experiments, numerical evidence clearly points out the advantages of our strategy to exploit structural knowledge. On randomly generated non-convex Multi-StQPs where easy instances were carefully avoided, our approach (AFW with parallel or randomized updates, both combined with the Short Step Chain strategy SSC) is dominating the block-coordinate FW method with randomized updates.

We tested resilience of our reported observations by employing two experimental setups, pure multistart and monotonic basin hopping. The same effects seem to prevail.

Instance construction was motivated by a stochastic variant of the StQP, varying both domain dimension l and number m of possible scenarios. In case \(l\le m\) there seems to be a slight edge towards the combination of AFW with randomized updates and SSC, compared to the parallel variant. This effect does not seem to happen with large l in comparison to m, but would not change superiority over traditional block-coordinate FW methods.