Projection free methods on product domains

Projection-free block-coordinate methods avoid high computational cost per iteration and at the same time exploit the particular problem structure of product domains. Frank-Wolfe-like approaches rank among the most popular ones of this type. However, as observed in the literature, there was a gap between the classical Frank-Wolfe theory and the block-coordinate case. Moreover, most of previous research concentrated on convex objectives. This study now deals also with the non-convex case and reduces above-mentioned theory gap, in combining a new, fully developed convergence theory with novel active set identification results which ensure that inherent sparsity of solutions can be exploited in an efficient way. Preliminary numerical experiments seem to justify our approach and also show promising results for obtaining global solutions in the non-convex case.


Introduction
We consider the problem min with objective f having L-Lipschitz regular gradient, and feasible set C ⊆ R n closed and convex.Furthermore, we assume that C is block separable, that is with C (i) ⊂ R ni closed and convex for i ∈ with f smooth and g(x) = m i=1 χ C (i) (x (i) ) convex and block separable (see, e.g., [37] for an overview of methods for this class of problems); here χ D : R d → [0, +∞] denotes the indicator function of a convex set D ⊆ R d , and for a block vector x ∈ R n = R n1 × ... × R nm we denote by x (i) ∈ R ni the component corresponding to the i-th block, so that x = (x (1) , ..., x (m) ).
Block-coordinate gradient descent (BCGD) strategies (see, e.g., [4]) represent a standard approach to solve problem (1) in the convex case.When dealing with non-convex objectives, those methods can anyway still be used as an efficient tool to perform local searches in probabilistic global optimization frameworks (see, e.g., [33] for further details).The way BCGD approaches work is very easy to understand: those methods build up, at each iteration, a suitable model of the original function for a block of variables and then perform a projection on the feasible set related to that block.
These methods simply rely on a suitable oracle that minimizes, at each iteration, a linear approximation of the function over the original feasible set, returning a point in argmin x∈C ⟨g, x⟩.
When C is defined as in (2), this decomposes in m independent problems thanks to the block separable structure of the feasible set.In turn, the resulting problems on the blocks can then be solved in parallel, a possibility that has widely been explored in the literature, especially in the context of traffic assignment (see, e.g., [31]).In a big data context, performing a full update of the variables might still represent a computational bottleneck that needs to be properly handled in practice.This is the reason why block-coordinate variants of the classic Frank-Wolfe (FW) method have been recently proposed (see, e.g., [28,36,41]).This method is proposed in [28] for structured support vector machine training, and randomly selects a block at each iteration to perform an FW update on the block.Several improvements on this algorithm, e.g., adaptive block sampling, use of pairwise and away-step directions, or oracle call caching, are described in [36], which obviously work in a sequential fashion.
However, in case one wants to take advantage of modern multicore architectures or of distributed clusters, parallel and distributed versions of the block-coordinate FW algorithm are also available [41].It is important to highlight that all the papers mentioned above only consider convex programming problems and use random sampling variants as the main block selection strategy.
Furthermore, as noticed in [36], the standard convergence analysis for FW variants (e.g., pairwise and away step FW) cannot be easily extended to the block-coordinate case.In particular, there has been no extension in this setting of the well known linear convergence rate guarantees for FW variants applied to strongly convex objectives (see [14] and references therein).This is mainly due to the difficulties in handling the bad/short steps (i.e., those steps that do not give a good progress and are taken to guarantee feasibility of the iterate) within a block-coordinate framework.In [36], the authors hence extend the convergence analysis of FW variants to the block coordinate setting under the strong assumption that there are no bad steps, claiming that novel proof techniques are required to carry out the analysis in general and close the gap between FW and BCFW in this context.
Here we focus on the non-convex case and define a new general block-coordinate algorithmic framework that gives flexibility in the use of both block selection strategies and FW-like directions.Such a flexibility is mainly obtained thanks to the way we perform approximate minimizations in the blocks.At each iteration, after selecting one block at least, we indeed use the Short Step Chain (SSC) procedure described in [39], which skips gradient computations in consecutive short steps until proper conditions are satisfied, to get the approximate minimization done in the selected blocks.
Concerning the block selection strategies, we explore three different options.The first one we consider is a parallel or Jacobi-like strategy (see, e.g., [5]), where the SSC procedure is performed for all blocks.This obviously reduces the computational burden with respect to the use of the SSC in the whole variable space (see, e.g., [39]) and eventually enables to use multicore architectures to perform those tasks in parallel.The second one is the random sampling (see, e.g., [28]), where the SSC procedure is performed at each iteration on a randomly selected subset of blocks.Finally we have a variant of the Gauss-Southwell rule (see, e.g., [34]), where we perform SSC in all blocks and then select a block which violates optimality conditions at most.Such a greedy rule may make more progress in the objective function, since it uses first order information to choose the right block, but is, in principle, more expensive than the other options we mentioned before (notice that the SSC is performed, at each iteration, for all blocks).Furthermore, we consider the following projection-free strategies: Away-step Frank-Wolfe (AFW), Pairwise Frank-Wolfe (PFW), and Frank-Wolfe method with in face directions (FDFW), see, e.g., [39] and references therein for further details.The AFW and PFW strategies depend on a set of "elementary atoms" A such that C = conv(A).Given A, for a base point x ∈ C we can define S x = {S ⊂ A : x is a proper convex combination of all the elements in S} , the family of possible active sets for a given point x.For x ∈ C and S ∈ S x , d PFW is a PFW direction with respect to the active set S and gradient −g if and only if d PFW = s − q with s ∈ argmax s∈C ⟨s, g⟩ and q ∈ argmin q∈S ⟨q, g⟩ .
Similarly, given x ∈ C and S ∈ S x , d AFW is an AFW direction with respect to the active set S and gradient −g if and only if where d FW is a classic Frank-Wolfe direction and d AS is the away direction The FDFW only requires the current point x and gradient −g to select a descent direction (i.e., it does not need to keep track of the active set) and is defined as for F(x) the minimal face of C containing x.The selection criterion is then analogous to the one used by the AFW: From a theoretical point of view, this new algorithmic framework enables us to give: a local linear convergence rate for any choice of block selection strategy and FW-like direction.This result is obtained under a Kurdyka-Lojasiewicz (KL) property (see, e.g., [3], [7] and [8]) and a tailored angle condition (see, e.g., [39]).Thanks to the way we handle short steps in our framework we are thus able to extend the analysis given for FW variants to the block-coordinate case and then to close the relevant gap in the theory highlighted in [36].a local active set identification result (see, e.g., [12,13,15,24]) for a specific structure of the Cartesian product defining the feasible set C, suitable choices of projection-free strategy (i.e., AFW direction is used), and general smooth non convex objectives.In particular, we prove that our framework identifies in finite time the support of a solution.Such a theoretical feature allows to reduce the dimension of the problem at hand and, consequently, the overall computational cost of the optimization procedure.This is, to the best of our knowledge, the first time that both a (bad step free) linear convergence rate and an active set identification result are given for block-coordinate FW variants.In particular, we solve the open question from [36] discussed above, proving that the linear convergence rate of FW variants can indeed be extended to the block coordinate setting.Furthermore, our results guarantee, for the first time in the literature of projection free optimization methods, identification of the local active set in a single iteration without a tailored active set strategy.
We also report some preliminary numerical results on a specific class of structured problems with a block separable feasible set.Those results show that the proposed framework outperforms the classic block-coordinate FW and, thanks to its flexibility, it can be effectively embedded into a probabilistic global optimization framework thus significantly boosting its performances.
The paper is organized as follows.Section 2 describes the details of our new algorithmic framework.An in-depth analysis of its convergence properties is reported in Section 3.An active set identification result is reported in Section 4. Preliminary numerical results, focusing on the computational analysis of both the local identification and the convergence properties of our framework, are reported in Section 5. Finally, some concluding remarks are included in Section 6.

Notation
For a closed and convex set C ⊂ R h we denote by π(C, x) the projection of x ∈ R h onto C, and by T C (x) the tangent space to C at x ∈ C. For g ∈ R h we also use π x (g) as a shorthand for ∥π(T C (x), g)∥.We denote by ŷ the vector y ∥y∥ for y ̸ = o, and ŷ = o otherwise.We finally denote by Br (x) and B r (x) the closed and open balls of radius r centered at x.

A new block-coordinate projection-free method
The block-coordinate framework we consider here applies the Short Step Chain (SSC) procedure from [39], described below as Algorithm 2, to some of the blocks at every iteration.A detailed scheme is specified as Algorithm 1; recall notation x = (x (1) , ..., x (m) ) with In Algorithm 1, we perform two main operations at each iteration.First, in Step 3, we pick a suitable subset of blocks M k according to a given block selection strategy.We then update (Steps Algorithm 1 Block coordinate method with Short Step Chain (SSC) procedure 4 and 5) the variables related to the selected blocks by means of the SSC procedure, while keeping all the variables in the other blocks unchanged.
We now briefly recall the SSC procedure from [39], designed to recycle the gradient in consecutive bad steps until suitable stopping conditions are met, in Algorithm 2.

Algorithm 2 Short
Step Chain procedure -SSC(x, g) max ∈ αmax(y j , d j ) 3: if d j = 0 then return y j 4: end if 5: compute an auxiliary step size β j 6: let α j = min(α (j) max , β j ) 7: y j+1 = y j + α j d j 8: if α j = β j then return y j+1 9: end if 10: j = j + 1, go to Step 2 By A we indicate a projection-free strategy to generate first-order feasible descent directions for smooth functions on the block where the SSC is applied (e.g., FW, PFW, AFW directions).Since the gradient, −g, is constant during the SSC procedure, it is easy to see that the procedure represents an application of A to minimize the linearized objective f g (z) = ⟨−g, z − x⟩ + f (x), with suitable stepsizes and stopping condition.More specifically, after a stationarity check (see Steps 2-4), the stepsize α j is the minimum of an auxiliary stepsize β j > 0 and the maximal stepsize α (j) max (which we always assume to be strictly positive).The point y j+1 generated at Step 7 is always feasible since α j ≤ α (j) max .Notice that if the method A used in the SSC performs a FW step (see equation (6) for the definition of FW step), then the SSC terminates, with α j = β j or with y j+1 a global minimizer of f g .
The auxiliary step size β j (see Step 5 of the SSC procedure) is thus defined as the maximal feasible stepsize (at y j ) for the trust region This guarantees the sufficient decrease condition and hence a monotone decrease of f in the SSC.For further details see [39].

Block selection strategies
As briefly mentioned in the introduction, we will consider three different block selection strategies in our analysis.The first one is a parallel or Jacobi-like strategy (see, e.g., [5]).In this case, we select all the blocks at each iteration.As we already observed, this is computationally cheaper than handling the whole variable space at once.Furthermore, multicore architectures might eventually be considered to perform those tasks in parallel.A definition of the strategy is given below: The second strategy is a variant of the GS rule (see, e.g., [34]), where we first perform SSC in all blocks and then select a block that violates optimality conditions at most.The formal definition is reported below.
Finally, we have random sampling (see, e.g., [28]).Here we randomly generate one index at each iteration with uniform probability distribution.The definition we have in this case is the following:

Convergence analysis
In this section, we analyze the convergence properties of our algorithmic framework.In particular, we show that under a suitably defined angle condition on the blocks and a local KL condition on the objective function, we get, for any block selection strategy used, a linear convergence rate.The convergence analysis presented in this section extends the results given in [39] to the block coordinate setting, a demanding task which is by no means straightforward.Hence, some novel arguments are required for this extension, which are now introduced, and then described in detail in the appendix.
Our convergence framework makes use of the angle condition introduced in [38,39].Such a condition ensures that the slope of the descent direction selected by the method is optimal up to a constant.We now recall this angle condition.For x ∈ C and g ∈ R n we first define the directional slope lower bound as if x is not stationary for −g, otherwise we set DSB A (C, x, g) = 1.We then define the slope lower bound as SB A (C, P ) = inf We use SB A (C) as a shorthand for SB A (C, C), and say that the angle condition holds for the method Remark 1 AFW, PFW and FDFW all satisfy the angle condition, when C is a polytope.A detailed proof of this result is reported in [39], together with some other examples of methods satisfying the angle condition for convex sets with smooth boundary described in [38].
We now report the local KL condition used to analyze the convergence of our algorithm.The same condition was used previously as well [39,Assumption 2.1].
Assumption 1 Given a stationary point x * ∈ C, there exists η, δ > 0 such that for every When dealing with convex programming problems, a Hölderian error bound with exponent 2 on the solution set implies condition (14), see [9,Corollary 6].Therefore, our assumption holds when dealing with µ-strongly convex functions (see, e.g., [27]), and in particular for the setting of the open question from [36] discussed in the introduction.It is however important to highlight that the error bound (14) holds in a variety of both convex and non-convex settings (see [39] for a detailed discussion on this matter).An interesting example for our analysis is the setting where f is (non-convex) quadratic, i.e., f (x) = x ⊤ Qx + b ⊤ x, and C is a polytope.
We now report our main convergence result.
Theorem 1 Let Assumption 1 hold at x * .Let us consider the sequence {x k } generated by Algorithm 1. Assume that: the angle condition (13) holds in every block for the same τ > 0; the SSC procedure always terminates in a finite number of steps.
Then, there exists δ > 0 such that, if x 0 ∈ B δ (x * ): for the parallel block selection strategy, we have and for for the GS block selection strategy, we have and for for the random block selection strategy we have, under the additional condition that holds for some δ > 0, that and x k → x * almost surely with This convergence result extends [39, Theorem 4.2] to our block coordinate setting.However, since the SSC is here applied independently to different blocks, we cannot directly apply the results from [39].Instead, we combine the properties of the SSC applied in single blocks by exploiting the structure of the tangent cone for product domains: This requires proving stronger properties for the sequence generated by the SSC than those presented in [39].The details with references to relevant results from [39] can be found in the appendix.Finite termination of the SSC procedure is instead directly ensured by the results proved in [38,39], in particular for the AFW, PFW and FDFW applied on polytopes.
Remark 2 If the feasible set C is a polytope and if we assume that the objective function f satisfies condition ( 14) on every point generated by the algorithm, with fixed f (x * ), then Algorithm 1 with AFW (PFW or FDFW) in the SSC converges at the rates given above.Condition (14) holds in case of µ-strongly convex functions, and hence we have that in those cases our algorithm globally converges with the rates given in Theorem 1.
Remark 3 Both the parallel and the GS strategy give the same rate with different constants.In particular, the constant ruling the GS case depends on the number of blocks used (the larger the number of blocks, the worse the rate) and is larger than the one we have for the parallel case.

Remark 4
The random block selection strategy has the same rate as the GS strategy, but it is given in expectation.In particular, the constant ruling the rate is the same as the GS one, hence depends on the number of blocks used.Note that a further technical assumption (19) on x * is needed in this case.

Active set identification
We now report an active set identification result for our framework.We only focus on Algorithm 1 with AFW in the SSC and assume that strict complementarity holds and that the sets in the Cartesian product have a specific structure: for e ∈ R n the vector with components all equal to 1.We now report our main identification result.A detailed proof is included in the appendix.
Theorem 2 Under the above assumptions on C, let A (i) be the AFW for i ∈ [1 : m], and let strict complementarity conditions hold at x * ∈ C.
-If {x k } is generated by Algorithm 1 with parallel selection, then there exists a neighborhood U of x * such that if x k ∈ U then supp(x k+1 ) = supp(x * ).-If {x k } is generated by Algorithm 1 with randomized or GS selection, then there exists a neighborhood U of x * such that if When the sequence generated by our algorithm converges to the point x * , it is then easy to see that the support of the iterate matches the final support of x * for k large enough.This result has relevant practical implications, especially when handling sparse optimization problems.Since the algorithm iterates have a constant support when k is large, we can simply focus on the few support components and forget about the others in this case.We hence can exploit this by embedding sophisticated tools (like, e.g., caching strategies, second-order methods) in the algorithm, thus obtaining a significant speed up in the end.

Numerical results
We report here some preliminary numerical results for a non-convex quadratic optimization problem referred to as Multi-StQP [16] on a product of (here identical) simplices, that is The matrix Q was generated in such a way that the solutions of problem (24) had components sparse but different from vertices.This is in fact the setting where FW variants have proved to be more effective [15,39].In order to obtain the desired property, we consider a perturbation of a stochastic StQP [11].Given { Qi } i∈ with p i probability of the StQP i. Equivalently, ( 25) is an instance of problem (24) with In our tests, we added to the stochastic StQP a perturbation coupling the blocks.More precisely, the matrix Q was set equal to Q + ε Q, for Q a random matrix with standard Gaussian independent entries.The coefficient ε was set equal to 1 2m 2 .We set Qi = Āi +αI l , for α = 0.5 and Āi the adjacency matrix of an Erdős-Rényi random graph, where each couple of vertices has probability p of being connected by an edge, independently from the other couples.Hence, for i ∈ is a regularized maximum-clique formulation, where each maximal clique corresponding to a unique strict local maximizer with support equal to its vertices, and conversely (see [10] and references therein).The probability p is set as follows for s the nearest integer to 0.4l, so that the expected number of cliques with size ≈ 0.4l is 1, (see, e.g., [2]).Notice that the perturbation term Q ensures that problem (24) cannot be solved by optimizing each block separately.
We remark here that different ways to build large StQPs starting from smaller instances and preserving the structure of their solutions have been discussed in [17].However, while the resulting problems decouple on the feasible set of the larger problem, they still decouple on the product of the feasible sets of the smaller instances, and for our purposes are equivalent to the block diagonal structure.
We tested four methods in total: AFW + SSC with parallel, GS and randomized updates (PAFW + SSC, GSAFW + SSC, BCAFW + SSC respectively), and FW with randomized updates (BCFW, coinciding with the block coordinate FW introduced in [28]).Our tests focused on the local identification and on the convergence properties of our methods.
The code was written in Python using the numpy package, and the tests were performed on an Intel Core i9-12900KS CPU 3.40GHz, 32GB RAM.The codes relevant to the numerical tests are available at the following link: https://github.com/DamianoZeffiro/Projection-free-product-domain.

Multistart
We first considered a multistart approach, where the results are averaged across 20 runs, choosing 4 starting points for each of 5 random initializations of the objective.
We measure both optimality gap (error estimate) and sparsity (number of nonzero components, ℓ 0 norm) of the iterates, reporting average and standard deviation in the plots.The estimated global optimum used in the optimality gap is obtained by subtracting 10 −5 from the best local solution found by the algorithms.We mostly consider the performance with respect to block gradient computations, with one gradient counted each time the SSC is performed in one of the blocks, as in previous works (see, e.g., [28]).In some tests involving the GSAFW + SSC, we consider instead block updates, with one block update counted each time the algorithms modifies the current iterate in one of the blocks.It is important to highlight that, since at each block update the gradient is constant and only one linear minimization is required at the beginning of the SSC, the number of gradient computations for our algorithms also coincides with the number of linear minimizations on the blocks for the FW variants we consider.We first compare PAFW + SSC, BCAFW + SSC and GSAFW + SSC (Figure 1).As expected, while GSAFW + SSC shows good performance with respect to block updates, it has a very poor performance with respect to block gradient computations, since at every iteration m gradients must be computed to update a single block.We then compared PAFW + SSC, BCAFW + SSC and BCFW.The results (Figure 2) clearly show that PAFW + SSC and BCAFW + SSC outperform BCFW.All these findings are consistent with the theoretical results described in Section 7.2.

Monotonic basin hopping
We then consider the monotonic basin hopping approach (see, e.g., [30,33]) described in Algorithm 3. The method computes a local optimizer x * ,i close to the current iterate xi (Step 2).There M is a local optimization algorithm, and given as input M and xi , the subroutine LO returns the result of applying M starting from xi , with a suitable stopping criterion which in our case is given by a limit on the number of gradient computations, set to 10m.The sequence of best points found in the first i iterations {x * ,i } is updated in Step 3, and in Step 5, xi+1 is chosen in a neighborhood of x * ,i .The neighborhood B(x, γ) for x ∈ C and γ ∈ (0, 1] is given by In

Conclusions
For a quite general optimization problem on product domains, we offer a seemingly new convergence theory, which ensures both convergence of objective values and (local) linear convergence of the iterates under widely accepted conditions, for block-coordinate FW variants.Convergence is global for µ-strongly convex objectives, but we mainly focus on the non-convex case.In case of randomized selection of the blocks, all results are in expectation, and need a further technical assumption.As usual, constants and rates are specified in terms of the Lipschitz constant L for the gradient map, the constant µ used in the local Kurdyka-Lojasiewicz-condition, and the parameter τ in the so-called angle condition.
The results are complemented by an active set identification result for a specific structure of the product domain and suitable choices of a projection-free strategy (FW-approach with away steps for the search direction): it is proved that our framework identifies the support of a solution in a finite number of iterations.
To the best of our knowledge, this is the first time that both a linear convergence rate and an active set identification result are given for (bad step-free) block-coordinate FW variants, in an effort to narrow the research gap observed in [36].
In our preliminary experiments, numerical evidence clearly points out the advantages of our strategy to exploit structural knowledge.On randomly generated non-convex Multi-StQPs where easy instances were carefully avoided, our approach (AFW with parallel or randomized updates, both combined with the Short Step Chain strategy SSC) is dominating the block-coordinate FW method with randomized updates.
We tested resilience of our reported observations by employing two experimental setups, pure multistart and monotonic basin hopping.The same effects seem to prevail.
Instance construction was motivated by a stochastic variant of the StQP, varying both domain dimension l and number m of possible scenarios.In case l ≤ m there seems to be a slight edge towards the combination of AFW with randomized updates and SSC, compared to the parallel variant.This effect does not seem to happen with large l in comparison to m, but would not change superiority over traditional block-coordinate FW methods.

Proofs
In the rest of this section, we always assume that the SSC terminates in a finite number of steps and that the angle condition holds.The first lemma is related to the single block setting, and strengthens some of the properties proved for the SSC in [39,Proposition 4.1].
Proof Let B = B ∥g∥ 2L (x + g 2L ) and let T be such that w k+1 = y T .By [39, (4.4)] we have that (33) holds for every z ∈ B (in place of y), and therefore as desired for every where the inequality follows from ⟨g, dl ⟩ pl Thus for proving (30), in the rest of the proof it will be enough to prove Furthermore, since by definition of the SSC, the scalar product ⟨g, y j ⟩ is increasing in j, we have We distinguish four cases, according to how the SSC terminates.In the first two, we show we can choose the last step, w = y T ; in the third, the penultimate choice w = y T −1 satisfies all conditions, and in the fourth case, an intermediate step is an appropriate choice.We abbreviate which is exactly (35).Then wk = w k+1 = y T satisfies the desired conditions.Case 3: y and which is (35) for l = T − 1. Combining (37) with (38) we also obtain so that in particular we can take wk = y T −1 .

Case 4: y
The condition w k+1 = y T ∈ ∂ B can be rewritten as For every j ∈ [0 : T ] we have We now want to prove that for every j ∈ [0 : T ] Indeed, we have where we used (40) in the first equality, (41) in the second, ⟨g, d j ⟩ ≥ 0 for every j in the first inequality and y j ∈ B in the second equality, which proves (42).We also have where we used (43) in the first inequality and (40) in the second (equality).

⊓ ⊔
We denote by SSC(w k , g) a point wk with the properties stated in the above lemma.It is also useful to define U 0 as the connected component of {x ∈ C : f (x) ≤ f (x 0 )} containing x 0 .The next result shows how in our block coordinate setting the assumption of Theorem 1 on U 0 allows us to retrieve a lower bound on the objective for points generated by the SSC.This lower bound is analogous to the lower bound required in [39,Theorem 4.2].
where we used the standard Descent Lemma in the first inequality and the the second follows by definition of BC k .From (45) it follows that {f (x k )} is decreasing, and that BC , and by induction we can conclude {x k+1 , xk } ⊂ U 0 .Finally, f (y) ∈ [f (x * ), f (x k )] for y ∈ {x k+1 , xk }, where the lower bound follows from the assumption that f (x * ) is a minimum in U 0 , and the upper bound follows from (45).
⊓ ⊔ In the following lemma, the properties of the SSC proved in Lemma 1 for single blocks are combined to obtain analogous properties on the whole product of blocks, and the KL condition is then used to lower bound suitable improvement measures with an optimality gap for the objective.We would like to highlight that, unlike the single block case, this optimality gap is measured with respect to an auxiliary point which is not necessarily among those generated by the algorithm.Proof of the linear convergence rate hence requires proper handling in this case.
Lemma 3 Let {x k } be a sequence generated by Algorithm 1, and assume that the angle condition holds for the method A (i) with the same τ , for all i ∈ 14) holds at xk , we then have, abbreviating g = −∇f (x k ): Observe that by the Lipschitz continuity of the gradient, we have the inequality and thus where we applied Jensen's inequality to (48) in the first inequality, and (30) together with (31) in the second inequality.Thus we can write where we used (49) in the first inequality and the KL property in the second.This proves (46).Using the standard Descent Lemma, we can give the upper bound where we used (33) in the second inequality.We can finally prove (47): where we used (32) in the first inequality and (51) in the second one.

⊓ ⊔
The next result, which directly follows from the previous lemma, explicitly lower bounds the improvement on the objective with the optimality gap introduced above.
Lemma 4 Let {x k } be a sequence generated by Algorithm 1, and assume that the angle condition holds for the method A (i) with the same τ , for all i ∈ [1 : m].Let xk = (SSC(x i=1 .Then, if the KL property (14) holds at xk , for parallel updates for GS updates and for random updates Proof We first prove the inequality for parallel updates.We have where the first inequality follows from (45), the second inequality by ( 46) where with the notation introduced in Lemma 3 we have by definition x k+1 = xk+1 .For GS updates, we have where in the first inequality we used the standard Descent Lemma, (33) in the second inequality; the equality follows by definition of GS updates, in the fourth inequality we applied again (33), and (46) in the last one.Finally, for random updates we have, denoting as i(k) = j the event that the index chosen at the step k is j: where the first inequality follows from (45), we used P({i(k) = j}) = 1 m in the second equality and (46) in the last inequality.
⊓ ⊔ In the next two lemmas, we relate the improvement measured with respect to the auxiliary point to the true improvement of the objective, and thus manage to extend the linear convergence rate in [39,Lemma 4.3] to the block coordinate setting.
Lemma 5 Let {x k } be a sequence generated by Algorithm 1, and assume that the angle condition holds for the method A (i) with the same τ , for all i ∈ for GS updates and for random updates Proof For parallel updates, we have where we have used the standard descent Lemma in the first inequality, (33) in the second inequality, and (47) in the last inequality.The proof follows analogously for GS updates, after noticing ⟨g, as showed in (57), and for random updates, using respectively.⊓ ⊔ Lemma 6 Let {x k } be a sequence generated by Algorithm 1, and assume that the angle condition holds for the method A (i) with the same τ , for all i ∈ [1 : m].Then, if the KL property (14) holds at x k , for parallel updates f for GS updates and for random updates Proof First observe that since τ ∈ [0, 1] and µ ≤ L we have Then, combining (53) and (59), we can write and rearranging The thesis follows for GS and random updates analogously.

⊓ ⊔
Given the previous results for the block coordinate setting, the remaining part of the proof is a straightforward adaptation of arguments used in the proof of [39,Theorem 4.2].
Proof (of Theorem 1) We need to prove that the KL property ( 14) holds in {x k }.The bounds on f (x k ) − f (x * ) then follow immediately by induction from Lemma 6, and in turn the bounds on ∥x k − x * ∥ follow as in the proof of [39,Lemma 4.3].
For random updates, we can take δ < δ small enough so that f (x 0 ) < f (x * )+η.Then by construction the KL property ( 14) holds in U 0 , and since {x k } is contained in U 0 by Lemma 2, ( 14) holds in particular in {x k }.
For parallel updates, thanks to Lemma 2 we have that {f (x k )} is decreasing and f (x k ), f (x k ) ≥ f (x * ).It can then be proved with an argument analogous to the proof of [39, Theorem 4.2] that for δ small enough, (14) holds in {x k }.We include the argument here for completeness.Let f k = f (x k ) − f (x * ), and let δ < δ/2 defined as in the proof of [39, Theorem 4.2] so that with q = q P here.We now want to prove [0:k−1] {x i , xi } ∪ {x k } ⊂ B δ (x * ) for every k ∈ N, by induction on k.Notice that x 0 ∈ B δ (x * ) by construction.To start with the inductive step, where we used (45) in the first inequality, and the second can be derived from [39, Lemma 8.1] as in the proof of [39,Theorem 4.2].But then where we used (72) together with (45) in the second inequality, f k+1 = f (x k+1 ) − f (x * ) ≥ 0 in the third inequality, and (71) together with f 0 ≥ f k in the last inequality.We now have x i ≥ min i∈supp(x * ) where we used | di | ≤ ∥ d∥ ≤ 1 in the first inequality, supp(d) ⊆ supp(x * ) in the second inequality, and x i → x * ,i > 0 in the third one.

⊓ ⊔
For x ∈ C, we define the expression and the Lagrangian multiplier vector We notice that strict complementarity holds at a stationary point x * ∈ C for ∇f (x * ) if and only if it holds for every i ∈ [1 : m] at x (i) * ∈ C (i) and ∇f (x * ) (i) .Lemma 9 Assume that strict complementarity holds at x * .Then the AFW applied to the simplex has active set related directions in x * as in Definition 4.

[ 1 :
m], and of course m i=1 n i = n.Notice that problem (1) falls in the class of composite optimization problems min x∈C [f (x) + g(x)]

Corollary 1
Under the above assumptions on C, let A (i) be the AFW for i ∈ [1 : m], and let strict complementarity conditions hold at x * ∈ C. If x k → x * (almost surely), then for parallel and GS selection (for random sampling) we have supp(x k ) = supp(x * ) for k large enough.
[1:m] representing m possible StQPs, with Qi ∈ R l×l for i ∈ [1 : m], the corresponding stochastic StQP with sample space [1 : m] is given by max m i=1

Case 1 :
T = 0 or d T = o.Since there are no descent directions, w k+1 = y T must be stationary for the gradient −g.Equivalently, pT = ∥π(T C (i) (w k+1 ), g)∥ = 0. Finally, it is clear that if T = 0 then d 0 = o, since y 0 must be stationary for −g.Thus taking wk = y T the desired properties follow.Before examining the remaining cases we remark that if the SSC terminates in Phase II, thenα T −1 = β T −1 must be maximal w.r.t. the conditions y T ∈ B T −1 or y T ∈ B. If α T −1 = 0 then y T −1 = y T ,and in this case we cannot have y T −1 ∈ ∂ B, otherwise the SSC would terminate in Phase II of the previous cycle.Therefore necessarily y T = y T −1 ∈ int(B T −1 ) c (Case 2).If β T −1 = α T −1 > 0 we must have y T −1 ∈ C T −1 = B T −1 ∩ B, and y T ∈ ∂B T −1 (Case 3) or y T ∈ ∂ B (Case 4) respectively.Case 2: y T −1 = y T ∈ int(B T −1 ) c .We can rewrite the condition as

7 . 2 Definition 4 Lemma 7 Lemma 8
where we used ∥x k − x k ∥ ≤ ∥x k+1 − x k ∥ in the second inequality and the last inequality follows as in (73).Thus xk ∈ B δ (x * ) as well, and the induction is complete.For GS updates the proof that{x k } ⊂ B δ (x * ) is analogous.⊓ ⊔ An active set identification criterionWe prove in this section Theorem 2, proposing a general active set identification criterion for Algorithm 1 in the special case where the feasible set C is the product of simplices.With the notation introduced in Section 4, letC * = {x ∈ C : supp(x) = supp(x * )} and S * = {x ∈ R n : supp(x) = supp(x * )}be the subset of points in C and the subspace of directions with the same support of x * respectively.We say that the method Ā has active set related directions in x * if it can do a bounded number of consecutive maximal steps, and if for some neighborhoodV of x * , x → x * , g → −∇f (x * ) and d ∈ Ā(x, g): if x ∈ C * then d ∈ S * with α max (x, d) = Θ(1), if x ∈ C \ C * then ⟨g, d⟩ = Θ(1).Under the assumptions of Definition 4:if x ∈ C \ C * we have α max (x, d) = o(1), if x ∈ C * , then ⟨g, d⟩ = o(1).Proof Notice that 0 ≤ α max (x, d)⟨g, d⟩ = ⟨g, (x + α max (x, d) d) − x⟩ = ⟨−∇f (x), (x + α max (x, d) d) − x⟩ + o(1) ≤ ⟨−∇f (x), x * − x⟩ + o(1) = o(1).equality we used ⟨g, d⟩ = Θ(1) by assumption.This proves the first part of the claim.As for the second part, we have⟨g, d⟩ = ⟨−∇f (x * ), d⟩ + o(1) = o(1) ,(77)where we used ⟨−∇f (x * ), d⟩ = 0 in the second equality, guaranteed by stationarity conditions since d ∈ S * .⊓⊔Thus for y 0 close enough to x * we must haveβ T * +1 < α (T * +1) max ,(83)and the claim is proved.Since the SSC terminates either with y T * +1 or y T * +2 , and both of these points are in C * , the thesis follows.⊓ ⊔ For x → x * , g → −∇f (x * ), if x ∈ C * , d ∈ S * and α max (x, d) coincides with the maximal feasible stepsize, then α max (x, d) = Θ(1).Proof We have α max (x, d) = min

Proof (of Theorem 2 ) 8 Declarations 8 . 1
Lemma 3.2(a)] we have that the descent direction selected by the AFW satisfies d = x − e î for some î ∈ argmax{λ i (x, g): i ∈ supp(x)} ⊂ [1 : n] \ supp(x * ).Therefore ⟨g, d⟩ = ⟨g, x − e î ∥x − e î∥ ⟩ = λ î(x, g) ∥x − e î∥ = Θ(1)(86)for x → x * and g → −∇f (x * ).As for the case x ∈ C * , then if x, g are close enough to x * we must have λ i (x, g) > 0 for every i in [1 : n] \ supp(x * ).Therefore by [13, Lemma 3.2(b)] if y is obtained from x with a FW update we must have y i = 0 for i ∈ [1 : n] \ supp(x * ), which is equivalent to say that the update direction must be in S * .The property α max (x, d) = Θ(1) follows by Lemma 8. ⊓ ⊔ Follows by applying the property proved in Lemma 9 to each block selected by the method.⊓ ⊔ Funding and/or Conflicts of interests/Competing interests Nothing to declare by all of the authors.