Affine-invariant contracting-point methods for Convex Optimization

In this paper, we develop new affine-invariant algorithms for solving composite convex minimization problems with bounded domain. We present a general framework of Contracting-Point methods, which solve at each iteration an auxiliary subproblem restricting the smooth part of the objective function onto contraction of the initial domain. This framework provides us with a systematic way for developing optimization methods of different order, endowed with the global complexity bounds. We show that using an appropriate affine-invariant smoothness condition, it is possible to implement one iteration of the Contracting-Point method by one step of the pure tensor method of degree $p \geq 1$. The resulting global rate of convergence in functional residual is then ${\cal O}(1 / k^p)$, where $k$ is the iteration counter. It is important that all constants in our bounds are affine-invariant. For $p = 1$, our scheme recovers well-known Frank-Wolfe algorithm, providing it with a new interpretation by a general perspective of tensor methods. Finally, within our framework, we present efficient implementation and total complexity analysis of the inexact second-order scheme $(p = 2)$, called Contracting Newton method. It can be seen as a proper implementation of the trust-region idea. Preliminary numerical results confirm its good practical performance both in the number of iterations, and in computational time.


Introduction
Motivation.In the last years, we can see an increasing interest to new frameworks for derivation and justification different methods for Convex Optimization, provided with a worst-case complexity analysis (see, for example, [3,16,4,19,21,11,6,15,23,22]).It appears that the accelerated proximal tensor methods [2,21] can be naturally explained through the framework of high-order proximal-point schemes [22] requiring solution of nontrivial auxiliary problem at every iteration.
This possibility serves as a departure point for the results presented in this paper.Indeed, the main drawback of proximal tensor methods consists in necessity of using a fixed Euclidean structure for measuring distances between points.However, the multi-dimensional Taylor polynomials are defined by directional derivatives, which are affine-invariant objects.Can we construct a family of tensor methods, which do not depend on the choice of the coordinate system in the space of variables?The results of this paper give a positive answer on this question.
Our framework extends the initial results presented in [19] and in [8].In [19], it was shown that the classical Frank-Wolfe algorithm can be generalized onto the case of the composite objective function [18] using a contraction of the feasible set towards the current test point.This operation was used there also for justifying a second-order method with contraction, which looks similar to the classical trust-region methods [5], but with asymmetric trust region.The convergence rates for the second-order methods with contractions were significantly improved in [8].In this paper, we extend the contraction technique onto the whole family of tensor methods.However, in the vein of [22], we start first from analyzing a conceptual scheme solving at each iteration an auxiliary optimization problem formulated in terms of the initial objective function.
The results of this work can be also seen as an affine-invariant counterpart of Contracting Proximal Methods from [6].In the latter algorithms, one need to fix the prox function which is suitable for the geometry of the problem, in advance.The parameters of the problem class are also usually required.The last but not least, all methods from this work are universal and parameter-free.
Contents.The paper is organized as follows.
In Section 2, we present a general framework of Contracting-Point methods.We provide two conceptual variants of our scheme for different conditions of inexactness for the solution of the subproblem: using a point with small residual in the function value, and using a stronger condition which involves the gradients.For both schemes we establish global bounds for the functional residual of the initial problem.These bounds lead to global convergence guarantees under a suitable choice of the parameters.For the scheme with the second condition of inexactness, we also provide a computable accuracy certificate.It can be used to estimate the functional residual directly within the method.
Section 3 contains smoothness conditions, which are useful to analyse affine-invariant highorder schemes.We present some basic inequalities and examples, related to the new definitions.
In Section 4, we show how to implement one iteration of our methods by computing an (inexact) affine-invariant tensor step.For the methods of degree p ≥ 1, we establish global convergence in the functional residual of the order O(1/k p ), where k is the iteration counter.For p = 1, this recovers well-known result about global convergence of the classical Frank-Wolfe algorithm [10,19].For p = 2, we obtain Contracting-Domain Newton Method from [8].Thus, our analysis also extends the results from these works to the case, when the corresponding subproblem is solved inexactly.
In Section 5, we present two-level optimization scheme, called Inexact Contracting Newton Method.This is implementation of the inexact second-order method, via computing its steps by the first-order Conditional Gradient Method.For the resulting algorithm, we establish global complexity O(1/ε 1/2 ) calls of the smooth part oracle (computing gradient and Hessian of the smooth part of the objective), and O(1/ε) calls of the linear minimization oracle of the composite part, where ε > 0 is the required accuracy in the functional residual.Additionally, we address effective implementation of our method for optimization over the standard simplex.
Section 6 contains numerical experiments.In Section 7, we discuss our results and highlight some open questions for the future research.

Notation.
In what follows we denote by E a finite-dimensional real vector space, and by E * its dual space, which is a space of linear functions on E. The value of function s ∈ E * at point x ∈ E is denoted by s, x .For a smooth function f : dom f → R, where dom f ⊆ E, we denote by ∇f (x) its gradient and by ∇ 2 f (x) its Hessian, evaluated at point for all x ∈ dom f and h ∈ E. For p ≥ 1, we denote by For its gradient in h, we use the following notation: In particular,

Contracting-point methods
Consider the following composite minimization problem where ψ : E → R ∪ {+∞} is a simple proper closed convex function with bounded domain, and function f (x) is convex and p (≥ 1) times continuously differentiable at every point x ∈ dom ψ.
In this section, we propose a conceptual optimization scheme for solving (1).At each step of our method, we choose a contracting coefficient γ k ∈ (0, 1] restricting the nontrivial part of our objective f (•) onto a contracted domain.At the same time, the domain for the composite part remains unchanged.
Namely, at point x k ∈ dom ψ, define Note that S k (y) = γ k ψ(v).Consider the following exact iteration: Of course, when γ k = 1, exact step from (2) solves the initial problem.However, we are going to look at the inexact minimizer.In this case, the choice of {γ k } k≥0 should take into account the efficiency of solving the auxiliary subproblem.Denote by F k (•) the objective in the auxiliary problem (2), that is We are going to use the point xk+1 = (1 − γ k )x k + γ k vk+1 with vk+1 ∈ dom ψ having a small residual in the function value: with some fixed δ k+1 ≥ 0.
Lemma 1 For all k ≥ 0 and v ∈ dom ψ, we have Proof: Indeed, for any v ∈ dom ψ, we have Therefore, Let us write down our method in an algorithmic form.
In Step 3 of this method, we add a simple test for ensuring monotonicity in the function value.This step is optional.
It is more convenient to describe the rate of convergence of this scheme with respect to another sequence of parameters.Let us introduce an arbitrary sequence of positive numbers {a k } k≥1 and denote A k def = k i=1 a i .Then, we can define the contracting coefficients as follows Theorem 1 For all points of sequence {x k } k≥0 , generated by process (5), we have the following relation: Proof: Indeed, for k = 0, we have A k = 0, B k = 0. Hence, (7) is valid.Assume it is valid for some k ≥ 0. Then From bound (7), we can see, that Hence, the actual rate of convergence of method (5) depends on the growth of coefficients {A k } k≥1 relatively to the level of inaccuracies {δ k } k≥1 .Potentially, this rate can be arbitrarily high.Since we did not assume anything yet about our objective function, this means that we just retransmitted the complexity of solving the problem (1) onto a lower level, the level of computing the point xk+1 , satisfying the condition (3).We are going to discuss different possibilities for that in Sections 4 and 5. Now, let us endow the method (5) with a computable accuracy certificate.For this purpose, for a sequence of given test points {x k } k≥1 ⊂ dom ψ, we introduce the following Estimating Function (see [20]): Hence, for all k ≥ 1, we can get the following bound for the functional residual: The complexity of computing the value of k usually does not exceed the complexity of computing the next iterate of our method since it requires just one call of the linear minimization oracle.Let us show that an appropriate rate of decrease of the estimates k can be guaranteed by sufficiently accurate steps of the method (2).
For that, we need a stronger condition on point xk+1 , that is with some δ k+1 ≥ 0. Note that, for δ k+1 = 0, condition (10) ensures the exactness of the corresponding step of method (2).Let us consider now the following algorithm.
This scheme differs from the previous method (5) only in the characteristic condition (10) for the next test point.
Theorem 2 For all points of the sequence {x k } k≥0 , generated by the process (11), we have Proof: For k = 0, relation ( 12) is valid since both sides are zeros.Assume that ( 12) holds for some k ≥ 0.Then, for any v ∈ dom ψ, we have Step 3 Here, the inequalities ( * ) and ( * * ) are justified by convexity of f (•) and ψ(•), correspondingly.Thus, (12) is proved for all k ≥ 0. 2 Combining now ( 9) with ( 12), we obtain We see that the right hand side in ( 13) is the same, as that one in (8).However, this convergence is stronger, since it provides a bound for the accuracy certificate k .
3 Affine-invariant high-order smoothness conditions We are going to describe efficiency of solving the auxiliary problem in (2) by some affineinvariant characteristics of variation of function f (•) over the compact convex sets.For a convex set Q, define Note, that for p = 1 this characteristic was considered in [14] for the analysis of the classical Frank-Wolfe algorithm.
In many situations, it is more convenient to use an upper bound for ∆ (p) Q (f ), which is a full variation of its (p + 1)th derivative over the set Q: Indeed, by Taylor formula, we have

Hence, ∆
(p) Sometimes, in order to exploit a primal-dual structure of the problem, we need to work with the dual objects (gradients), as in method (11).In this case, we need a characteristic of variation of the gradient ∇f (•) over the set Q: x,y,v∈Q, t∈(0,1] Since At the same time, by Taylor formula, we get Therefore, again we have an upper bound in terms of the variation of (p + 1)th derivative, that is See Proposition 1 in Appendix for the proof of the last inequality.Hence, the value of V (p+1) Q (f ) is the biggest one.However, in many cases it is more convenient.
Example 1 For a fixed self-adjoint positive-definite linear operator B : E → E * , define the corresponding Euclidean norm as x := Bx, x 1/2 , x ∈ E. Let pth derivative of function f (•) be Lipschitz continuous with respect to this norm: for all x, y ∈ Q.Let Q be a compact set with diameter x − y < +∞.
Then, we have In some situations we can obtain much better estimates.
Example 2 Let A 0, and x (i) = 1}.For measuring distances in the standard simplex, we choose 1 -norm: In this case, D = D ||• (S n ) = 2, and L 1 = max 1≤i≤n A (i,i) .On the other hand, V A(e i − e j ), e i − e j ≤ max 1≤i,j≤n where e k denotes the kth coordinate vector in R n .Thus, V Sn ≤ L 1 D 2 .However, for some matrices, the value V (2) which can be much smaller than 4L 1 .
Example 3 Let given vectors a 1 , . . ., a m span the whole R n .Consider the objective Then, it holds (see Example 1 in [7] for the first inequality): Therefore, in 1 -norm we have . At the same time, V The last expression is the maximal difference between variations of the coordinates.It can be much smaller than Moreover, we have (see Example 1 in [7]): Hence, we obtain

Contracting-point tensor methods
In this section, we show how to implement Contracting-point methods, by using affine-invariant tensor steps.At each iteration of (2), we approximate f (•) by Taylor's polynomial of degree p ≥ 1 around the current point x k : Thus, we need to solve the following auxiliary problem: Note that this global minimum M * k is well defined since dom ψ is bounded.Let us take where vk+1 is an inexact solution to (20) in the following sense: Then, this point serves as a good candidate for the inexact step of our method.
For p = 1 and ψ(•) being an indicator function of a compact convex set, this is well-known Frank-Wolfe algorithm [10].For p = 2, this is Contracting-Domain Newton Method from [8].
Straightforward consequence of our observations is the following Then, for all iterations {x k } k≥1 generated by method (22), we have , and A k+1 = p+1 k+p+1 .Combining (8) with Theorem 3, we have , we get the required inequality. 2 It is important, that the required level of accuracy ξ k+1 for solving the subproblem is not static: it is changing with iterations.Indeed, from the practical perspective, there is no need to use high accuracy during the first iterations, but it is natural to improve our precision while approaching to the optimum.Inexact proximal-type tensor methods with dynamic inner accuracies were studied in [9].
Let us note that the objective M k (y) from ( 20) is generally nonconvex for p ≥ 3, and it may be nontrivial to look for its global minimum.Because of that, we propose an alternative condition for the next point.It requires just to find an inexact stationary point of Ω p (f, x k ; y).That is a point xk+1 , satisfying for some tolerance value ξ k+1 ≥ 0.
Its convergence analysis is straightforward.

Proof:
Combining inequality (13) with the statement of Theorem 5, we have It remains to use the same reasoning, as in the proof of Theorem 4. 2

Inexact contracting Newton method
In this section, let us present an implementation of our method (22) for p = 2, when at each step we solve the subproblem inexactly by a variant of first-order Conditional Gradient Method.The entire algorithm looks as follows.
Set t = t + 1 and go to 4-a, else go to 5.
We provide an analysis of the total number of oracle calls for f (step 2) and the total number of linear minimization oracle calls for the composite component ψ (step 4-c), required to solve problem (1) up to the given accuracy level.
Theorem 7 Let γ k = 3 k+3 .Then, for iterations {x k } k≥1 generated by method (25), we have Therefore, for any ε > 0, it is enough to perform iteration of the method, in order to get F (x K ) − F * ≤ ε.And the total number N K of linear minimization oracle calls during these iterations is bounded as Proof: Let us fix arbitrary iteration k ≥ 0 of our method and consider the following objective: We need to find the point vk+1 such that Note that if we set xk+1 := γ k vk+1 + (1 − γ k )x k , then from (29) we obtain bound (21) satisfied with ξ k+1 = cγ 3 k .Thus we would obtain iteration of Algorithm (22) for p = 2, and Theorem 4 gives the required rate of convergence (26).We are about to show that steps 4-a -4-e of our algorithm are aiming to find such point vk+1 .
Let us introduce auxiliary sequences A t def = t • (t + 1) and a t+1 def A t+1 , and we have the following representation of the Estimating Functions, for every t ≥ 0 By convexity of g k (•), we have Therefore, we obtain the following upper bound for the residual (29), for any where φ * t+1 = min w φ t+1 (w) = φ t+1 (w t+1 ).Now, let us show by induction, that A i+1 .It obviously holds for t = 0. Assume that it holds for some t ≥ 0.Then, Therefore, we obtain , and this is (31) for the next step.Therefore, we have (31) established for all t ≥ 0. Combining (30) with (31), we get the following guarantee for the inner steps 4-a -4-e: dom ψ (f ) t+1 .
Therefore, all iterations of our method is well-defined.We exit from the inner loop on step 4-e after t ≥ and the point vk+1 ≡ z t+1 satisfies (29).Hence, we obtain (26) and ( 27).The total number of linear minimization oracle calls can be estimated as follows According to the result of Theorem 7, in order to solve problem (1) up to ε > 0 accuracy, we need to perform O( 1 ε ) total computations of step 4-c of the method (estimate (28)).This is the same amount of linear minimization oracle calls, as required in the classical Frank-Wolfe algorithm [19].However, this estimate can be over-pessimistic for our method.Indeed, it comes as the product of the worst-case complexity bounds for the outer and the inner optimization schemes.It seems to be very rare to meet with the worst-case instance at the both levels simultaneously.Thus, the practical performance of our method can be much better.
At the same time, the total number of gradient and Hessian computations is only 27)).This can lead to a significant acceleration over first-order Frank-Wolfe algorithm, when the gradient computation is a bottleneck (see our experimental comparison in the next section).
The only parameter which remains to choose in method (25), is the tolerance constant c > 0. Note that the right hand side of (28) is convex in c.Hence, its approximate minimization provides us with the following choice In practical applications, we may not know some of these constants.However, in many cases they are small.Therefore, an appropriate choice of c is a small constant.
Finally, let us discuss effective implementation of our method, when the composite part is {0, +∞}-indicator of the standard simplex: This is an example of a set with a finite number of atoms, which are the standard coordinate vectors in this case: S n = Conv {e 1 , . . ., e n }.See [14] for more examples of atomic sets in the context of Frank-Wolfe algorithm.The maximization of a convex function over such sets can be implemented very efficiently, since the maximum is always at the corner (one of the atoms).
At iteration k ≥ 0 of method (25), we need to minimize over S n the quadratic function Assume that we keep the vector ∇g k (z t ) ∈ R n for the current point z t , t ≥ 0 of the inner process, as well as its aggregation Then, at step 4-c we need to compute a vector It is enough to find an index j of a minimal element of h t and to set w t+1 := e j .The new gradient is equal to and the function value can be expressed using the gradient as follows The product ∇ 2 f (x k )e j is just j-th column of the matrix.Hence, preparing in advance the following objects: ∇f (x k ) ∈ R n , ∇ 2 f (x k ) ∈ R n×n and the Hessian-vector product ∇ 2 f (x k )x k ∈ R n , we are able to perform iteration of the inner loop (steps 4-a -4-e) very efficiently in O(n) arithmetical operations.

Numerical experiments
Let us consider the problem of minimizing the log-sum-exp function (SoftMax) over the standard simplex S n (33).Coefficients {a i } m i=1 and b are generated randomly from the uniform distribution on [−1, 1].We compare the performance of Inexact Contracting Newton Method (25) with that one of the classical Frank-Wolfe algorithm, for different values of the parameters.The results are shown on Figures 1-3.
We see, that the new method works significantly better in terms of the outer iterations (oracle calls).This confirms our theory.At the same time, for many values of the parameters, it shows better performance in terms of total computational time as well1 .Time, s

Discussion
In this paper, we present a new general framework of Contracting-Point methods, which can be used for developing affine-invariant optimization algorithms of different order.For the methods of order p ≥ 1, we prove the following global convergence rate: This is the same rate, as that of the basic high-order Proximal-Point scheme [22].However, the methods from our paper are free from using the norms or any other characteristic parameters of the problem.This nice property makes Contracting-Point methods favourable for solving optimization problems over the sets with a non-Euclidean geometry (e.g. over the simplex or over a general convex polytope).At the same time, it is known that in Euclidean case, the prox-type methods can be accelerated, achieving O(1/k p+1 ) global rate of convergence [2,21,6,22].Using additional onedimensional search at each iteration, this rate can be improved up to O(1/k 3p+1 2 ) (see [11,22]).The latter rate is shown to be optimal [1,20].To the best of our knowledge, the lower bounds for high-order methods in general non-Euclidean case remain unknown.However, the worstcase oracle complexity of the classical Frank-Wolfe algorithm (the case p = 1 in our framework) is proven to be near-optimal for smooth minimization over • ∞ -balls [13].
Another open question is a possibility of efficient implementation of our methods for the case p ≥ 3.In view of absence of explicit regularizer (contrary to the prox-type methods), the subproblem in (20) can be nonconvex.Hence, it seems hard to find its global minimizer.We hope that for some problem classes, it is still feasible to satisfy the inexact stationarity condition (23) by reasonable amount of computations.We keep this question for further investigation.If it happens that our bilinear form is positive semidefinite (e.g. it is determined by the Hessian of a convex function), the constant in (45) can be improved to be 1, so the both supremums are equal.Proof: Indeed, by the Eigenvalue Decomposition, for some r ≥ 0, there exists a set of linear forms a 1 , . . ., a r ∈ E * and positive numbers λ 1 , . . ., λ r > 0 such that λ i a i , u a i , v , u, v ∈ S.