Inexact accelerated high-order proximal-point methods

In this paper, we present a new framework of bi-level unconstrained minimization for development of accelerated methods in Convex Programming. These methods use approximations of the high-order proximal points, which are solutions of some auxiliary parametric optimization problems. For computing these points, we can use different methods, and, in particular, the lower-order schemes. This opens a possibility for the latter methods to overpass traditional limits of the Complexity Theory. As an example, we obtain a new second-order method with the convergence rate Ok-4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\left( k^{-4}\right) $$\end{document}, where k is the iteration counter. This rate is better than the maximal possible rate of convergence for this type of methods, as applied to functions with Lipschitz continuous Hessian. We also present new methods with the exact auxiliary search procedure, which have the rate of convergence Ok-(3p+1)/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\left( k^{-(3p+1)/ 2}\right) $$\end{document}, where p≥1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p \ge 1$$\end{document} is the order of the proximal operator. The auxiliary problem at each iteration of these schemes is convex.


Introduction
Motivation In the last decade, in Convex Optimization we can observe a high activity in the development of the accelerated high-order methods and proving for them the lower complexity bounds. (see [1,2,5,9,12]). At this moment, for methods of any order there exists a natural problem class, for which we know the accelerated methods. For example, functions with Lipschitz continuous gradients, can be naturally minimized by the first-order schemes, which can demonstrate in this case an unimprovable con-vergence rate of the order O k −2 , where k is the iteration counter. For functions with Lipschits continuous Hessians, we can apply the second-order methods with the rate of convergence going up to O k −7/2 , etc.
This one-to-one correspondence between the type of the methods and the particular problem class allows us to speak about optimal methods of certain order, which have unimprovable convergence rate. However, recently in [21] there were presented new results which break down this peaceful picture. It was shown that the special superfast second-order methods can converge with the rate O k −4 , which is faster than the lower bound O k −7/2 for these type of schemes. Of course, there is no contradiction with the Complexity Theory. The classical lower bound for the second-order schemes was obtained for functions with Lipschitz continuous Hessian, and in [21] we worked with the functions having Lipschitz continuous third derivative. In any case, this is the first example of successful expansion of the lower-order methods at the territory traditionally reserved for the higher-order schemes. In this paper, we are trying to analyze and explain this phenomena in some general framework.
At each iteration of the superfast methods from [21], we need to solve a serious auxiliary problem requiring additional calls of oracle (the number of these calls is bounded by the logarithm of accuracy). Therefore, in our developments we decided to employ one of the most expensive operations of Convex Optimization, the proximalpoint iteration.
The proximal approximation of function f (·), defined by ϕ λ (x) = min y f (y) + 1 2λ y − x 2 , λ > 0, (1) was introduced by Moreau [15]. Later on, Martinet [14] proposed the first proximalpoint method based on this operation. The importance of this construction for computational practice was questionable up to the developments of Rockafellar [25], who used the proximal-point iteration in the dual space for justifying the Augmented Lagrangians. This dual scheme was accelerated by Güller [10], who introduced in this method some elements of the Fast Gradient Method from [16]. Some attempts were made by Teboulle and others [11,26] in studying the proximal-point iteration with non-quadratic non-Euclidean kernel. However, during decades this idea was mainly considered as a theoretical achievement which hardly can be used in the efficient optimization algorithms.
In this paper, we come back to this old idea, having in mind another type of kernel functions. Our goal is the development of accelerated methods for Unconstrained Convex Optimization. Therefore, we suggest to replace y − x 2 in (1) by · p+1 , with p ≥ 1. We call the corresponding proximal step the pth-order proximal-point operation. This terminology is justified by two facts.
Firstly, in Sect. 2, we show that the corresponding simple proximal-point method converges as O k − p . The rate of convergence of the accelerated version of this method is O k −( p+1) . In both cases, we can use appropriate approximations of the proximal point.
For a smooth function f : E → R denote by ∇ f (x) its gradient, and by ∇ 2 f (x) its Hessian evaluated at point x ∈ E. Note that Using the above norm, we can define the standard Euclidean prox-functions where p ≥ 1 is an integer parameter. These functions have the following derivatives: Note that function d p+1 (·) is uniformly convex (see, for example, Lemma 4.2.3 in [18]): In what follows, we often work with directional derivatives. For p ≥ 1, denote by is a symmetric p-linear form. Its norm is defined in a standard way: If all directions h 1 , . . . , h p are the same, we apply notation Note that, in general, we have (see, for example, Appendix 1 in [23]) In this paper, we work with functions from the problem classes F p , which are convex and p times continuously differentiable on E. Denote by M p ( f ) the uniform upper bound for the pth derivative: One of our main results is based on the following relation between the second, third and fourth derivatives of convex function (see Lemma 3 in [20]): where ξ is an arbitrary positive number.

Inexact high-order proximal-point steps
Consider the following optimization problem: where f (·) is a differentiable closed convex function. Denote by x * one of its optimal solutions and let f * = f (x * ).
All methods presented in this paper are based on the pth-order proximal-point operators, defined as follows: where H > 0 and p ≥ 1. The properties of the standard first-order proximal-point operator are studied very well in the literature (e.g. [24]). However, we will see that the highorder proximal-point methods converge much faster. The main goal of this paper is to establish the global rate of convergence of these methods in accelerated and nonaccelerated forms and complement this information by the complexity of computing an appropriate inexact proximal-point step (9). Indeed, very often, the proximal-point operator (9) cannot be computed in a closed form. Instead, we have to use an approximate solution of this problem obtained by an auxiliary optimization procedure. Let us introduce the set of acceptable solutions to problem (9), that is where β ∈ [0, 1) is a tolerance parameter. Note that prox Proof Indeed, inequality (11) follows from representation Further, denote r = T −x . Then, squaring both parts in inequality (10), we have This inequality can be rewritten as follows: and this is inequality (12). Let us compute the derivative of κ(·): Note that r (11) , and this is inequality (13).
The following corollary is a trivial consequence of convexity of f (·) and inequality (13).
Let us justify now the rate of convergence of the basic inexact high-order proximalpoint method: For analyzing this scheme, we need the following Lemma A.1 from [19].

Lemma 2
Let the sequence of positive numbers {ξ k } k≥0 satisfy the following condition: where α ∈ (0, 1]. Then for any k ≥ 0 we have } the radius of the initial level set of the objective function in problem (8).
Theorem 1 Let the sequence {x k } k≥0 be generated by method (17). Then for any k ≥ 0 we have Proof Indeed, in view of Corollary 1, for any k ≥ 0 we have (18) valid for all k ≥ 0. Hence, in view of Lemma 2, we have And this is inequality (20).
Note that the rate of convergence (20) of method (17) does not depend on the properties of function f (·). This means that the actual complexity of problem (8) for this method is reflected somehow in the complexity of finding the point x k+1 ∈ A p H (x k , β). We will discuss this issue in the remaining part of this paper. To conclude this section, let us present an accelerated variant of the Inexact Proximal-Point Method. Our presentation is very similar to the justification of Accelerated Tensor Methods in Section 2 in [21]. Therefore we omit some technical details. Denote And let us choose β ∈ [0, 1 p ]. Then, for any Define now the sequence of scaling coefficients . This inequality can be rewritten in the following form: Consider the following high-order proximal method.

Inexact Accelerated pth-Order Proximal-Point Method
Note that computation of point v k at Step 1 can be done in a closed form.
Theorem 2 Let sequence {x k } k≥0 be generated by method (25). Then, for any k ≥ 1, we have Proof First of all, note that by induction it is easy to see that In particular, for ψ * Let us prove by induction the following relation: For k = 0, we have ψ * 0 = 0 and A 0 = 0. Hence, (29) is valid. Assume it is valid for some k ≥ 0. Then Note that where the last equality follows from the relation Further, for all x ∈ E we have (see, for example, Lemma 2 in [17]) Putting all these inequalities together, we obtain It remains to note that in view of relations (27) and (29), we have In order to get the remaining bound for v k , we need to apply inequalities (28) and (29) with x = x * .
We can see that method (25) is much faster than the basic method (17). Its rate of convergence is also independent on the properties of the objective function. Hence, in order to evaluate its actual performance, we need to investigate the complexity of finding a point T ∈ A p H (x, β). This will be done in the remaining sections of the paper.

Approximating proximal-point operator by tensor step
Let us show that with appropriate values of parameters, the inclusion T ∈ A p H (x, β) can be ensured by a single step of the Basic Tensor Method of degree p. Firstly, recall some simple facts from the theory of tensor methods.
For function f (·), let us assume that M p+1 ( f ) < +∞. Define its Taylor approximation at point x ∈ E: Define now the augmented Taylor polynomial  [20]). Therefore, we are able to compute the tensor step Let us allow some inexactness in computation of this point. Namely, we assume that we can compute a point T satisfying the following condition: where γ ∈ 0, β 1+β is the tolerance parameter. Thus, (This inequality was used as a termination criterion in [5].) Let us prove the following simple result. Thus, In other words, these values of parameters ensure the inclusion T ∈ A p H (x, β). For such M, the convexity condition M ≥ pM p+1 ( f ) is satisfied with β ≤ 1 p−1 . Thus, the accelerated and non-accelerated tensor methods from [20] can be seen as particular implementations of inexact high-order proximal-point methods. Their efficiency bounds can be obtained by Theorems 1 and 2.

Bi-level unconstrained minimization
In solving problem (8) by inexact high-order proximal-point methods from Sect. 2, we have two degrees of freedom. Firstly, we need to decide on the order p of the proximalpoint method. This defines the rate of convergence for the upper-level process. Note that for obtaining the rates (20) or (26), we do not need any assumption on the properties of the objective function.
After that, we have to choose the lower-level method for computing a point For analyzing efficiency of the latter method, we do need to assume something on the objective function. Thus, the overall complexity of this bi-level scheme depends on efficiency bounds of both processes. Note that the objective function in the auxiliary problem (9) has some structure (composite form, uniform convexity), which can help to increase efficiency of the lower-level method. We call this framework Bi-level unconstrained minimization (BLUM). Let us show that it opens new horizons in the development of very efficient optimization methods.
Indeed, as we have seen in Sect. 3, the condition (33) can be satisfied by one step of tensor method. This strategy does not require additional calls of the oracle. However, the high-order tensor methods need computations of the high-order derivatives and therefore quite often they are impractical. In this case, it is reasonable to solve the auxiliary problem in (9) by a cheaper method, based on the derivatives of a smaller degree than the order of the underlying proximal-point scheme.
In this section, we present an example when this strategy works very well. We are going to consider a third-order proximal-point method, which is implemented by a second-order scheme. The first confirmation that this is possible was obtained in [21], using the approximations of third derivative along two vectors by the finite differences of gradients. In the remaining part of this section, we discuss a simpler approach, based on a direct application of the relative non-degeneracy condition [4,13] to the auxiliary problem (9).
Let us consider the following unconstrained minimization problem: with constant H > 0 and central pointx ∈ E. As compared with notation (9), we drop the index p since in this section we always have p = 3. In what follows, we assume that the fourth derivative of function f (·) is bounded on E by constant M 4 . Our main tool for solving the problem (34) is the gradient method based on relative non-degeneracy condition. This condition is formulated in terms of the Bregman distance. Recall that this is a (non-symmetric) distance function between two points x and y from E, which is computed with respect to some convex scaling function ρ(·). It is defined as follows: We say that the function ϕ(·) is relatively non-degenerate on E with respect to the scaling function ρ(·) if there exist two constants 0 < μ ≤ L such that The value κ = μ L is called the condition number of function ϕ(·) with respect to ρ(·). Recall that there exists a convenient sufficient condition for relations (36), this is It appears that for function fx ,H (·) in the problem (34), we can point out a simple scaling function, ensuring validity of the condition (36) with a good value of κ.

and function fx ,H (·) satisfy the condition (37) on E with constants
where ξ ≥ 1 is the unique solution of the following quadratic equation: Proof For the sake of notation, denote M 4 = M 4 ( f ) and assume thatx = 0 ∈ E. Then, for any x ∈ E, we have Therefore, Similarly, using again (4), we have Hence, From now on, we fix the following values for our parameters: Note that these values satisfy relations (39) and (40). Consequently, we can use a simpler notation for the corresponding scaling function: Let us present now an optimization method for solving efficiently the problem (34). For our goals, the most appropriate variant of this method can be found in [19].
For k ≥ 0, iterate: Note that this is a first-order method for solving the problem (34) provided that the Hessian ∇ 2 f (x k ) is represented in an appropriate basis (this can be done before the iterations start). It forms a sequence of points {x k } k≥0 with monotonically decreasing values of the objective function.
Applying now Lemma 3 in [19], we come to the following result.

Lemma 4
Let sequence {x k } k≥0 be generated by method (43). Then, for any k ≥ 1 and any x ∈ dom ψ we have Let us show how this method can be used on the lower level of the proximalpoint method (25) with p = 3. Our optimization problem (8) is characterized by the following parameters: For our analysis, parameters M 4 ( f ) and R 0 are critical. The remaining parameters M 2 ( f ) and D 0 appear in the efficiency bounds only inside the logarithms. Using the constant M 2 ( f ), we can bound the variation of the objective function as follows: Let us write down the full version of the combination of method (25) with (43). We choose β = 1 p = 1 3 and other parameters by (41). Define the following sequences: (47) 1 3 by the following procedure.

Inexact Accelerated 3rd-Order Proximal-Point Method
• Set x k,0 = y k . For i ≥ 0, iterate The major difference of this method from the earlier tensor methods [20,21] consists in the necessity to call oracle of the objective function at each iteration of the internal loop.
Clearly, this is a second-order method, which implements the inexact third-order proximal-point method (25). Let us assume for a moment, that at each upper-level iteration of this scheme, the numbers i * k are well defined. Then by Theorem 2, we get the following rate of convergence: Thus, it remains to find an upper bound for the numbers i * k . For that, we need to get an upper bound for the size of points x k,i . We will do this under the following assumption: where > 0 is the desired accuracy of the solution to problem (8). Note that we need this assumption only for estimating the number of steps, which are necessary to violate it. Assume that at some iteration k ≥ 0 the points x k and v k are well defined. Since f (x k ) ≤ f (x 0 ), in view of Theorem 2 we have Hence, However, in view of Lemma 4, ϕ k (x k,i ) → min x∈E ϕ k (x) as i → ∞. This implies ensuring that the auxiliary minimization process at iteration k is finite and x k+1 and v k+1 are well defined. Let us estimate its length. In view of Lemma 3.2 in [21], for all u ∈ E with u ≤ D, we have Therefore, for all x with x − y k ≤ D 1 and y ∈ E. At the same time, in view of Lemma 3.3 in [21], the last bound and Theorem 3 imply Hence, . Since x * k − y k ≤ D 1 , we can get an upper bound for i * k from the following inequality: Using Lemma 7 in [21], we can estimate for function θ(τ ) = a 2 τ 2 + b 4 τ 4 its dual function as follows: In our case, a = M 2 ( f ) + 15M 4 ( f )D 2 1 and b = 6M 4 ( f ). Therefore, Thus, we can see that all values i * k are bounded by O(ln 1 ). A similar reasoning shows that the length of the last iteration, stopped at the moment when the condition (50) be violated, is also bounded by O(ln 1 ). Hence, we have proved the following theorem.

Theorem 4 The second-order method (48) finds an -solution of problem (8) in
iterations. At each iteration, it calls the second-order oracle once and the first-order oracle O ln 1 times at most.
Let us discuss now the implementation details of method (48). At each inner iteration of this scheme, it is necessary to solve an auxiliary optimization problem for finding the point x k,i+1 ∈ R n . For doing this efficiently, it is reasonable to start with computation of the tri-diagonal factorization of matrix ∇ 2 f (y k ): where U k ∈ R n×n is an orthogonal matrix and k ∈ R n×n is a symmetric tri-diagonal matrix. Then we can change variables: and minimize the functionφ k (w) = ϕ k (y k + U k w). The advantage of this formulation is that in the new variables the scaling function ρ k (·) becomes very simple: where · (2) is the standard Euclidean norm in R n . Thus, the computation of the new point w k,i+1 = U T k (x k,i+1 − y k ) can be done in a linear time. Therefore, the total complexity of each iteration of the inner method will be quadratic in n (plus one computation of the first-order oracle).
Note that in the method (48) we have a possibility of computing the lower bounds for the optimal value of the objective function, provided that we have an upper bound for the distance to the minimum: Then, for k ≥ 1, we can compute the value * and use it in the termination criterion. Note that with this value the inequality (49) is valid if we replace f * by * k and R 0 by R. If the bound R is not known, we can update the initial guess dynamically using the observed distance between x k and x 0 .

High-order proximal-point methods with line search
In this section, we consider new methods for solving the problem (8), which are based on pth-order proximal-point operator with line search ( p ≥ 1). It is defined as follows: where the pointx and direction u belong to E and the proximal coefficient H is positive. Note that the value of this operator is a solution of a convex optimization problem. As compared with operation (9), we increased the dimension of the search variable by one. Hence, it should not create a significant additional complexity. In this paper, we will analyze only the exact computation in (53). Let us mention the main properties of operator (53). Finally, for proving the remaining inequality, we choose in the optimization problem in (53) x =x and τ = 0.
Clearly, the smaller H is, the better is the result of (53). However, the small values of H make this computation more difficult. Thus, a reasonable choice of H must be dictated by the problem class and the auxiliary methods, which will be used for solving (53) approximately. We keep the detailed analysis of different possibilities for the future research.
Consider the following optimization scheme.

Define a k+1 by equation
Proof Let us prove this relation by induction. For k = 0, we have A 0 = 0, B 0 = 0, and ψ 0 (x) = 1 2 x − x 0 . Thus, in this case inequality (58) is valid. Let us assume that it is valid for some k ≥ 0. Then Let us prove now the main result of this section. In the proof, we closely follow the arguments justifying Lemma 4.3.5 in [18].

Theorem 5
For any k ≥ 1, we have In other words, we have the following bound: We need to minimize now the sum k−1 i=0 1 ξ i subject to this bound. Since the bound is active, we can introduce for it a Lagrange multiplier λ > 0 and find the optimal ξ i from the equation Therefore, . This means that we have proved the following inequality: (61) and θ = 1 Consequently, C k+1 ≥ C k + 2γ p+1 and we conclude that C k ≥ γ + 2γ (k−1) p+1 , k ≥ 1. Substituting this bound in (61), we get It remains to note that in view of inequality (58), we have

Conclusion
In this paper we present a new framework BLUM, where the development of the accelerated minimization scheme consists of two steps. Firstly, we choose the order of the proximal-point iteration. At this moment, we are not restricted by any properties of the objective function except its differentiability (which can be dropped) and convexity.
better. This means, that our complexity scale, instead being one-dimensional, should be two-dimensional at least. But of course all these questions need further investigations. We hope that our results create interesting directions of research related to further increase of the efficiency of the lower-order methods as applied to the problem classes which were traditionally out of their scope. During the time, which was necessary for accepting this paper by Mathematical Programming, we managed to advance in this direction. The interested reader can consult our next publication [22], where we present a second-order scheme for minimizing functions with Lipschitz-continuous third derivative, which has the convergence rate of the upper level of the order O( 1 k 5 ), keeping the logarithmic complexity of the lower-level process.