Superfast Second-Order Methods for Unconstrained Convex Optimization

Nesterov, Yurii

doi:10.1007/s10957-021-01930-y

Superfast Second-Order Methods for Unconstrained Convex Optimization

Open access
Published: 29 August 2021

Volume 191, pages 1–30, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Superfast Second-Order Methods for Unconstrained Convex Optimization

Download PDF

Yurii Nesterov ORCID: orcid.org/0000-0002-0542-8757¹

3676 Accesses
2 Altmetric
Explore all metrics

Abstract

In this paper, we present new second-order methods with convergence rate $O\left( k^{-4}\right) $, where k is the iteration counter. This is faster than the existing lower bound for this type of schemes (Agarwal and Hazan in Proceedings of the 31st conference on learning theory, PMLR, pp. 774–792, 2018; Arjevani and Shiff in Math Program 178(1–2):327–360, 2019), which is $O\left( k^{-7/2} \right) $. Our progress can be explained by a finer specification of the problem class. The main idea of this approach consists in implementation of the third-order scheme from Nesterov (Math Program 186:157–183, 2021) using the second-order oracle. At each iteration of our method, we solve a nontrivial auxiliary problem by a linearly convergent scheme based on the relative non-degeneracy condition (Bauschke et al. in Math Oper Res 42:330–348, 2016; Lu et al. in SIOPT 28(1):333–354, 2018). During this process, the Hessian of the objective function is computed once, and the gradient is computed $O\left( \ln {1 \over \epsilon }\right) $ times, where $\epsilon $ is the desired accuracy of the solution for our problem.

A note on the convergence of ADMM for linearly constrained convex optimization problems

Article 01 August 2016

An adaptive high order method for finding third-order critical points of nonconvex optimization

Article 16 March 2022

A Note On the Weak Convergence of the Extragradient Method for Solving Pseudo-Monotone Variational Inequalities

Article Open access 02 March 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the last years, the theory of high-order methods in convex optimization was developed seemingly up to its natural limits. After discovering the simple fact that the auxiliary problem in tensor methods can be posed as a problem of minimizing a convex multivariate polynomial [15], very soon the performance of these methods was increased up to the maximal limits [6, 7, 9], given by the theoretical lower complexity bounds [1, 2].

It is interesting that the first accelerated tensor methods were analyzed in the unpublished paper [3], where the author did not express any hope for their practical implementations in the future. In [3] and [15], it was shown that the p-th order methods can accelerate up to the level $O\left( k^{-(p+1)}\right) $, where k is the iterations counter. The main advantage of the theory in [15] is that it corresponds to the methods with convex polynomial subproblems.

However, the fastest tensor methods [6, 7, 9] are based on the trick discovered in [11] for the second-order methods. It allows to increase the rate of convergence of tensor methods up to the level $O\left( k^{-(3p+1)/2}\right) $, which matches the lower complexity bounds for functions with Lipschitz-continuous pth derivative. Thus, for example, the best possible rate of convergence of the second-order methods on the corresponding problem class is of the order $O\left( k^{-7/2}\right) $.

Unfortunately, this advanced technique requires finding at each iteration a root of a univariate nonlinear non-monotone equation defined by inverse Hessians of the objective function. Hence, from the practical point of view, the methods proposed in [15] remain the most attractive.

The developments of this paper are based on one simple observation. In [15], it was shown that the accelerated tensor method of degree three with the rate of convergence $O\left( k^{-4}\right) $ can be implemented by using at each iteration a simple gradient method based on the relative non-degeneracy condition [4, 10]. This auxiliary method has to minimize an augmented Taylor polynomial of degree three, computed at the current test point $x \in {\mathbb {R}}^n$:

$$\begin{aligned} \langle \nabla f(x), h \rangle + {1 \over 2}\langle \nabla ^2 f(x)h,h \rangle + {1 \over 6} D^3f(x)[h]^3 + {H \over 24} \Vert h \Vert ^4_2 \rightarrow \min \limits _{h \in {\mathbb {R}}^n}. \end{aligned}$$

At each iteration of this linearly convergent scheme, we need to compute the gradient of the auxiliary objective function in h. The only non-trivial part of this gradient comes from the gradient of the third derivative. This is the vector $D^3f(x)[h]^2 \in {\mathbb {R}}^n$. It is the only place where we need the third-order information. However, it is well known that

$$\begin{aligned} D^3f(x)[h]^2= & {} \lim \limits _{\tau \rightarrow 0} {1 \over \tau ^2} [ \nabla f(x+\tau h) + \nabla f(x - \tau h) - 2 \nabla f(x)]. \end{aligned}$$

In other words, the vector $D^3f(x)[h]^2$ can be approximated with any accuracy by the first-order information. This means that we have a chance to implement the third-order method with the convergence rate $O\left( k^{-4}\right) $ using only the second-order information.

So, formally our method will be of the order two. However, it will have the rate of convergence, which is higher than the formal lower bound $O\left( k^{-7/2}\right) $ for the second-order schemes. Of course, the reason for this is that it will work with the problem class initially reserved for the third-order methods. However, interestingly enough, our method will demonstrate on this class the same rate of convergence as the third-order schemes.

In order to implement our hint into rigorous statements, we need to introduce in the constructions of Section 5 in [15] some modifications related to the inexactness of the available information. This is the subject of the remaining sections of this paper.

Contents. The paper is organized as follows: In Sect. 2, we introduce a convenient definition of the acceptable neighborhood of the exact tensor step. It differs from the previous ones (e.g. [5, 8, 13]) since for its verification it is necessary to call the oracle of the main objective function. However, we will see that it significantly simplifies the overall complexity analysis. We prove that every point from this neighborhood ensures a good decrease of the objective functions, which is sufficient for implementing the Basic Tensor Method and its accelerated version without spoiling their rates of convergence.

In Sect. 3, we analyze the rate of convergence of the gradient method based on the relative smoothness condition [4, 10], under the assumption that the gradient of the objective function is computed with a small absolute error. We need this analysis for replacing the exact value of the third derivative along two vectors by a finite difference of the gradients. We show that the perturbed method converges linearly to a small neighborhood of the exact solution.

In Sect. 4, we put all our results together in order to justify a second-order implementation of the accelerated third-order tensor method. The rate of convergence of the resulting algorithm is of the order $O\left( k^{-4}\right) $, where k is the iteration counter. At each iteration, we compute the Hessian once and the gradient is computed $O\left( \ln {1 \over \epsilon }\right) $ times, where $\epsilon $ is the desired accuracy of the solution of the main problem. Recall that this rate of convergence is impossible for the second-order schemes working with the functions with Lipschitz-continuous third derivative (see [1, 2]). However, our problem class is smaller (see Lemma 4.1).

In Sect. 5, we show how to ensure boundedness of the constants, essential for our minimization schemes. Finally, we conclude the paper with Sect. 6, containing a discussion of our results and directions for future research.

Notation and generalities. In what follows, we denote by ${\mathbb {E}}$ a finite-dimensional real vector space and by ${\mathbb {E}}^*$ its dual spaced composed by linear functions on ${\mathbb {E}}$. For such a function $s \in {\mathbb {E}}^*$, we denote by $\langle s, x \rangle $ its value at $x \in {\mathbb {E}}$.

If it is not mentioned explicitly, we measure distances in ${\mathbb {E}}$ and ${\mathbb {E}}^*$ in a Euclidean norm. For that, using a self-adjoint positive-definite operator $B: {\mathbb {E}}\rightarrow {\mathbb {E}}^*$ (notation $B = B^* \succ 0$), we define

$$\begin{aligned} \Vert x \Vert= & {} \langle B x, x \rangle ^{1/2}, \quad x \in {\mathbb {E}}, \quad \Vert g \Vert _* \; = \; \langle g, B^{-1} g \rangle ^{1/2}, \quad g \in E^*. \end{aligned}$$

In the formulas involving products of linear operators, it will be convenient to treat $x \in {\mathbb {E}}$ as a linear operator from ${\mathbb {R}}$ to ${\mathbb {E}}$, and $x^*$ as a linear operator from ${\mathbb {E}}^*$ to ${\mathbb {R}}$. In this case, $xx^*$ is a linear operator from ${\mathbb {E}}^*$ to ${\mathbb {E}}$, acting as follows:

$$\begin{aligned} (xx^*)g= & {} \langle g, x \rangle x \; \in \; {\mathbb {E}}, \quad g \in {\mathbb {E}}^*. \end{aligned}$$

For a smooth function $f: \mathrm{dom \,}f \rightarrow {\mathbb {R}}$ with convex and open domain $\mathrm{dom \,}f \subseteq {\mathbb {E}}$, denote by $\nabla f(x)$ its gradient, and by $\nabla ^2 f(x)$ its Hessian evaluated at point $x \in \mathrm{dom \,}f \subseteq {\mathbb {E}}$. Note that

$$\begin{aligned} \nabla f(x)\in & {} {\mathbb {E}}^*, \quad \nabla ^2 f(x) h \; \in \; {\mathbb {E}}^*, \quad x \in \mathrm{dom \,}f, \; h \in {\mathbb {E}}. \end{aligned}$$

In our analysis, we use Bregman divergence of function $f(\cdot )$ defined as follows:

$$\begin{aligned} \beta _f (x,y)= & {} f(y) - f(x) - \langle \nabla f(x), y - x \rangle , \quad x, y \in \mathrm{dom \,}f. \end{aligned}$$

(1)

We often work with directional derivatives. For $p \ge 1$, denote by

$$\begin{aligned} D^p f(x)[h_1, \dots , h_p] \end{aligned}$$

the directional derivative of f at x along directions $h_i \in {\mathbb {E}}$, $i = 1, \dots , p$. Note that $D^p f(x)[ \cdot ]$ is a symmetric p-linear form. Its norm is defined as follows:

$$\begin{aligned} \Vert D^pf(x) \Vert= & {} \max \limits _{h_1, \dots , h_p} \left\{ \Big | D^p f(x)[h_1, \dots , h_p ] \Big |: \; \Vert h_i \Vert \le 1, \, i = 1, \dots , p \right\} . \end{aligned}$$

(2)

In terms of our previous notation, for any $x \in \mathrm{dom \,}f$ and $h_1, h_2 \in {\mathbb {E}}$, we have

$$\begin{aligned} Df(x)[h_1]= & {} \langle \nabla f(x), h_1 \rangle , \quad D^2f(x)[h_1, h_2] \; = \; \langle \nabla ^2 f(x) h_1, h_2 \rangle . \end{aligned}$$

For Hessian, this gives the spectral norm of self-adjoint linear operator (the maximal module of all eigenvalues computed with respect to operator B).

If all directions $h_1, \dots , h_p$ are the same, we apply notation

$$\begin{aligned} D^p f(x)[h]^p, \quad h \in {\mathbb {E}}. \end{aligned}$$

Then, Taylor approximation of function $f(\cdot )$ at $x \in \mathrm{dom \,}f$ can be written as

$$\begin{aligned}&f(y) = \varOmega _{x,p}(y) + o(\Vert y - x\Vert ^p), \quad y \in \mathrm{dom \,}f,\\&\varOmega _{x,p}(y) {\mathop {=}\limits ^{\mathrm {def}}}f(x) + \sum \limits _{k=1}^p {1 \over k!} D^k f(x)[y-x]^k, \quad y \in {\mathbb {E}}. \end{aligned}$$

Note that, in general, we have (see, for example, Appendix 1 in [16])

$$\begin{aligned} \Vert D^pf(x) \Vert= & {} \max \limits _{h} \left\{ \Big | D^p f(x)[h]^p \Big |: \; \Vert h \Vert \le 1 \right\} . \end{aligned}$$

(3)

Similarly, since for $x, y \in \mathrm{dom \,}f$ being fixed, the form $D^pf(x)[\cdot , \dots , \cdot ] - D^pf(y)[\cdot , \dots , \cdot ]$ is p-linear and symmetric, we also have

$$\begin{aligned} \Vert D^pf(x) - D^pf(y) \Vert= & {} \max \limits _{h} \left\{ \Big | D^p f(x)[h]^p - D^pf(y)[h]^p\Big |: \; \Vert h \Vert \le 1 \right\} . \end{aligned}$$

(4)

In this paper, we consider functions from the problem classes ${{{\mathcal {F}}}}_p$, which are convex and p times differentiable on ${\mathbb {E}}$. Denote by $L_p$ its uniform bound for the Lipschitz constant of their pth derivative:

$$\begin{aligned} \Vert D^p f(x) - D^pf(y) \Vert\le & {} L_p \Vert x - y \Vert , \quad x, y \in \mathrm{dom \,}f, \quad p \ge 1. \end{aligned}$$

(5)

If an ambiguity can arise, we use notation $L_p(f)$. Sometimes it is more convenient to work with uniform bounds on the derivatives:

$$\begin{aligned} M_p(f)= & {} \sup \limits _{x \in \mathrm{dom \,}f} \Vert D^pf(x) \Vert . \end{aligned}$$

(6)

If both values are well defined, we suppose that $L_p(f) = M_{p+1}(f)$, $p \ge 1$.

Let $F(\cdot )$ be a sufficiently smooth vector function, $F: \mathrm{dom \,}F \rightarrow {\mathbb {E}}_2$. Then, by the well-known Taylor formula, we have

$$\begin{aligned}&F(y) \; - \; F(x) - \sum \limits _{k=1}^p {1 \over k!} D^k F(x)[y-x]^k \nonumber \\&\quad = \; {1 \over p!} \int \limits _1^1 (1-\tau )^p D^{p+1}F(x+\tau (y-x))[y-x]^{p+1} d\tau , \quad x, y \in \mathrm{dom \,}F.\qquad \end{aligned}$$

(7)

Hence, we can bound the following residual:

$$\begin{aligned} | f(y) - \varOmega _{x,p}(y) |\le & {} {L_{p} \over (p+1)!} \Vert y - x \Vert ^{p+1}, \quad x, y \in \mathrm{dom \,}f. \end{aligned}$$

(8)

By the same reason, for functions $\nabla f(\cdot )$ and $\nabla ^2 f(\cdot )$, we get

$$\begin{aligned} \Vert \nabla f(y) - \nabla \varOmega _{x,p}(y) \Vert _*\le & {} {L_p \over p!} \Vert y - x \Vert ^{p}, \end{aligned}$$

(9)

$$\begin{aligned} \Vert \nabla ^2 f(y) - \nabla ^2 \varOmega _{x,p}(y) \Vert\le & {} {L_p \over (p-1)!} \Vert y - x \Vert ^{p-1}, \end{aligned}$$

(10)

which are valid for all $x, y \in \mathrm{dom \,}f$.

Finally, for simplifying long expressions, we often use the trivial inequality

$$\begin{aligned} \begin{array}{rcl} \Big (a^{1/p} + b^{1/p} \Big )^p\le & {} 2^{p-1}(a+b), \end{array} \end{aligned}$$

(11)

which is valid for all $a, b \ge 0$ and $p \ge 1$.

2 Tensor Methods with Inexact Iteration

Consider the following unconstrained optimization problem:

$$\begin{aligned} \min \limits _{x \in {\mathbb {E}}} \; f(x), \end{aligned}$$

(12)

where $f(\cdot )$ is a convex function with Lipschitz-continuous pth derivative:

$$\begin{aligned} \Vert D^pf(x) - D^p f(y) \Vert\le & {} L_p \Vert x - y \Vert , \quad x, y \in {\mathbb {E}}, \quad p \ge 1. \end{aligned}$$

(13)

In this section, we work only with Euclidean norms.

We are going to solve problem (12) by tensor methods. Their performance crucially depends on ability to achieve a significant improvement in the objective function at the current test point.

Definition 2.1

We say that point $T \in {\mathbb {E}}$ ensures $\underline{p \,th-order \,improvement}$ of some point $x \in {\mathbb {E}}$ with factor $c > 0$ if it satisfies the following inequality:

$$\begin{aligned} \langle \nabla f(T), x - T \rangle\ge & {} c \Vert \nabla f(T) \Vert ^{p+1 \over p}_*. \end{aligned}$$

(14)

This terminology has the following justification. Consider the augmented Taylor polynomial of degree $p \ge 1$:

$$\begin{aligned}&{\hat{\varOmega }}_{x,p,H}(y) {\mathop {=}\limits ^{\mathrm {def}}}\varOmega _{x,p}(y) + {H \over (p+1)!} \Vert y - x \Vert ^{p+1}, \quad y \in {\mathbb {E}}. \end{aligned}$$

By (8), for $H \ge L_p$, this function gives us an upper estimate for the objective. Moreover, for $H \ge p L_p$ this function is convex (see Theorem 1 in [15]).

We are going to generate new test point T as a close approximation to the minimum of function $\hat{\varOmega }_{x,p,H}(\cdot )$. Namely, we are interested in points from the following nested neighborhoods:

$$\begin{aligned} {{{\mathcal {N}}}}^{\gamma }_{p,H}(x)= & {} \{ T \in {\mathbb {E}}:\; \Vert \nabla {\hat{\varOmega }}_{x,p,H}(T) \Vert _* \le \gamma \Vert \nabla f(T) \Vert _* \}, \end{aligned}$$

(15)

where $\gamma \in \left[ 0,1\right) $ is an accuracy parameter. The smallest set ${{{\mathcal {N}}}}^0_{p,H}(x)$ contains only the exact minimizers of the augmented Taylor polynomial. Note that ${\hat{\varOmega }}_{x,p,H}(x) = \nabla f(x)$. Hence, if $\nabla f(x) \ne 0$, then $x \not \in {{{\mathcal {N}}}}^{\gamma }_{p,H}(x)$ for any $\gamma \in [0,1)$.

These neighborhoods are important by the following reason.

Theorem 2.1

Let $x \in {\mathbb {E}}$ and parameters $\gamma $, H satisfy the following condition:

$$\begin{aligned} \gamma + {L_p \over H}\le & {} {1 \over p}. \end{aligned}$$

(16)

Then, any point $T \in {{{\mathcal {N}}}}^{\gamma }_{p,H}(x)$ ensures a pth-order improvement of x with factor

$$\begin{aligned}&c_{\gamma ,H}(p) {\mathop {=}\limits ^{\mathrm {def}}}\left[ {(1-\gamma )p! \over L_p+H} \right] ^{1 \over p}. \end{aligned}$$

(17)

Consequently, we have

$$\begin{aligned} f(x) - f(T)\ge & {} c_{\gamma ,H}(p) \Vert \nabla f(T) \Vert _*^{p+1 \over p}. \end{aligned}$$

(18)

Proof

Let $T \in {{{\mathcal {N}}}}^{\gamma }_{p,H}(x)$. Denote by $r = \Vert x - T \Vert $. Then,

$$\begin{aligned}&\Vert \nabla f(T) \Vert _*^2 + 2 {H \over p!} r^{p-1} \langle \nabla f(T), T - x \rangle + \left( H \over p!\right) ^2 r^{2p} \\&\quad = \; \Vert \nabla f(T) + {H \over p!} r^{p-1} B(T-x) \Vert _*^2\\&\quad = \; \Vert \nabla f(T) - \nabla \varOmega _{x,p}(T) + \nabla {\hat{\varOmega }}_{x,p,H}(T) \Vert _*^2\\&\quad {\mathop {\le }\limits ^{(9)}} \; \left( { L_p \over p!} r^p + \gamma \Vert \nabla f(T) \Vert _* \right) ^2. \end{aligned}$$

Therefore,

$$\begin{aligned}&{2H r^{p-1} \over p!} \langle \nabla f(T), x - T \rangle \ge (1-\gamma ^2) \Vert \nabla f(T) \Vert _*^2 + {H^2 - L_p^2 \over (p!)^2 } r^{2p} \\&\quad - {2 \gamma L_p \over p!} r^p \Vert \nabla f(T) \Vert _*. \end{aligned}$$

In other words,

$$\begin{aligned}&\langle \nabla f(T), x - T \rangle \ge {(1-\gamma ^2) p! \over 2 H r^{p-1}} \Vert \nabla f(T) \Vert _*^2 + {H^2 - L_p^2 \over 2 H p! } r^{p+1} \\&\quad - {\gamma r L_p \over H} \Vert \nabla f(T) \Vert _* {\mathop {=}\limits ^{\mathrm {def}}}\varkappa (r). \end{aligned}$$

Function $\varkappa (r)$ is convex in $r \ge 0$. Its derivative in r is

$$\begin{aligned} \varkappa '(r)= & {} - {(1-\gamma ^2)(p-1) p! \over 2 H r^{p}} \Vert \nabla f(T) \Vert _*^2 \\&+ {(p+1)(H^2 - L_p^2) \over 2 H p! } r^{p} - \gamma {L_p \over H} \Vert \nabla f(T) \Vert _*. \end{aligned}$$

Note that

$$\begin{aligned} \Vert \nabla f(T) \Vert _*= & {} \Vert \nabla f(T) - \nabla \varOmega _{x,p}(T) + \nabla {\hat{\varOmega }}_{x,p,H}(T) - {H \over p!} r^{p-1} B(T-x) \Vert _*\\\le & {} {L_p \over p!} r^p + \gamma \Vert \nabla f(T) \Vert _* + {H \over p!} r^p. \end{aligned}$$

Thus, $r \ge r_* {\mathop {=}\limits ^{\mathrm {def}}}\left[ {(1-\gamma )p! \,\Vert \nabla f(T) \Vert _*\over L_p+H} \right] ^{1 \over p}$. At the same time,

$$\begin{aligned} \varkappa '(r_*)= & {} - {(1-\gamma ^2)(p-1) p! \Vert \nabla f(T) \Vert _*^2 \cdot \over 2 H } \cdot {L_p+H \over (1-\gamma )p! \,\Vert \nabla f(T) \Vert _*} \\&+ {(p+1)(H^2 - L_p^2) \over 2 H p! } \cdot {(1-\gamma )p! \,\Vert \nabla f(T) \Vert _*\over L_p+H} - \gamma {L_p \over H} \Vert \nabla f(T) \Vert _*\\= & {} \Vert \nabla f(T) \Vert _* \left[ - {(1+\gamma )(p-1) \over 2 }\left( 1 + {L_p \over H} \right) \right. \\&\left. + {(p+1)(1-\gamma ) \over 2} \left( 1 - {L_p \over H}\right) - \gamma {L_p \over H} \right] \\= & {} \Vert \nabla f(T) \Vert _* \left[ 1 - p \gamma - p {L_p \over H} \right] \; {\mathop {\ge }\limits ^{(16)}} \; 0. \end{aligned}$$

So by convexity of $\varkappa (\cdot )$ and $r \ge r_*$, we have $\varkappa (r) \ge \varkappa (r_*)$. Therefore,

$$\begin{aligned}&\langle \nabla f(T), x - T \rangle \\&\quad \ge \varkappa (r_*) \; = \; r_* \left[ {(1-\gamma ^2) p! \over 2 H r^{p}_*} \Vert \nabla f(T) \Vert _*^2 + {H^2 - L_p^2 \over 2 H p! } r^{p}_* - \gamma {L_p \over H} \Vert \nabla f(T) \Vert _* \right] \\&\quad = r_* \Vert \nabla f(T) \Vert _* \left[ {(1-\gamma ^2) p! \over 2 H} \cdot {L_p+H \over (1-\gamma )p! } + {H^2 - L_p^2 \over 2 H p! } \cdot {(1-\gamma )p! \over L_p+H} - \gamma {L_p \over H} \right] \\&\quad = r_* \Vert \nabla f(T) \Vert _*. \end{aligned}$$

Inequality (18) is valid since our function is convex:

$$\begin{aligned} f(x)\ge & {} f(T) + \langle \nabla f(T), x - T \rangle . \Box \end{aligned}$$

We have proved that the pth-order improvement at point $x \in {\mathbb {E}}$ can be ensured by inexact minimizers of the augmented Taylor polynomials of degree $p \ge 1$. Let us present the efficiency estimates for corresponding methods.

From now on, let us assume that the constant $L_p$ is known. For the sake of notation, we fix the following values of the parameters:

$$\begin{aligned} \gamma= & {} {1 \over 2p}, \quad H \; = \; 2p L_p. \end{aligned}$$

(19)

Then, we can use a shorter notation for the following objects:

$$\begin{aligned}&{{{\mathcal {N}}}}_p(x) {\mathop {=}\limits ^{\mathrm {def}}}{{{\mathcal {N}}}}^{1/(2p)}_{p, 2pL_p}(x), \quad c_p \; {\mathop {=}\limits ^{\mathrm {def}}}\; c_{1/(2p),2pL_p}(p) \; = \; \left[ {2p-1 \over 2p(2p+1)}\, {p! \over L_p} \right] ^{1 \over p}. \end{aligned}$$

(20)

As a consequence of all these specifications, we have the following result.

Corollary 2.1

For any $x \in {\mathbb {E}}$, all points from the neighborhood ${{{\mathcal {N}}}}_p(x)$ ensure the pth-order improvement of x with factor $c_p$.

Let us start from the simplest Inexact Basic Tensor Method:

(21)

Denote $R(x_0) = \max \limits _{y \in {\mathbb {E}}} \{ \Vert y - x^* \Vert : \; f(y) \le f(x_0) \}$.

Theorem 2.2

Let the sequence $\{ x_k \}_{k \ge 0}$ be generated by method (21). Then, for any $k \ge 1$ we have

$$\begin{aligned} f(x_k) - f^*\le & {} \left[ {p+1 \over k} \left( {1 \over c_p}R^{p+1 \over p}(x_0) + (f(x_0)- f^*)^{1/p} \right) \right] ^p\nonumber \\&{\mathop {\le }\limits ^{(11)}} \left( {2(p+1) \over k} \right) ^p \left[ {p(2p+1)\over (2p-1)p!} L_p R^{p+1}(x_0) + {1 \over 2}(f(x_0) - f^*) \right] .\nonumber \\ \end{aligned}$$

(22)

Proof

In view of inequality (18), we have $f(x_k) \le f(x_0)$ for all $k \ge 0$. Therefore,

$$\begin{aligned} \Vert x_k - x^* \Vert\le & {} R_0 \; {\mathop {=}\limits ^{\mathrm {def}}}\; R(x_0), \quad k \ge 0. \end{aligned}$$

Consequently,

$$\begin{aligned}&f(x_k) - f(x_{k+1}) {\mathop {\ge }\limits ^{(18)}} c_p \Vert \nabla f(x_{k+1}) \Vert ^{p+1 \over p}_* \; \ge \; c_p \left( {\langle \nabla f(x_{k+1}), x_{k+1} - x^* \rangle \over R(x_0)}\right) ^{p+1 \over p}\\&\quad \ge c_p \left( {f(x_{k+1}) - f^* \over R(x_0)}\right) ^{p+1 \over p}. \end{aligned}$$

Denoting $\xi _k = {c^p_p \over R^{p+1}_0}(f(x_k)- f^*)$, we get inequality $\xi _k - \xi _{k+1} \ge \xi _{k+1}^{p+1 \over p}$. Hence, in view of Lemma 11 in [13], we have

$$\begin{aligned} \xi _k\le & {} { 1 \over k^p} \left[ (p+1) (1 + \xi _0^{1/p}) \right] ^p, \quad k \ge 1. \end{aligned}$$

This is exactly the estimate (22). $\square $

Let us present a convergence analysis for Inexact Accelerated Tensor Method. We need to choose the degree of the method and define the prox-function

$$\begin{aligned} d_{p+1}(x)= & {} {1 \over p+1} \Vert x \Vert ^{p+1}, \quad x \in {\mathbb {E}}. \end{aligned}$$

This is a uniformly convex function of degree $p+1$: for all $x,y \in {\mathbb {E}}$ we have

$$\begin{aligned} d_{p+1}(y)\ge & {} d_{p+1}(x) + \langle \nabla d_{p+1}(x), y - x \rangle + {1 \over p+1} \left( {1 \over 2}\right) ^{p-1} \Vert y - x \Vert ^{p+1} \end{aligned}$$

(23)

(see, for example, Lemma 4.2.3 in [14]). Define the sequence

$$\begin{aligned} A_k= & {} 2 \left( {p+1 \over 2p} c_p \right) ^p \left( k \over p+1\right) ^{p+1}, \quad a_{k+1} \; {\mathop {=}\limits ^{\mathrm {def}}}\; A_{k+1} - A_k, \quad k \ge 0. \end{aligned}$$

(24)

Note that for all values $B_k = \left( k \over p+1\right) ^{p+1}$ with $k \ge 0$ we have

$$\begin{aligned} {(B_{k+1} - B_k)^{p+1 \over p} \over B_{k+1}}= & {} \left( {k+1 \over p+1} - {k \over p+1} \left[ k \over k+1\right] ^p \right) ^{p+1 \over p} \\\le & {} \left( {k+1 \over p+1} - {k \over p+1} \left[ 1 - {p \over k+1} \right] \right) ^{p+1 \over p} \le 1. \end{aligned}$$

Therefore, the elements of sequence $\{ A_k \}_{k \ge 0}$ satisfy the following inequality:

$$\begin{aligned} a_{k+1}^{p+1 \over p}\le & {} 2^{1/p} {p+1 \over 2p} c_p\, A_{k+1}, \quad k \ge 0. \end{aligned}$$

(25)

(26)

First of all, note that by induction it is easy to see that

$$\begin{aligned} \psi _k(x)\le & {} A_k f(x) + d_{p+1}(x-x_0), \quad x \in {\mathbb {E}}. \end{aligned}$$

(27)

In particular, for $\psi ^*_k {\mathop {=}\limits ^{\mathrm {def}}}\min \limits _{x \in {\mathbb {E}}} \psi _k(x)$ and all $x \in {\mathbb {E}}$, we have

$$\begin{aligned} \begin{array}{rcl}&\,&A_k f(x) + d_{p+1}(x-x_0) \; {\mathop {\ge }\limits ^{(27)}} \; \psi _k(x) {\mathop {\ge }\limits ^{(23)}} \psi _k^* + {1 \over p+1} \left( {1 \over 2}\right) ^{p-1} \Vert x - v_k \Vert ^{p+1}.\qquad \end{array} \end{aligned}$$

(28)

Let us prove by induction the following relation:

$$\begin{aligned} \psi _k^*\ge & {} A_k f(x_k), \quad k \ge 0. \end{aligned}$$

(29)

For $k = 0$, we have $\psi _0^* = 0$ and $A_0 = 0$. Hence, (29) is valid. Assume it is valid for some $k \ge 0$. Then,

$$\begin{aligned}&\psi ^*_{k+1} \; = \min \limits _{x \in {\mathbb {E}}} \Big \{ \psi _k(x) + a_{k+1} [ f(x_{k+1}) + \langle \nabla f(x_{k+1}), x - x_{k+1} \rangle ] \Big \}\\&\quad {\mathop {\ge }\limits ^{(28)}} \min \limits _{x \in {\mathbb {E}}} \Big \{ \psi _k^* + {1 \over p+1} \left( {1 \over 2}\right) ^{p-1} \Vert x - v_k \Vert ^{p+1}\\&\quad + a_{k+1} [ f(x_{k+1}) + \langle \nabla f(x_{k+1}), x - x_{k+1} \rangle ] \Big \}. \end{aligned}$$

Note that

$$\begin{aligned}&\psi _k^* + a_{k+1} [ f(x_{k+1}) + \langle \nabla f(x_{k+1}), x - x_{k+1} \rangle ]\\&{\mathop {\ge }\limits ^{(29)}} A_k f(x_k) + a_{k+1} [ f(x_{k+1}) + \langle \nabla f(x_{k+1}), x - x_{k+1} \rangle ]\\&\quad \ge A_{k+1} f(x_{k+1}) + \langle \nabla f(x_{k+1}), a_{k+1}(x - x_{k+1}) + A_{k}(x_k - x_{k+1}) \rangle \\&\quad = A_{k+1} f(x_{k+1}) + \langle \nabla f(x_{k+1}), a_{k+1}(x - v_k) + A_{k+1}(y_k - x_{k+1}) \rangle . \end{aligned}$$

Further, in view of inequality ${\alpha \over p+1} \tau ^{p+1} - \beta \tau \ge - {p \over p+1} \alpha ^{-1/p} \beta ^{(p+1)/p}$, $\tau \ge 0$, for all $x \in {\mathbb {E}}$ we have

$$\begin{aligned}&{1 \over p+1} \left( {1 \over 2}\right) ^{p-1} \Vert x - v_k \Vert ^{p+1} + a_{k+1} \langle \nabla f(x_{k+1}), x - v_k \rangle \\&\quad \ge - {p \over p+1} 2^{p-1 \over p} \Big ( a_{k+1} \Vert \nabla f(x_{k+1}) \Vert _* \Big )^{p+1 \over p}. \end{aligned}$$

Finally, since $x_{k+1} \in {{{\mathcal {N}}}}_p(y_k)$, by Corollary 2.1 we get

$$\begin{aligned} \langle \nabla f(x_{k+1}), y_k - x_{k+1} \rangle\ge & {} c_p \Vert \nabla f(x_{k+1}) \Vert _*^{p+1 \over p}. \end{aligned}$$

Putting all these inequalities together, we obtain

$$\begin{aligned} \psi ^*_{k+1}\ge & {} A_{k+1} f(x_{k+1}) - {p \over p+1} 2^{p-1 \over p} \Big ( a_{k+1} \Vert \nabla f(x_{k+1}) \Vert _* \Big )^{p+1 \over p}\\&+ A_{k+1} c_p \Vert \nabla f(x_{k+1}) \Vert _*^{p+1 \over p}\\= & {} A_{k+1} f(x_{k+1}) + \Vert \nabla f(x_{k+1}) \Vert _*^{p+1 \over p} \left( A_{k+1} c_p - {p \over p+1} 2^{p-1 \over p} a_{k+1}^{p+1 \over p} \right) \\&{\mathop {\ge }\limits ^{(25)}} A_{k+1} f(x_{k+1}). \end{aligned}$$

Thus, we have proved the following theorem.

Theorem 2.3

Let sequence $\{ x_k \}_{k \ge 0}$ be generated by method (26). Then, for any $k \ge 1$, we have

$$\begin{aligned} f(x_k) - f^*\le & {} {2p+1 \over 2(2p-1)p!} \left( 2p \over k \right) ^{p+1} \cdot L_p\Vert x^* - x_0 \Vert ^{p+1}. \end{aligned}$$

(30)

Proof

Indeed, in view of relations (27) and (29), we have

$$\begin{aligned} f(x_k) - f^*\le & {} {1 \over A_k} d_{p+1}(x^*-x_0) {\mathop {=}\limits ^{(24)}} {1 \over 2}\left( {2 p \over (p+1)c_p}\right) ^p \left( p+1 \over k \right) ^{p+1} \\&\cdot {1 \over p+1} \Vert x^* - x_0 \Vert ^{p+1}\\= & {} {1 \over 2}\left( {2p \over c_p} \right) ^p \left( 1 \over k \right) ^{p+1} \cdot \Vert x^* - x_0 \Vert ^{p+1} \; = \; {(2p+1)L_p \over 2(2p-1)p!} \left( 2p \over k \right) ^{p+1} \\&\cdot \Vert x^* - x_0 \Vert ^{p+1}. \end{aligned}$$

$\square $

3 Relative Non-degeneracy and Approximate Gradients

In this section, we measure distances in ${\mathbb {E}}$ by general norms. Consider the following composite minimization problem:

$$\begin{aligned} \min \limits _{x \in \mathrm{dom \,}\psi } \left\{ F(x) {\mathop {=}\limits ^{\mathrm {def}}}\varphi (x) + \psi (x) \right\} , \end{aligned}$$

(31)

where the convex function $\varphi (\cdot )$ is differentiable, and $\psi (\cdot )$ is a simple closed convex function. The most important example of function $\psi (\cdot )$ is an indicator function for a closed convex set. Denote by $x^*$ one of the optimal solutions of problem (31), and let $F^* = F(x^*)$.

Let $\varphi (\cdot )$ be non-degenerate with respect to some scaling function $d(\cdot )$:

$$\begin{aligned} \mu _d(\varphi ) \beta _d(x,y)\le & {} \beta _{\varphi }(x,y) \; {\mathop {=}\limits ^{(1)}} \; \varphi (y) - \varphi (x) - \langle \nabla \varphi (x), y - x \rangle \nonumber \\\le & {} L_d(\varphi ) \beta _d(x,y), \quad x, y \in \mathrm{dom \,}\psi , \end{aligned}$$

(32)

where $0 \le \mu _d (\varphi ) \le L_d(\varphi )$. Denote by $\gamma _d(\varphi ) = {\mu _d (\varphi ) \over L_d(\varphi )} \le 1$ the condition number of function $\varphi (\cdot )$ with respect to the scaling function $d(\cdot )$. Sometimes it is more convenient to work with the second-order variant of the condition (32):

$$\begin{aligned} \mu _d(\varphi ) \nabla ^2 d(x)\preceq & {} \nabla ^2 \varphi (x) \; \preceq \; L_d(\varphi ) \nabla ^2 d(x), \quad x \in \mathrm{dom \,}\psi . \end{aligned}$$

(33)

We are going to solve problem (31) using an approximate gradient of the smooth part of the objective function. Namely, at each point $x \in {\mathbb {E}}$ we use a vector $g_{\varphi }(x)$ such that

$$\begin{aligned} \Vert g_{\varphi }(x) - \nabla \varphi (x) \Vert _*\le & {} \delta , \end{aligned}$$

(34)

where $\delta \ge 0$ is an accuracy parameter.

Our first goal is to describe the influence of parameter $\delta $ onto the quality of the computed approximate solutions to problem (31). For this, we need to assume that function $d(\cdot )$ is uniformly convex of degree $p+1$ with $p \ge 1$:

$$\begin{aligned} \beta _d(x,y)\ge & {} {1 \over p+1} \sigma _{p+1}(d) \Vert x - y \Vert ^{p+1}, \quad x, y \in \mathrm{dom \,}\psi . \end{aligned}$$

(35)

Consider the following Bregman Distance Gradient Method (BDGM), working with inexact information.

(36)

Lemma 3.1

Let the approximate gradient $g_{\varphi }(x_k)$ satisfy the condition (34). Then, for any $x \in {\mathbb {E}}$ and $k \ge 0$ we have

$$\begin{aligned} \begin{array}{rcl} \beta _d(x_{k+1},x)\le & {} \left( 1-{1 \over 4}\gamma _d(\varphi ) \right) \beta _d(x_k,x) + {1 \over 2L_d(\varphi )} [F(x) - F(x_{k+1})] + {\hat{\delta }}, \end{array} \end{aligned}$$

(37)

where ${\hat{\delta }} {\mathop {=}\limits ^{\mathrm {def}}}{2p \over p+1} \delta ^{p+1 \over p} \left( {(p+1)(2+\gamma _d(\varphi )) \over \sigma _{p+1}(d) \, \gamma _d(\varphi )}\right) ^{1 \over p}$.

Proof

The first-order optimality condition defining $x_{k+1}$ is as follows:

$$\begin{aligned} \begin{array}{rcl} \langle g_{\varphi }(x_k) + 2L_d(\varphi ) ( \nabla d(x_{k+1}) - \nabla d(x_k), x - x_{k+1} \rangle + \psi (x)\ge & {} \psi (x_{k+1}) \end{array} \end{aligned}$$

(38)

for all $x \in \mathrm{dom \,}\psi $. Therefore, denoting $r_k(x) = \beta _d(x_k,x)$, we have

$$\begin{aligned}&r_{k+1}(x) - r_k(x) \\&\quad =\Big (d(x) - d(x_{k+1}) - \langle \nabla d(x_{k+1}), x - x_{k+1} \rangle \Big ) \\&\qquad - \Big (d(x) - d(x_{k}) - \langle \nabla d(x_{k}), x - x_{k} \rangle \Big )\\&\quad = d(x_k) - \langle \nabla d(x_k), x_k - x_{k+1} \rangle - d(x_{k+1}) \\&\qquad + \langle \nabla d(x_k) - \nabla d(x_{k+1}), x - x_{k+1} \rangle \\&\quad {\mathop {\le }\limits ^{(38)}} - \beta _d(x_k,x_{k+1}) + {1 \over 2 L_d(\varphi ) } \Big [ \langle g_{\varphi }(x_k), x - x_{k+1} \rangle + \psi (x) - \psi (x_{k+1}) \Big ]. \end{aligned}$$

Note that $\langle g_{\varphi }(x_k), x - x_{k+1} \rangle = \langle g_{\varphi }(x_k) - \nabla \varphi (x_k) , x - x_{k+1} \rangle + \langle \nabla \varphi (x_k) , x - x_{k+1} \rangle $, and

$$\begin{aligned}&\langle \nabla \varphi (x_k) , x - x_{k+1} \rangle {\mathop {\le }\limits ^{(32)}} \langle \nabla \varphi (x_k) , x_k - x_{k+1} \rangle + \varphi (x) - \varphi (x_k) - \mu _d(\varphi ) \beta _d(x_k,x)\\&\quad {\mathop {\le }\limits ^{(32)}} L_d(\varphi ) d(x_k,x_{k+1}) + \varphi (x) - \varphi (x_{k+1}) - \mu _d(\varphi ) \beta _d(x_k,x). \end{aligned}$$

Hence,

$$\begin{aligned}&r_{k+1}(x) - r_k(x) + {1 \over 2L_d(\varphi )} [ F(x_{k+1}) - F(x)] \\&\quad \le \langle g_{\varphi }(x_k) - \nabla \varphi (x_k) , x - x_{k+1} \rangle - {1 \over 2}\beta _d(x_k, x_{k+1}) - {1 \over 2}\gamma _d(\varphi ) \beta _d(x_k,x)\\&\quad {\mathop {\le }\limits ^{(35)}} - {1 \over 4} \gamma _d(\varphi ) r_k(x) + \langle g_{\varphi }(x_k) - \nabla \varphi (x_k) , x - x_{k+1} \rangle \\&\quad - {\sigma _{p+1}(d) \over 2(p+1)}\Big ( \Vert x_k - x_{k+1} \Vert ^{p+1} + {1 \over 2}\gamma _d(\varphi ) \Vert x_k - x \Vert ^{p+1} \Big ). \end{aligned}$$

Since $\Vert x \Vert = \Vert - x \Vert $ for all x in ${\mathbb {E}}$, the minimum in $x_k$ of the expression in brackets is attained at some $x_k = (1-\alpha ) x_{k+1} + \alpha x$ with $\alpha \in (0,1)$. On the other hand, the minimum of the function

$$\begin{aligned} \alpha ^{p+1} + {1 \over 2} \gamma _d(\varphi ) (1-\alpha )^{p+1}, \quad \alpha \in [0,1], \end{aligned}$$

is attained at ${\bar{\alpha }} ={ \beta \over 1 + \beta }$ with $\beta = \left( {1 \over 2}\gamma _d(\varphi ) \right) ^{1 \over p}$. This is

$$\begin{aligned} {\bar{\alpha }}^{p+1} + \beta ^p (1- {\bar{\alpha }})^{p+1}= & {} {\bar{\alpha }} {\beta ^p \over (1+\beta )^p} + {\beta ^p \over (1+\beta )^{p+1}} \; \\= & {} \; {\beta ^p \over (1+\beta )^p} \; {\mathop {\ge }\limits ^{(11)}} \; { \gamma _d(\varphi ) \over 2^{p-1}(2 + \gamma _d(\varphi ))}. \end{aligned}$$

Thus,

$$\begin{aligned}&r_{k+1}(x) - (1 - {1 \over 4} \gamma _d(\varphi ))r_k(x) + {1 \over 2L_d(\varphi )} [ F(x_{k+1}) - F(x)]\\&\quad \le \langle g_{\varphi }(x_k) - \nabla \varphi (x_k) , x - x_{k+1} \rangle - {\sigma _{p+1}(d) \, \gamma _d(\varphi ) \over 2^p (p+1)(2+ \gamma _d(\varphi ))} \Vert x - x_{k+1} \Vert ^{p+1}\\&\quad {\mathop {\le }\limits ^{(34)}} {2p \over p+1} \delta ^{p+1 \over p} \left( {(p+1)(2+\gamma _d(\varphi )) \over \sigma _{p+1}(d) \, \gamma _d(\varphi )}\right) ^{1 \over p}. \Box \end{aligned}$$

Applying inequality (37) with $x = x^*$ recursively to all $k = 0, \dots ,T-1$, we get the following relation:

$$\begin{aligned}&\beta _d(x_{T},x^*) + {1 \over 2L_d(\varphi )}\sum \limits _{k=0}^{T-1}(1- \gamma )^{T-k-1} [ F(x_{k+1})-F(x^*)]\nonumber \\&\quad \le \; (1-\gamma )^T\beta _d(x_0,x^* ) + S_T {\hat{\delta }} , \end{aligned}$$

(39)

where $\gamma = {1 \over 4} \gamma _d(\varphi )$, and $S_T = \sum \limits _{k=0}^{T-1} (1- \gamma )^{T-k-1} \; = \; {1 \over \gamma } \Big ( 1 - (1-\gamma )^{T}\Big )$.

Thus, denoting $F^*_T = \min \limits _{0 \le k \le T} F(x_k)$, we get the following bound:

$$\begin{aligned}&F^*_T - F^* {\mathop {\le }\limits ^{(39)}} {2\gamma (1-\gamma )^T \over 1 - (1-\gamma )^T} L_d(\varphi ) \beta (x_0,x^*) + 2 \hat{\delta } L_d(\varphi ), \quad T \ge 1. \end{aligned}$$

(40)

Note that $\lim \limits _{\gamma \downarrow 0} {\gamma (1-\gamma )^T \over 1 - (1-\gamma )^T} = {1 \over T}$. Hence, for $\mu _d(\varphi ) = 0$ we get the convergence rate

$$\begin{aligned}&F^*_T - F^* {\mathop {\le }\limits ^{(39)}} 2 L_d(\varphi ) \left( {1 \over T} \beta (x_0,x^*) + 2{\hat{\delta }} \right) , \quad T \ge 1. \end{aligned}$$

(41)

$\square $

In our main application, presented in Sect. 4, we need to generate points with small norm of the gradient. In order to achieve this goal with method (36), we need one more assumption on the scaling function $d(\cdot )$.

From now on, we consider the unconstrained minimization problems. This means that in (31) we have $\psi (x) = 0$ for all $x \in {\mathbb {E}}$.

Definition 3.1

We call the scaling function $d(\cdot )$ $\underline{{norm-dominated}}$ on the set $S \subseteq {\mathbb {E}}$ by some function $\theta _{S}(\cdot ): {\mathbb {R}}_+ \rightarrow {\mathbb {R}}_+$ if there exists a convex function $\theta _S(\cdot )$ with $\theta _{S}(0)=0$ such that

$$\begin{aligned} \beta _d(x,y)\le & {} \theta _{S}(\Vert x - y \Vert ) \end{aligned}$$

(42)

for all $x \in S$ and $y \in {\mathbb {E}}$.

Clearly, if function $d(\cdot )$ is norm-dominated by function $\theta _S(\cdot )$ and $\eta _S(\tau ) \ge \theta _S(\tau )$ for all $\tau \ge 0$, then $d(\cdot )$ is also norm-dominated by function $\eta _S(\cdot )$.

Let us give an important example of a norm-dominated scaling function.

Lemma 3.2

Function $d_4(\cdot )$ is norm-dominated on the Euclidean ball

$$\begin{aligned} B_R= & {} \{ x \in {\mathbb {E}}: \; \Vert x \Vert \le R \} \end{aligned}$$

by the function

$$\begin{aligned} \theta _R(\tau )= & {} {1 \over 4} (\tau ^2 + 2 R \tau )^2 + {1 \over 2}R^2 \tau ^2 \; \le \; {1 \over 2}\tau ^4 + {5 \over 2} R^2 \tau ^2, \quad \tau \ge 0. \end{aligned}$$

(43)

Proof

Let $x \in B_R$ and $y = x + h \in {\mathbb {E}}$. Then,

$$\begin{aligned} \beta _{d_4}(x,y)= & {} {1 \over 4} \Vert y \Vert ^4 - {1 \over 4} \Vert x \Vert ^4 - \Vert x \Vert ^2 \langle Bx, y - x \rangle \\= & {} {1 \over 4} [ \Vert x \Vert ^2 + 2 \langle B x, h \rangle + \Vert h \Vert ^2]^2 - {1 \over 4 } \Vert x \Vert ^4 - \Vert x \Vert ^2 \langle Bx, h \rangle \\= & {} {1 \over 4} [ \Vert x \Vert ^4 + 4 \langle B x, h \rangle ^2 + \Vert h \Vert ^4 + 4 ( \Vert x \Vert ^2 + \Vert h \Vert ^2) \langle Bx, h \rangle + 2 \Vert x \Vert ^2 \Vert h \Vert ^2] \\&- {1 \over 4 } \Vert x \Vert ^4 - \Vert x \Vert ^2 \langle Bx, h \rangle \\= & {} {1 \over 4} ( \Vert h \Vert ^2 + 2 \langle Bx, h \rangle )^2 + {1 \over 2}\Vert x \Vert ^2 \Vert h \Vert ^2. \end{aligned}$$

Thus, we can take $\theta _R(\tau ) = {1 \over 4} (\tau ^2 + 2 R \tau )^2 + {1 \over 2}R^2 \tau ^2$. $\square $

Note that the statement of Lemma 3.2 can be extended onto all convex polynomial scaling functions.

Norm-dominated scaling functions are important in view of the following.

Lemma 3.3

Let scaling function $d(\cdot )$ be norm-dominated on the level set

$$\begin{aligned} {{{\mathcal {L}}}}_{\varphi }({\bar{x}})= & {} \{ x \in {\mathbb {E}}: \; \varphi (x) \le \varphi ({\bar{x}}) \} \end{aligned}$$

by some function $\theta (\cdot )$. Then, for any $x \in {{{\mathcal {L}}}}_{\varphi }({\bar{x}})$ we have:

$$\begin{aligned} \varphi (x) - \varphi (x^*)\ge & {} L_d(\varphi ) \, \theta ^*\left( {1 \over L_d(\varphi ) } \Vert \nabla \varphi (x) \Vert _* \right) , \end{aligned}$$

(44)

where $\theta ^*(\tau ) = \max \limits _{\lambda } [ \lambda \tau - \theta (\tau )]$.

Proof

Indeed, for any $ x \in {{{\mathcal {L}}}}_{\varphi }({\bar{x}})$ and $y \in {\mathbb {E}}$ we have

$$\begin{aligned}&\varphi (y) {\mathop {\le }\limits ^{(32)}} \varphi (x) + \langle \nabla \varphi (x), y - x \rangle + L_d(\varphi ) \beta _d(x,y)\\&\quad {\mathop {\le }\limits ^{(42)}} \varphi (x) + \langle \nabla \varphi (x), y - x \rangle + L_d(\varphi ) \theta (\Vert y - x \Vert ). \end{aligned}$$

Therefore,

$$\begin{aligned} \varphi ^*= & {} \min \limits _{y \in {\mathbb {E}}} \varphi (y) \; \le \; \min \limits _{y \in {\mathbb {E}}} \Big \{ \varphi (x) + \langle \nabla \varphi (x), y - x \rangle + L_d(\varphi ) \theta (\Vert y - x \Vert ) \Big \}\\= & {} \min \limits _{r \ge 0} \; \min \limits _{y: \Vert y - x \Vert = r} \Big \{ \varphi (x) + \langle \nabla \varphi (x), y - x \rangle + L_d(\varphi ) \theta (r) \Big \}\\= & {} \varphi (x) + \min \limits _{r \ge 0} \Big \{ - r \Vert \nabla \varphi (x) \Vert ^* + L_d(\varphi ) \theta (r) \Big \}\\= & {} \varphi (x) - L_d(\varphi ) \, \theta ^*\left( {1 \over L_d(\varphi ) } \Vert \nabla \varphi (x) \Vert _* \right) . \Box \end{aligned}$$

Thus, for norm-dominated scaling functions, the rate of convergence in function value can be transformed into the rate of decrease of the norm of the gradient of function $\varphi (\cdot )$. This feature is very important for practical implementations of Inexact Tensor Methods presented in Sect. 2. In the next section, we discuss in details how it works for inexact third-order methods. $\square $

4 Second-Order Implementations of the Third-Order Methods

In this section, we are going to solve the unconstrained minimization problem

$$\begin{aligned} \min \limits _{x \in {\mathbb {E}}} \; f(x), \end{aligned}$$

(45)

where the objective function is convex and smooth, using the second-order implementations of the third-order methods. For the pure second-order methods, the standard assumption on the objective function in (45) is Lipschitz continuity of the second derivative (see, for example, [12, 17]). We are going to replace it by a stronger assumption, using the following fact.

Lemma 4.1

Let constants $M_2(f)$ and $M_4(f)$ be finite. Then

$$\begin{aligned} M_3(f)\le & {} \sqrt{2 M_2(f) M_4(f)}. \end{aligned}$$

(46)

Proof

Let $x \in \mathrm{dom \,}f$. Then, for any direction $h \in {\mathbb {E}}$ and $\tau > 0$ small enough, we have $x - \tau h \in \mathrm{dom \,}f$ and

$$\begin{aligned} 0\preceq & {} \nabla ^2 f(x-\tau h) \; {\mathop {=}\limits ^{(7)}} \; \nabla ^2 f(x) - \tau D^3f(x)[h] + \tau ^2 \int \limits _0^1(1-\lambda ) D^4f(x+\lambda h) [h]^2 d \lambda \\\preceq & {} \nabla ^2 f(x) - \tau D^3f(x)[h] + {1 \over 2}\tau ^2 M_4(f) \Vert h \Vert ^2 B. \end{aligned}$$

Thus, $D^3f(x) [h]^3 \le {1 \over \tau } \langle \nabla ^2 f(x) h, h \rangle + {\tau \over 2} M_4(f) \Vert h \Vert ^4$. Minimizing this inequality in $\tau > 0$ and taking the supremum of the result in $h \in {\mathbb {E}}$, we get (46). $\square $

Thus, from now on, we assume that

$$\begin{aligned} L_3(f)\equiv & {} M_4(f) \; < \; + \infty . \end{aligned}$$

(47)

Assumption $M_2(f) < +\infty $ is not so necessary. We will discuss different variants of its replacements in Sect. 5.

In our situation, we can apply to (45) the third-order tensor method $\hbox {ATMI}_3$ (see 26). At each iteration of this method, we need to minimize the augmented third-order Taylor polynomial $\hat{\varOmega }_{x,3,H}(\cdot )$. As it was shown in [15], this can be done by an auxiliary scheme based on the relative smoothness condition. This approach is based on the following matrix inequality (see Lemma 3 in [15]):

$$\begin{aligned} -{1 \over \xi } \nabla ^2 f(x) - {\xi \over 2} M_4(f) \Vert h \Vert ^2 B \preceq D^3f(x)[h] \preceq {1 \over \xi } \nabla ^2 f(x) + {\xi \over 2} M_4(f) \Vert h \Vert ^2 B , \end{aligned}$$

(48)

which is valid for all $x \in \mathrm{dom \,}f$, $h \in {\mathbb {E}}$ and $\xi > 0$.

As compared with [15], our situation is more complicated. Firstly, we are not going to use the exact minimum of function ${\hat{\varOmega }}_{x,3,H}(\cdot )$. And secondly, we are going to minimize this function using its approximate gradients.

Let us start from discussion of the second issue. Let us fix a parameter $\tau > 0$ and for all $x,y \in {\mathbb {E}}$, consider the following vector functions:

$$\begin{aligned} h_y^{\tau }(x)= & {} {2 \over \tau ^2} [ \nabla f(y + \tau (x-y)) - \nabla f(y) - \tau \nabla ^2 f(x)(y-x)] \; \in \; {\mathbb {E}}^*,\\ g_y^{\tau }(x)= & {} {1 \over \tau ^2} [ \nabla f(y + \tau (x-y)) + \nabla f(y - \tau (x-y)) - 2 \nabla f(y)] \; \in \; {\mathbb {E}}^*, \end{aligned}$$

the finite-difference approximations of third derivative along direction $[x-y]^2$.

Lemma 4.2

For any $x , y \in {\mathbb {E}}$, we have

$$\begin{aligned} \Vert h_y^{\tau }(x) - D^3f(y)[x-y]^2 \Vert _*\le & {} {\tau \over 3} M_4(f) \Vert x - y \Vert ^3, \end{aligned}$$

(49)

$$\begin{aligned} \Vert g_y^{\tau }(x) - D^3f(y)[x-y]^2 \Vert _*\le & {} {\tau \over 3} M_4(f) \Vert x - y \Vert ^3, \end{aligned}$$

(50)

$$\begin{aligned} \Vert g_y^{\tau }(x) - D^3f(y)[x-y]^2 \Vert _*\le & {} {\tau ^2 \over 12} L_4(f) \Vert x - y \Vert ^4. \end{aligned}$$

(51)

Proof

Denote $h = \tau (x-y)$. Then, by Taylor formula we have

$$\begin{aligned}&\nabla f(y + h) - \nabla f(y) - \nabla ^2 f(y) h - {1 \over 2}D^3f(y)[h]^2 \\&\quad {\mathop {=}\limits ^{(7)}} {1 \over 2}\int \limits _0^1 (1-\lambda )^2 D^4f(y+\lambda h)[h]^3 d\lambda . \end{aligned}$$

Applying a uniform upper bound for the fourth derivative to the right-hand side of this representation, we get inequality (49). Further,

$$\begin{aligned}&\nabla f(y - h) - \nabla f(y) + \nabla ^2 f(y) h - {1 \over 2}D^3f(y)[h]^2 \\&\quad {\mathop {=}\limits ^{(7)}}{1 \over 2}\int \limits _0^1 (1-\lambda )^2 D^4f(y-\lambda h)[-h]^3 d\lambda . \end{aligned}$$

Adding these two representations, we get

$$\begin{aligned}&g_y^{\tau }(x) - D^3f(y)[x-y]^2\\&\quad = {\tau \over 2} \int \limits _0^1 (1-\lambda )^2 \Big (D^4f(y+\lambda \tau (x-y)) - D^4f(y-\lambda \tau (x-y))\Big )[x-y]^3 d\lambda , \end{aligned}$$

and we obtain inequality (50). If the fourth derivative derivative is Lipschitz continuous, then

$$\begin{aligned} \Vert g_y^{\tau }(x) - D^3f(y)[x-y]^2 \Vert _*\le & {} {\tau \over 2} \int \limits _0^1 (1-\lambda )^2 \cdot 2 \lambda \tau \Vert x - y \Vert ^4 L_4(f) d\lambda , \end{aligned}$$

and this is inequality (51). $\square $

In this paper, we usually employ the approximation $g_y^{\tau }(\cdot )$. Note that

$$\begin{aligned} \nabla {\hat{\varOmega }}_{y,3,H}(x)= & {} \nabla f(y) + \nabla ^2 f(y)h + {1 \over 2}D^3f(y)[h]^2 + {H \over 6} \Vert h \Vert ^2 Bh, \end{aligned}$$

where $h = x-y$. Thus, we can easily compute approximate gradients of function ${\hat{\varOmega }}_{y,3,H}(\cdot )$ using the first-order information on function $f(\cdot )$. Let us show that this can help us to minimize the augmented Taylor polynomial of degree three by the machinery presented in Sect. 3.

At each iteration k of $\hbox {ATMI}_3$, we need to find point $x_{k+1} \in {{{\mathcal {N}}}}_3(y_k)$. For the sake of notation, let us assume that $y_k = 0$. We need to find a point $x_+ \in {{{\mathcal {N}}}}_3(0)$ by minimizing the function

$$\begin{aligned} \varphi _k(x) \;= & {} \; {\hat{\varOmega }}_{0,3,6L_3}(x) {\mathop {=}\limits ^{\mathrm {def}}}f(0) + \langle \nabla f(0), x \rangle + {1 \over 2}\langle \nabla ^2 f(0)x, x \rangle \nonumber \\&+ {1 \over 6} D^3f(0)[x]^3 + {L_3 \over 4} \Vert x \Vert ^4. \end{aligned}$$

(52)

Thus, our auxiliary problem is as follows:

$$\begin{aligned} \min \limits _{x \in {\mathbb {E}}} \; \varphi _k(x). \end{aligned}$$

(53)

Denote $x^*_k = \arg \min \limits _{x \in {\mathbb {E}}} \varphi _k(x)$ and $\varphi ^*_k = \varphi _k(x^*_k)$. Note that

$$\begin{aligned} \nabla \varphi _k(x)= & {} \nabla f(0) + \nabla ^2 f(0)x + {1 \over 2}D^3f(0)[x]^2 + L_3 \Vert x \Vert ^2 Bx, \end{aligned}$$

(54)

$$\begin{aligned} \nabla ^2 \varphi _k(x)= & {} \nabla ^2 f(0) + D^3f(0)[x]+ L_3 \Big ( \Vert x\Vert ^2 B + 2 B x x^* B \Big )\nonumber \\= & {} \nabla ^2 f(0) + D^3f(0)[x]+ L_3 \nabla ^2 d_4(x). \end{aligned}$$

(55)

Therefore,

$$\begin{aligned} \nabla ^2 \varphi _k(x)&{\mathop {\preceq }\limits ^{(48)}}&\left( 1+ {1 \over \xi } \right) \nabla ^2 f(0) + \left( 1 + {\xi \over 2} \right) L_3 \nabla ^2 d_4(x),\nonumber \\ \nabla ^2 \varphi _k(x)&{\mathop {\succeq }\limits ^{(48)}}&\left( 1- {1 \over \xi } \right) \nabla ^2 f(0) + \left( 1 - {\xi \over 2} \right) L_3 \nabla ^2 d_4(x), \end{aligned}$$

(56)

Now it is clear that in our case a good scaling function is as follows:

$$\begin{aligned} \rho _k(x)= & {} {1 \over 2}\langle \nabla ^2 f(0)x, x \rangle + L_3 d_4(x), \quad x \in {\mathbb {E}}. \end{aligned}$$

(57)

Indeed, applying the relations (56) with $\xi = \sqrt{2}$, we get

$$\begin{aligned} \left( 1 - {1 \over \sqrt{2}} \right) \nabla ^2 \rho _k(x)\preceq & {} \nabla ^2 \varphi _k(x) \; \preceq \; \left( 1 + {1 \over \sqrt{2}} \right) \nabla ^2 \rho _k(x), \quad x \in {\mathbb {E}}. \end{aligned}$$

Thus, we can take

$$\begin{aligned} \mu\equiv & {} \mu _{\rho _k}(\varphi _k) \; = 1 - {1 \over \sqrt{2}} \; = \; {1 \over 2 + \sqrt{2}}, \quad L \; \equiv \; L_{\rho _k}(\varphi _k) \; = \; 1 + {1 \over \sqrt{2}}, \end{aligned}$$

and obtain for function $\varphi _k(\cdot )$ the condition number bounded by a constant:

$$\begin{aligned}&\gamma (\varphi ) {\mathop {=}\limits ^{\mathrm {def}}}{ \mu _{\rho _k}(\varphi _k) \over L_{\rho _k}(\varphi _k)} \; = \; {1 \over (1 + \sqrt{2})^2} \; = \; { 1 \over 3 + 2 \sqrt{2}} \; > \; {1 \over 6}. \end{aligned}$$

(58)

The second condition for applicability of method (36) is the uniform convexity of the Bregman distance. In our case, this is true since

$$\begin{aligned} \beta _{\rho _k}(x,y)\ge & {} L_3 \beta _{d_4}(x,y) \; {\mathop {\ge }\limits ^{(23)}} \; {1 \over 16} L_3\Vert x - y \Vert ^4, \quad x, y \in {\mathbb {E}}. \end{aligned}$$

(59)

Thus, in terms of inequality (35), we have $\sigma _4(\rho _k) = {1 \over 4}L_3$. This property is important for bounding the size of the set

$$\begin{aligned} {{{\mathcal {L}}}}_k= & {} \{ x \in {\mathbb {E}}: \; \varphi _k(x) \le \varphi _k(0) \}. \end{aligned}$$

Lemma 4.3

For any $x \in {{{\mathcal {L}}}}_k$, we have

$$\begin{aligned} \Vert x \Vert\le & {} 2^{1/3} R_k, \quad \Vert x^*_k \Vert \; \le \; R_k \; {\mathop {=}\limits ^{\mathrm {def}}}\; 2 \left( { 2 + \sqrt{2} \over L_3} \Vert \nabla f(0) \Vert _* \right) ^{1 \over 3}. \end{aligned}$$

(60)

Proof

Indeed,

$$\begin{aligned} \langle \nabla f(0), 0 - x^*_k \rangle= & {} \langle \nabla \varphi _k(0), 0 - x^*_k \rangle \; = \; \varphi _k(0) - \varphi _k^* + \beta _{\varphi _k}(0,x^*_k) \\= & {} \beta _{\varphi _k}(x^*_k,0) + \beta _{\varphi _k}(0,x^*_k) \; \ge \; \mu [\beta _{\rho _k}(x^*_k, 0) + \beta _{\rho _k}(0,x^*_k)] \\\ge & {} \mu L_3[\beta _{d_4}(x^*_k, 0) + \beta _{d_4}(0,x^*_k)] \; {\mathop {\ge }\limits ^{(23)}} \; 2 \mu {L_3 \over 16} \Vert x^*_k \Vert ^4. \end{aligned}$$

Consequently, we have the following bound:

$$\begin{aligned} \Vert x^*_k \Vert\le & {} 2 \left[ { 2+\sqrt{2} \over L_3} \Vert \nabla f(0) \Vert _* \right] ^{1 \over 3} \; = \; R_k . \end{aligned}$$

(61)

Further, for $x \in {{{\mathcal {L}}}}_k$, we have

$$\begin{aligned} \langle \nabla \varphi _k(0), 0 - x \rangle= & {} \varphi _k(0) - \varphi _k(x) + \beta _{\varphi _k}(0,x) \; \ge \; \beta _{\varphi _k}(0,x)\\\ge & {} \mu L_3 \beta _{d_4}(0,x) \; {\mathop {\ge }\limits ^{(23)}} \; {\mu L_3 \over 16} \Vert x \Vert ^4. \end{aligned}$$

Thus, $\Vert x \Vert \le \left[ {16 \over \mu L_3} \Vert \nabla f(0) \Vert * \right] ^{1 \over 3} = 2^{1/3} R_k$. $\square $

The third condition is the possibility of approximating the gradient of function $\varphi _k(\cdot )$. In our case, in view of Lemma 4.2, we can take

$$\begin{aligned} g_{\varphi _k,\tau }(x)= & {} \nabla f(0) + \nabla ^2 f(0)x + {1 \over 2}g^{\tau }_0(x) + L_3 \Vert x \Vert ^2 Bx, \end{aligned}$$

(62)

where $g^{\tau }_0(x) = {1 \over \tau ^2} [\nabla f(\tau x) + \nabla f(-\tau x) - 2 \nabla f(0)]$. In this case,

$$\begin{aligned}&\Vert g_{\varphi _k,\tau }(x) - \nabla \varphi _k(x) \Vert _* {\mathop {\le }\limits ^{(50)}} {\tau \over 3} L_3 \Vert x \Vert ^3, \quad x \in {\mathbb {E}}. \end{aligned}$$

(63)

Thus, in order to ensure condition (34) and keep $\tau $ separated from zero (this is necessary for stability of the process), we need to guarantee the boundedness of the minimizing sequence for function $\varphi _k(\cdot )$. However, since we know an explicit upper bound (60) on the size of the optimal point, it is possible to ensure this by introducing an additional constraint on the size of variables. Let us replace the problem (53) by the following one:

$$\begin{aligned} \min \limits _{x \in S_k} \; \varphi _k(x), \quad S_k \; {\mathop {=}\limits ^{\mathrm {def}}}\; \{ x \in {\mathbb {E}}: \; \Vert x \Vert \le R_k \}. \end{aligned}$$

(64)

In view of Lemma 4.3, the optimal solutions of problems (53) and (64) coincide.

Consider a variant of method (36) with $\psi \equiv 0$ and accuracy $\delta > 0$.

(65)

Note that the auxiliary problem in this method has now an additional ball constraint (64). However, this does not increase significantly its complexity since the Euclidean norm is already present in the objective function.

Let us mention the main properties of this minimization process. First of all, since all points $x_i$ belong to $S_k$, for all $i \ge 0$ we have

$$\begin{aligned}&\Vert g_{\varphi _k,\tau }(x_i) - \nabla \varphi _k(x_i) \Vert _* {\mathop {\le }\limits ^{(64)}} {\tau \over 3} L_3 R_k^3 \nonumber \\&\quad = { \delta L_3 \over 8(2+\sqrt{2}) \Vert \nabla f(0) \Vert _*} {8 (2+\sqrt{2}) \over L_3} \Vert \nabla f(0) \Vert _* \; = \; \delta . \end{aligned}$$

(66)

This means, in particular, that the sopping criterion at Step 2 of method (65) is correct: if it is satisfied, then

$$\begin{aligned} \Vert \nabla \varphi _k(x_i) \Vert _*\le & {} \Vert g_{\varphi _k,\tau }(x_i) \Vert _* + \delta \; \le \; {1 \over 6} \Vert \nabla f(x_i) \Vert _*, \end{aligned}$$

which implies $x_i \in {{{\mathcal {N}}}}_3(0)$.

Moreover, we can apply Lemma 3.1 to the following objects:

$$\begin{aligned} d(\cdot ) = \rho _k(\cdot ), \quad L_{\rho _k}(\varphi _k) = 1 + {1 \over \sqrt{2}}, \quad \gamma _{\rho _k}(\varphi _k) = {1 \over 6}, \quad \sigma _4(\rho _k) = {1 \over 4} L_3. \end{aligned}$$

(67)

Therefore, in our case, inequality (37) with $p = 3$ can be rewritten as

$$\begin{aligned} \beta _{\rho _k}(x_{i+1}, x )\le & {} \left( 1 - {1 \over 24} \right) \beta _{\rho _k}(x_{i}, x ) + {1 \over 2 + \sqrt{2}} [ \varphi _k(x) - \varphi _k(x_{i+1})] + {\hat{\delta }},\nonumber \\ {\hat{\delta }}= & {} {3 \over 2} \delta ^{4 \over 3} \left( {208 \over L_3} \right) ^{1 \over 3} \; < \; {\hat{\delta }}_+ \, {\mathop {=}\limits ^{\mathrm {def}}}{9 \delta ^{4 / 3}\over L_3^{1/3}}. \end{aligned}$$

(68)

In view of (57), $\beta _{\rho _k}(x_0, x ) \le {1 \over 2}L_1R_k^2 + {1 \over 4} L_3 R_k^4$. Hence, by (40) we have

$$\begin{aligned} \min \limits _{0 \le i \le T} \varphi _k(x_i) - \varphi _k^*\le & {} (2+\sqrt{2}) \left\{ {L_1 R_k^2 + {1 \over 2} L_3 R_k^4 \over 6 \left[ \left( 1+ {1 \over 23} \right) ^T - 1 \right] } + {\hat{\delta }}_+ \right\} , \quad T \ge 1, \end{aligned}$$

(69)

where $L_1$ is any upper estimate for the value $\Vert \nabla ^2f(0) \Vert $.

From this bound, we have a natural limit for the number of iterations of method (65), sufficient for obtaining the following inequality:

$$\begin{aligned} \varphi _k({\hat{x}}_T) - \varphi _k^*\le & {} 2(2+\sqrt{2}) \hat{\delta }_+, \end{aligned}$$

(70)

where ${\hat{x}}_T = \arg \min \limits _x \Big \{ \varphi _k(x): \; x \in \{ 0, x_1, \dots , x_T\} \Big \} \in {{{\mathcal {L}}}}_k$. Indeed, for this it is enough to have

$$\begin{aligned} 1 + {6 \over {\hat{\delta }}_+} [L_1 R_k^2 + {1 \over 2} L_3 R_k^4 ]\le & {} e^{T/24} \quad \Big ( \le \; \left( 1+ {1 \over 23} \right) ^T \Big ). \end{aligned}$$

Hence, we have the following bound:

$$\begin{aligned} T\le & {} T_k(\delta ) \; {\mathop {=}\limits ^{\mathrm {def}}}\; 24 \ln \left( 1 + {2 \over 3} \left( 1 \over \delta \right) ^{4 / 3} L_3^{1/3} \left[ L_1 R_k^2 + {1 \over 2} L_3 R_k^4 \right] \right) . \end{aligned}$$

(71)

However, the upper-level method $\hbox {ATMI}_3$ needs a point with small gradient:

$$\begin{aligned} \Vert \nabla \varphi _k({\hat{x}}_T) \Vert _*\le & {} {1 \over 6} \Vert \nabla f({\hat{x}}_T) \Vert _*. \end{aligned}$$

(72)

In order to derive this bound from inequality (70) with an appropriate value of $\hat{\delta }_+$, we use the fact that our scaling function $\rho _k(\cdot )$ is norm-dominated. Indeed, in view of Lemma 3.2 and representation (57), this function is norm-dominated on any Euclidean ball $B_r$ by the following function:

$$\begin{aligned} \theta _r(\tau )= & {} {1 \over 2}(L_1 + 5 L_3 r^2) \tau ^2 + {1 \over 2}L_3 \tau ^4. \end{aligned}$$

Hence, in view of Lemma 4.3, our scaling function $\rho _k(\cdot )$ is norm-dominated on the set ${{{\mathcal {L}}}}_k$ by $\theta _{{\hat{r}}_k}(\cdot )$ with

$$\begin{aligned} {\hat{r}}_k = 2^{1/3}R_k. \end{aligned}$$

(73)

Thus, in order to apply Lemma 3.3, we need to estimate from above the inverse to its conjugate function.

Lemma 4.4

For any $r>0$, we have

$$\begin{aligned} \left( \theta ^*_r\right) ^{-1}(\xi )\le & {} \sqrt{ 2(L_1+5L_3 r^2)\xi } + 2 L_3^{1/4} \left( {2 \over 3}\xi \right) ^{3/4}, \quad \xi \ge 0. \end{aligned}$$

(74)

Proof

Consider the primal function $\theta (\tau ) = {a \tau ^2 \over 2} + {b \tau ^4 \over 4} $ with $a, b \ge 0$. Then, its conjugate function is defined as follows:

$$\begin{aligned} \theta ^*(\lambda )= & {} \max \limits _{\tau } \left[ \lambda \tau - {a \tau ^2 \over 2} - {b \tau ^4 \over 4} \right] , \quad \lambda \ge 0. \end{aligned}$$

We need to find $\lambda \ge 0$ from the equation $\xi = \theta ^*(\lambda )$.

Note that the optimal solution $\tau = \tau (\lambda )$ in the above maximization problem can be found from the equation

$$\begin{aligned} \lambda= & {} a \tau + b \tau ^3. \end{aligned}$$

(75)

Therefore,

$$\begin{aligned}&\xi \; = \; \theta ^*(\lambda ) {\mathop {=}\limits ^{(75)}} {a \over 2} \tau ^2(\lambda ) + {3 b \over 4} \tau ^4(\lambda ) \end{aligned}$$

Thus, we can write down $\tau (\lambda )$ as a function of $\xi $:

$$\begin{aligned} \tau ^2(\lambda )= & {} {4 \xi \over a + \sqrt{a^2 + 12b \xi }} \; \le \; \min \left\{ 2{ \xi \over a}, \sqrt{4\xi \over 3 b} \right\} . \end{aligned}$$

Hence,

$$\begin{aligned}&\lambda {\mathop {\le }\limits ^{(75)}} \sqrt{2 a \xi } + b^{1/4} \left( 4 \xi \over 3 \right) ^{3/4}. \end{aligned}$$

It remains to use the actual values $a = L_1 + 5 L_3 r^2$ and $b = 2 L_3$. $\square $

Now we can write down the condition for our parameter $\delta $, which ensures the desired inequality (72). Indeed, in view of inequalities (70) and (44), after $T_k(\delta )$ inner steps (see 71) we can guarantee that

$$\begin{aligned} \Vert \nabla \varphi _k({\hat{x}}_T) \Vert _*\le & {} L \cdot (\theta ^*_{{\hat{r}}_k})^{-1} \Big ( {2 \over L} (2+\sqrt{2}) {\hat{\delta }}_+\Big ) \; = \; L \cdot (\theta ^*_{{\hat{r}}_k})^{-1} \Big ( 4 {\hat{\delta }}_+\Big ), \end{aligned}$$

(76)

where $L {\mathop {=}\limits ^{(67)}} 1 + {1 \over \sqrt{2}}$. In order to stop method (65) at this moment, we need to guarantee that the norm of the approximate gradient is small enough. Hence, our condition for parameter $\delta $ can be derived from the following reasoning. Since

$$\begin{aligned}&\Vert g_{\varphi _k,\tau }({\hat{x}}_T) \Vert _* {\mathop {\le }\limits ^{(66)}} \delta + \Vert \nabla \varphi _k({\hat{x}}_T) \Vert _* \; {\mathop {\le }\limits ^{(76)}} \; \delta + L \cdot (\theta ^*_{{\hat{r}}_k})^{-1} \Big ( 4 {\hat{\delta }}_+\Big ), \end{aligned}$$

in order to satisfy condition $\Vert g_{\varphi _k,\tau }({\hat{x}}_T) \Vert _* \le {1 \over 6} \Vert \nabla f({\hat{x}}_T) \Vert - \delta $, by Lemma 4.4, it is sufficient to satisfy inequality

$$\begin{aligned} 2 \delta + 2 L \sqrt{ 2 (L_1+5L_3 {\hat{r}}_k^2) \hat{\delta }_+} + 2 L_3^{1/4} \left( {8 \over 3} {\hat{\delta }}_+ \right) ^{3/4}\le & {} {1 \over 6} \epsilon _g, \end{aligned}$$

(77)

where $\epsilon _g > 0$ is a lower bound for the norm of the gradients of the objective function during the whole minimization process. Recall that

$$\begin{aligned}&{\hat{r}}_k {\mathop {=}\limits ^{(73)}} 2^{4/3} \left( {2+\sqrt{2} \over L_3} \Vert \nabla f(0) \Vert _* \right) ^{1 \over 3}, \quad {\hat{\delta }}_+ \; {\mathop {=}\limits ^{(68)}} \; {9 \delta ^{4 / 3} \over L_3^{1/3}}. \end{aligned}$$

Hence, this inequality can be rewritten in the following form:

$$\begin{aligned} 2(1+(24)^{3/4}) \delta + 6L \delta ^{2/3} \sqrt{ {2 L_1 \over L_3^{1/3}} + 10 \Big (16(2+\sqrt{2}) \Vert \nabla f(0) \Vert _*\Big )^{2/3}}\le & {} {1 \over 6} \epsilon _g. \end{aligned}$$

Using the upper integer bounds on the coefficients, it can be strengthened:

$$\begin{aligned} 24 \delta + 21 \delta ^{2/3} \sqrt{ {1 \over 2 L_3^{1/3}} \Vert \nabla ^2 f(0) \Vert + 36 \Vert \nabla f(0) \Vert _*^{2/3}}\le & {} {1 \over 6} \epsilon _g, \end{aligned}$$

(78)

where we take $L_1 = \Vert \nabla ^2 f(0) \Vert $ since this corresponds to the actual role of this constant in the complexity analysis of method (65).

This means that, in accordance to (78), we need to choose

$$\begin{aligned} \delta= & {} O\left( { \epsilon _g^{3/2} \over \Vert \nabla f(0) \Vert _*^{1/2} \; + \; \, \Vert \nabla ^2 f(0) \Vert ^{3/2} / L_3^{1/2} } \right) . \end{aligned}$$

(79)

Since $\Vert \nabla f(0) \Vert _* \ge \epsilon _g$, we always have $\delta \le O(\epsilon _g)$.

Note that all coefficients in the condition (78) are known (provided that we have a good estimate for the Lipschitz constant $L_3$). Thus, we have

$$\begin{aligned} T_k(\delta )= & {} O\left( \ln {G + H \over \epsilon _g} \right) , \end{aligned}$$

where G and H are the uniform upper bounds for the norms of the gradients and Hessians computed at the points generated by the main process. Validity of the assumption on finiteness of these bounds is discussed in Sect. 5.

Let us write down our inexact algorithmic schemes (21) and (26), employing the inner procedure (65). These methods have only one parameter $\delta >0$, which must be chosen in accordance to (78). They need also the constant $L_3$.

We start from the variant of Inexact Basic Tensor Method (21).

(80)

At each iteration of this method, we have $O\left( \ln {G +H \over \epsilon _g}\right) $ iterations of the inner scheme. Each of them needs three calls of oracle of the main objective function (twice for computing the approximate gradient of function $\varphi _k(\cdot )$ and once for verifying the stopping criterion). In view of Theorem 2.2, the rate of convergence of the main process is as follows:

$$\begin{aligned} f(x_k) - f^*\le & {} \left( 8 \over k\right) ^3 \left[ {7 \over 10} L_3 R^4(x_0) + {1 \over 2}(f(x_0) - f^*) \right] , \quad k \ge 1. \end{aligned}$$

(81)

Thus, the analytical complexity bound of the method (80) is of the order

$$\begin{aligned} O \left( R(x_0) \cdot \left( L_3 \over \epsilon _f\right) ^{1/3} \ln {G + H \over \epsilon _g} \right) , \end{aligned}$$

(82)

where $\epsilon _f>0$ is the desired accuracy in the function value. Note that this method uses only the second-order oracle.

Let us look now at the accelerated scheme.

(83)

As before, each iteration of this method needs at most $O\left( \ln {G +H \over \epsilon _g}\right) $ iterations of the inner scheme. In view of Theorem 2.3, the rate of convergence of the main process in (83) is as follows:

$$\begin{aligned} f(x_k) - f^*\le & {} {7 \over 60} \left( 6 \over k\right) ^4 \cdot L_3 \Vert x_0 - x^* \Vert ^4, \quad k \ge 1. \end{aligned}$$

(84)

Thus, the analytical complexity bound of this method is of the order

$$\begin{aligned} O \left( \Vert x_0 - x^* \Vert \cdot \left( L_3 \over \epsilon _f\right) ^{1/4} \ln {G + H\over \epsilon _g} \right) , \end{aligned}$$

(85)

Recall that method (83) is a second-order scheme.

5 Bounds for the Derivatives

The complexity analysis in Sect. 4 is valid only if we can guarantee the finiteness of the constants G and H. The simplest way of doing this consists in considering the following class of functions:

$$\begin{aligned} {{{\mathcal {M}}}}_{1,2,4}= & {} \{ f \in {\mathbb {C}}^4({\mathbb {E}}): \; M_1(f)< + \infty ,\, M_2(f)< + \infty ,\; M_4(f) < + \infty \}.\qquad \end{aligned}$$

(86)

This is a nontrivial class, but it is quite restrictive. In this section, we show that it is possible to derive the finiteness of G and H from our main assumption (47) and the properties of the minimization schemes.

Indeed, we can easily bound derivatives at test points from a bounded set. Let us present a trivial result, which follows from Taylor formula (7).

Lemma 5.1

For any $x \in B_D(x_0) {\mathop {=}\limits ^{\mathrm {def}}}\{ x \in {\mathbb {E}}: \; \Vert x - x_0 \Vert \le D \}$, we have

$$\begin{aligned} \Vert \nabla f(x) \Vert _*\le & {} \Vert \nabla f(x_0) \Vert _* + \Vert \nabla ^2 f(x_0) \Vert D + {1 \over 2}\Vert D^3f(x_0) \Vert D^2 + {1 \over 6} M_4(f) D^3,\nonumber \\ \Vert \nabla ^2 f(x) \Vert\le & {} \Vert \nabla ^2 f(x_0) \Vert + \Vert D^3f(x_0) \Vert D + {1 \over 2} M_4(f) D^2. \end{aligned}$$

(87)

We can use the right-hand sides of inequalities (87) as our constants G and H provided that the distance between $x_0$ and the test points does not exceed some $D < +\infty $. Note that we do not use D, G, and H in our methods. They appear only in the bounds for the number of inner steps and stay inside the logarithm. The important criterion (78), defining an appropriate value of the parameter $\delta > 0$, is based on the available information about the first and second derivatives at the current test point.

Thus, we need to prove that the sequences of test points in our methods are bounded. Let us start from Inexact Basic Tensor Method (80). For this method, the situation is very simple. We have already assumed that the size of the level set $R(x_0)$ is finite. Since the method (80) is monotone, for any $x_k$ generated by this scheme, we have

$$\begin{aligned} \Vert x_k - x_0 \Vert\le & {} \Vert x_k - x^* \Vert + \Vert x^* - x_0 \Vert \; \le \; 2R(x_0), \quad k \ge 0. \end{aligned}$$

Thus, we can take in (87) $D = 2R(x_0)$.

Let us look now at Inexact Accelerated Tensor Method. Actually, for proving the boundedness of sequences of the test points $\{ y_k \}_{k \ge 0}$, it is better to consider its monotone variant. The additional Step 4 of this method ensures monotonicity of the sequence $\{ f(x_k) \}_{k\ge 0}$.

(88)

Complexity analysis, presented in Sect. 2, remains also valid for the monotone variant (88). Indeed, in the right-hand side of the relation (29), we can replace point $x_k$ by any point with better value of the objective function.

Lemma 5.2

Let points $\{ y_k \}_{k \ge 0}$ be generated by the method (88). Then,

$$\begin{aligned} \Vert y_k - x_0 \Vert\le & {} (1+\sqrt{2}) R(x_0), \quad k \ge 0. \end{aligned}$$

(89)

Proof

Indeed, choosing in the relation (28) $p = 3$ and $x = x^*$, we get

$$\begin{aligned} {1 \over 16} \Vert v_k - x^* \Vert ^4\le & {} {1 \over 4} \Vert x^* - x_0 \Vert ^4 \end{aligned}$$

At the same time, since $f(x_k) \le f(x_0)$, we have $\Vert x_k - x^* \Vert \le R(x_0)$. Hence, in view of the definition of $y_k$ at Step 1 in (88),

$$\begin{aligned} \Vert y_k - x_0 \Vert\le & {} \max \{ \Vert x_k - x_0 \Vert , \Vert v_k - x_0 \Vert \} \; \le \; \max \{ 2 R(x_0), (1+\sqrt{2})R(x_0) \} \\= & {} (1+\sqrt{2})R(x_0). \Box \end{aligned}$$

Thus, for accelerated method (88) we can take $D = (1+\sqrt{2})R(x_0)$. $\square $

6 Conclusion

From our results, we conclude that the existing classification of the problem classes, optimization schemes, and complexity bounds is not perfect. Traditionally, we put in one-to-one correspondence the type of numerical schemes (classified by its order) and the problem classes (classified by the Lipschitz condition for the highest derivative). In this way, we attach the $1{\text{ st }}$-order methods to functions with Lipschitz-continuous gradients. The $2{\text{ nd }}$-order methods correspond to the functions with Lipschitz-continuous Hessian, etc.

This picture allows us to speak about the optimal methods. For example, we say that the Fast Gradient Methods (FGM) with the convergence rate $O\left( k^{-2}\right) $ are the optimal $1{\text{ st }}$-order methods. However, the only reason why FGM could be called optimal is that they implement the lower bound for a certain problem class, which is considered to be the natural field of application for the $1{\text{ st }}$-order methods only.

Now it is clear the above over-simplified picture of the world must be replaced by something more elaborated. We have seen that there exist problem classes for which the $2{\text{ nd }}$- and the $3{\text{ rd }}$-order methods demonstrate the same rate of convergence. So, the correct classification of problem classes and optimization methods must be at least two-parametric. This is, of course, an interesting topic for the further research.

Another interesting question is related to the $1{\text{ st }}$-order schemes. Indeed, if we managed to accelerate the $2{\text{ nd }}$-order methods above their ”natural” complexity limits, may be there exists a similar possibility for the $1{\text{ st }}$-order schemes? In our opinion, the answer is negative. Indeed, the lower complexity bounds for the $1{\text{ st }}$-order methods are supported by a worst-possible quadratic function. Quadratic functions already have zero high-order derivatives. Therefore, any assumptions on the high-order derivatives cannot eliminate this bad function from the problem class. For the $2{\text{ nd }}$-order methods, the worst-case function has discontinuous third derivative (see, for example, Section 4.3.1 in [14]). Therefore, assumptions on the fourth derivative can help.

References

Agarwal, N., Hazan, E.: Lower bounds for higher-order convex optimization. In: Proceedings of the 31st Conference On Learning Theory, PMLR, vol. 75, pp. 774–792 (2018)
Arjevani, O.S., Shiff, R.: Oracle complexity of second-order methods for smooth convex optimization. Math. Program. 178(1–2), 327–360 (2019)
Article MathSciNet Google Scholar
Baes, M.: Estimate sequence methods: extensions and approximations. Optimization (2009)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first order methods revisited and applications. Math. Oper. Res. 42, 330–348 (2016)
Article MathSciNet Google Scholar
Birgin, E.G., Gardenghi, J.L., Martinez, J.M., Santos, S.A., Toint, P.L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math. Program. 163, 359–368 (2017)
Article MathSciNet Google Scholar
Bubeck, S., Jiang, Q., Lee, Y.T., Li, Y., Sidford, A.: Near-optimal method for highly nonsmooth convex optimization. In: COLT, pp. 492–507 (2019)
Gasnikov, A., Gorbunov, E., Kovalev, D., Mohhamed, A., Chernousova, E.: The global rate of convergence for optimal tensor methods in smooth convex optimization. arXiv:1809.00382 (2018)
Grapiglia, G.N., Nesterov, Yu.: On inexact solution of auxiliary problems in tensor methods for convex optimization. Optim. Methods Softw. 36(1), 145–170 (2021)
Article MathSciNet Google Scholar
Jiang, B., Wang, H., Zang, S.: An optimal high-order tensor method for convex optimization. In: Conference on Learning Theory, pp. 1799–1801 (2019)
Lu, H., Freund, R., Nesterov, Yu.: Relatively smooth convex optimization by first-order methods, and applications. SIOPT 28(1), 333–354 (2018)
Article MathSciNet Google Scholar
Monteiro, R.D.C., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to the second-order methods. SIOPT 23(2), 1092–1125 (2013)
Article MathSciNet Google Scholar
Nesterov, Y.: Accelerating the cubic regularization of Newtons method on convex problems. Math. Program. 112(1), 159–181 (2008)
Article MathSciNet Google Scholar
Nesterov, Y.: Inexact Basic Tensor Methods. CORE DP (# 2019/23) (2019)
Nesterov, Y.: Lectures on Convex Optimization. Springer, Berlin (2018)
Book Google Scholar
Nesterov, Y.: Implementable tensor methods in unconstrained convex optimization. Math. Program. 186, 157–183 (2021)
Article MathSciNet Google Scholar
Nesterov, Y., Nemirovskii, A.: Interior Point Polynomial Methods in Convex Programming: Theory and Applications. SIAM, Philadelphia (1994)
Book Google Scholar
Nesterov, Y., Polyak, B.: Cubic regularization of Newtons method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This paper has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 788368). It was also supported by Multidisciplinary Institute in Artificial intelligence MIAI@Grenoble Alpes (ANR-19-P3IA-0003). The author would like to thank Alexander Gasnikov for discussions. The comments of two anonymous referees were extremely useful.

Author information

Authors and Affiliations

Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), Louvain-la-Neuve, Belgium
Yurii Nesterov

Authors

Yurii Nesterov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yurii Nesterov.

Additional information

Communicated by Anil Aswani.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nesterov, Y. Superfast Second-Order Methods for Unconstrained Convex Optimization. J Optim Theory Appl 191, 1–30 (2021). https://doi.org/10.1007/s10957-021-01930-y

Download citation

Received: 18 June 2020
Accepted: 16 August 2021
Published: 29 August 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s10957-021-01930-y

Keywords

Mathematics Subject Classification

90C25

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Superfast Second-Order Methods for Unconstrained Convex Optimization

Abstract

Similar content being viewed by others

A note on the convergence of ADMM for linearly constrained convex optimization problems

An adaptive high order method for finding third-order critical points of nonconvex optimization

A Note On the Weak Convergence of the Extragradient Method for Solving Pseudo-Monotone Variational Inequalities

1 Introduction

2 Tensor Methods with Inexact Iteration

Definition 2.1

Theorem 2.1

Proof

Corollary 2.1

Theorem 2.2

Proof

Theorem 2.3

Proof

3 Relative Non-degeneracy and Approximate Gradients

Lemma 3.1

Proof

Definition 3.1

Lemma 3.2

Proof

Lemma 3.3

Proof

4 Second-Order Implementations of the Third-Order Methods

Lemma 4.1

Proof

Lemma 4.2

Proof

Lemma 4.3

Proof

Lemma 4.4

Proof

5 Bounds for the Derivatives

Lemma 5.1

Lemma 5.2

Proof

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation