1 Introduction

Motivation In the last years, we can see an increasing interest in new frameworks for derivation and justification of different methods for Convex Optimization, provided with a worst-case complexity analysis (see, for example, [3, 4, 6, 11, 14, 15, 18, 20,21,22]). It appears that the accelerated proximal tensor methods [2, 20] can be naturally explained through the framework of high-order proximal-point schemes [21] requiring solution of nontrivial auxiliary problem at every iteration.

This possibility serves as a departure point for the results presented in this paper. Indeed, the main drawback of proximal tensor methods consists in necessity of using a fixed Euclidean structure for measuring distances between points. However, the multi-dimensional Taylor polynomials are defined by directional derivatives, which are affine-invariant objects. Can we construct a family of tensor methods, which do not depend on the choice of the coordinate system in the space of variables? The results of this paper give a positive answer on this question.

Our framework extends the initial results presented in [8, 18]. In [18], it was shown that the classical Frank–Wolfe algorithm can be generalized onto the case of the composite objective function [17] using a contraction of the feasible set towards the current test point. This operation was used there also for justifying a second-order method with contraction, which looks similar to the classical trust-region methods [5], but with asymmetric trust region. The convergence rates for the second-order methods with contractions were significantly improved in [8]. In this paper, we extend the contraction technique onto the whole family of tensor methods. However, in the vein of [21], we start first from analysing a conceptual scheme solving at each iteration an auxiliary optimization problem formulated in terms of the initial objective function.

The results of this work can be also seen as an affine-invariant counterpart of Contracting Proximal Methods from [6]. In the latter algorithms, one needs to fix the prox function which is suitable for the geometry of the problem, in advance. The parameters of the problem class are also usually required. Last but not least, all methods from this work do not fix a particular prox function and they are parameter-free.

Contents The paper is organized as follows.

In Sect. 2, we present a general framework of Contracting-Point methods. We provide two conceptual variants of our scheme for different conditions of inexactness for the solution of the subproblem: using a point with small residual in the function value, and using a stronger condition which involves the gradients. For both schemes we establish global bounds for the functional residual of the initial problem. These bounds lead to global convergence guarantees under a suitable choice of the parameters. For the scheme with the second condition of inexactness, we also provide a computable accuracy certificate. It can be used to estimate the functional residual directly within the method.

Section 3 contains smoothness conditions, which are useful to analyse affine-invariant high-order schemes. We present some basic inequalities and examples, related to the new definitions.

In Sect. 4, we show how to implement one iteration of our methods by computing an (inexact) affine-invariant tensor step. For the methods of degree \(p \ge 1\), we establish global convergence in the functional residual of the order \({\mathcal {O}}(1 / k^p)\), where k is the iteration counter. For \(p = 1\), this recovers a well-known result about global convergence of the classical Frank–Wolfe algorithm [10, 18]. For \(p = 2\), we obtain Contracting-Domain Newton Method from [8]. Thus, our analysis also extends the results from these works to the case, when the corresponding subproblem is solved inexactly.

In Sect. 5, we present a two-level optimization scheme, called Inexact Contracting Newton Method. This is an implementation of the inexact second-order method, in which the steps are computed by the first-order Conditional Gradient Method. For the resulting algorithm, we establish global complexity \({\mathcal {O}}(1 / \varepsilon ^{1/2})\) calls of the smooth part oracle (computing gradient and Hessian of the smooth part of the objective), and \({\mathcal {O}}(1 / \varepsilon )\) calls of the linear minimization oracle of the composite part, where \(\varepsilon > 0\) is the required accuracy in the functional residual. Additionally, we address effective implementation of our method for optimization over the standard simplex.

Section 6 contains numerical experiments.

In Sect. 7, we discuss our results and highlight some open questions for the future research.

Notation In what follows we denote by \(\mathbb {E}\) a finite-dimensional real vector space, and by \(\mathbb {E}^*\) its dual space, which is a space of linear functions on \(\mathbb {E}\). The value of function \(s \in \mathbb {E}^{*}\) at point \(x \in \mathbb {E}\) is denoted by \(\langle s, x \rangle \).

For a smooth function \(f: \mathrm{dom}\,f \rightarrow \mathbb {R}\), where \(\mathrm{dom}\,f \subseteq \mathbb {E}\), we denote by \(\nabla f(x)\) its gradient and by \(\nabla ^2 f(x)\) its Hessian, evaluated at point \(x \in \mathrm{dom}\,f \subseteq \mathbb {E}\). Note that

$$\begin{aligned} \nabla f(x) \in \mathbb {E}^{*}, \quad \nabla ^2 f(x) h \;\; \in \;\; \mathbb {E}^{*}, \end{aligned}$$

for all \(x \in \mathrm{dom}\,f\) and \(h \in \mathbb {E}\). For \(p \ge 1\), we denote by \( D^p f(x)[h_1, \dots , h_p] \) the pth directional derivative of f along directions \(h_1, \dots , h_p \in \mathbb {E}\). Note that \(D^p f(x)\) is a p-linear symmetric form on \(\mathbb {E}\). If \(h_i = h\) for all \(1 \le i \le p\), a shorter notation \(D^p f(x)[h]^p\) is used. For its gradient in h, we use the following notation:

$$\begin{aligned} D^p f(x)[h]^{p - 1} \; {\mathop {=}\limits ^{\mathrm {def}}}\; {1 \over p}\nabla D^p f(x)[h]^p \in \mathbb {E}^{*}, \quad h \in \mathbb {E}. \end{aligned}$$

In particular, \(D^1 f(x)[h]^{0} \equiv \nabla f(x)\), and \(D^2 f(x)[h]^{1} \equiv \nabla ^2 f(x)h\).

2 Contracting-point methods

Consider the following composite minimization problem

$$\begin{aligned} F^* \; {\mathop {=}\limits ^{\mathrm {def}}}\; \min \limits _{x \in \mathrm{dom}\,\psi } \Big [F(x) = f(x) + \psi (x) \Big ], \end{aligned}$$
(1)

where \(\psi : \mathbb {E}\rightarrow \mathbb {R}\cup \{ +\infty \}\) is a simple proper closed convex function with bounded domain, and function f(x) is convex and p (\(\ge 1\)) times continuously differentiable at every point \(x \in \mathrm{dom}\,\psi \).

The main requirement is that \(\psi \) should have a simple structure, which means that corresponding auxiliary subproblems are efficiently solvable. We will see examples of subproblems when discussing implementation of the methods. Typically, we substitute some polynomial model for f, while the composite component remains unchanged.

In this section, we propose a conceptual optimization scheme for solving (1) and provide the motivation for the idea. At each step of our method, we choose a contracting coefficient \(\gamma _k \in (0, 1]\) restricting the nontrivial part of our objective \(f(\cdot )\) onto a contracted domain. At the same time, the domain for the composite part remains unchanged.

Namely, at point \(x_k \in \mathrm{dom}\,\psi \), define

$$\begin{aligned} S_k(y) {\mathop {=}\limits ^{\mathrm {def}}}\gamma _k \psi \bigl ( x_k + \frac{1}{\gamma _k}(y - x_k) \bigr ), \quad y = x_k + \gamma _k(v - x_k), \quad v \in \mathrm{dom}\,\psi . \end{aligned}$$

Note that \(S_k(y) = \gamma _k \psi (v)\). Consider the following exact iteration:

(2)

Of course, when \(\gamma _k = 1\), exact step from (2) solves the initial problem. However, we are going to look at the inexact minimizer. In this case, the choice of \(\{ \gamma _k \}_{k \ge 0}\) should take into account the efficiency of solving the auxiliary subproblem.

Let us consider the function \(v \mapsto g_k(v) := f((1 - \gamma _k) x_k + \gamma _k v)\). Note that its derivatives are as follows:

$$\begin{aligned} D^{q} g_k(v) = \gamma _k^{q} D^q f( (1 - \gamma _k) x_k + \gamma _k v ), \quad q \ge 1. \end{aligned}$$
(3)

The smoothness characteristics of the objective (i.e. the Lipschitz constants) are defined by using the derivatives. Hence, we can hope that the smoothness properties of \(g_k(\cdot )\) can be better than those of \(f(\cdot )\), when \(\gamma _k < 1\). We see from (3) that for smaller \(\gamma _k\) we have smaller derivatives. The idea is to choose \(\gamma _k\) to make a trade-off between the smoothness and the quality of approximation of the initial objective. The result of employing the contracted objective should be combined with the progress made by an optimization algorithm up to the current iterate \(x_k\).

Denote by \(F_k(\cdot )\) the objective in the auxiliary problem (2), that is

$$\begin{aligned} F_k(y) {\mathop {=}\limits ^{\mathrm {def}}}f(y) + S_k(y), \quad y \; = \; (1-\gamma _k) x_k + \gamma _k v, \quad v \in \mathrm{dom}\,\psi . \end{aligned}$$

Let us fix a point \(\bar{v}_{k + 1} \in \mathrm{dom}\,\psi \) that is an approximate minimizer of \(F_k\) in v. Thus, we assume that the point \(\bar{x}_{k + 1} = (1-\gamma _k) x_k + \gamma _k \bar{v}_{k+1}\) have a small residual in the function value:

$$\begin{aligned} F_k(\bar{x}_{k + 1}) - F_k(x_{k+1}^*) \le \delta _{k + 1}, \end{aligned}$$
(4)

with some fixed \(\delta _{k + 1} \ge 0\).

Lemma 1

For all \(k \ge 0\) and \(v \in \mathrm{dom}\,\psi \), we have

$$\begin{aligned} F(\bar{x}_{k+1}) \le (1-\gamma _k) F(x_k) + \gamma _k F(v) + \delta _{k + 1}. \end{aligned}$$
(5)

Proof

Indeed, for any \(v \in \mathrm{dom}\,\psi \), we have

Therefore,

$$\begin{aligned} F(\bar{x}_{k+1})= & {} F_k(\bar{x}_{k+1}) + \psi (\bar{x}_{k+1}) - \gamma _k\psi (\bar{v}_{k + 1}) \\\le & {} (1-\gamma _k) f(x_k) + \gamma _k F(v) + \delta _{k+1} + \psi (\bar{x}_{k+1}) - \gamma _k \psi (\bar{v}_{k+1})\\\le & {} (1-\gamma _k) F(x_k) + \gamma _k F(v) + \delta _{k+1}. \end{aligned}$$

\(\square \)

Let us write down our method in an algorithmic form.

$$\begin{aligned} \begin{array}{|c|} \hline \\ \mathbf{Conceptual \ Contracting-Point \ Method, I}\\ \\ \hline \\ \begin{array}{l} \mathbf{Initialization.}\,\, \text{ Choose }\, x_0 \in \mathrm{dom}\,\psi .\\ \\ \mathbf{Iteration}\, k \ge 0.\\ \\ \text{1: } \text{ Choose }\, \gamma _k \in (0,1].\\ \\ \text{2: } \text{ For } \text{ some }\, \delta _{k + 1} \ge 0,\, \text{ find }\, \bar{x}_{k+1}\, \text{ satisfying } (4). \\ \\ \text{3: } \text{ If }\, F(\bar{x}_{k + 1}) \le F(x_k),\, \text{ then } \text{ set }\, x_{k + 1} = \bar{x}_{k + 1}.\,\, \text{ Else } \text{ choose }\, x_{k+1} = x_k.\\ \end{array}\\ \\ \hline \end{array} \end{aligned}$$
(6)

In Step 3 of this method, we add a simple test for ensuring monotonicity in the function value. This step is optional. Moreover, looking at algorithm (6), one may think that we are forgetting the points \(\bar{x}_{k + 1}\) when the function value is increasing: \(F(\bar{x}_{k + 1}) > F(x_k)\), and thus we are losing some computations. However, even if point \(\bar{x}_{k + 1}\) has not been taken as \(x_{k + 1}\), we shall use it internally as a starting point for computing the next \(\bar{x}_{k + 2}\) (see also [9] for the concept of monotone inexact step).

It is more convenient to describe the rate of convergence of this scheme with respect to another sequence of parameters. Let us introduce an arbitrary sequence of positive numbers \(\{ a_k \}_{k \ge 1}\) and denote \(A_k {\mathop {=}\limits ^{\mathrm {def}}}\sum _{i = 1}^k a_i\). Then, we can define the contracting coefficients as follows

$$\begin{aligned} \gamma _k {\mathop {=}\limits ^{\mathrm {def}}}\frac{a_{k + 1}}{A_{k + 1}}. \end{aligned}$$
(7)

Theorem 1

For all points of the sequence \(\{ x_k \}_{k \ge 0}\), generated by process (6), we have the following relation:

$$\begin{aligned} A_kF(x_k) \le A_k F^{*} + B_k, \quad \text {with} \quad B_k \; {\mathop {=}\limits ^{\mathrm {def}}}\; \sum _{i = 1}^k A_i \delta _i. \end{aligned}$$
(8)

Proof

Indeed, for \(k = 0\), we have \(A_k = 0\), \(B_k = 0\). Hence, (8) is valid. Assume it is valid for some \(k \ge 0\). Then

$$\begin{aligned} A_{k+1} F(x_{k+1})&{\mathop {\le }\limits ^{\text{ Step } \text{3 }}} A_{k+1} F(\bar{x}_{k+1}) \\&\le A_{k+1} \Big ( (1-\gamma _k) F(x_k) + \gamma _k F^* + \delta _{k+1} \Big )\\&{\mathop {=}\limits ^{(7)}}A_k F(x_k) + a_{k+1} F^* + A_{k+1} \delta _{k+1} \\&{\mathop {\le }\limits ^{(8)}} A_{k+1} F^* + B_{k+1}. \end{aligned}$$

\(\square \)

From bound (8), we can see, that

$$\begin{aligned} F(x_k) - F^{*} \le \frac{1}{A_k}\, \sum _{i = 1}^k A_i \delta _i, \quad k \ge 1. \end{aligned}$$
(9)

Hence, the actual rate of convergence of method (6) depends on the growth of coefficients \(\{ A_k \}_{k \ge 1}\) relatively to the level of inaccuracies \(\{ \delta _k \}_{k \ge 1}\). Potentially, this rate can be arbitrarily high. Since we did not assume anything yet about our objective function, this means that we just retransmitted the complexity of solving the problem (1) onto a lower level, the level of computing the point \(\bar{x}_{k+1}\), satisfying the condition (4). We are going to discuss different possibilities for that in Sects. 4 and 5.

Now, let us endow the method (6) with a computable accuracy certificate. For this purpose, for a sequence of given test points \(\{ \bar{x}_k \}_{k \ge 1} \subset \mathrm{dom}\,\psi \), we introduce the following Estimating Function (see [19]):

$$\begin{aligned} \varphi _k(v) {\mathop {=}\limits ^{\mathrm {def}}}\sum \limits _{i = 1}^k a_i \bigl [ f(\bar{x}_i) + \langle \nabla f(\bar{x}_i), v - \bar{x}_i \rangle + \psi (v) \bigr ]. \end{aligned}$$

By convexity of \(f(\cdot )\), we have \(A_k F(v) \ge \varphi _k(v)\) for all \(v \in \mathrm{dom}\,\psi \). Hence, for all \(k \ge 1\), we can get the following bound for the functional residual:

$$\begin{aligned} F(x_k) - F^{*} \le \ell _k \; {\mathop {=}\limits ^{\mathrm {def}}}\; F(x_k) - \frac{1}{A_k}\, \varphi _k^{*}, \quad \varphi _k^* \; {\mathop {=}\limits ^{\mathrm {def}}}\; \min \limits _{v \in \mathrm{dom}\,\psi } \; \varphi _k(v). \end{aligned}$$
(10)

The complexity of computing the value of \(\ell _k\) usually does not exceed the complexity of computing the next iterate of our method since it requires just one call of the linear minimization oracle. Let us show that an appropriate rate of decrease of the estimates \(\ell _k\) can be guaranteed by sufficiently accurate steps of the method (2). For that, we need a stronger condition on point \(\bar{x}_{k + 1}\), that is

$$\begin{aligned}&\langle \nabla f(\bar{x}_{k + 1}), v - \bar{v}_{k + 1} \rangle + \psi (v) \ge \psi (\bar{v}_{k + 1}) - \frac{1}{\gamma _k} \delta _{k + 1}, \quad v \in \mathrm{dom}\,\psi , \nonumber \\&\quad \bar{x}_{k + 1} = (1-\gamma _k) x_k + \gamma _k \bar{v}_{k+1}, \end{aligned}$$
(11)

with some \(\delta _{k + 1} \ge 0\). Note that, for \(\delta _{k + 1} = 0\), condition (11) ensures the exactness of the corresponding step of method (2).

Let us consider now the following algorithm.

$$\begin{aligned} \begin{array}{|c|} \hline \\ \mathbf{Conceptual}\,\,\mathbf{Contracting-Point}\,\,\mathbf{Method,}\,\,\mathbf{II}\\ \\ \hline \\ \begin{array}{l} \mathbf{Initialization.} \text{ Choose }\, x_0 \in \mathrm{dom}\,\psi .\\ \\ \mathbf{Iteration}\, k \ge 0.\\ \\ \text{1: } \text{ Choose }\, \gamma _k \in (0,1].\\ \\ \text{2: } \text{ For } \text{ some }\, \delta _{k + 1} \ge 0,\, \text{ find }\, \bar{x}_{k+1}\, \text{ satisfying } \text{(11) }. \\ \\ \text{3: } \text{ If }\, F(\bar{x}_{k + 1}) \le F(x_k),\, \text{ then } \text{ set }\, x_{k + 1} = \bar{x}_{k + 1}.\,\, \text{ Else } \text{ choose }\, x_{k+1} = x_k.\\ \end{array}\\ \\ \hline \end{array} \end{aligned}$$
(12)

This scheme differs from the previous method (6) only in the characteristic condition (11) for the next test point.

Theorem 2

For all points of the sequence \(\{ x_k \}_{k \ge 0}\), generated by the process (12), we have

$$\begin{aligned} \varphi _k^{*} \ge A_k F(x_k) - B_k, \quad k \ge 0. \end{aligned}$$
(13)

Proof

For \(k = 0\), relation (13) is valid since both sides are zeros. Assume that (13) holds for some \(k \ge 0\). Then, for any \(v \in \mathrm{dom}\,\psi \), we have

$$\begin{aligned} \varphi _{k + 1}(v)&\equiv \varphi _k(v) + a_{k + 1} \bigl [ f(\bar{x}_{k + 1}) + \langle \nabla f( \bar{x}_{k + 1} ), v - \bar{x}_{k + 1} \rangle + \psi (v) \bigr ] \\&{\mathop {\ge }\limits ^{(13)}} A_k F(x_k) - B_k + a_{k + 1} \bigl [ f(\bar{x}_{k + 1}) + \langle \nabla f( \bar{x}_{k + 1} ), v - \bar{x}_{k + 1} \rangle + \psi (v) \bigr ] \\&\overset{(*)}{\ge } A_{k + 1} \bigl [ f(\bar{x}_{k + 1}) + \langle \nabla f(\bar{x}_{k + 1}), \frac{a_{k + 1} v + A_k x_k}{A_{k + 1}} - \bar{x}_{k + 1} \rangle \bigr ] \\&\qquad + A_k \psi (x_k) + a_{k + 1} \psi (v) - B_k \\&= A_{k + 1} f(\bar{x}_{k + 1}) + a_{k + 1} \bigl [ \langle \nabla f(\bar{x}_{k + 1}), v - \bar{v}_{k + 1} \rangle + \psi (v)\bigr ] \\&\qquad + A_k \psi (x_k) - B_k \\&{\mathop {\ge }\limits ^{(11)}} A_{k + 1} f(\bar{x}_{k + 1}) + a_{k + 1} \psi (\bar{v}_{k + 1}) + A_k \psi (x_k) - B_{k + 1} \\&\overset{(**)}{\ge } A_{k + 1} F(\bar{x}_{k + 1}) - B_{k + 1} \;\; {\mathop {\ge }\limits ^{\text{ Step } \text{3 }}} \;\; A_{k + 1} F(x_{k + 1}) - B_{k + 1}. \end{aligned}$$

Here, the inequalities \((*)\) and \((**)\) are justified by convexity of \(f(\cdot )\) and \(\psi (\cdot )\), correspondingly. Thus, (13) is proved for all \(k \ge 0\). \(\square \)

Combining now (10) with (13), we obtain

$$\begin{aligned} F(x_k) - F^{*} \le \ell _k \;\; \le \;\; \frac{1}{A_k} \, \sum _{i = 1}^k A_i \delta _i, \quad k \ge 1. \end{aligned}$$
(14)

We see that the right hand side in (14) is the same, as that one in (9). However, this convergence is stronger, since it provides a bound for the accuracy certificate \(\ell _k\).

3 Affine-invariant high-order smoothness conditions

We are going to study complexity of solving the auxiliary problem in (2), and how it depends on the contracting parameter. For that we use affine-invariant characteristics of variation of function \(f(\cdot )\) over the compact convex sets. For a convex set Q, define

$$\begin{aligned} \varDelta ^p_Q(f) {\mathop {=}\limits ^{\mathrm {def}}}\mathop {\sup }\limits _{\begin{array}{c} { x, v \in Q,} \\ {t \in (0, 1] } \end{array}} \frac{1}{t^{p + 1}}\Big | f( x + t(v - x)) - f(x) - \sum \limits _{i = 1}^p \frac{t^i}{i!}D^i f(x)[v - x]^i \Big |. \end{aligned}$$
(15)

Note, that for \(p = 1\) this characteristic was considered in [13] for the analysis of the classical Frank–Wolfe algorithm.

In many situations, it is more convenient to use an upper bound for \(\varDelta ^p_Q(f)\), which is a full variation of its \((p+1)\)th derivative over the set Q:

$$\begin{aligned} {\mathcal {V}}^{p+1}_Q(f) {\mathop {=}\limits ^{\mathrm {def}}}\sup \limits _{x, y, v\in Q} \Big | D^{p+1}f(y)[v - x]^{p+1} \Big |. \end{aligned}$$
(16)

Indeed, by Taylor formula, we have

$$\begin{aligned}&\frac{1}{t^{p + 1}}\Big [f(x + t(v - x)) - f(x) - \sum \limits _{i = 1}^p \frac{t^i}{i!}D^i f(x)[v - x]^i\Big ]\\&\quad = {1 \over p!}\int \limits _0^1 (1-\tau )^p D^{p+1}f(x + \tau t(v-x))[v-x]^{p+1} d \tau . \end{aligned}$$

Hence,

$$\begin{aligned} \varDelta ^p_Q(f) \le {1 \over (p+1)!} {\mathcal {V}}^{p+1}_Q(f). \end{aligned}$$
(17)

Sometimes, in order to exploit a primal-dual structure of the problem, we need to work with the dual objects (gradients), as in method (12). In this case, we need a characteristic of variation of the gradient \(\nabla f(\cdot )\) over the set Q:

$$\begin{aligned} \varGamma ^p_Q(f)&{\mathop {=}\limits ^{\mathrm {def}}}&\mathop {\sup }\limits _{\begin{array}{c} { x, y, v \in Q,} \\ {t \in (0, 1] } \end{array}} \frac{1}{t^p}\Big | \langle \nabla f(x + t(v - x)) - \nabla f(x) \nonumber \\&- \sum \limits _{i = 2}^{p} \frac{t^{i - 1}}{(i - 1)!}D^i f(x)[v - x]^{i - 1}, v - y \rangle \Big |. \end{aligned}$$
(18)

Since

$$\begin{aligned}&\frac{1}{t} \Bigl [ f(x + t(v - x)) - f(x) - \sum \limits _{i = 1}^p \frac{t^i}{i!}D^i f(x)[v - x]^i \Bigr ] \\&\quad = \frac{1}{t}\Bigl [ \int \limits _0^1 \langle \nabla f(x + \tau t(v - x)), t (v - x) \rangle d\tau - \sum \limits _{i = 1}^p \frac{t^i}{i!} D^i f( x )[v - x]^i \Bigr ] \\&\quad = \int \limits _0^1 \langle \nabla f( x + \tau t (v - x) ) - \sum \limits _{i = 1}^p \frac{(\tau t)^{i - 1}}{(i - 1)!} D^{i} f(x)[v - x]^{i - 1}, v - x\rangle d\tau , \end{aligned}$$

we conclude that

$$\begin{aligned} \varDelta _Q^p(f) \le \frac{1}{p + 1}\varGamma _Q^p (f). \end{aligned}$$
(19)

At the same time, by Taylor formula, we get

$$\begin{aligned}&\frac{1}{t^p}\Bigl [ \nabla f(x + t(v - x)) - \nabla f(x) - \sum \limits _{i = 2}^{p} \frac{t^{i - 1}}{(i - 1)!} D^{i} f(x)[v - x]^{i - 1}\Bigr ] \nonumber \\&\qquad = \frac{1}{(p - 1)!} \int \limits _0^1 (1 - \tau )^{p - 1} D^{p + 1} f(x + \tau t(v - x) )[v - x]^{p} d \tau . \end{aligned}$$
(20)

Therefore, again we have an upper bound in terms of the variation of \((p + 1)\)th derivative, that is

$$\begin{aligned} \varGamma ^p_Q(f) {\mathop {\le }\limits ^{(20)}} \frac{1}{p!} \sup \limits _{x, y, z, v \in Q} \Big | \langle D^{p+1}f(z)[v-x]^{p}, v - y\rangle \Big | \;\; \le \;\; \frac{2(p + 1)^p}{(p!)^2} {\mathcal {V}}^{p+1}_Q(f). \end{aligned}$$

The last inequality can be justified by simple arguments from Linear Algebra (see also Section 2.3 in [16]). Hence, the value of \({\mathcal {V}}_Q^{p + 1}(f)\) is the biggest one. However, in many cases it is more convenient.

Example 1

For a fixed self-adjoint positive-definite linear operator \(B: \mathbb {E}\rightarrow \mathbb {E}^{*}\), define the corresponding Euclidean norm as \(\Vert x\Vert := \langle Bx, x \rangle ^{1/2}, \; x \in \mathbb {E}\). Let \(Q \subset \mathbb {E}\) be a compact set with diameter

$$\begin{aligned} \mathscr {D}= & {} \mathscr {D}_{\Vert \cdot \Vert }(Q) \; {\mathop {=}\limits ^{\mathrm {def}}}\; \max \limits _{x, y \in Q} \Vert x - y\Vert \; < \; +\infty . \end{aligned}$$

Let W be an open set containing it: \(Q \subset W \subseteq \mathbb {E}\). Assume that function f is \((p + 1)\)-times continuously differentiable on W, and its p-th derivative is Lipschitz continuous on W (w.r.t. \(\Vert \cdot \Vert \)):

$$\begin{aligned} \Vert D^p f(x) - D^p f(y) \Vert {\mathop {=}\limits ^{\mathrm {def}}}\max \limits _{h \in \mathbb {E}: \Vert h\Vert \le 1} | (D^p f(x) - D^p f(y))[h]^p | \; \le \; L_p \Vert y - x\Vert , \end{aligned}$$

for all \(x, y \in W\).

Then, we have

$$\begin{aligned} {\mathcal {V}}^{p + 1}_Q(f)\le & {} L_p \mathscr {D}^{p + 1}. \end{aligned}$$

\(\square \)

In some situations we can obtain much better estimates.

Example 2

Let \(A \succeq 0\), and \(f(x) = {1 \over 2}\langle A x, x \rangle \) with

$$\begin{aligned} x \in \mathbb {S}_n&{\mathop {=}\limits ^{\mathrm {def}}}&\{ x \in \mathbb {R}^n_+: \sum \limits _{i=1}^n x^{(i)} = 1 \}. \end{aligned}$$

For measuring distances in the standard simplex, we choose \(\ell _1\)-norm:

$$\begin{aligned} \Vert h \Vert= & {} \sum \limits _{i=1}^n |h^{(i)}|, \quad h \in \mathbb {R}^n. \end{aligned}$$

In this case, \(\mathscr {D}= \mathscr {D}_{|| \cdot \Vert }(\mathbb {S}_n) = 2\), and \(L_1 = \max \limits _{1 \le i \le n} A^{(i,i)}\). On the other hand,

$$\begin{aligned} {\mathcal {V}}^2_{\mathbb {S}_n}(f)= & {} \max \limits _{1 \le i, j \le n} \langle A(e_i-e_j), e_i - e_j \rangle \\\le & {} \max \limits _{1 \le i, j \le n} [2 \langle Ae_i,e_i \rangle + 2 \langle A e_j, e_j \rangle ] \; = \; 4L_1, \end{aligned}$$

where \(e_k\) denotes the kth coordinate vector in \(\mathbb {R}^n\). Thus, \({\mathcal {V}}^2_{\mathbb {S}_n} \le L_1 \mathscr {D}^2\).

However, for some matrices, the value \({\mathcal {V}}^2_{\mathbb {S}_n}(f)\) can be much smaller than \(L_1\mathscr {D}^2\). Indeed, let \(A = a a^T\) for some \(a \in \mathbb {R}^n\). Then \(L_1 = \max \limits _{1 \le i \le n} (a^{(i)})^2\), and

$$\begin{aligned} {\mathcal {V}}^2_{\mathbb {S}_n}(f)= & {} \left[ \max \limits _{1 \le i \le n} a^{(i)} - \min \limits _{1 \le i \le n} a^{(i)} \right] ^2, \end{aligned}$$

which can be much smaller than \(4L_1\). \(\square \)

Example 3

For given vectors \(a_1, \dots , a_m \in \mathbb {R}^n\), consider the objective

$$\begin{aligned} f(x)= & {} \ln \left( \sum \limits _{k = 1}^m e^{\langle a_k, x \rangle } \right) , \quad x \in \mathbb {S}_n. \end{aligned}$$

Then, it holds

$$\begin{aligned} \langle \nabla ^2 f(x)h, h \rangle\le & {} \max \limits _{1 \le k, l \le m} \langle a_k - a_l, h \rangle ^2 \\\le & {} \max \limits _{1 \le k, l \le m} \Vert a_k - a_l\Vert _{\infty }^2 \Vert h\Vert _1^2, \quad h \in \mathbb {R}^n \end{aligned}$$

(see Example 1 in [7] for the first inequality). Therefore, in \(\ell _1\)-norm we have

$$\begin{aligned} L_1= & {} \max \limits _{1 \le k, l \le m} \max \limits _{1 \le i \le n} \left[ a_k^{(i)} - a_l^{(i)} \right] ^2. \end{aligned}$$

At the same time,

$$\begin{aligned} {\mathcal {V}}_{\mathbb {S}_n}^2(f)= & {} \sup \limits _{x \in \mathbb {S}_n} \max \limits _{1 \le i, j \le n} \langle \nabla ^2 f(x) (e_i - e_j), e_i - e_j \rangle \\\le & {} \max \limits _{1 \le k, l \le m} \max \limits _{1 \le i, j \le n} \left[ \bigl ( a_k^{(i)} - a_k^{(j)} \bigr ) - \bigl ( a_l^{(i)} - a_l^{(j)} \bigr ) \right] ^2. \end{aligned}$$

The last expression is the maximal difference between variations of the coordinates. It can be much smaller than \(L_1 \mathscr {D}^2 = 4 L_1\).

Moreover, we have (see Example 1 in [7]):

$$\begin{aligned} |D^3 f(x)[h]^3|\le & {} \max \limits _{1 \le k, l \le m} |\langle a_k - a_l, h \rangle |^3, \quad h \in \mathbb {R}^n. \end{aligned}$$

Hence, we obtain

$$\begin{aligned} {\mathcal {V}}_{\mathbb {S}_n}^3(f)\le & {} \max \limits _{1 \le k, l \le m} \max \limits _{1 \le i, j \le n} \Bigl | \bigl ( a_k^{(i)} - a_k^{(j)} \bigr ) - \bigl ( a_l^{(i)} - a_l^{(j)} \bigr ) \Bigr |^3. \end{aligned}$$

\(\square \)

4 Contracting-point tensor methods

In this section, we show how to implement Contracting-point methods, by using affine-invariant tensor steps. At each iteration of (2), we approximate \(f(\cdot )\) by Taylor’s polynomial of degree \(p \ge 1\) around the current point \(x_k\):

$$\begin{aligned} f(y)\approx & {} \varOmega _p(f, x_k; y) \;\; {\mathop {=}\limits ^{\mathrm {def}}}\;\; f(x_k) + \sum \limits _{i = 1}^p \frac{1}{i!} D^i f(x_k) [y - x_k]^i. \end{aligned}$$

Thus, we need to solve the following auxiliary problem:

$$\begin{aligned} \min \limits _{v \in \mathrm{dom}\,\psi } \Bigl \{ M_k(y) {\mathop {=}\limits ^{\mathrm {def}}}\varOmega _p(f, x_k; y) + S_k(y) : \; y = (1 - \gamma _k )x_k + \gamma _k v \Bigr \}. \end{aligned}$$
(21)

Note that this global minimum \(M_k^{*}\) is well defined since \(\mathrm{dom}\,\psi \) is bounded. Let us take

$$\begin{aligned} \bar{x}_{k + 1}= & {} (1 - \gamma _k )x_k + \gamma _k \bar{v}_{k+1}, \end{aligned}$$

where \(\bar{v}_{k+1}\) is an inexact solution to (21) in the following sense:

$$\begin{aligned} M_k(\bar{x}_{k + 1}) - M_k^{*} \le \xi _{k + 1}. \end{aligned}$$
(22)

Then, this point serves as a good candidate for the inexact step of our method.

Theorem 3

Let \(\xi _{k + 1} \le c\gamma _k^{p + 1}\), for some constant \(c \ge 0\). Then

$$\begin{aligned} F_k(\bar{x}_{k + 1}) - F_k^{*}\le & {} \delta _{k + 1}, \end{aligned}$$

for \(\delta _{k + 1} = (c + 2\varDelta _{\mathrm{dom}\,\psi }^p(f))\gamma _k^{p + 1}\).

Proof

Indeed, for \(y = x_k + \gamma _k (v - x_k)\) with arbitrary \(v \in \mathrm{dom}\,\psi \), we have

\(\square \)

Thus, we come to the following minimization scheme.

$$\begin{aligned} \begin{array}{|c|} \hline \\ \text{ Contracting-Point } \text{ Tensor } \text{ Method, } \text{ I }\\ \\ \hline \\ \begin{array}{l} \mathbf{Initialization.}\,\, \text{ Choose }\, x_0 \in \mathrm{dom}\,\psi , c \ge 0.\\ \\ \mathbf{Iteration}\, k \ge 0.\\ \\ \text{1: } \text{ Choose }\, \gamma _k \in (0,1].\\ \\ \text{2: } \text{ For } \text{ some }\, \xi _{k + 1} \le c \gamma _k^{p + 1}, \text{ find }\, \bar{x}_{k+1}\, \text{ satisfying } \text{(22) }. \\ \\ \text{3: } \text{ If }\, F(\bar{x}_{k + 1}) \le F(x_k),\, \text{ then } \text{ set }\, x_{k + 1} = \bar{x}_{k + 1}.\, \text{ Else } \text{ choose }\, x_{k+1} = x_k.\\ \end{array}\\ \\ \hline \end{array} \end{aligned}$$
(23)

For \(p = 1\) and \(\psi (\cdot )\) being an indicator function of a compact convex set, this is well-known Frank–Wolfe algorithm [10]. For \(p = 2\), this is Contracting-Domain Newton Method from [8].

Straightforward consequence of our observations is the following

Theorem 4

Let \(A_k {\mathop {=}\limits ^{\mathrm {def}}}k \cdot (k + 1) \cdot \ldots \cdot (k + p)\), and consequently \(\gamma _k = \frac{p + 1}{k + p + 1}\). Then, for all iterations \(\{ x_k \}_{k \ge 1}\) generated by method (23), we have

$$\begin{aligned} F(x_k) - F^{*}\le & {} (p + 1)^{p + 1} \cdot (c + 2\varDelta _{\mathrm{dom}\,\psi }^p) \cdot k^{-p}. \end{aligned}$$

Proof

Combining (9) with Theorem 3, we have

$$\begin{aligned} F(x_k) - F^{*}\le & {} \frac{(c + 2\varDelta _{\mathrm{dom}\,\psi }^p (f))}{A_k} \sum \limits _{i = 1}^k \frac{a_i^{p + 1}}{A_i^p}, \quad k \ge 1. \end{aligned}$$

Since

$$\begin{aligned} \frac{1}{A_k} \sum \limits _{i = 1}^k \frac{a_{i}^{p + 1}}{A_i^p}= & {} \frac{1}{A_k} \sum \limits _{i = 1}^k \frac{(p + 1)^{p + 1} A_i}{(p + i)^{p + 1}} \;\; \le \;\; \frac{(p + 1)^{p + 1} k}{A_k} \;\; \le \;\; \frac{(p + 1)^{p + 1}}{k^p}, \end{aligned}$$

we get the required inequality. \(\square \)

It is important that the required level of accuracy \(\xi _{k + 1}\) for solving the subproblem is not static: it is changing with iterations. Indeed, from the practical perspective, there is no need to use high accuracy during the first iterations, but it is natural to improve our precision while approaching the optimum. Inexact proximal-type tensor methods with dynamic inner accuracies were studied in [9].

Let us note that the objective \(M_k(y)\) from (21) is generally nonconvex for \(p \ge 3\), and it may be nontrivial to look for its global minimum. Because of that, we propose an alternative condition for the next point. It requires just to find a point satisfying an (inexact) first-order necessary condition for local optimality of \(\varOmega _p(f, x_k; y)\). That is a point \(\bar{x}_{k + 1}\), satisfying for all \(v \in \mathrm{dom}\,\psi \)

$$\begin{aligned}&\langle \nabla \varOmega _p(f, x_k; \bar{x}_{k + 1}), v - \bar{v}_{k + 1} \rangle + \psi (v) \ge \psi (\bar{v}_{k + 1}) - \frac{1}{\gamma _k} \xi _{k + 1}, \nonumber \\&\quad \bar{x}_{k + 1} = (1-\gamma _k) x_k + \gamma _k \bar{v}_{k+1}, \end{aligned}$$
(24)

for some tolerance value \(\xi _{k + 1} \ge 0\).

Theorem 5

Let point \(\bar{x}_{k + 1}\) satisfy condition (24) with

$$\begin{aligned} \xi _{k + 1}\le & {} c \gamma _k^{p + 1}, \end{aligned}$$

for some constant \(c \ge 0\). Then it satisfies inexact condition (11) of the Conceptual Contracting-Point Method with

$$\begin{aligned} \delta _{k + 1}= & {} (c + \varGamma _{\mathrm{dom}\,\psi }^p(f))\gamma _k^{p + 1}. \end{aligned}$$

Proof

Indeed, for any \(v \in \mathrm{dom}\,\psi \), we have

$$\begin{aligned}&\langle \nabla f(\bar{x}_{k + 1}), v - \bar{v}_{k + 1} \rangle + \psi (v) \\&\quad =\langle \nabla \varOmega _p (f, x_k; \bar{x}_{k + 1}), v - \bar{v}_{k + 1} \rangle + \psi (v) \\&\qquad + \langle \nabla f(\bar{x}_{k + 1}) - \varOmega _p(f, x_k; \bar{x}_{k + 1}), v - \bar{v}_{k + 1} \rangle \\&\quad {\mathop {\ge }\limits ^{(24)}} \psi (\bar{v}_{k + 1}) - c \gamma _k^p + \langle \nabla f(\bar{x}_{k + 1}) - \varOmega _p(f, x_k; \bar{x}_{k + 1}), v - \bar{v}_{k + 1} \rangle \\&\quad {\mathop {\ge }\limits ^{(18)}} \psi (\bar{v}_{k + 1}) - (c + \varGamma _{\mathrm{dom}\,\psi }^p(f))\gamma _k^p \;\; = \;\; \psi (\bar{v}_{k + 1}) - \frac{1}{\gamma _k} \delta _{k + 1}. \end{aligned}$$

\(\square \)

Note the appearance of \(\gamma _k^{p + 1}\) in both Theorems 3 and 5. It comes from the form of the derivatives for contracted objective (3), where we substitute \(q = p + 1\) to bound the error of p-th order Taylor approximation.

Now, changing inexactness condition (22) in method (23) by condition (24), we come to the following algorithm.

$$\begin{aligned} \begin{array}{|c|} \hline \\ \mathbf{Contracting-Point\,\, Tensor\,\, Method, II}\\ \\ \hline \\ \begin{array}{l} \mathbf{Initialization.}\,\, \text{ Choose }\, x_0 \in \mathrm{dom}\,\psi , c \ge 0.\\ \\ \mathbf{Iteration}\,\, k \ge 0.\\ \\ \text{1: } \text{ Choose }\, \gamma _k \in (0,1].\\ \\ \text{2: } \text{ For } \text{ some }\, \xi _{k + 1} \le c \gamma _k^{p + 1},\, \text{ find }\, \bar{x}_{k+1}\, \text{ satisfying } \text{(24) }. \\ \\ \text{3: } \text{ If }\, F(\bar{x}_{k + 1}) \le F(x_k),\, \text{ then } \text{ set }\, x_{k + 1} = \bar{x}_{k + 1}.\, \text{ Else } \text{ choose }\, x_{k+1} = x_k.\\ \end{array}\\ \\ \hline \end{array} \end{aligned}$$
(25)

Its convergence analysis is straightforward.

Theorem 6

Let \(A_k {\mathop {=}\limits ^{\mathrm {def}}}k \cdot (k + 1) \cdot \ldots \cdot (k + p)\), and consequently \(\gamma _k = \frac{p + 1}{k + p + 1}\). Then, for all iterations \(\{x_k \}_{k \ge 1}\) of method (25), we have

$$\begin{aligned} F(x_k) - F^{*}\le & {} \ell _k \;\; \le \;\; (p + 1)^{p + 1} \cdot (c + \varGamma _{\mathrm{dom}\,\psi }^p(f)) \cdot k^{-p}. \end{aligned}$$

Proof

Combining inequality (14) with the statement of Theorem 5, we have

$$\begin{aligned} F(x_k) - F^{*}\le & {} \ell _k \;\; \le \;\; \frac{c + \varGamma _{\mathrm{dom}\,\psi }^p (f)}{A_k} \sum \limits _{i = 1}^k \frac{a_i^{p + 1}}{A_i^p}, \quad k \ge 1. \end{aligned}$$

It remains to use the same reasoning, as in the proof of Theorem 4. \(\square \)

Finally, let us discuss a trust-region interpretation of our methods. In the exact form (\(\xi _k \equiv 0\)), iterations of the Contracting-Point Tensor Methods can be rewritten as follows, for \(k \ge 0\):

For \(\psi (x) \equiv \text {Ind}_Q(x)\), where Q is a bounded convex set, this method can be seen as a trust-region scheme [5] with p-th order Taylor model of the objective function, regularized by the contraction of the feasible set Q.

5 Inexact contracting Newton method

In this section, let us present an implementation of our method (23) for \(p = 2\), when at each step we solve the subproblem inexactly by a variant of first-order Conditional Gradient Method. The entire algorithm looks as follows.

(26)

We provide an analysis of the total number of oracle calls for f (step 2) and the total number of linear minimization oracle calls for the composite component \(\psi \) (step 4-c), required to solve problem (1) up to the given accuracy level.

Theorem 7

Let \(\gamma _k = \frac{3}{k + 3}\). Then, for iterations \(\{ x_k \}_{k \ge 1}\) generated by method (26), we have

$$\begin{aligned} F(x_k) - F^{*} \le 27 \cdot (c + 2 \varDelta ^{(2)}_{\mathrm{dom}\,\psi }) \cdot k^{-2}. \end{aligned}$$
(27)

Therefore, for any \(\varepsilon > 0\), it is enough to perform

$$\begin{aligned} K = \biggl \lceil \sqrt{\frac{27(c + 2\varDelta ^{(2)}_{\mathrm{dom}\,\psi }(f))}{\varepsilon }} \; \biggr \rceil \end{aligned}$$
(28)

iteration of the method, in order to get \(F(x_K) - F^{*} \le \varepsilon \). And the total number \(N_K\) of linear minimization oracle calls during these iterations is bounded as

$$\begin{aligned} N_K \le 2 \cdot \Bigl ( 1 + \frac{2 {\mathcal {V}}^{(2)}_{\mathrm{dom}\,\psi }(f)}{c} \Bigr ) \cdot \Bigl ( 1 + \frac{ 27(c + 2 \varDelta ^{(2)}_{\mathrm{dom}\,\psi }(f)) }{\varepsilon } \Bigr ). \end{aligned}$$
(29)

Proof

Let us fix arbitrary iteration \(k \ge 0\) of our method and consider the following objective:

$$\begin{aligned} m_k(v)= & {} g_k(v) + \psi (v) \\= & {} \langle \nabla f(x_k), v - x_k \rangle + \frac{\gamma _k}{2} \langle \nabla ^2 f(x_k)(v - x_k), v - x_k \rangle + \psi (v). \end{aligned}$$

We need to find the point \(\bar{v}_{k + 1}\) such that

$$\begin{aligned} m_k(\bar{v}_{k + 1}) - m_k^{*} \le c \gamma _k^{2}. \end{aligned}$$
(30)

Note that if we set \(\bar{x}_{k + 1} := \gamma _k \bar{v}_{k + 1} + (1 - \gamma _k) x_k\), then from (30) we obtain bound (22) satisfied with \(\xi _{k + 1} = c \gamma _k^{3}\). Thus we would obtain iteration of Algorithm (23) for \(p = 2\), and Theorem 4 gives the required rate of convergence (27). We are about to show that steps 4-a – 4-e of our algorithm are aiming to find such point \(\bar{v}_{k + 1}\).

Let us introduce auxiliary sequences \(A_t {\mathop {=}\limits ^{\mathrm {def}}}t \cdot (t + 1)\) and \(a_{t + 1} {\mathop {=}\limits ^{\mathrm {def}}}A_{t + 1} - A_t\) for \(t \ge 0\). We use these sequences for analysing the inner method.

Then, \(\alpha _t \equiv \frac{a_{t + 1}}{A_{t + 1}}\), and we have the following representation of the Estimating Functions, for every \(t \ge 0\)

$$\begin{aligned} \phi _{t + 1}(w)= & {} \frac{1}{A_{t + 1}} \sum \limits _{i = 0}^{t} a_{i + 1} \Bigl [ g_k(z_i) + \langle \nabla g_k(z_i), w - z_i \rangle + \psi (w) \Bigr ]. \end{aligned}$$

By convexity of \(g_k(\cdot )\), we have

$$\begin{aligned} m_k (w)\ge & {} \phi _{t + 1}(w), \quad w \in \mathrm{dom}\,\psi . \end{aligned}$$

Therefore, we obtain the following upper bound for the residual (30), for any \(v \in \mathrm{dom}\,\psi \)

$$\begin{aligned} m_k(v) - m_k^{*} \le m_k(v) - \phi _{t + 1}^{*}, \end{aligned}$$
(31)

where \(\phi _{t + 1}^{*} = \min _{w} \phi _{t + 1}(w) = \phi _{t + 1}(w_{t + 1})\).

Now, let us show by induction, that

$$\begin{aligned} A_{t}\phi _{t}^{*} \ge A_{t} m_k(z_{t}) \; - \; B_{t}, \quad t \ge 0, \end{aligned}$$
(32)

for \(B_t := \frac{\gamma _k {\mathcal {V}}_{\mathrm{dom}\,\psi }^2 (f)}{2} \sum _{i = 0}^t \frac{a_{i + 1}^2}{A_{i + 1}}\). It obviously holds for \(t = 0\). Assume that it holds for some \(t \ge 0\). Then,

$$\begin{aligned} A_{t + 1} \phi _{t + 1}^{*}= & {} A_{t + 1} \phi _{t + 1}(w_{t + 1}) \\= & {} A_t \phi _t(w_{t + 1}) + a_{t + 1} \bigl [ g_k(z_{t}) + \langle \nabla g_k(z_{t}), w_{t + 1} - z_{t} \rangle + \psi (w_{t + 1}) \bigr ] \\&{\mathop {\ge }\limits ^{(32)}} A_t m_k(z_t) + a_{t + 1} \bigl [ g_k(z_{t}) + \langle \nabla g_k(z_{t}), w_{t + 1} - z_{t} \rangle + \psi (w_{t + 1}) \bigr ] - B_t \\= & {} A_{t + 1} \Bigl [ g_k(z_{t}) + \alpha _t \langle \nabla g_k(z_t), w_{t + 1} - z_t \rangle + \alpha _t \psi (w_{t + 1}) + (1 - \alpha _t) \psi (z_t) \Bigr ] - B_t \\\ge & {} A_{t + 1} \Bigl [ g_k(z_{t}) + \alpha _t \langle \nabla g_k(z_t), w_{t + 1} - z_t \rangle + \psi (z_{t + 1}) \Bigr ] - B_t. \end{aligned}$$

Note that

$$\begin{aligned} g_k(z_{t + 1})= & {} g_k(z_t + \alpha _t(w_{t + 1} - z_t)) \\= & {} g_k(z_t) + \alpha _t \langle \nabla g_k(z_t), w_{t + 1} - z_t \rangle \\&+ \frac{\alpha _t^2 \gamma _k}{2} \langle \nabla ^2 f(x_k)(w_{t + 1} - z_t), w_{t + 1} - z_t \rangle . \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} A_{t + 1} \phi _{t + 1}^{*}\ge & {} A_{t + 1} m_k(z_{t + 1}) - B_t - \frac{a_{t + 1}^2}{A_{t + 1}} \cdot \frac{\gamma _k {\mathcal {V}}_{\mathrm{dom}\,\psi }^{(2)}(f)}{2}, \end{aligned}$$

and this is (32) for the next step. Therefore, we have (32) established for all \(t \ge 0\).

Combining (31) with (32), we get the following guarantee for the inner steps 4-a – 4-e:

$$\begin{aligned} m_k(z_{t + 1}) - m_k^{*}\le & {} m_k(z_{t + 1}) - \phi _{t + 1}^{*} \;\; \le \;\; \frac{\gamma _k {\mathcal {V}}_{\mathrm{dom}\,\psi }^{(2)}(f)}{2 A_{t + 1}} \sum \limits _{i = 0}^{t} \frac{a_{i + 1}^2}{A_{i + 1}} \\\le & {} \frac{2 \gamma _k {\mathcal {V}}_{\mathrm{dom}\,\psi }^{(2)}(f)}{{t + 1}}. \end{aligned}$$

Therefore, all iterations of our method is well-defined. We exit from the inner loop on step 4-e after

$$\begin{aligned} t \ge \frac{2 {\mathcal {V}}_{\mathrm{dom}\,\psi }^{(2)}(f) }{c \gamma _k} - 1 = \frac{2(k + 3) {\mathcal {V}}_{\mathrm{dom}\,\psi }^{(2)}(f) }{3 c} - 1, \end{aligned}$$
(33)

and the point \(\bar{v}_{k + 1} \equiv z_{t + 1}\) satisfies (30).

Hence, we obtain (27) and (28). The total number of linear minimization oracle calls can be estimated as follows

$$\begin{aligned} N_K&{\mathop {\le }\limits ^{(33)}} \sum \limits _{k = 0}^{K - 1} \Bigl ( 1 + \frac{2(k + 3) {\mathcal {V}}_{\mathrm{dom}\,\psi }^{(2)}(f)}{3c} \Bigr ) \;\; = \;\; K \Bigl ( 1 + \frac{{\mathcal {V}}^{(2)}_{\mathrm{dom}\,\psi }(f) }{3c}\bigl ( K + 5 \bigr ) \Bigr ) \\&\le K^2 \Bigl ( 1 + \frac{2 {\mathcal {V}}^{(2)}_{\mathrm{dom}\,\psi }(f)}{c} \Bigr ) \;\; \le \;\; 2 \cdot \Bigl ( 1 + \frac{2 {\mathcal {V}}^{(2)}_{\mathrm{dom}\,\psi }(f)}{c} \Bigr ) \cdot \Bigl ( 1 + \frac{ 27(c + 2 \varDelta ^{(2)}_{\mathrm{dom}\,\psi }(f)) }{\varepsilon } \Bigr ). \end{aligned}$$

\(\square \)

According to the result of Theorem 7, in order to solve problem (1) up to \(\varepsilon > 0\) accuracy, we need to perform \({\mathcal {O}}(\frac{1}{\varepsilon })\) total computations of step 4-c of the method (estimate (29)). This is the same amount of linear minimization oracle calls, as required in the classical Frank–Wolfe algorithm [18]. However, this estimate can be over-pessimistic for our method. Indeed, it comes as the product of the worst-case complexity bounds for the outer and the inner optimization schemes. It seems to be very rare to meet with the worst-case instance at the both levels simultaneously. Thus, the practical performance of our method can be much better.

At the same time, the total number of gradient and Hessian computations is only \({\mathcal {O}}(\frac{1}{\varepsilon ^{1/2}})\) (estimate (28)). This can lead to a significant acceleration over first-order Frank–Wolfe algorithm, when the gradient computation is a bottleneck (see our experimental comparison in the next section).

The only parameter which remains to choose in method (26), is the tolerance constant \(c > 0\). Note that the right hand side of (29) is convex in c. Hence, its approximate minimization provides us with the following choice

$$\begin{aligned} c = 2 \sqrt{ {\mathcal {V}}_{\mathrm{dom}\,\psi }^{(2)}(f) \, \varDelta ^{(2)}_{\mathrm{dom}\,\psi }(f) }. \end{aligned}$$

In practical applications, we may not know some of these constants. However, in many cases they are small. Therefore, an appropriate choice of c is a small constant.

Finally, let us discuss effective implementation of our method, when the composite part is \(\{0, +\infty \}\)-indicator of the standard simplex:

$$\begin{aligned} \mathrm{dom}\,\psi \;\; = \;\; \mathbb {S}_n {\mathop {=}\limits ^{\mathrm {def}}}\Bigl \{ x \in \mathbb {R}^n_{+} \; : \; \sum \limits _{i = 1}^n x^{(i)} = 1 \Bigr \}. \end{aligned}$$
(34)

This is an example of a set with a finite number of atoms, which are the standard coordinate vectors in this case:

$$\begin{aligned} \mathbb {S}_n= & {} \mathrm{Conv}\,\{e_1, \dots , e_n \}. \end{aligned}$$

See [13] for more examples of atomic sets in the context of Frank–Wolfe algorithm. The maximization of a convex function over such sets can be implemented very efficiently, since the maximum is always at the corner (one of the atoms).

At iteration \(k \ge 0\) of method (26), we need to minimize over \(\mathbb {S}_n\) the quadratic function

$$\begin{aligned} g_k(v)= & {} \langle \nabla f(x_k), v - x_k \rangle + \frac{\gamma _k}{2} \langle \nabla ^2 f(x_k)(v - x_k), v - x_k \rangle , \end{aligned}$$

whose gradient is

$$\begin{aligned} \nabla g_k(v) = \nabla f(x_k) + \gamma _k \nabla ^2 f(x_k)(v - x_k). \end{aligned}$$

Assume that we keep the vector \(\nabla g_k(z_t) \in \mathbb {R}^n\) for the current point \(z_t\), \(t \ge 0\) of the inner process, as well as its aggregation

$$\begin{aligned} h_t&{\mathop {=}\limits ^{\mathrm {def}}}&\alpha _t \nabla g_k(z_t) + (1 - \alpha _t) h_{t - 1}, \quad h_{-1} \;\; {\mathop {=}\limits ^{\mathrm {def}}}\;\; 0 \in \mathbb {R}^n. \end{aligned}$$

Then, at step 4-c we need to compute a vector

It is enough to find an index j of a minimal element of \(h_t\) and to set \(w_{t + 1} := e_j\). The new gradient is equal to

$$\begin{aligned} \nabla g_k(z_{t + 1})&\overset{\text {Step 4-d}}{=}&\nabla g_k(\alpha _t w_{t + 1} + (1 - \alpha _t) z_t ) \\= & {} \alpha _t\Bigl ( \nabla f(x_k) + \gamma _k \nabla ^2 f(x_k) (e_j - x_k) \Bigr ) + (1 - \alpha _t) \nabla g_k(z_t), \end{aligned}$$

and the function value can be expressed using the gradient as follows

$$\begin{aligned} g_k(z_{t + 1})= & {} \frac{1}{2}\langle \nabla f(x_k) + \nabla g_k(z_{t + 1}), z_{t + 1} - x_k \rangle . \end{aligned}$$

The product \(\nabla ^2 f(x_k) e_j\) is just j-th column of the matrix. Hence, preparing in advance the following objects: \(\nabla f(x_k) \in \mathbb {R}^n\), \(\nabla ^2 f(x_k) \in \mathbb {R}^{n \times n}\) and the Hessian-vector product \(\nabla ^2 f(x_k) x_k \in \mathbb {R}^n\), we are able to perform iteration of the inner loop (steps 4-a – 4-e) very efficiently in \({ \mathcal {O}}(n)\) arithmetical operations.

6 Numerical experiments

Let us consider the problem of minimizing the log-sum-exp function (SoftMax)

$$\begin{aligned} f_{\mu }(x)= & {} \mu \ln \biggl ( \sum \limits _{i = 1}^m \exp \Bigl ( \frac{\langle a_i, x \rangle - b_i}{\mu } \Bigr ) \biggr ), \quad x \in \mathbb {R}^n, \end{aligned}$$

over the standard simplex \(\mathbb {S}_n\) (34). Coefficients \(\{ a_i \}_{i = 1}^m\) and b are generated randomly from the uniform distribution on \([-1, 1]\). We compare the performance of Inexact Contracting Newton Method (26) with that one of the classical Frank–Wolfe algorithm, for different values of the parameters. The results are shown on Figs. 12 and 3.

We see, that the new method works significantly better in terms of the outer iterations (oracle calls). This confirms our theory. At the same time, for many values of the parameters, it shows better performance in terms of total computational time as wellFootnote 1.

Fig. 1
figure 1

\(n = 100, \; m = 1000\)

Fig. 2
figure 2

\(n = 100, \; m = 2500\)

Fig. 3
figure 3

\(n = 500, \; m = 2500\)

7 Discussion

In this paper, we present a new general framework of Contracting-Point methods, which can be used for developing affine-invariant optimization algorithms of different order. For the methods of order \(p \ge 1\), we prove the following global convergence rate:

$$\begin{aligned} F(x_k) - F^{*}\le & {} {\mathcal {O}}\bigl ( 1 / k^{p} \bigr ), \quad k \ge 1. \end{aligned}$$

This is the same rate, as that of the basic high-order Proximal-Point scheme [21]. However, the methods from our paper are free from using the norms or any other characteristic parameters of the problem. This nice property makes Contracting-Point methods favourable for solving optimization problems over the sets with a non-Euclidean geometry (e.g. over the simplex or over a general convex polytope).

At the same time, it is known that in Euclidean case, the prox-type methods can be accelerated, achieving \({\mathcal {O}}(1 / k^{p + 1})\) global rate of convergence [2, 6, 20, 21]. Using additional one-dimensional search at each iteration, this rate can be improved up to \({\mathcal {O}}(1 / k^{\frac{3p + 1}{2}})\) (see [11, 21]). The latter rate is shown to be optimal [1, 19]. To the best of our knowledge, the lower bounds for high-order methods in general non-Euclidean case remain unknown. However, the worst-case oracle complexity of the classical Frank–Wolfe algorithm (the case \(p = 1\) in our framework) is proven to be near-optimal for smooth minimization over \(\Vert \cdot \Vert _{\infty }\)-balls [12].

Another open question is a possibility of efficient implementation of our methods for the case \(p \ge 3\). In view of absence of explicit regularizer (contrary to the prox-type methods), the subproblem in (21) can be nonconvex. Hence, it seems hard to find its global minimizer. We hope that for some problem classes, it is still feasible to satisfy the inexact necessary condition for local optimality (24) by reasonable amount of computations. We keep this question for further investigation.