1 Introduction

Let \({\mathcal {X}},{\mathcal {Y}}\) be real, finite-dimensional Hilbert spaces. In this work, we consider saddle-point problems of the form

$$\begin{aligned} \min _{x\in {\mathcal {X}}}\max _{y\in {\mathcal {Y}}} \varPhi (x,y):= \psi (x) + f(x,y) - \zeta (y), \end{aligned}$$
(1)

where

  • \(f:{\mathcal {X}}\times {\mathcal {Y}}\rightarrow {\mathbb {R}}\) is convex–concave and continuously differentiable.

  • \(\psi :{\mathcal {X}}\rightarrow (-\infty ,+\infty ],\zeta :{\mathcal {Y}}\rightarrow (-\infty ,+\infty ]\) are proper, lower semicontinuous (l.s.c.), and convex.

Since many non-smooth optimisation problems can be cast in the form of (1), it is a useful and heavily studied tool in itself [1, 2, 20, 25, 37, 44]. However, rather than attempt to solve (1) in its current form, it is far more convenient, even within the aforementioned references, to follow in the steps of Korpelevič [34] and Popov [47], by casting it as the Variational Inequality (VI):

$$\begin{aligned} \text {Find }z^*\in {{\,\mathrm{{\mathcal {H}}}\,}}\text { such that } \langle F(z^*),z-z^*\rangle + g(z) - g(z^*) \ge 0 \quad \forall z\in {{\,\mathrm{{\mathcal {H}}}\,}}, \end{aligned}$$
(2)

where \(z=(x,y)\in {\mathcal {H}}:={\mathcal {X}}\oplus {\mathcal {Y}}\), and

$$\begin{aligned} F(z) = (\nabla _x f(x,y), -\nabla _y f(x,y)), \quad g(z) = \psi (x) + \zeta (y). \end{aligned}$$
(3)

and the variable \(z^*:=(x^*,y^*)\) shown in (2) characterises the solution \((x^*,y^*)\) to (1).

Many methods (see, for instance, [21, 43, 46]) for solving (2) require global Lipschitz continuity of the operator F. However, this assumption is often too strong to hold in practice. Even when F is globally Lipschitz continuous, knowledge of its Lipschitz constant is usually required as input to the chosen algorithm and determining this constant is typically more difficult than solving the original problem. Moreover, even if F is globally Lipschitz and its global Lipschitz constant is known, then, as the step size is inversely related to the Lipschitz constant, a constant step-size rule can be too conservative. This is particularly unnecessary if the generated sequence lies entirely within a region where a local Lipschitz constant is small (relative to the size of the global constant).

Therefore, it is beneficial to instead define a step-size sequence which attempts to approximate a local Lipschitz constant with respect to the point iterates. The standard approach then is to generate a step-size sequence via a backtracking procedure (see [11, 15, 29, 38, 39, 42] and references therein). While avoiding each of the shortcomings listed above, such methods can become expensive when considering the overall run-time of the algorithm, due to the arbitrarily large number of steps taken during the backtracking procedure within each iteration. An emerging alternative is that of adaptive step sizes [3, 40, 41], which accomplish the same goals as backtracking methods without the need for backtracking, i.e. the step-size update is fully explicit. In particular, the adaptive Golden RAtio ALgorithm (aGRAAL) [40] (as stated in Algorithm 1), named as such because of its relationship with the Golden Ratio \(\varphi =\frac{1+\sqrt{5}}{2}\), solves (2) and is the method we focus on here.

Algorithm 1
figure a

The Adaptive Golden RAtio ALgorithm (aGRAAL) [40].

One other way to potentially improve methods is to replace the Euclidean distance in the proximal operator with a non-Euclidean family of distance-like functions called the Bregman distance. Such methods, for solving (2), can be found in existing literature [15, 22, 23, 26, 27, 30,31,32, 45, 48]. Interestingly, most of these methods require a Lipschitz assumption but do not require knowledge of the Lipschitz constant. However, these employ a backtracking procedure and/or a non-increasing step-size sequence, such as that found in [28], whereas the step size of our new method is fully explicit and is allowed to increase slightly at each iteration.

In this paper, we investigate Bregman modifications to Algorithm 1. To this end, we begin by proposing the Bregman-Golden RAtio ALgorithm (B-GRAAL), a Bregman version of the fixed step-size Golden RAtio ALgorithm (GRAAL), and prove convergence of our new method in full. We then present an adaptive version of B-GRAAL, or similarly, a Bregman modification to Algorithm 1, which we refer to as the Bregman-adaptive Golden RAtio ALgorithm (B-aGRAAL). Although we only provide a convergence analysis of B-aGRAAL in a restrictive setting, we observe it to work numerically outside of this setting.

One advantage of our new method is the flexibility provided by the Bregman proximal operator. In the context of convex–concave games, for instance, this modified operator arises as the projection onto the probability simplex which has a simple closed-form expression with respect to the Kullback–Leibler (KL) divergence but not with respect to the standard Euclidean distance. In fact, the Euclidean projection requires an \(O(n\log n)\) time algorithm in n-dimensions [18, 19, 50], whereas the KL projection only requires O(n) time (see, for instance [10, Section 5] and [35, Section 4.4]). Another advantage of these modifications in the constrained optimisation case is that it is sometimes possible to choose a Bregman distance whose domain is the constraint set so as to make the feasibility of the iterates implicit.

The remainder of this paper is structure as follows. In Sect. 2, we collect preliminary results for use in our analysis. In Sect. 3, we present our fixed step-size method and a proof of convergence. In Sect. 4, we present our adaptive method with some partial analysis. Section 5 contains experimental results. Firstly, we compare the fixed step-size method with the Euclidean distance and the KL divergence on a matrix game between two players. Secondly, we make the same comparison for the adaptive method on a power allocation problem in a Gaussian communication channel. Finally, we apply the adaptive method to an N-person Cournot oligopoly model with appropriately chosen Bregman distances over a closed box. We then conclude this paper by presenting some directions for further research.

2 Preliminaries

Throughout this work, \({{\,\mathrm{{\mathcal {H}}}\,}}\) denotes a real, finite-dimensional Hilbert space with inner-product \(\langle \cdot ,\cdot \rangle \) and induced norm \(\Vert \cdot \Vert \). Given an extended real-valued function \(f:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\), its domain is denoted \({{\,\textrm{dom}\,}}f:= \{x\in {{\,\mathrm{{\mathcal {H}}}\,}}:f(x)<+\infty \}\). Its subdifferential at \(x\in {{\,\textrm{dom}\,}}f\) is given by

$$\begin{aligned} \partial f(x):= \{\nu \in {{\,\mathrm{{\mathcal {H}}}\,}}:f(x) - f(y) - \langle \nu ,x-y\rangle \le 0\quad \forall y\in {{\,\mathrm{{\mathcal {H}}}\,}}\}, \end{aligned}$$

and defined as \(\partial f(x):=\emptyset \) for \(x\not \in {{\,\textrm{dom}\,}}f\). The indicator function of a set \(K\subseteq {{\,\mathrm{{\mathcal {H}}}\,}}\) is written \(\iota _K\) and takes the value 0 for \(x\in K\) and \(+\infty \) otherwise.

A proper, l.s.c., convex function \(h:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\) is called Legendre [12, Definition 7.1.1] if it is strictly convex on every convex subset of \({{\,\textrm{dom}\,}}\partial h:=\{x\in {{\,\mathrm{{\mathcal {H}}}\,}}:\partial h(x)\ne \emptyset \}\) and differentiable on \({{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\ne \emptyset \) such that \(\Vert \nabla h(x)\Vert \rightarrow \infty \) whenever x approaches the boundary of \({{\,\textrm{dom}\,}}h\). The convex conjugate of h written as \(h^*:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\) is the given by

$$\begin{aligned} h^*(x^*):= \sup _{x\in {{\,\mathrm{{\mathcal {H}}}\,}}}\{\langle x^*,x\rangle - h(x)\}. \end{aligned}$$

When h is merely differentiable on \({{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\), the Bregman distance generated by h is the function \(D_h:{{\,\mathrm{{\mathcal {H}}}\,}}\times {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\rightarrow (-\infty ,+\infty ]\) given by

$$\begin{aligned} D_h(x,y):= h(x) - h(y) - \langle \nabla h(y),x-y\rangle . \end{aligned}$$

When h is also convex, \(D_h\) is nonnegative, and when h is \(\sigma \)-strongly convex, \(D_h\) satisfies

$$\begin{aligned} D_h(x,y)\ge \frac{\sigma }{2}\Vert x-y\Vert ^2\quad \forall (x,y)\in {{\,\mathrm{{\mathcal {H}}}\,}}\times {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h. \end{aligned}$$

We begin by collecting some general properties of the Bregman distance.

Proposition 2.1

(Properties of the Bregman distance) Let \(h:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\) be Legendre. Then, the following assertions hold.

  1. (a)

    (three-point identity) For all \(x,y\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\) and \(z\in {{\,\textrm{dom}\,}}h\), we have

    $$\begin{aligned} D_h(z,x) - D_h(z,y) - D_h(y,x) = \langle \nabla h(x) - \nabla h(y), y-z\rangle . \end{aligned}$$
  2. (b)

    For all \(x,y\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\), we have

    $$\begin{aligned} D_h(x,y) = D_{h^*}(\nabla h(y), \nabla h(x)) \end{aligned}$$
  3. (c)

    Let \(x\in {{\,\textrm{dom}\,}}h\), \(y,u,v\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\), and \(\alpha \in {\mathbb {R}}\). Suppose additionally that \(\nabla h(y) = \alpha \nabla h(u) + (1-\alpha )\nabla h(v)\). Then,

    $$\begin{aligned} D_h(x,y) = \alpha \Big [D_h(x,u) - D_h(y,u)\Big ] + (1-\alpha )\Big [D_h(x,v) - D_h(y,v)\Big ]. \end{aligned}$$

Proof

(a) See, for instance, [49, Lemma 2.2] and the paragraph immediately after. (b) See, for instance, [4, Theorem 3.7(v)]. (c) By using the definition of \(D_h\), together with the assumption \(\nabla h(y) = \alpha \nabla h(u) + (1-\alpha )\nabla h(v)\), we obtain

$$\begin{aligned} D_h(x,y)&= h(x) - h(y) - \langle \nabla h(y),x-y\rangle \\&= \alpha \Big [h(x) - h(y) - \langle \nabla h(u),x-y\rangle \Big ] + (1-\alpha )\Big [h(x) - h(y) - \langle \nabla h(v),x-y\rangle \Big ]\\&= \alpha \Big [h(x) - h(u) - \langle \nabla h(u), x-u\rangle - h(y) + h(u) - \langle \nabla h(u),u-y\rangle \Big ]\\&\quad + (1-\alpha )\Big [h(x) - h(v) - \langle \nabla h(v), x-v\rangle - h(y) + h(v) - \langle \nabla h(v),v-y\rangle \Big ] \\&= \alpha \Big [D_h(x,u) - D_h(y,u)\Big ] + (1-\alpha )\Big [D_h(x,v) - D_h(y,v)\Big ]. \end{aligned}$$

This completes the proof. \(\square \)

Remark 2.1

When \(h=\frac{1}{2}\Vert \cdot \Vert ^2\), the expression shown in Proposition 2.1(c) simplifies to the Euclidean identity

$$\begin{aligned} \forall x,y\in {{\,\mathrm{{\mathcal {H}}}\,}}, \alpha \in {\mathbb {R}} \qquad \Vert \alpha x + (1-\alpha )y\Vert ^2 = \alpha \Vert x\Vert ^2 + (1-\alpha )\Vert y\Vert ^2 - \alpha (1-\alpha )\Vert x-y\Vert ^2. \end{aligned}$$

We now turn our attention to operators defined in terms of the Bregman divergence. The (left) Bregman proximal operator of a function \(f:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\) is the (potentially set-valued) operator given by

$$\begin{aligned} {{\,\mathrm{\textbf{prox}}\,}}_f^h(y):= \mathop {{\textrm{argmin}}}\limits _{x\in {{\,\mathrm{{\mathcal {H}}}\,}}}\left\{ f(x) + D_h(x,y)\right\} \quad \forall y\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h. \end{aligned}$$
(7)

Since we will only require the left Bregman proximal operator in this work, will omit the qualifier “left” from here onwards. For further details on the analogous “right Bregman proximal operator”, the reader is referred to [13]. The (left) Bregman projection onto C is the (left) Bregman proximal operator of \(\iota _C\), that is,

$$\begin{aligned} P^h_C(y):= {{\,\mathrm{\textbf{prox}}\,}}_{\iota _C}^h(y) = \mathop {{\textrm{argmin}}}\limits _{x\in C} D_h(x,y)\quad \forall y\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h. \end{aligned}$$

Next, we collect properties of the Bregman proximal operator for use in our subsequence algorithm analysis.

Proposition 2.2

(Bregman proximal operator) Let \(f:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\) be proper, l.s.c, convex and let \(h:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\) be Legendre such that \({{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}f{\ne }\emptyset \).

  1. (a)

    \({{\,\textrm{range}\,}}({{\,\mathrm{\textbf{prox}}\,}}^h_f)\subseteq {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}f\).

  2. (b)

    \({{\,\mathrm{\textbf{prox}}\,}}^h_f\) is single-valued on \({{\,\textrm{dom}\,}}({{\,\mathrm{\textbf{prox}}\,}}^h_f)\subseteq {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). Moreover, if \(h+f\) is supercoercive, that is, \(\lim _{\Vert x\Vert \rightarrow +\infty }\frac{h(x) + f(x)}{\Vert x\Vert } = +\infty \), then \({{\,\textrm{dom}\,}}({{\,\mathrm{\textbf{prox}}\,}}^h_f)={{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\).

  3. (c)

    Let \(y\in {{\,\textrm{dom}\,}}({{\,\mathrm{\textbf{prox}}\,}}^h_f)\). Then, \(x={{\,\mathrm{\textbf{prox}}\,}}^h_f(y)\) if and only if, for all \(u\in {{\,\mathrm{{\mathcal {H}}}\,}}\), we have

    $$\begin{aligned} f(x) - f(u) \le \langle \nabla h(y) - \nabla h(x),x-u\rangle . \end{aligned}$$
    (8)
  4. (d)

    Let \(y,y'\in {{\,\textrm{dom}\,}}({{\,\mathrm{\textbf{prox}}\,}}^h_f)\), \(x={{\,\mathrm{\textbf{prox}}\,}}_f^h(y)\) and \(x'={{\,\mathrm{\textbf{prox}}\,}}_f^h(y')\). Then

    $$\begin{aligned} 0\le \langle \nabla h(x) - \nabla h(x^\prime ), x-x^\prime \rangle \le \langle \nabla h(y) - \nabla h(y^\prime ), x-x^\prime \rangle . \end{aligned}$$
    (9)

Proof

(a) See [7, Proposition 3.23(v)(b)], noting that the sum rule holds for f and h since \({{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}f{\ne }\emptyset \) and thus \({{\,\textrm{dom}\,}}\partial \left( f{+}h\right) {=} {{\,\textrm{dom}\,}}\left( \partial f {+} \nabla h\right) {\subseteq } {{\,\textrm{dom}\,}}f{\cap }{{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). (b) The first part follows by combining (a) and [7, Proposition 3.22 (ii)(d)]. For the second part, see [7, Proposition 3.21(vii)]. (c) Since the sum rule holds for f and h, the first-order optimality condition implies \(x={{\,\mathrm{\textbf{prox}}\,}}_f^h(y)\) if and only if \(\nabla h(y)-\nabla h (x)\in \partial f(x)\). The latter is equivalent to (8). (d) The first inequality in (9) follows from convexity of h. To show the second, we apply (8) with \(u=x^\prime \) to see

$$\begin{aligned} f(x)-f(x^\prime )\le \langle \nabla h(y)-\nabla h(x),x-x^\prime \rangle , \end{aligned}$$

and similarly,

$$\begin{aligned} f(x^\prime ) - f(x)\le \langle \nabla h(y^\prime ) - \nabla h(x^\prime ),x^\prime -x\rangle . \end{aligned}$$

Then, adding these inequalities gives the desired result. \(\square \)

Remark 2.2

Parts (a), (b), (d) of Proposition 2.2 also apply to the Bregman resolvent [13], which is defined for a set-valued operator \(A:{{\,\mathrm{{\mathcal {H}}}\,}}\rightrightarrows {{\,\mathrm{{\mathcal {H}}}\,}}\) as \(R^h_{A} = \left( \nabla h + A\right) ^{-1}\circ \nabla h\). To see that the resolvent generalises the proximal operator, we refer to [7, Proposition 3.22 (ii)(a)].

3 The Bregman-Golden Ratio Algorithm

In this section, we consider the Variational Inequality (VI) problem

$$\begin{aligned} \langle F(z^*),z-z^*\rangle + g(z) - g(z^*) \ge 0 \quad \forall z\in {{\,\mathrm{{\mathcal {H}}}\,}}, \end{aligned}$$
(10)

where we assume that

A.1:

\(g:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\) is proper, l.s.c., convex.

A.2:

\(h:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow (-\infty ,+\infty ]\) is continuously differentiable, Legendre, and \(\sigma \)-strongly convex. In addition, we will also require that \(D_h(x,x_n)\rightarrow 0\) for every sequence \((x_n)\subseteq {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\) that converges to some \(x\in {{\,\textrm{dom}\,}}h\).

A.3:

\(F:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow {{\,\mathrm{{\mathcal {H}}}\,}}\) is monotone over \({{\,\textrm{dom}\,}}g\cap {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\ne \emptyset \).

A.4:

\(\varOmega :=S\cap {{\,\textrm{dom}\,}}h \ne \emptyset \) where S denotes the solution set of (10).

Remark 3.1

Assumption A.2 is common in the literature concerning Bregman first-order methods [8, 9, 17, 49]. In particular, the limit condition holds when \(\nabla h\) is continuous and \({{\,\textrm{dom}\,}}h\) is open. Indeed, in this case, \(\sigma \)-strong convexity of h implies that \(h^*\) is \(\frac{1}{\sigma }\)-smooth [33, Theorem 6], and so applying Proposition 2.1(b) gives

$$\begin{aligned} D_h(x,x_n)=D_{h^*}\left( \nabla h(x_n), \nabla h(x)\right) \le \frac{1}{2\sigma }\Vert \nabla h(x_n) - \nabla h(x)\Vert ^2 \rightarrow 0. \end{aligned}$$

The significance of \({{\,\textrm{dom}\,}}h\) being open here is that h is differentiable at \(x\in {{\,\textrm{dom}\,}}h\); however, we observe that the same condition can still hold if \({{\,\textrm{dom}\,}}h\) is closed. In particular, it also holds for the KL divergence (see, for instance, [17, Example 2.1]).

Our proposed algorithm for solving (10) when F is L-Lipschitz is called the Bregman-Golden RAtio ALgorithm (B-GRAAL) and is stated in Algorithm 2. Recall that \(\varphi :=\frac{1+\sqrt{5}}{2}\) denotes the Golden Ratio, which satisfies \(\varphi ^2=\varphi +1\).

Algorithm 2
figure b

The Bregman-Golden RAtio ALgorithm (B-GRAAL)

The following lemma establishes the well-definedness of the sequences generated by the Bregman-GRAAL.

Lemma 3.1

Suppose Assumptions A.1A.2 hold. Then, the sequences \(({\overline{z}}_k)\) and \((z_k)\) generated by Algorithm 2 are well defined. Moreover, \(({\overline{z}}_k)\subseteq {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\) and \((z_k)\subseteq {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}g\).

Proof

Suppose by way of induction that \(z_{k},{\overline{z}}_{k-1}\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\) for some \(k\ge 1\). Since the gradient \(\nabla h:{{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\rightarrow {{\,\textrm{int}\,}}({{\,\textrm{dom}\,}}h^*)\) is a bijection [12, Theorem 7.3.7], it follows that \(\nabla h(z_k),\nabla h({\overline{z}}_{k-1})\in {{\,\textrm{int}\,}}({{\,\textrm{dom}\,}}h^*)\). As \({{\,\textrm{int}\,}}({{\,\textrm{dom}\,}}h^*)\) is a convex set, (11) implies that \(\nabla h({\overline{z}}_k) \in {{\,\textrm{int}\,}}({{\,\textrm{dom}\,}}h^*)\) which establishes that \({\overline{z}}_k\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). Next, we observe that \(z_{k+1} = {{\,\mathrm{\textbf{prox}}\,}}_{\lambda f}^h({\overline{z}}_k)\), where \(f(z):=\langle F(z_k), z-z_k\rangle + g(z)\). Note that \({{\,\textrm{dom}\,}}f={{\,\textrm{dom}\,}}g\). Since \(\lambda f+h\) is \(\sigma \)-strongly convex, it is supercoercive by [5, Corollary 11.16] and so Proposition 2.2(a)-(b) shows that \({{\,\mathrm{\textbf{prox}}\,}}^h_{\lambda f}\) is single-valued with \({{\,\textrm{range}\,}}({{\,\mathrm{\textbf{prox}}\,}}^h_{\lambda f})\subseteq {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}g\) and therefore \(z_{k+1}\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}g\). \(\square \)

Remark 3.2

The Bregman proximal step shown in (12) can be expressed in terms of the Bregman proximal operator: \(z_{k+1}\!=\!{{\,\mathrm{\textbf{prox}}\,}}_{\lambda f}^h({\overline{z}}_k)\), where \(f(z)\!=\!\langle F(z_k), z\!-\!z_k\rangle \!+g(z)\). Equivalently, \(z_{k+1} = {{\,\mathrm{\textbf{prox}}\,}}_{\lambda g}^h(\left( \nabla h\right) ^{-1}(\nabla h({\overline{z}}_k) - \lambda F(z_k)))\), due to the first-order optimality condition in Proposition 2.2(c).

The following lemma is key in our convergence analysis.

Lemma 3.2

Suppose Assumptions A.1A.4 hold and that F is L-Lipschitz continuous on \({{\,\textrm{dom}\,}}g\cap {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). Let \(z^*\in \varOmega \) be arbitrary. Then, the sequences \((z_k), ({\overline{z}}_k)\) generated by Algorithm 2 satisfy

$$\begin{aligned} 0\le & {} (\varphi +1)D_h(z^*, {\overline{z}}_{k+1}) + \frac{\varphi }{2}D_h(z_{k+1}, z_k) - \varphi D_h(z_{k+1}, {\overline{z}}_{k+1}) \nonumber \\\le & {} (\varphi +1)D_h(z^*, {\overline{z}}_k) + \frac{\varphi }{2}D_h(z_k, z_{k-1}) - \varphi D_h(z_k, {\overline{z}}_k)\nonumber \\{} & {} \quad - \left( 1-\frac{1}{\varphi }\right) D_h(z_{k+1}, {\overline{z}}_k). \end{aligned}$$
(13)

Proof

By first applying Proposition 2.2(c) with \(f(z):= \lambda (\langle F(z_k), z-z_k\rangle + g(z))\), \(u:=z\in {{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}g\) arbitrary, \(x:=z_{k+1}\) and \(y:={\overline{z}}_k\), followed by the three-point identity (Proposition 2.1(a)) we obtain

$$\begin{aligned} \lambda&\left( \langle F(z_k),z_{k+1}-z\rangle + g(z_{k+1}) - g(z)\right) \nonumber \\&\le \langle \nabla h({\overline{z}}_k) - \nabla h(z_{k+1}), z_{k+1} - z\rangle \nonumber \\&= D_h(z, {\overline{z}}_k) - D_h(z, z_{k+1}) - D_h(z_{k+1}, {\overline{z}}_k). \end{aligned}$$
(14)

Shifting the index in (14) (by setting \(k\equiv k-1\)), setting \(z:=z_{k+1}\), and using \(\varphi \nabla h({\overline{z}}_k) = (\varphi - 1)\nabla h(z_k) + \nabla h({\overline{z}}_{k-1})\) followed by the three-point identity (Proposition 2.1(a)) gives

$$\begin{aligned} \begin{aligned}&\lambda \left( \langle F(z_{k-1}), z_k - z_{k+1}\rangle + g(z_k) - g(z_{k+1})\right) \\&\quad \le \langle \nabla h({\overline{z}}_{k-1}) - \nabla h(z_k), z_k - z_{k+1}\rangle \\&\quad = \varphi \langle \nabla h({\overline{z}}_k) - \nabla h(z_k), z_k - z_{k+1}\rangle \\&\quad =\varphi \left[ D_h(z_{k+1}, {\overline{z}}_k) - D_h(z_{k+1}, z_k) - D_h(z_k, {\overline{z}}_k)\right] . \end{aligned} \end{aligned}$$
(15)

Let \(z^*\in \varOmega \). Setting \(z:=z^*\) in (14), summing with (15) and rearranging yields

$$\begin{aligned} \begin{aligned}&\lambda \left( \langle F(z_k), z_k - z^*\rangle + g(z_k) - g(z^*)\right) \\&\quad \le D_h(z^*,{\overline{z}}_k) - D_h(z^*,z_{k+1}) - D_h(z_{k+1},{\overline{z}}_k)\\&\qquad + \varphi \Big [D_h(z_{k+1},{\overline{z}}_k) - D_h(z_{k+1},z_k) - D_h(z_k,{\overline{z}}_k)\Big ]\\&\qquad +\lambda \langle F(z_k) - F(z_{k-1}), z_k - z_{k+1}\rangle . \end{aligned} \end{aligned}$$
(16)

We observe that the left side of (16) is nonnegative as a consequence of (10) and A.3:

$$\begin{aligned} 0\le \langle F(z^*),z_k-z^*\rangle + g(z_k) - g(z^*)\le \langle F(z_k),z_k-z^*\rangle + g(z_k) - g(z^*). \end{aligned}$$
(17)

To estimate the final term in (16), we use the Cauchy–Schwarz inequality, L-Lipschitz continuity of F, \(\sigma \)-strong convexity of h and the inequality \(\lambda \le \frac{\sigma \varphi }{2L}\) to obtain

$$\begin{aligned} \begin{aligned} \lambda \langle F(z_k) - F(z_{k-1}), z_k - z_{k+1}\rangle&\le \lambda L\Vert z_k - z_{k-1}\Vert \Vert z_k - z_{k+1}\Vert \\&\le \frac{\lambda L}{2}\left( \Vert z_k - z_{k-1}\Vert ^2+\Vert z_k - z_{k+1}\Vert ^2\right) \\&\le \frac{\varphi }{2}\left( D_h(z_k, z_{k-1}) + D_h(z_{k+1}, z_k)\right) . \end{aligned} \end{aligned}$$
(18)

Combining (16), (17) and (18) gives

$$\begin{aligned} D_h(z^*, z_{k+1})\le & {} D_h(z^*, {\overline{z}}_k) + (\varphi -1)D_h(z_{k+1}, {\overline{z}}_k) - \varphi D_h(z_k, {\overline{z}}_k)\nonumber \\{} & {} \quad - \frac{\varphi }{2}D_h(z_{k+1}, z_k) + \frac{\varphi }{2}D_h(z_k,z_{k-1}). \end{aligned}$$
(19)

Now, applying Proposition 2.1(c) with

$$\begin{aligned}\nabla h(z_{k+1}) = \frac{\varphi \nabla h({\overline{z}}_{k+1}) - \nabla h({\overline{z}}_k)}{\varphi -1} = (\varphi +1)\nabla h({\overline{z}}_{k+1}) - \varphi \nabla h({\overline{z}}_k) \end{aligned}$$

and rearranging yields

$$\begin{aligned} (\varphi +1)D_h(z^*, {\overline{z}}_{k+1})= & {} D_h(z^*, z_{k+1}) + (\varphi +1)D_h(z_{k+1}, {\overline{z}}_{k+1}) \nonumber \\{} & {} + \varphi \Big [D_h(z^*, {\overline{z}}_k) - D_h(z_{k+1}, {\overline{z}}_k)\Big ]. \end{aligned}$$
(20)

Combining (19) and (20), followed by collecting like-terms, gives

$$\begin{aligned} (\varphi +1)D_h\!\!\!{} & {} (z^*, {\overline{z}}_{k+1}) + \frac{\varphi }{2}D_h(z_{k+1}, z_k) - \varphi D_h(z_{k+1}, {\overline{z}}_{k+1}) \nonumber \\{} & {} \le (\varphi +1)D_h(z^*, {\overline{z}}_k) + \frac{\varphi }{2}D_h(z_k, z_{k-1}) - \varphi D_h(z_k, {\overline{z}}_k) \nonumber \\{} & {} \quad + D_h(z_{k+1}, {\overline{z}}_{k+1}) - D_h(z_{k+1}, {\overline{z}}_k). \end{aligned}$$
(21)

Since \(\nabla h({\overline{z}}_{k+1})=\frac{\varphi -1}{\varphi }\nabla h(z_{k+1})+\frac{1}{\varphi }\nabla h({\overline{z}}_k)\), Proposition 2.1(c) gives

$$\begin{aligned} D_h(z_{k+1}, {\overline{z}}_{k+1})= & {} \frac{\varphi -1}{\varphi }\big [D_h(z_{k+1}, z_{k+1}) - D_h({\overline{z}}_{k+1}, z_{k+1})\big ] \nonumber \\{} & {} + \frac{1}{\varphi }\big [D_h(z_{k+1}, {\overline{z}}_k) - D_h({\overline{z}}_{k+1}, {\overline{z}}_k)\big ] \nonumber \\\le & {} \frac{1}{\varphi }D_h(z_{k+1}, {\overline{z}}_k). \end{aligned}$$
(22)

Combining (21) and (22) establishes the second inequality in (13). To show the first inequality in (13), we apply the three-point identity (Proposition 2.1(a)) to see that

$$\begin{aligned} \begin{aligned}&D_h(z^*,z_{k+1}) + D_h(z_{k+1},{\overline{z}}_{k+1}) \\&\quad = D_h(z^*,{\overline{z}}_{k+1}) + \langle \nabla h({\overline{z}}_{k+1}) - \nabla h(z_{k+1}), z^*-z_{k+1}\rangle \\&\quad = D_h(z^*,{\overline{z}}_{k+1}) + \frac{1}{\varphi }\langle \nabla h({\overline{z}}_k) - \nabla h(z_{k+1}), z^*-z_{k+1}\rangle .\\&\quad \le D_h(z^*, {\overline{z}}_{k+1}) + \frac{\lambda }{\varphi }\left( \langle F(z_k), z^* - z_{k+1}\rangle - g(z_{k+1}) + g(z^*)\right) \\&\quad = D_h(z^*, {\overline{z}}_{k+1}) + \frac{\lambda }{\varphi }\langle F(z_k) - F(z_{k+1}), z^* - z_{k+1}\rangle \\&\qquad +\frac{\lambda }{\varphi }(\langle F(z_{k+1}), z^* - z_{k+1}\rangle - g(z_{k+1}) + g(z^*)) \\&\quad \le D_h(z^*, {\overline{z}}_{k+1}) + \frac{\lambda }{\varphi }\langle F(z_k) - F(z_{k+1}), z^* - z_{k+1}\rangle . \end{aligned} \end{aligned}$$
(23)

Using L-Lipschitz continuity of F and \(\sigma \)-strong convexity of h gives

$$\begin{aligned} \frac{\lambda }{\varphi }\langle F(z_k) - F(z_{k+1}), z^*- z_{k+1}\rangle \le \frac{1}{2}D_h(z_{k+1}, z_k) + \frac{1}{2}D_h(z^*, z_{k+1}). \end{aligned}$$
(24)

On substituting (24) back into (23) and rearranging, we obtain

$$\begin{aligned} \begin{aligned} D_h(z_{k+1}, {\overline{z}}_{k+1})&\le D_h(z^*, {\overline{z}}_{k+1}) + \frac{1}{2}D_h(z_{k+1}, z_k) -\frac{1}{2}D_h(z^*, z_{k+1})\\&\le D_h(z^*, {\overline{z}}_{k+1}) + \frac{1}{2}D_h(z_{k+1}, z_k), \end{aligned} \end{aligned}$$
(25)

and therefore

$$\begin{aligned}{} & {} 0 \le D_h(z^*, {\overline{z}}_{k+1}) + \frac{1}{2}D_h(z_{k+1}, z_k) - D_h(z_{k+1}, {\overline{z}}_{k+1}) \\{} & {} \quad \implies 0\le (\varphi +1)D_h(z^*, {\overline{z}}_{k+1}) + \frac{\varphi }{2}D_h(z_{k+1}, z_k) - \varphi D_h(z_{k+1}, {\overline{z}}_{k+1}), \end{aligned}$$

which establishes the first inequality of (13) and thus completes the proof. \(\square \)

The following is our main result regarding convergence of the Bregman GRAAL with fixed step size.

Theorem 3.1

Suppose Assumptions A.1A.4 hold and that F is L-Lipschitz continuous on \({{\,\textrm{dom}\,}}g\cap {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). Then, the sequences \((z_k)\) and \(({\overline{z}}_k)\) generated by Algorithm 2 converge to a point in \( \varOmega \).

Proof

Let \(z^*\in \varOmega \) be arbitrary, and let \((\eta _k)\) denote the sequence given by

$$\begin{aligned} \eta _k:= (\varphi +1)D_h(z^*, {\overline{z}}_k) + \frac{\varphi }{2}D_h(z_k, z_{k-1}) - \varphi D_h(z_k, {\overline{z}}_k)\quad \forall k\in {\mathbb {N}}. \end{aligned}$$
(26)

Lemma 3.2 implies that \(\lim _{k\rightarrow \infty }\eta _k\) exists and \(D_h(z_{k+1}, {\overline{z}}_k) \rightarrow 0\) as \(k\rightarrow \infty \). Referring to (22), it follows that \(D_h(z_{k+1}, {\overline{z}}_{k+1}) \rightarrow 0\). Also, by applying Proposition 2.1(c) with the identity \(\nabla h(z_k) = (\varphi +1)\nabla h({\overline{z}}_k) - \varphi \nabla h({\overline{z}}_{k-1})\), we obtain

$$\begin{aligned} D_h(z_{k+1}, z_k)&= (\varphi +1)\big [D_h(z_{k+1}, {\overline{z}}_k) - D_h(z_k,{\overline{z}}_k)\big ]\\&\quad - \varphi \big [D_h(z_{k+1}, {\overline{z}}_{k-1}) - D_h(z_k, {\overline{z}}_{k-1})\big ] \\&\le (\varphi +1)D_h(z_{k+1},{\overline{z}}_k)+\varphi D_h(z_k,{\overline{z}}_{k-1}) \rightarrow 0. \end{aligned}$$

Altogether, we have that

$$\begin{aligned} \lim _{k\rightarrow \infty }\eta _k = (\varphi +1)\lim _{k\rightarrow \infty }D_h(z^*, {\overline{z}}_k), \end{aligned}$$

and, in particular, \(\lim _{k\rightarrow \infty }D_h(z^*, {\overline{z}}_k)\) exists.

Next, using \(\sigma \)-strong convexity of h, we deduce that \(z_{k+1}-{\overline{z}}_k\rightarrow 0\), and that \((z_k)\) and \(({\overline{z}}_k)\) are bounded. Thus, let \({\overline{z}}\in {{\,\mathrm{{\mathcal {H}}}\,}}\) be a cluster point of \(({\overline{z}}_k)\). Then, there exists a subsequence \(({\overline{z}}_{k_j})\) such that \({\overline{z}}_{k_j}\rightarrow {\overline{z}}\) and \(z_{k_j+1}\rightarrow {\overline{z}}\) as \(j\rightarrow \infty \). Now, recalling (14) gives

$$\begin{aligned}{} & {} \lambda \left( \langle F(z_{k_j}),z_{{k_j}+1}-z\rangle + g(z_{{k_j}+1}) - g(z)\right) \\{} & {} \quad \le \langle \nabla h({\overline{z}}_{k_j}) - \nabla h(z_{{k_j}+1}), z_{{k_j}+1} - z\rangle \quad \forall z\in {{\,\mathrm{{\mathcal {H}}}\,}}, \end{aligned}$$

and taking the limit-infimum of both sides as \(j\rightarrow \infty \) shows that \({\overline{z}}\in \varOmega \). Since \(z^*\in \varOmega \) was chosen in Lemma 3.2 to be arbitrary, we can now set \(z^*={\overline{z}}\). It then follows that \(\lim _{j\rightarrow \infty }D_h(z^*,{\overline{z}}_{k_j}) = 0\), and consequently, \(\lim _{j\rightarrow \infty }\eta _{k_j} = 0\). Also note that for \(n\ge k_j\), we have \(\eta _n\le \eta _{k_j}\) from Lemma 3.2, and therefore,

$$\begin{aligned} (\varphi +1)\lim _{n\rightarrow \infty }D_h(z^*, {\overline{z}}_n) = \lim _{n\rightarrow \infty }\eta _n \le \lim _{j\rightarrow \infty }\eta _{k_j} = 0, \end{aligned}$$

and therefore \({\overline{z}}_k\rightarrow z^*\) from strong convexity. The fact that \(z_k\rightarrow z^*\) follows since \(z_k-{\overline{z}}_k\rightarrow 0\). \(\square \)

Remark 3.3

In the special case where \(h=\Vert \cdot \Vert ^2\), Algorithm 2 recovers the Euclidean GRAAL with fixed step size from [40, Section 2] and the conclusions of Theorem 3.1 recover [40, Theorem 1]. Despite this, the proof provided here is new and not the same as the one in [40, Theorem 1] even when specialised to the Euclidean case. Indeed, [40, Theorem 1] proceeds by establishing the inequality

$$\begin{aligned}{} & {} (\varphi +1)\Vert {\overline{z}}_{k+1}-z^*\Vert ^2 + \frac{\varphi }{2}\Vert z_{k+1}-z_k\Vert ^2 \nonumber \\{} & {} \qquad \qquad \le (\varphi +1)\Vert {\overline{z}}_{k}-z^*\Vert ^2 + \frac{\varphi }{2}\Vert z_{k}-z_{k-1}\Vert ^2 - \varphi \Vert z_k-{\overline{z}}_k\Vert ^2, \end{aligned}$$
(27)

which is different to Lemma 3.2. Interestingly, (27) can be deduced from (21) in Lemma 3.2 by using the equality

$$\begin{aligned} \Vert z_{k+1}-{\overline{z}}_{k+1}\Vert ^2 = \frac{1}{\varphi ^2}\Vert z_{k+1} - {\overline{z}}_k\Vert ^2, \end{aligned}$$

which applies in the Euclidean case, in place of (22) followed by the identity \(\varphi ^2=\varphi +1\). Note also that the inequality (22) is already weaker than the inequality \(\Vert z_{k+1}-{\overline{z}}_{k+1}\Vert ^2 \le \frac{1}{\varphi ^2}\Vert z_{k+1} - {\overline{z}}_k\Vert ^2\).

3.1 Linear Convergence of B-GRAAL

In this subsection, we investigate linear convergence of Algorithm 2. Recall that a sequence \({(z_k)\subset {{\,\mathrm{{\mathcal {H}}}\,}}}\) is said to converge Q-linearly to \(z\in {{\,\mathrm{{\mathcal {H}}}\,}}\) if there exists some \(q\in (0,1)\) such that \(\Vert z_{k+1} - z\Vert < q\Vert z_k - z\Vert \) for all k sufficiently large. A sequence \((z_k)\) is said to converge R-linearly if \(\Vert z_k - z\Vert \le \gamma _k\) for sufficiently large k and a sequence \((\gamma _k)\subseteq {\mathbb {R}}\) converging Q-linearly to 0

In [40, Section 2.3], the sequences generated by Algorithm 1 were shown to converge R-linearly under the following error bound condition: there exists \(\alpha ,\beta >0\) such that

$$\begin{aligned} \Vert z - {{\,\mathrm{\textbf{prox}}\,}}_{\lambda g}(z - \lambda F(z) )\Vert \le \alpha \implies {{\,\textrm{dist}\,}}(z,S)\le \beta \Vert z - {{\,\mathrm{\textbf{prox}}\,}}_{\lambda g}(z - \lambda F(z))\Vert \end{aligned}$$

for all z. However, the proof in [40, Section 2.3] relies heavily on the (Euclidean) triangle inequality and thus does not generalise directly to the Bregman setting. Instead, we shall assume that, for some \(\mu >0\), F satisfies

$$\begin{aligned} \langle F(x) - F(y),x-y\rangle \ge \mu D_h(x,y)\qquad \forall (x,y)\in {{\,\textrm{dom}\,}}h\times {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h. \end{aligned}$$
(28)

When \(h=\frac{1}{2}\Vert \cdot \Vert ^2\), this property reduces to the standard definition of \(\mu \)-strong monotonicity. When the operator F is continuous, condition (28) is weaker than the notion of \(\mu \)-strongly monotone relative to h introduced in [24]. The latter has been used to show linear convergence of the Bregman proximal point algorithm [24, Theorem 3.3].

The following is our main result concerning linear convergence of Algorithm 2 under condition (28).

Theorem 3.2

Suppose Assumptions A.1A.4 hold, and that F is L-Lipschitz continuous on \({{\,\textrm{dom}\,}}g\cap {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). Additionally, suppose that F satisfies condition (28) holds for some \(\mu >0\), and that \(\lambda <\frac{\sigma \varphi }{2L}\). Then, the sequences \((z_k)\) and \(({\overline{z}}_k)\) generated by Algorithm 2 converge R-linearly to the unique point in \(\varOmega \).

Proof

By decreasing the value of \(\mu >0\) if necessary, we assume without loss of generality that

$$\begin{aligned} \mu <\frac{1}{2}~~\text {and}~~\lambda \le \frac{\sigma \varphi \sqrt{1-\mu }}{2L}. \end{aligned}$$

Let \(z^*\in \varOmega \). Since F satisfies (28), we have

$$\begin{aligned} 0\le & {} \langle F(z^*),z_k-z^*\rangle + g(z_k) - g(z^*)\nonumber \\\le & {} \langle F(z_k),z_k-z^*\rangle + g(z_k) - g(z^*) - \mu D_h(z^*,z_k). \end{aligned}$$
(29)

Using the Cauchy–Schwarz inequality, L-Lipschitz continuity of F, \(\sigma \)-strong convexity of h and the inequality \(\le \frac{\sigma \varphi \sqrt{1-\mu }}{2L}\) yields

$$\begin{aligned} \begin{aligned}&\lambda \langle F(z_k) - F(z_{k-1}), z_k - z_{k+1}\rangle \\&\quad \le \frac{\sigma \varphi }{2}\left( \sqrt{1-\mu }\Vert z_k - z_{k-1}\Vert \right) \cdot \Vert z_k - z_{k+1}\Vert \\&\quad \le \frac{\sigma \varphi }{2}\left( \frac{(1-\mu )}{2}\Vert z_k - z_{k-1}\Vert ^2+\frac{1}{2}\Vert z_k - z_{k+1}\Vert ^2\right) \\&\quad \le \frac{(1-\mu )\varphi }{2} D_h(z_k, z_{k-1}) + \frac{\varphi }{2}D_h(z_{k+1}, z_k). \end{aligned} \end{aligned}$$
(30)

By following the same steps as the proof of Lemma 3.2, but (29) in place of (17) and (30) in place of (18), the analogue of (19) becomes

$$\begin{aligned} D_h(z^*, z_{k+1})\le & {} D_h(z^*, {\overline{z}}_k) + (\varphi -1)D_h(z_{k+1}, {\overline{z}}_k) - \varphi D_h(z_k, {\overline{z}}_k) - \mu D_h(z^*,z_k)\nonumber \\{} & {} -\frac{\varphi }{2}D_h(z_{k+1}, z_k) + \frac{(1-\mu )\varphi }{2} D_h(z_k,z_{k-1}). \end{aligned}$$
(31)

Applying Proposition 2.1(c) with \(\nabla h(z_{k+1})=(\varphi +1)\nabla h({\overline{z}}_{k+1})-\varphi \nabla h({\overline{z}}_k)\) gives

$$\begin{aligned} D_h(z^*,z_{k+1})= & {} (\varphi +1)\left[ D_h(z^*,{\overline{z}}_{k+1}) - D_h(z_{k+1},{\overline{z}}_{k+1})\right] \nonumber \\{} & {} - \varphi \left[ D_h(z^*,{\overline{z}}_k) - D_h(z_{k+1},{\overline{z}}_k)\right] . \end{aligned}$$
(32)

Let \((\eta _k)\) denote the sequence given by (26). Then, using the identity (32) to eliminate \(D_h(z^*,z_{k+1})\) and \(D_h(z^*,z_k)\) from (31) gives

$$\begin{aligned} \eta _{k+1}^\prime&:=(\varphi +1)D_h(z^*,{\overline{z}}_{k+1}) + \frac{\varphi }{2} D_h(z_{k+1},z_k) - D_h(z_{k+1},{\overline{z}}_{k+1})\\&\le (1-\mu )(\varphi +1)D_h(z^*,{\overline{z}}_k)+\frac{(1-\mu )\varphi }{2} D_h(z_k,z_{k-1})+(\mu (1+\varphi )-\varphi )D_h(z_k,{\overline{z}}_k)\\&\quad +\mu \varphi D_h(z^*,{\overline{z}}_{k-1})-\mu \varphi D_h(z_k,{\overline{z}}_{k-1})+\varphi D_h(z_{k+1},{\overline{z}}_{k+1})-D_h(z_{k+1},{\overline{z}}_k)\\&\le (1-\mu )\eta _k + \mu \varphi D_h(z^*,{\overline{z}}_{k-1}), \end{aligned}$$

where we note that the final inequality uses \(D_h(z_k,{\overline{z}}_k)\le \frac{1}{\varphi }D_h(z_k,{\overline{z}}_{k-1})\). Since \((\eta _k)\) is non-increasing by Lemma 3.2 and \(\frac{(1-\mu )\varphi }{1-\mu (2-\mu )}\ge 1\), it follows that

$$\begin{aligned} \eta _{k+1}^\prime&\le (1-\mu )\eta _{k-1} + \mu (\varphi -1)(\varphi +1) D_h(z^*,{\overline{z}}_{k-1}) \\&= \bigl (1-\mu (2-\varphi )\bigr ) (\varphi +1)D_h(z^*,{\overline{z}}_{k-1}) + (1-\mu )\frac{\varphi }{2}D_h(z_{k-1},z_{k-2})\\&\quad -(1-\mu )\varphi D_h(z_{k-1},{\overline{z}}_{k-1})\\&\le \bigl (1-\mu (2-\varphi )\bigr )\bigl [(\varphi +1)D_h(z^*,{\overline{z}}_{k-1}) +\frac{\varphi }{2}D_h(z_{k-1},z_{k-2}) - D_h(z_{k-1},{\overline{z}}_{k-1})\bigr ] \\&= q\eta _{k-1}^\prime \text {~~where~~}q:=\bigl (1-\mu (2-\varphi )\bigr )\in (0,1). \end{aligned}$$

Thus, we have established R-linear convergence of \((\eta _k^\prime )\) to zero. Using (25) and \(\sigma \)-strong convexity of h, we then deduce that

$$\begin{aligned} \eta _{k+1}^\prime= & {} \varphi D_h(z^*,{\overline{z}}_{k+1}) + \frac{\varphi -1}{2} D_h(z_{k+1},z_k) \\{} & {} + \left[ D_h(z^*,{\overline{z}}_{k+1}) + \frac{1}{2} D_h(z_{k+1},z_k) - D_h(z_{k+1},{\overline{z}}_{k+1}) \right] \\\ge & {} \frac{\sigma \varphi }{2}\Vert z^*-{\overline{z}}_{k+1}\Vert ^2 + \frac{\sigma (\varphi -1)}{4}\Vert z_{k+1}-z_k\Vert ^2. \end{aligned}$$

From this, it follows that \(({\overline{z}}_k)\) converges R-linearly to \(z^*\) and \(\Vert z_{k+1}-z_k\Vert \) converges R-linearly to 0. The latter then implies that \((z_k)\) converges R-linearly to \(z^*\). Since \(z^*\) was chosen arbitrarily from \(\varOmega \), it must be unique. \(\square \)

4 The Adaptive Bregman-Golden Ratio Algorithm

In this section, we present an adaptive modification to Algorithm 2 and analyse its convergence. As with the Euclidean aGRAAL, our Bregman adaptive modification has a fully explicit step-size rule. It is presented in Algorithm 3.

Algorithm 3
figure c

The Bregman adaptive Golden RAtio ALgorithm (B-aGRAAL)

We must note that in the case of \(F(z_k) = F(z_{k-1})\) in (33), we adopt the convention \(\frac{x}{0} = +\infty \) for all \(x\in {\mathbb {R}}\), and therefore, \(\lambda _k = \min \{\rho \lambda _{k-1},+\infty ,\lambda _{\max }\} = \min \{\rho \lambda _{k-1},\lambda _{\max }\}\).

Observe that the step-size sequence \((\lambda _k)\) in Algorithm 3 approximates the inverse of a local Lipschitz constant in the following sense:

$$\begin{aligned}{} & {} \lambda _k \le \frac{\sigma \phi \theta _{k-1}}{4\lambda _{k-1}}\frac{\Vert z_k - z_{k-1}\Vert ^2}{\Vert F(z_k) - F(z_{k-1})\Vert ^2} = \frac{\sigma \theta _k\theta _{k-1}}{4\lambda _k}\frac{\Vert z_k - z_{k-1}\Vert ^2}{\Vert F(z_k) - F(z_{k-1})\Vert ^2} \nonumber \\{} & {} \quad \implies \lambda _k\Vert F(z_k) - F(z_{k-1})\Vert \le \frac{\sqrt{\sigma \theta _k\theta _{k-1}}}{2}\Vert z_k - z_{k-1}\Vert . \end{aligned}$$
(36)

Before giving our main convergence result for Algorithm 3, we require some preparatory lemmas. The first two are concerned with well-definedness of the algorithm and boundedness of the step-size sequence. In particular, the proof of the latter is similar to that of [40, Lemma 2], the only difference here is that we account for the updated step-size rule, and we derive explicit bounds which will become useful later.

Lemma 4.1

Suppose Assumptions A.1A.2 hold. Then, the sequences \(({\overline{z}}_k)\) and \((z_k)\) generated by Algorithm 3 are well defined. Moreover, \(({\overline{z}}_k)\subseteq {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\) and \((z_k)\subseteq {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}g\).

Proof

Follows by an analogous argument to that of Lemma 3.1 but with \(\lambda \) replaced by \(\lambda _k\) and \(\varphi \) replaced by \(\phi \). \(\square \)

Lemma 4.2

If \((z_k)\) generated by Algorithm 3 is bounded and F is locally Lipschitz, then both \((\lambda _k)\) and \((\theta _k)\) are bounded and separated from 0. In fact, there exists some \(L>0\) satisfying \(\Vert F(z_k)-F(z_{k-1})\Vert \le L\Vert z_k-z_{k-1}\Vert \) for all \(k\in {\mathbb {N}}\), such that

$$\begin{aligned} \lambda _k\ge \frac{\sigma \phi ^2}{4L^2\lambda _{\max }}\text {~~and~~} \theta _k\ge \frac{\sigma \phi ^3}{4L^2\lambda _{\max }^2}\quad \forall k\in {\mathbb {N}}. \end{aligned}$$

Proof

First we note that \(\lambda _k\le \lambda _{\max }\) by definition, and that \(\theta _k\le \rho \phi \le 1+\frac{1}{\phi }\). Since \((z_k)\) is bounded and F is locally Lipschitz continuous, there exists \(L>0\) such that \(\Vert F(z_k)-F(z_{k-1})\Vert \le L\Vert z_k-z_{k-1}\Vert \) for all \(k\in {\mathbb {N}}\). Without loss of generality, choose L sufficiently large so that \(\lambda _i\ge \frac{\sigma \phi ^2}{4\,L^2\lambda _{\max }}\) for \(i=0,1\). Now, by way of induction, suppose that \(\lambda _i\ge \frac{\sigma \phi ^2}{4 L^2\lambda _{\max }}\) for all \(i=0,\dots ,k-1\). Then, there are three cases:

$$\begin{aligned} \lambda _k&= \lambda _{\max }\ge \lambda _{k-1}\ge \frac{\sigma \phi ^2}{4 L^2\lambda _{\max }},\\ \lambda _k&= \rho \lambda _{k-1}>\lambda _{k-1}\ge \frac{\sigma \phi ^2}{4 L^2\lambda _{\max }},\text {~or,}\\ \lambda _k&= \frac{\sigma \phi \theta _{k-1}}{4\lambda _{k-1}}\frac{\Vert z_k - z_{k-1}\Vert ^2}{\Vert F(z_k) - F(z_{k-1})\Vert ^2}\ge \frac{\sigma \phi ^2}{4L^2\lambda _{k-1}}\ge \frac{\sigma \phi ^2}{4L^2\lambda _{\max }}. \end{aligned}$$

In each case, the desired lower bound holds, and the bound \(\theta _k\ge \frac{\sigma \phi ^3}{4L^2\lambda _{\max }^2}\) follows immediately. \(\square \)

Our next result establishes an inequality similar, but not completely analogous, to its fixed step-size counterpart in Lemma 3.2.

Lemma 4.3

Suppose Assumptions A.1–A.4 hold. Let \(z^*\in \varOmega \) be arbitrary. Then, the sequences \((z_k)\), \(({\overline{z}}_k)\) generated by Algorithm 3 satisfy.

$$\begin{aligned}{} & {} \frac{\phi }{\phi -1}D_h(z^*, {\overline{z}}_{k+1}) + \frac{\theta _k}{2}D_h(z_{k+1}, z_k) - D_h(z_{k+1}, {\overline{z}}_{k+1}) \nonumber \\{} & {} \qquad \qquad \le \frac{\phi }{\phi -1}D_h(z^*, {\overline{z}}_k) + \frac{\theta _{k-1}}{2}D_h(z_k, z_{k-1})- \theta _k D_h(z_k, {\overline{z}}_k)\nonumber \\{} & {} \qquad \qquad \quad +\left( \theta _k-1-\frac{1}{\phi }\right) D_h(z_{k+1}, {\overline{z}}_k). \end{aligned}$$
(37)

Proof

We proceed in a similar fashion as in the proof to Lemma 3.2. By first applying Proposition 2.2(c) with \(f(z):= \lambda _k(\langle F(z_k), z-z_k\rangle + g(z))\), \(u:=z\in {{\,\textrm{dom}\,}}h\cap {{\,\textrm{dom}\,}}g\) arbitrary, \(x:=z_{k+1}\) and \(y_0:={\overline{z}}_k\), followed by the three-point identity (Proposition 2.1(a)) we obtain

$$\begin{aligned} \begin{aligned}&\lambda _k\left( \langle F(z_k),z_{k+1}-z\rangle + g(z_{k+1}) - g(z)\right) \\ {}&\quad \le \langle \nabla h({\overline{z}}_k) - \nabla h(z_{k+1}), z_{k+1} - z\rangle \\&\quad = D_h(z, {\overline{z}}_k) - D_h(z, z_{k+1}) - D_h(z_{k+1}, {\overline{z}}_k). \end{aligned} \end{aligned}$$
(38)

Shifting the index (38) (by setting \(k\equiv k-1\)), setting \(z:=z_{k+1}\) and using the fact that \(\phi \nabla h({\overline{z}}_k) = (\phi -1)\nabla h(z_k) + \nabla h({\overline{z}}_{k-1})\) followed by the three-point identity (Proposition 2.1(a)) gives

$$\begin{aligned} \begin{aligned}&\lambda _{k-1}\left( \langle F(z_{k-1}), z_k - z_{k+1}\rangle + g(z_k) - g(z_{k+1})\right) \\&\quad \le \langle \nabla h({\overline{z}}_{k-1}) - \nabla h(z_k), z_k - z_{k+1}\rangle \\&\quad = \phi \langle \nabla h({\overline{z}}_k) - \nabla h(z_k), z_k - z_{k+1}\rangle \\&\quad = \phi \left[ D_h(z_{k+1},{\overline{z}}_k)-D_h(z_{k+1},z_k) -D_h(z_k,{\overline{z}}_k)\right] . \end{aligned} \end{aligned}$$
(39)

Now, multiplying both sides of (39) by \(\tfrac{\lambda _k}{\lambda _{k-1}}\) then gives

$$\begin{aligned}{} & {} \lambda _k\left( \langle F(z_{k-1}), z_k - z_{k+1}\rangle + g(z_k) - g(z_{k+1})\right) \nonumber \\{} & {} \quad \le \theta _k\left[ D_h(z_{k+1}, {\overline{z}}_k) - D_h(z_{k+1}, z_k) - D_h(z_k, {\overline{z}}_k)\right] . \end{aligned}$$
(40)

Let \(z^*\in \varOmega \). Setting \(z=z^*\) in (38), summing with (39) and rearranging yields

$$\begin{aligned} \begin{aligned}&\lambda _k\left( \langle F(z_k), z_k - z^*\rangle + g(z_k) - g(z^*)\right) \\&\quad \le D_h(z^*,{\overline{z}}_k) - D_h(z^*,z_{k+1}) - D_h(z_{k+1},{\overline{z}}_k)\\&\qquad +\theta _k\Big [D_h(z_{k+1},{\overline{z}}_k) - D_h(z_{k+1},z_k) - D_h(z_k,{\overline{z}}_k)\Big ]\\&\qquad +\lambda _k\langle F(z_k) - F(z_{k-1}), z_k - z_{k+1}\rangle . \end{aligned} \end{aligned}$$
(41)

We observe that the left side of (41) is nonnegative as a consequence of (10) and A.3:

$$\begin{aligned} 0\le \langle F(z^*),z_k-z^*\rangle + g(z_k) - g(z^*)\le \langle F(z_k),z_k-z^*\rangle + g(z_k) - g(z^*). \end{aligned}$$
(42)

To estimate the final term of (41), we use the Cauchy–Schwarz inequality, the local Lipschitz estimate (36) and \(\sigma \)-strong convexity of h to obtain

$$\begin{aligned} \begin{aligned}&\lambda _k\langle F(z_k) - F(z_{k-1}), z_k - z_{k+1}\rangle \\&\quad \le \frac{\sqrt{\sigma \theta _k\theta _{k-1}}}{2}\Vert z_k - z_{k-1}\Vert \Vert z_k - z_{k+1}\Vert \\&\quad \le \frac{\sigma \theta _{k-1}}{4}\Vert z_k - z_{k-1}\Vert ^2+\frac{\sigma \theta _k}{4}\Vert z_k - z_{k+1}\Vert ^2\\&\quad \le \frac{\theta _{k-1}}{2}D_h(z_k, z_{k-1}) + \frac{\theta _k}{2}D_h(z_{k+1}, z_k). \end{aligned} \end{aligned}$$
(43)

Combining (41), (42) and (43) gives

$$\begin{aligned} D_h(z^*, z_{k+1})\le & {} D_h(z^*, {\overline{z}}_k) + (\theta _k-1)D_h(z_{k+1}, {\overline{z}}_k) - \theta _k D_h(z_k, {\overline{z}}_k)\nonumber \\{} & {} - \frac{\theta _k}{2}D_h(z_{k+1}, z_k) + \frac{\theta _{k-1}}{2}D_h(z_k,z_{k-1}). \end{aligned}$$
(44)

Now, applying Proposition 2.1(c) with \(\nabla h(z_{k+1}) = \frac{\phi \nabla h({\overline{z}}_{k+1}) - \nabla h({\overline{z}}_k)}{\phi -1}\) and rearranging yield

$$\begin{aligned} \frac{\phi }{\phi -1}D_h(z^*, {\overline{z}}_{k+1})= & {} D_h(z^*, z_{k+1}) + \frac{\phi }{\phi -1}D_h(z_{k+1}, {\overline{z}}_{k+1}) \nonumber \\{} & {} + \frac{1}{\phi -1}\Big [D_h(z^*, {\overline{z}}_k) - D_h(z_{k+1}, {\overline{z}}_k)\Big ]. \end{aligned}$$
(45)

Combining (44) and (45), followed by collecting like-terms and rearranging, gives

$$\begin{aligned}{} & {} \frac{\phi }{\phi -1}D_h(z^*, {\overline{z}}_{k+1}) + \frac{\theta _k}{2}D_h(z_{k+1}, z_k) - D_h(z_{k+1}, {\overline{z}}_{k+1}) \nonumber \\{} & {} \qquad \qquad \qquad \le \frac{\phi }{\phi -1}D_h(z^*, {\overline{z}}_k) + \frac{\theta _{k-1}}{2}D_h(z_k, z_{k-1})- \theta _k D_h(z_k, {\overline{z}}_k) \nonumber \\{} & {} \qquad \qquad \qquad \quad +\left( \theta _k - \frac{\phi }{\phi -1}\right) D_h(z_{k+1}, {\overline{z}}_k) + \frac{1}{\phi -1}D_h(z_{k+1}, {\overline{z}}_{k+1}). \end{aligned}$$
(46)

Next we apply Proposition 2.1(c) once again to see that

$$\begin{aligned} D_h(z_{k+1},{\overline{z}}_{k+1})= & {} \frac{\phi -1}{\phi }\left[ D_h(z_{k+1},z_{k+1}) - D_h({\overline{z}}_{k+1},z_{k+1})\right] \nonumber \\{} & {} + \frac{1}{\phi }\left[ D_h(z_{k+1},{\overline{z}}_k) - D_h({\overline{z}}_{k+1},{\overline{z}}_k)\right] \nonumber \\\le & {} \frac{1}{\phi }D_h(z_{k+1},{\overline{z}}_k). \end{aligned}$$
(47)

The final line of (46) can therefore be estimated as

$$\begin{aligned} \begin{aligned}&\left( \theta _k - \frac{\phi }{\phi -1}\right) D_h(z_{k+1}, {\overline{z}}_k) + \frac{1}{\phi -1}D_h(z_{k+1}, {\overline{z}}_{k+1}) \\&\qquad \qquad \qquad \quad \le \left( \theta _k - \frac{\phi }{\phi -1} + \frac{1/\phi }{\phi -1}\right) D_h(z_{k+1}, {\overline{z}}_k) \\&\quad \qquad \qquad \qquad = \left( \theta _k -1-\frac{1}{\phi }\right) D_h(z_{k+1}, {\overline{z}}_k). \end{aligned} \end{aligned}$$
(48)

Substituting (48) into (46) gives (37), which completes the proof. \(\square \)

The following is our main result regarding convergence of the Bregman-adaptive GRAAL.

Theorem 4.1

Suppose Assumptions A.1A.4 hold, and F is locally Lipschitz continuous on the bounded set \({{\,\textrm{dom}\,}}g\cap {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). Choose \(\phi \in \left( 1,\varphi \right) \) and \(\rho \in \left[ 1,\frac{1}{\phi }+\frac{1}{\phi ^2}\right) \). Then, the sequences \((z_k)\) and \(({\overline{z}}_k)\) generated by Algorithm 3 converge to a point in \(\varOmega \) whenever \(\lambda _{\textrm{max}}>0\) is sufficiently small.

Proof

Since \({{\,\textrm{dom}\,}}g\cap {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\) is bounded and F is locally Lipschitz, there exists \(L>0\) such that F is L-Lipschitz continuous on \({{\,\textrm{dom}\,}}g\cap {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). Suppose \(\lambda _{\textrm{max}}\) is sufficiently small so that \(\lambda _{\max }\le \frac{\phi \sqrt{\phi \sigma }}{2L}\) holds. Then, applying Lemma 4.2 gives \(\theta _k\ge \frac{\sigma \phi ^3}{4L^2\lambda _{\max }^2}\ge 1\) for all \(k\in {\mathbb {N}}\). Also, since \(\theta _k \le \rho \phi < 1+\frac{1}{\phi }\), there exists an \(\epsilon >0\) such that \(\theta _k-1-\frac{1}{\phi } \le -\epsilon \) for all \(k\in {\mathbb {N}}\).

Now, let \(z^*\in \varOmega \) be arbitrary and denote by \((\eta _k)\) the sequence

$$\begin{aligned} \eta _k:= \frac{\phi }{\phi -1}D_h(z^*, {\overline{z}}_k) +\frac{\theta _{k-1}}{2}D_h(z_k, z_{k-1}) - D_h(z_k, {\overline{z}}_k). \end{aligned}$$

Applying Lemma 4.3 yields

$$\begin{aligned} \eta _{k+1} \le \eta _k -\epsilon D_h(z_{k+1},{\overline{z}}_k). \end{aligned}$$
(49)

Next, we should show that \((\eta _k)\) is bounded below. To this end, using the three-point identity (Proposition 2.1(a)) followed by (38) gives

$$\begin{aligned} \begin{aligned}&D_h(z^*,z_{k+1}) + D_h(z_{k+1},{\overline{z}}_{k+1}) \\&\quad = D_h(z^*,{\overline{z}}_{k+1}) + \langle \nabla h({\overline{z}}_{k+1}) - \nabla h(z_{k+1}), z^*-z_{k+1}\rangle \\&\quad = D_h(z^*,{\overline{z}}_{k+1}) + \frac{1}{\phi }\langle \nabla h({\overline{z}}_k) - \nabla h(z_{k+1}), z^*-z_{k+1}\rangle \\&\quad \le D_h(z^*, {\overline{z}}_{k+1}) + \frac{\lambda _k}{\phi }\left( \langle F(z_k), z^* - z_{k+1}\rangle - g(z_{k+1}) + g(z^*)\right) \\&\quad = D_h(z^*, {\overline{z}}_{k+1}) + \frac{\lambda _k}{\phi }\langle F(z_k) - F(z_{k+1}), z^* - z_{k+1}\rangle \\&\qquad +\frac{\lambda _k}{\phi }(\langle F(z_{k+1}), z^* - z_{k+1}\rangle - g(z_{k+1}) + g(z^*)). \end{aligned} \end{aligned}$$
(50)

Now, as with (16), the final term in (50) is nonnegative. Since \((\lambda _k)\) and \((\theta _k)\) are bounded and separated from 0 by Lemma 4.2, there exists a constant \(M>0\) such that

$$\begin{aligned} \begin{aligned}&\lambda _k\langle F(z_k) - F(z_{k+1}), z^* - z_{k+1}\rangle \\&\le \frac{\lambda _k}{\lambda _{k+1}}\lambda _{k+1}\Vert F(z_k) - F(z_{k+1})\Vert \Vert z^*-z_{k+1}\Vert \\&\le \frac{\lambda _k}{\lambda _{k+1}}\frac{\sigma \sqrt{\theta _{k+1} \theta _k}}{2}\Vert z_k-z_{k+1}\Vert \Vert z^*-z_{k+1}\Vert \le M. \end{aligned} \end{aligned}$$
(51)

Combining (50) and (51), noting that \(\phi <\frac{\phi }{\phi -1}\) then gives

$$\begin{aligned} D_h(z_{k+1},{\overline{z}}_{k+1})&\le D_h(z^*, {\overline{z}}_{k+1}) + \frac{M}{\phi }\nonumber \\&\le \phi D_h(z^*,{\overline{z}}_{k+1}) + M \implies \eta _{k+1} \ge -M, \end{aligned}$$
(52)

which establishes that \((\eta _k)\) is bounded below.

Next, by telescoping the inequality (49), deduce that \((\eta _k)\) is bounded and \(D_h(z_{k+1},{\overline{z}}_k)\rightarrow 0\) as \(k\rightarrow \infty \). Referring to (47), it follows that \(D_h(z_{k+1},{\overline{z}}_{k+1})\rightarrow 0\). Also, by applying Proposition 2.1(c) with the identity \(\nabla h(z_k) = \frac{\phi \nabla h({\overline{z}}_k) - \nabla h({\overline{z}}_{k-1})}{\phi -1}\), we obtain

$$\begin{aligned} D_h(z_{k+1},z_k)&= \frac{\phi }{\phi -1}\left[ D_h(z_{k+1},{\overline{z}}_k) - D_h(z_k,{\overline{z}}_k)\right] \\&\quad - \frac{1}{\phi -1}\left[ D_h(z_{k+1},{\overline{z}}_{k-1}) - D_h(z_k,{\overline{z}}_{k-1})\right] \\&\le \frac{\phi }{\phi -1}D_h(z_{k+1},{\overline{z}}_k) + \frac{1}{\phi -1}D_h(z_k,{\overline{z}}_{k-1}) \rightarrow 0. \end{aligned}$$

Altogether, noting that \((\theta _k)\) is bounded, we deduce that

$$\begin{aligned} \lim _{k\rightarrow \infty }\eta _k = \frac{\phi }{\phi -1}\lim _{k\rightarrow \infty }D_h(z^*,{\overline{z}}_k), \end{aligned}$$

and so, in particular, \(\lim _{k\rightarrow \infty }D_h(z^*,{\overline{z}}_k)\) exists.

Next, using \(\sigma \)-strong convexity of h, we deduce that \(z_{k+1}-{\overline{z}}_k\rightarrow 0\). Let \({\overline{z}}\in {{\,\mathrm{{\mathcal {H}}}\,}}\) be a cluster point of the bounded sequence \(({\overline{z}}_k)\). Then, there exists a subsequence \(({\overline{z}}_{k_j})\) such that \({\overline{z}}_{k_j}\rightarrow {\overline{z}}\) and \(z_{k_j+1}\rightarrow {\overline{z}}\) as \(j\rightarrow \infty \). Now, recalling (38) gives

$$\begin{aligned}{} & {} \lambda _k\left( \langle F(z_{k_j}),z_{{k_j}+1}-z\rangle + g(z_{{k_j}+1}) - g(z)\right) \\{} & {} \quad \le \langle \nabla h({\overline{z}}_{k_j}) - \nabla h(z_{{k_j}+1}), z_{{k_j}+1} - z\rangle \quad \forall z\in {{\,\mathrm{{\mathcal {H}}}\,}}, \end{aligned}$$

and taking the limit-infimum of both sides as \(j\rightarrow \infty \) shows that \({\overline{z}}\in \varOmega \). Since \(z^*\in \varOmega \) was chosen in Lemma 4.3 to be arbitrary, we can now set \(z^*={\overline{z}}\). It then follows that \(\lim _{j\rightarrow \infty }D_h(z^*,{\overline{z}}_{k_j}) = 0\), and consequently, \(\lim _{j\rightarrow \infty }\eta _{k_j} = 0\). Also note that for \(n\ge k_j\), we have \(\eta _n\le \eta _{k_j}\) from Lemma 3.2, and therefore,

$$\begin{aligned} \frac{\phi }{\phi -1}\lim _{n\rightarrow \infty }D_h(z^*, {\overline{z}}_n) = \lim _{n\rightarrow \infty }\eta _n \le \lim _{j\rightarrow \infty }\eta _{k_j} = 0, \end{aligned}$$

and \({\overline{z}}_k\rightarrow z^*\) from strong convexity. The fact that \(z_k\rightarrow z^*\) follows since \(z_k-{\overline{z}}_k\rightarrow 0\). \(\square \)

Remark 4.1

The energy (49) used in the proof of Theorem 4.1 is not completely analogous to the one used in the fixed step-size case given in Lemma 3.2. Notably, the coefficient of the final term of \(\eta _k\) in (49) has coefficient \(-1\), whereas the corresponding term in Lemma 3.2 has coefficient \(-\varphi \). Although it is unlikely to be useful in practice, another interesting feature of the proof it that it requires the maximum step size to satisfy the upper bound

$$\begin{aligned} \lambda _{\textrm{max}}\le \frac{\phi \sqrt{\phi \sigma }}{2L} = \sqrt{\frac{\phi }{\sigma }}\frac{\sigma \phi }{2L}. \end{aligned}$$

Note that when \(\phi >\sigma \), this upper bound is looser than the upper bound of \(\lambda <\frac{\sigma \phi }{2L}\) required for B-GRAAL (Algorithm 2). Thus, it is possible for maximum step size of B-aGRAAL to be larger than the step size required for B-GRAAL. This is the case, for instance, if \(D_h\) is the KL divergence on the simplex in which case \(\sigma =1<\phi \).

Also, the proof of Theorem 4.1 required L be a Lipschitz constant for F. However, thanks to Lemma 4.2, it would have sufficed for L to satisfy the weaker Lipschitz-like inequality \(\Vert F(z_k)-F(z_{k-1})\Vert \le L\Vert z_k-z_{k-1}\Vert \) for all \(k\in {\mathbb {N}}\). In turn, this would allow for a greater upper bound for \(\lambda _{\textrm{max}}\) which is inversely related to L.

Remark 4.2

A similar analysis could be applied to infer weak convergence in infinite-dimensional spaces for Algorithms 2 and 3. However, this would firstly require the additional condition of weak–weak continuity in \(\nabla h\). Lemma 4.2 in this context would also require that F is Lipschitz continuous over bounded sets, which in infinite dimensions is a stronger condition than local Lipschitz continuity. We therefore decided to work in finite dimensions for simplicity.

5 Numerical Experiments

In this section, we present some experimental comparisons between Algorithms 2 and 3 and their respective Euclidean counterparts. To be precise, in Sect. 5.1 we compare the Kullback–Leibler version of Algorithm 2 to the Euclidean version, which is equivalent to GRAAL. In Sect. 5.2, we make the same comparison for Algorithm 3, recalling once again that aGRAAL is recovered in the Euclidean case. Finally, in Sect. 5.3, we compare the Euclidean aGRAAL to Algorithm 3 with two distinct choices of Bregman distances. All experiments are run in Python 3 on a Windows 10 machine with 8GB memory and an Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz processor.

We consider three different problems, all of which can formulated as the variation inequality (10) for different choices of the function g and the operator F. Note that the solutions of the variational inequality (10) can also be characterised as the monotone inclusion

$$\begin{aligned} \text {find}~z^*\in {{\,\mathrm{{\mathcal {H}}}\,}}~\text {such that}~0\in F(z^*) + \partial g(z^*). \end{aligned}$$

Thus, by noting that (35) implies \(0 \in F(z_k) + \partial g(z_{k+1}) + \frac{1}{\lambda _k}\left( \nabla h(z_{k+1}) - \nabla h({\overline{z}}_k)\right) \), we monitor the quantity \(J_k\) given by

$$\begin{aligned} J_k:= \frac{1}{\lambda _k}\left( \nabla h({\overline{z}}_k) - \nabla h(z_{k+1})\right) + F(z_{k+1}) - F(z_k) \in F(z_{k+1}) + \partial g(z_{k+1}) \end{aligned}$$
(53)

as a natural residual for Algorithm 3. The analogous expression for Algorithm 2 is given by replacing \(\lambda _k\) in (53) with \(\lambda \).

In all experiments, we run each algorithm on for the same (fixed) number of iterations on 10 random instances of the chosen problem. The figures in each section show the decrease in residual over time, with each faint line representing one instance and the bold line showing the mean behaviour. We use the parameters \(\phi =1.5,\lambda _{\max }=10^6\) throughout, and initial step size and iterates as described in each section.

5.1 Matrix Games

To test the fixed step-size B-GRAAL (Algorithm 2), we first consider the following matrix game between two players

$$\begin{aligned} \min _{x\in \Delta ^n}\max _{y\in \Delta ^n}\langle Mx,y\rangle , \end{aligned}$$
(54)

where \(\Delta ^n:= \{x\in {\mathbb {R}}^n_+:\sum _{i=1}^n x_i = 1\}\) denotes the unit simplex and \(M\in {\mathbb {R}}^{n\times n}\) a given matrix. This problem is of the form specified in (1) and so can be formulated as the variational inequality (10).

In particular, we consider the specific problem in the form of (54) of placing a server on a network \(G = (V,E)\) with n vertices in a way that minimises its response time. In this problem, a request will originate at some vertex \(v_j\in V\), which is not known ahead of time, and the objective is to place the server at a vertex \(v_i\in V\) such the response time, as measured by the graphical distance \(d(v_i,v_j)\), is minimised. We consider the case where the request location \(v_j\in V\) is a decision made by an adversary. The decision variable \(x\in \Delta ^n\) (resp. \(y\in \Delta ^n\)) models mixed strategies for placement of the server (resp. request origin). In other words, \(x_t\) (resp. \(y_t\)) is the probability of the server (resp. request origin) being located at the node \(v_t\) for \(t=1,\dots ,n\). The matrix M is taken to be the distance matrix of the graph G, that is, \(M_{ij} = d(v_i,v_j)\) for all vertices \(v_i,v_j\in V\). In this way, the objective function in (54) measures the expected response time, which we would like to minimise while our adversary seeks to maximise it.

We compare Algorithm 2 with the squared norm \(h(z) = \frac{1}{2}\Vert z\Vert ^2\), which generates the squared Euclidean distance \(D_h(u,v) = \frac{1}{2}\Vert u-v\Vert ^2\), and the negative entropy \(h(z) = \sum _{i=1}^{2n} z_i\log z_i\), which generates the KL divergence \(D_h(u,v) = \sum _{i=1}^{2n}u_i\log \frac{u_i}{v_i} + v_i - u_i\), where \(z=(x,y)\in {\mathbb {R}}^{2n}\) is the concatenation of the two vectors x and y. Both of these choices for h are 1-strongly convex on \(\Delta ^n\). A potential advantage of the KL projection onto the simplex is that it has the simple closed-form expression \(x\mapsto \frac{x}{\Vert x\Vert _1}\), whereas the Euclidean projection has no closed form and takes \(O(n\log n)\) time (see, for instance, [18, 19, 50]). We run two experiments with \(n=500\) and \(n=1000\), and the results are shown, respectively, in Figs. 1 and 2. Initial points are chosen as \(x_0=y_0=\left( \frac{1}{n},\dots ,\frac{1}{n}\right) \in \Delta ^n\), and \(z_0=(x_0,y_0)\), then \({\overline{z}}_0\) as a random perturbation of \(z_0\). The Lipschitz constant of F is computed as \(L=\Vert M\Vert _2\), and the algorithm step size is taken as \(\lambda =\frac{\varphi }{2L}\). Under these conditions, convergence to a solution is guaranteed by Theorem 3.1.

The residual, given by \(\min _{i\le k}\Vert J_i\Vert ^2\), is shown in Figs. 1a and 2a. The time per iteration is also shown in Figs. 1b and 2b. Despite the KL projection being faster than the Euclidean projection, overall, the Euclidean method performed better in this instance.

We now move onto experiments for the adaptive Algorithm 3.

Fig. 1
figure 1

Matrix game results for \(n=500\)

Fig. 2
figure 2

Matrix game results for \(n=1000\)

5.2 Gaussian Communication

We now turn our attention to maximising the information capacity of a noisy Gaussian communication channel [16, Chapter 9]. In this problem, the goal is to allocate a total power of P across m channels, represented by \(p\in {\mathbb {R}}^m_+\), to maximise the total information capacity of the channels in the presence of allocated noise, represented by \(n\in {\mathbb {R}}^m_+\). The information capacity of the ith channel, denoted \(C_i(p_i,n_i)\), is the function of power \(p_i\) and noise level \(n_i\) and is given by

$$\begin{aligned} C_i(p_i,n_i) = \log \left( 1+\frac{\beta _i p_i}{\mu _i+n_i}\right) , \end{aligned}$$

where \(\mu _i>0\) and \(\beta _i>0\) are given constants.

Assuming a total power level of P and a total noise level of N, optimising for the worst-case scenario by treating the noise allocation as an adversary gives the convex–concave game

$$\begin{aligned} \max _{p\in \Delta ^m_P}\min _{n\in \Delta ^m_N} \sum _{i=1}^m C_i(p_i,n_i), \end{aligned}$$

where \(\Delta _T^n:= \{x\in {\mathbb {R}}^n_+:\sum _{i=1}^n x_i = T\}\) is a scaled simplex. This problem is also of the form specified in (1) and so can also be formulated as the variational inequality (10).

The Lipschitz constant of the operator F for this problem is not straightforward to compute, so we apply the adaptive algorithms. Similarly to Sect. 5.1, we compare the Euclidean and KL versions of Algorithm 3. Since \(x\mapsto x\log x\) is \(\frac{1}{M}\)-strongly convex for \(0<x\le M\), the strong convexity constant of the negative entropy over \(\Delta ^m_P\times \Delta ^m_N\) is \(\min \left\{ \frac{1}{P},\frac{1}{N}\right\} \). In our experiments, we set \((P,N)=(500,50)\) and generate \(\beta \in (0,P]^m\) and \(\mu \in (1,N+1]^m\) uniformly. The initial points are chosen as \(p_0=\left( \frac{P}{m},\dots ,\frac{P}{m}\right) \in \Delta ^m_P,n_0=\left( \frac{N}{m},\dots ,\frac{N}{M}\right) \in \Delta ^m_N\), and \(z_0=(p_0,n_0)\), and the initial step size is taken as \(\lambda _0=\frac{\Vert z_0-{\overline{z}}_0\Vert ^2}{\Vert F(z_0)-F({\overline{z}}_0)\Vert ^2}\) where \({\overline{z}}_0\) is again a small random perturbation of \(z_0\). It is also worth noting that for the KL version, we had to multiply \(\lambda _0\) by a small constant (\(10^{-2}\)) to avoid numerical instability issues.

Fig. 3
figure 3

Gaussian communication channel results for \(m=100\)

Fig. 4
figure 4

Gaussian communication channel results for \(m=200\)

We run two experiments, for \(m=100\) and \(m=200\), and plot the results in Figs. 3 and 4, respectively. The KL method is slower than the Euclidean method in terms of the number of iterations and time. However, unlike in the previous section, both methods reach a similar final accuracy.

5.3 Cournot Completion

Our final example is a standard N-player Cournot oligopoly model [14, Example 2.1]. This is a system in which N independent firms supply the market with a quantity of some common good or service. More formally, each firm seeks to maximise their utility subject to their capacity, that is,

$$\begin{aligned} \displaystyle {\max _{0\le x_i\le C_i}} u_i(x_i,x_{-i}) = x_i P(x_T) - c_i x_i, \end{aligned}$$
(55)

where

  • \(x_i\ge 0\) is the quantity of the good supplied by the ith firm, \(i=1,\dots ,N\).

  • \(x_{-i} = (x_1,\dots ,x_{i-1},x_{i+1},\dots ,x_N)\) is the quantity of the good supplied by all other firms.

  • \(x_T:= \sum _{i=1}^N x_i\) is the total amount supplied.

  • \(C_i>0\) is the production capacity of the ith firm.

  • \(c_i>0\) is the production cost of the ith firm.

  • \(P:{\mathbb {R}}_+\rightarrow {\mathbb {R}}\) is the inverse demand curve.

In this section, we consider solutions to this problem in the sense of Nash equilibria, which is equivalent to the variational inequality (10) with

$$\begin{aligned} F = \left( -\frac{\partial u_1}{\partial x_1}, \dots , -\frac{\partial u_N}{\partial x_N}\right) ,\quad K = \left[ 0,C_1\right] \times \dots \times \left[ 0,C_N\right] ,\quad g = \iota _K. \end{aligned}$$

By choosing a function h such that \({{\,\textrm{dom}\,}}h=K\), it is possible to implicitly enforce the capacity constraint in this problem and avoid performing projections. Two examples, over a single closed interval \([\alpha ,\beta ]\), present themselves which satisfy our assumptions:

$$\begin{aligned} h_1(x) = (x-\alpha )\log (x-\alpha ) + (\beta -x)\log (\beta -x), \quad h_2(x) = -\sqrt{(x-\alpha )(\beta -x)}. \end{aligned}$$

When \(\alpha =0\) and \(\beta =1\), \(h_1\) is called the Fermi-Dirac Entropy, and similarly when \(\alpha =-1\) and \(\beta =1\), \(D_{h_2}\) is called the Hellinger distance ( [9, Example 2.2], [8, Example 1], [49, Example 2.1]). Then, we sum over the independent functions of one variable, with appropriately set intervals \(\alpha =0,\beta =C_i\), to create the Bregman distance over the closed box K. We also compute the strong convexity constants by minimising \(\nabla ^2 h_1\) over \((\alpha ,\beta )\), which gives \(\sigma = \frac{4}{\beta -\alpha }\), and similarly for \(h_2\), \(\sigma =\frac{2}{\beta -\alpha }\).

In our experiments, P is taken to be a linear inverse demand curve given by \(P(x) = a-bx\) for \(a,b>0\). All parameters are generated by a log-normal distribution, except the cost vector \(c\in {\mathbb {R}}^N\), which is generated uniformly in \(\left[ \frac{C_1}{100},\frac{C_1}{5}\right] \times \dots \times \left[ \frac{C_N}{100},\frac{C_N}{5}\right] \). We run two experiments, with \(N=2000\) and \(N=5000\), the results of which are shown in Figs. 5 and 6. The initial points are chosen as \(z_0 = \frac{C}{2}\), and \(\lambda _0=1\).

Fig. 5
figure 5

Results for \(N=2000\)

Fig. 6
figure 6

Results for \(N=5000\)

We observe here that the final accuracy depends heavily on the choice of Bregman distance. In both figures, we observe that the Hellinger method is not making any progress after approximately 200 iterations, while the Euclidean method achieves a modest final accuracy. Meanwhile, the Bit method is very fast and accurate, converging to a near 0 tolerance in roughly 300 iterations in all instances. Finally, we note that all methods are roughly equal in terms of time per iteration. This is to be expected, since the Euclidean method requires a projection which evaluates \(\min \{\max \{x_i,0\}, C_i\}\) for each component, whereas the Bregman methods require evaluating \(\nabla h,(\nabla h)^{-1}\) over each component—all of which are taking O(N) time.

6 Conclusion

In this paper, we extended the adaptive method aGRAAL (Algorithm 1) to the Bregman distance setting. We proposed two such extensions: The first, Algorithm 2, generalises the fixed step-size GRAAL and converges under the same assumptions for a strongly convex Bregman function \(h:{{\,\mathrm{{\mathcal {H}}}\,}}\rightarrow {\mathbb {R}}\). The second, Algorithm 3, generalises Algorithm 1 and converges in a more restrictive setting. We first examined the performance of Algorithm 2 and found that the KL version is less favourable than the Euclidean version, despite the reduced time per iteration. We then tested Algorithm 3 on a convex–concave game for Gaussian communication channels, where our new method performed worse with respect to the KL divergence when compared to the Euclidean method, although the run-time per iteration was again significantly shorter as was expected. Finally, we examined a Cournot completion model, where one of Bregman based methods reached a much higher accuracy very quickly.

We conclude by outlining directions for further research:

  • It would be interesting to know whether Algorithm 3 can be shown to converge in a more general setting than what we have shown, and if so, under what circumstances. The difficulties in our analysis arose from two issues: first, the estimate derived in (47) for the Bregman case is weaker than the Euclidean equality \(\Vert z_{k+1}-{\overline{z}}_{k+1}\Vert ^2 = \frac{1}{\phi ^2}\Vert z_{k+1}-{\overline{z}}_k\Vert ^2\) used in [40], and second, the inability to bound \(\theta _k\) below without additional assumptions (as the bound in Lemma 4.2 can be arbitrarily small in general).

  • Throughout this paper, h was assumed strongly convex, but whether or not this assumption can be relaxed is unclear. One potential replacement for strong convexity is considered in [6, 36] where h is twice differentiable such that the Hessian matrix \(\nabla ^2 h(z)\) is positive definite for all \(z\in {{\,\textrm{int}\,}}{{\,\textrm{dom}\,}}h\). Within the scope of twice differentiable functions, such a condition lies between strict and strong convexity, the main consequence being that \(\nabla h\) is locally Lipschitz and h is locally strongly convex. Indeed, for \(\sigma \)-strongly convex h with \(\eta \)-Lipschitz gradient, one can derive the estimates

    $$\begin{aligned} \frac{\sigma }{2}\Vert z-z'\Vert ^2 \le D_h(z,z') \le \frac{\eta }{2}\Vert z-z'\Vert ^2\quad \forall z,z'. \end{aligned}$$

    If such inequalities hold on a local scale, then it would remain to be seen whether these coefficients can be estimated and used in the same way that \(\lambda _k\) approximates an inverse of the local Lipschitz constant of F.

  • In the context of the convex composite optimisation problems of the form \(\min _{x\in {{\,\mathrm{{\mathcal {H}}}\,}}}f(x) + g(x)\) where g is non-smooth and f is smooth, the Bregman proximal gradient algorithm [8, 49] is known to converge in a more general setting than Lipschitz continuity. Specifically, L-Lipschitz continuity of \(\nabla f\) can be relaxed to convexity of the function \(Lh-f\), which indeed holds if \(\nabla f\) is Lipschitz and h is strongly convex. It would be interesting to see if a similar relaxation of Lipschitz continuous can be used in the context of the algorithms discussed here.