1 Introduction

Saddle point—or minimax—problems arise traditionally in game theory [23] or for example in the context of determining primal-dual pairs of optimal solutions of constrained convex optimisation problems [1]. However, in recent years they have witnessed increased interest due to many relevant and challenging applications in the field of machine learning, with the most prominent being the training of Generative Adversarial Networks (GANs) [10]. Even though the problems in reality are often not of this form, in the classical setting the minimax objective comprises a smooth convex-concave coupling function with Lipschitz continuous gradient and a (potentially nonsmooth) regulariser in each variable, leading to a convex-concave objective in total.

One well established method in practice due to its simplicity and computational efficiency is Gradient Descent Ascent (GDA), either in a simultaneous or in an alternating variant (for a recent comparison of the convergence behaviour of the two schemes we refer to [24]). However, naive application of GDA is known to lead to oscillatory behaviour or even divergence already in simple cases such as bilinear objectives. Most algorithms with convergence guarantees in the general convex-concave setting make use of the formulation of the first order optimality conditions as monotone inclusion or variational inequality, treating both components in a symmetric fashion. For example we have the Extragradient method [12] whose application to minimax problems has been studied in [19] under the name of Mirror Prox, and the Forward-Backward-Forward method (FBF) [22] with application to saddle point problems in [3]. Both algorithms have even been successfully applied to the training of GANs (see [3, 9]), but, though being single-loop methods, suffer in practice from requiring two gradient evaluations per iteration. A possible way to avoid this is to reuse previous gradients. Doing this for FBF—as shown in [3]—recovers the Forward-Reflected Backward method [15] which was applied to saddle point problems under the name of Optimistic Mirror Descent and to GAN training under the name of Optimistic Gradient Descent Ascent [5, 6, 14].

The first method treating general coupling functions with an asymmetric scheme is the Accelerated Primal-Dual Algorithm (APD) by [11], involving an optimistic gradient ascent step in one component which is followed by a gradient descent step in the other one. In the special case of a bilinear coupling function APD recovers the Primal-Dual Hybrid Gradient Method (PDHG) [4]. In the case of the minimax objective being strongly convex-concave  acceleration of PDHG is obtained in [4], which is also done for APD in [11], however only under the rather limiting assumption of linearity of the coupling function in one component.

In this paper we introduce a novel algorithm OGAProx for solving a convex-concave saddle point problem, where the convex-concave coupling function is smooth in one variable and nonsmooth in the other, and it is augmented by a nonsmooth regulariser in the smooth component. OGAProx consists of an optimistic gradient ascent step in the smooth component of the coupling function combined with a proximal step of the regulariser, which is followed by a proximal step of the coupling function in the nonsmooth component. We will be also able to accelerate our method in the convex-strongly concave setting without linearity assumption on the coupling function. Furthermore, we prove linear convergence if the problem is strongly convex-strongly concave, yielding similar results as for PDHG [4] in the bilinear case.

So far in most works nonsmoothness is only introduced via regularisers, as the coupling function is typically accessed through gradient evaluations. Recently there is another development, although with the saddle point problem not being convex-concave, where the assumption on differentiability of the coupling function in both components is weakened to only one component [2]. As the evaluation of the proximal mapping does not require differentiability we will assume the coupling function to be smooth in only one component, too.

The remainder of the paper is organised as follows. Next we will introduce the precise problem formulation and the setting we will work with, formulate the proposed algorithm OGAProx and state our contributions. This will be followed by preliminaries in Sect. 2. Afterwards we will discuss the properties of our algorithm in the convex-concave and convex-strongly concave setting and state respective convergence results in Sect. 3. After that we will investigate the convergence of the method under the additional assumption of strong convexity-strong concavity in Sect. 4. The paper will be concluded by numerical experiments in Sect. 5, where we treat a simple nonsmooth-linear saddle point problem, the training of multi kernel support vector machines and a classification problem taking into account minimax group fairness.

1.1 Problem description

Consider the saddle point problem

$$\begin{aligned} \min _{x\in {{\mathcal {H}}}} \max _{y\in {{\mathcal {G}}}} \Psi (x,y) := \Phi (x,y) - g(y), \end{aligned}$$
(1)

where \({{\mathcal {H}}}, \, {{\mathcal {G}}}\) are real Hilbert spaces, \(\Phi : {{\mathcal {H}}}\times {{\mathcal {G}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) is a coupling function with \({{\,\mathrm{dom}\,}}\Phi := \{(x,y) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\ | \ \Phi (x,y) < +\infty \} \ne \emptyset\) and \(g: {{\mathcal {G}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) a regulariser. Throughout the paper (unless otherwise specified) we will make the following assumptions:

  • g is proper, lower semicontinuous and convex with modulus \(\nu \ge 0\), i.e. \(g - \frac{\nu }{2} \left\Vert \, \cdot \, \right\Vert ^{2}\) is convex (notice that we also allow and consider the situation \(\nu = 0\), in which case g is convex; otherwise g is strongly convex);

  • for all \(y \in {{\,\mathrm{dom}\,}}g\), \(\Phi (\, \cdot \,,y): {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) is proper, convex and lower semicontinuous;

  • for all \(x \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) := \{ u \in {{\mathcal {H}}}\ | \ \exists y \in {{\mathcal {G}}}\ \text{ such } \text{ that } \ (u,y) \in {{\,\mathrm{dom}\,}}\Phi \}\) we have that \({{\,\mathrm{dom}\,}}\Phi (x, \, \cdot \,) = {{\mathcal {G}}}\) and \(\Phi (x,\, \cdot \,): {{\mathcal {G}}}\rightarrow \mathbb {R}\) is concave and Fréchet differentiable. Moreover, \({{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) is closed;

  • there exist \(L_{yx}, \, L_{yy} \ge 0\) such that for all \((x,y), \, (x',y') \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) \times {{\,\mathrm{dom}\,}}g\) it holds

    $$\begin{aligned} \left\Vert \nabla _{y}\Phi (x, y) - \nabla _{y}\Phi (x', y') \right\Vert \le L_{yx} \left\Vert x-x' \right\Vert + L_{yy} \left\Vert y - y' \right\Vert . \end{aligned}$$
    (2)

By convention we set \(+\infty - (+\infty ) := +\infty\). Thus, the situation can be summarised by

$$\begin{aligned} \Psi (x, y) = {\left\{ \begin{array}{ll} - \infty &{} \text {if } x \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) \text { and } y \notin {{\,\mathrm{dom}\,}}g,\\ \Phi (x,y) - g(y) &{} \text {if } x \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) \text { and } y \in {{\,\mathrm{dom}\,}}g,\\ + \infty &{} \text {if } x \notin {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ). \end{array}\right. } \end{aligned}$$
(3)

We are interested in finding a saddle point of (1), which is a point \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) that fulfils the inequalities

$$\begin{aligned} \Psi (x^{*}, y) \le \Psi (x^{*}, y^{*})\le \Psi (x, y^{*}) \quad \forall (x,y) \in {{\mathcal {H}}}\times {{\mathcal {G}}}. \end{aligned}$$
(4)

For the remainder we assume that such a saddle point exists.

The assumptions considered above ensure that for any saddle point \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) we have

$$\begin{aligned} x^{*}\in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ), \quad y^{*}\in {{\,\mathrm{dom}\,}}g \ \text{ and } \ \Psi (x^{*}, y^{*}) = \Phi (x^{*},y^{*}) - g(y^{*})\in \mathbb {R}. \end{aligned}$$

Finding a saddle point of (1) amounts to solving the necessary and sufficient first order optimality conditions, given by the following coupled inclusion problems

$$\begin{aligned} 0 \in \partial \left[ \Phi (\, \cdot \,, y^{*}) \right] (x^{*}) \quad \text {and} \quad 0 \in \text {-} \nabla _{y}\Phi (x^{*}, y^{*}) + \partial g(y^{*}). \end{aligned}$$

Remark 1

In case \(\Phi\) and g have full domain, \(\Psi\) is a convex-concave function with full domain and the set \({{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) is obviously closed. However, in order to allow more flexibility and to cover a wider range of problems (see also the last section with numerical experiments), our investigations are carried out in the more general setting given by the assumptions described above. Furthermore, these assumptions allow us to stay in the rigorous setting of the theory of convex-concave saddle functions as described by Rockafellar in [21] (see Definition 4 and Proposition 5 below).

Example 2

Consider the nonsmooth convex optimisation problem with inequality constraints

$$\begin{aligned} \begin{array}{rl} \min &{} f(x),\\ \text{ subject } \text{ to } &{} h_i(x) \le 0, i=1, ..., m\end{array} \end{aligned}$$
(5)

where \(f : {{\mathcal {H}}} \rightarrow {\mathbb {R}} \cup \{+\infty \}\) is a proper, convex and lower semicontinuous function and \(h_i : {{\mathcal {H}}} \rightarrow {\mathbb {R}}, i=1, ..., m,\) are convex and continuous functions. The Lagrangian attached to (5) reads

$$\begin{aligned} L : {{\mathcal {H}}} \times \mathbb {R}^m \rightarrow {\mathbb {R}} \cup \{+\infty \}, \quad L(x,\lambda _1, ..., \lambda _m) = f(x) + \sum _{i=1}^m \lambda _i h_i(x). \end{aligned}$$

Then the saddle point problem

$$\begin{aligned} \min _{x \in {{\mathcal {H}}}} \max _{(\lambda _1, ..., \lambda _m) \in \mathbb {R}_+^m} L(x,\lambda _1, ..., \lambda _m) = \min _{x \in {{\mathcal {H}}}} \max _{(\lambda _1, ..., \lambda _m) \in \mathbb {R}_m} L(x,\lambda _1, ..., \lambda _m) - \delta _{\mathbb {R}_m^+}(\lambda _1, ..., \lambda _m) \end{aligned}$$
(6)

exhibits the structure of saddle point problem (1). It is known that if \((x^{*}, \lambda _1^{*}, ...,\lambda _m^{*})\) is a saddle point of (6), then \(x^{*}\) is an optimal solution of the constrained convex optimisation problem (5) and \((\lambda _1^{*}, ...,\lambda _m^{*})\) is an optimal solution of its Lagrange dual.

1.2 Algorithm

The algorithm we investigate performs an optimistic gradient ascent step of \(\Phi\) followed by an evaluation of the proximal mapping g in the variable y, while it carries out a purely proximal step of \(\Phi\) in x. We will call this method Optimistic Gradient Ascent – Proximal Point algorithm (OGAProx) in the following. For all \(k \ge 0\) we define

$$\begin{aligned} y_{k+1}&= {\text {prox}}^{}_{\sigma _{k} g}\left( y_{k} + \sigma _{k} \left[ (1 + \theta _{k}) \nabla _{y}\Phi (x_{k}, y_{k}) - \theta _{k} \nabla _{y}\Phi (x_{k-1}, y_{k-1})\right] \right) , \end{aligned}$$
(7)
$$\begin{aligned} x_{k+1}&= {\text {prox}}^{}_{\tau _{k} \Phi (\, \cdot \,, y_{k+1})}\left( x_{k} \right) , \end{aligned}$$
(8)

with the conventions \(x_{-1} := x_{0}\) and \(y_{-1} := y_{0}\) for starting points \(x_{0} \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) and \(y_{0} \in {{\,\mathrm{dom}\,}}g\). The particular choices of the sequences \((\sigma _{k})_{k \ge 0}, \, (\tau _{k})_{k \ge 0} \subseteq \mathbb {R}_{++}\) and \((\theta _{k})_{k \ge 0} \subseteq \left( 0, 1 \right]\) will be specified later.

1.3 Contribution

Let us summarize the main results of this paper:

  1. 1.

    We introduce a novel algorithm to solve saddle point problems with nonsmooth coupling functions, which in general is not assumed to be linear in either component.

  2. 2.

    We prove for the saddle function \(\Psi\) being

    1. (a)

      convex-concave (see Theorem 9):

      • weak convergence of the generated sequence \((x_{k}, y_{k})_{k \ge 0}\) to a saddle point \((x^{*}, y^{*})\) as \(k \rightarrow + \infty\);

      • convergence of the minimax gap \(\Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K})\) to zero like \(\mathcal {O}(\frac{1}{K})\) as \(K \rightarrow + \infty\), where \((\bar{x}_{K})_{K \ge 1}\) and \((\bar{y}_{K})_{K \ge 1}\) are the ergodic sequences obtained by averaging \((x_k)_{k \ge 1}\) and \((y_k)_{k \ge 1}\), respectively;

    2. (b)

      convex-strongly concave (see Theorem 12):

      • strong convergence of \((y_{k})_{k \ge 0}\) to \(y^{*}\) like \(\mathcal {O}(\frac{1}{k})\) as \(k \rightarrow + \infty\);

      • convergence of the minimax gap \(\Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K})\) to zero like \(\mathcal {O}(\frac{1}{K^{2}})\) as \(K \rightarrow + \infty\);

    3. (c)

      strongly convex-strongly concave (see Theorem 14):

      • linear convergence of \((x_{k}, y_{k})_{k \ge 0}\) to \((x^{*}, y^{*})\) like \(\mathcal {O}(\theta ^{k})\), with \(0< \theta < 1\), as \(k \rightarrow + \infty\);

      • linear convergence of the minimax gap \(\Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K})\) to zero like \(\mathcal {O}(\theta ^{K})\) as \(K \rightarrow + \infty\).

2 Preliminaries

We recall some basic notions in convex analysis and monotone operator theory (see for example [1]). The real Hilbert spaces \({{\mathcal {H}}}\) and \({{\mathcal {G}}}\) are endowed with inner products \(\left\langle \, \cdot \,, \, \cdot \, \right\rangle _{{{\mathcal {H}}}}\) and \(\left\langle \, \cdot \,, \, \cdot \, \right\rangle _{{{\mathcal {G}}}}\), respectively. As it will be clear from the context which one is meant, we will drop the index for ease of notation and write \(\left\langle \, \cdot \,, \, \cdot \, \right\rangle\) for both. The norm induced by the respective inner products is defined by \(\left\Vert \, \cdot \, \right\Vert := \sqrt{\left\langle \, \cdot \,, \, \cdot \, \right\rangle }\).

A function \(f: {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) is said to be proper if \({{\,\mathrm{dom}\,}}f := \{x \in {{\mathcal {H}}}: f(x) < + \infty \} \ne \emptyset\). The (convex) subdifferential of the function \(f:{{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{ + \infty \}\) at \(x \in {{\mathcal {H}}}\) is defined by \(\partial f (x) := \{ u \in {{\mathcal {H}}}\quad \vert \, \left\langle y - x, u \right\rangle + f(x) \le f(y) \ \forall y \in {{\mathcal {H}}}\}\) if \(f(x) \in \mathbb {R}\) and by \(\partial f (x) := \emptyset\) otherwise. If the function f is convex and Fréchet differentiable at \(x \in {{\mathcal {H}}}\), then \(\partial f (x) = \{\nabla f (x)\}\). For the sum of a proper, convex and lower semicontinuous function \(f: {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) and a convex and Fréchet differentiable function \(h: {{\mathcal {H}}}\rightarrow \mathbb {R}\) we have \(\partial (f + h)(x) = \partial f(x) + \nabla h(x)\) for all \(x \in {{\mathcal {H}}}\). The subdifferential of the indicator function \(\delta _C\) of a nonempty closed convex set \(C \subseteq {{\mathcal {H}}}\), that is defined as \(\delta _C(x) = 0\) for \(x \in C\) and \(\delta _C(x) = +\infty\) otherwise, is denoted by \(N_C:=\partial \delta _C\) and is called the normal cone of the set C.

Let \(f: {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{ + \infty \}\) be proper, convex and lower semicontinuous. The proximal operator of f is defined by

$$\begin{aligned} prox _{f}: {{\mathcal {H}}}\rightarrow {{\mathcal {H}}}, \quad {\text {prox}}^{}_{f}\left( x \right) := {{\,\mathrm{arg\,min}\,}}_{y \in {{\mathcal {H}}}} \left\{ f(y) + \frac{1}{2} \left\Vert y - x \right\Vert ^{2} \right\} . \end{aligned}$$

The proximal operator of the indicator function \(\delta _C\) of a nonempty closed convex set \(C \subseteq {{\mathcal {H}}}\) is the orthogonal projection \(P_C : {{\mathcal {H}}}\rightarrow C\) of the set C.

A set-valued operator \(A: {{\mathcal {H}}}\rightrightarrows {{\mathcal {H}}}\) is said to be monotone if for all (xu) , \((y, v) \in {{\,\mathrm{gra}\,}}A := \{ (z, w) \in {{\mathcal {H}}}\times {{\mathcal {H}}}\, \vert \, w \in A z \}\) we have \(\left\langle x - y, u - v \right\rangle \ge 0\). Furthermore, A is said to be maximal monotone if it is monotone and there exists no monotone operator \(B: {{\mathcal {H}}}\rightrightarrows {{\mathcal {H}}}\) such that \({{\,\mathrm{gra}\,}}A \subsetneqq {{\,\mathrm{gra}\,}}B\). The graph of a maximal monotone operator \(A: {{\mathcal {H}}}\rightrightarrows {{\mathcal {H}}}\) is sequentially closed in the strong \(\times\) weak topology, which means that if \((x_{k}, u_{k})_{k \ge 0}\) is a sequence in \({{\,\mathrm{gra}\,}}A\) such that \(x_{k} \rightarrow x\) and \(u_{k} \rightharpoonup u\) as \(k \rightarrow +\infty\), then \((x, u) \in {{\,\mathrm{gra}\,}}A\). The notation \(u_{k} \rightharpoonup u\) as \(k \rightarrow +\infty\) is used to denote convergence of the sequence \((u_k)_{k \ge 0}\) to u in the weak topology.

To show weak convergence of sequences in Hilbert spaces we use the following so-called Opial Lemma.

Lemma 3

(Opial Lemma [20]) Let \(C \subseteq {{\mathcal {H}}}\) be a nonempty set and \((x_{k})_{k \ge 0}\) a sequence in \({{\mathcal {H}}}\) such that the following two conditions hold:

  1. (a)

    for every \(x \in C\), \(\lim _{k \rightarrow + \infty } \left\Vert x_{k} - x \right\Vert\) exists;

  2. (b)

    every weak sequential cluster point of \((x_{k})_{k \ge 0}\) belongs to C.

Then \((x_{k})_{k \ge 0}\) converges weakly to an element in C.

In the following definition we adjust the term proper to the saddle point setting and refer to [21] for further considerations related to saddle functions.

Definition 4

A function \(\Psi : {{\mathcal {H}}}\times {{\mathcal {G}}}\rightarrow \mathbb {R}\cup \{ \pm \infty \}\) is called a saddle function, if \(\Psi (\, \cdot \,, y)\) is convex for all \(y \in {{\mathcal {G}}}\) and \(\Psi (x, \, \cdot \,)\) is concave for all \(x \in {{\mathcal {G}}}\). A saddle function \(\Psi\) is called proper, if there exists \((x', y') \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) such that \(\Psi (x', y) < + \infty\) for all \(y \in {{\mathcal {G}}}\) and \(- \infty < \Psi (x, y')\) for all \(x \in {{\mathcal {H}}}\).

We conclude the preliminary section with a useful result regarding the minimax objective from (1).

Proposition 5

The function \(\Psi : {{\mathcal {H}}}\times {{\mathcal {G}}}\rightarrow \mathbb {R}\cup \{ \pm \infty \}\) defined via (3) is a proper saddle function such that \(\Psi (\, \cdot \,, y)\) is lower semicontinuous for each \(y \in {{\mathcal {G}}}\) and \(\Psi (x, \, \cdot \,)\) is upper semicontinuous for each \(x \in {{\mathcal {H}}}\). Consequently, the operator

$$\begin{aligned} (x,y) \mapsto \partial [\Psi (\, \cdot \,,y)](x) \times \partial [\text {-} \Psi (x, \, \cdot \,)](y) \end{aligned}$$

is maximal monotone.

Proof

We choose \((x', y') \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) and distinguish four cases.

Firstly, we look at the case \(y' \notin {{\,\mathrm{dom}\,}}g\). Then

$$\begin{aligned} \Psi (x, y') = {\left\{ \begin{array}{ll} - \infty &{} \text {if } x \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ),\\ + \infty &{} \text {if } x \notin {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ), \end{array}\right. } \end{aligned}$$

thus \(x \mapsto \Psi (x,y')\) is convex and lower semicontinuous, since \({{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) is convex and closed. Secondly, if \(y' \in {{\,\mathrm{dom}\,}}g\), then \(g(y') \in \mathbb {R}\) and

$$\begin{aligned} \Psi (x, y') = \Phi (x,y') - g(y') \ \forall x \in {{\mathcal {H}}}, \end{aligned}$$

which means that \(x \mapsto \Psi (x,y')\) is convex and lower semicontinuous. This proves that \(\Psi (\, \cdot \,, y)\) is convex and lower semicontinuous for all \(y \in {{\mathcal {G}}}\).

On the other hand, if \(x' \notin {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\), then

$$\begin{aligned} \Psi (x', y) = + \infty \ \forall y \in {{\mathcal {G}}}, \end{aligned}$$

which means that \(y \mapsto \Psi (x', y)\) is upper semicontinuous and concave. Finally, if \(x' \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\), then

$$\begin{aligned} \Phi (x', y) \in \mathbb {R}\ \text{ and } \ -\Psi (x', y) = -\Phi (x',y) + g(y) \ \forall y \in {{\mathcal {G}}}. \end{aligned}$$

Hence \(y \mapsto -\Psi (x', y)\) is proper, convex and lower semicontinuous, and so \(y \mapsto \Psi (x', y)\) is concave and upper semicontinuous. This proves that \(\Psi (x, \, \cdot \,)\) is concave and upper semicontinuous for all \(x \in {{\mathcal {H}}}\).

Moreover, \(\Psi\) is a proper saddle function. By assumption we have \(g(y) > - \infty\) for all \(y \in {{\mathcal {G}}}\) and there exists \(x' \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) \ne \emptyset\) such that \({{\,\mathrm{dom}\,}}\Phi (x', \, \cdot \,) = {{\mathcal {G}}}\). Thus

$$\begin{aligned} \Psi (x', y) = \Phi (x', y) - g(y) < + \infty \ \forall y \in {{\mathcal {G}}}. \end{aligned}$$

Furthermore, by assumption there exist \(y' \in {{\,\mathrm{dom}\,}}g \subseteq {{\mathcal {G}}}\) such that \(g(y') < + \infty\) and for all \(x \in {{\mathcal {H}}}\) we have \(\Phi (x, y') > - \infty\). Hence,

$$\begin{aligned} \Psi (x, y') = \Phi (x, y') - g(y') > - \infty \ \forall x \in {{\mathcal {H}}}. \end{aligned}$$

The maximal monotonicity of \((x,y) \mapsto \partial [\Psi (\, \cdot \,,y)](x) \times \partial [\text {-} \Psi (x, \, \cdot \,)](y)\) follows from Corollary 1 and Theorem 3 in [21, pp. 248–249]. \(\square\)

3 Convex-(strongly) concave setting

First we will treat the case when the coupling function \(\Phi\) is convex-concave and g is convex with modulus \(\nu \ge 0\). In the case \(\nu = 0\) this corresponds to \(\Psi (x,y) = \Phi (x,y) - g(y)\) being convex-concave, while for \(\nu > 0\) the saddle function \(\Psi (x, y)\) is convex-strongly concave.

We will start with stating two assumptions on the step sizes of the algorithm which will be needed in the convergence analysis. These will be followed by a unified preparatory analysis for general \(\nu \ge 0\) that will be the base to show convergence of the iterates as well as of the minimax gap. After that we will introduce a choice of parameters that satisfy the aforementioned assumptions. The section will be closed by convergence results for the convex-concave (\(\nu = 0\)) and the convex-strongly concave (\(\nu > 0\)) setting.

Assumption 1

We assume that the step sizes \(\tau _{k}\), \(\sigma _{k}\) and the momentum parameter \(\theta _{k}\) satisfy

$$\begin{aligned} \tau _{k+1} \ge \frac{\tau _{k}}{\theta _{k+1}} \quad \text{ and } \quad \sigma _{k+1} \ge \frac{\sigma _{k}}{\theta _{k+1} (1 + \nu \sigma _{k})} \quad \text {for all } k \ge 0. \end{aligned}$$
(9)

Furthermore, we assume that there exist \(\delta > 0\) and \((\alpha _{k})_{k \ge 0} \subseteq \mathbb {R}_{++}\) such that

$$\begin{aligned} \frac{1 - \delta }{\tau _{k}} \ge \frac{L_{yx}}{\alpha _{k+1}} \quad \text{ and } \quad \frac{1 - \delta }{\sigma _{k}} \ge L_{yx} \alpha _{k} \theta _{k} + L_{yy} (1 + \theta _{k}) \quad \text {for all } k \ge 0, \end{aligned}$$
(10)

where \(\theta _{0} := 1\).

3.1 Preliminary considerations

In this subsection we will make some preliminary considerations that will play an important role when proving the convergence properties of the numerical scheme given by (7)–(8). For all \(k \ge 0\) we will use the notations

$$\begin{aligned} q_{k} := \nabla _{y}\Phi (x_{k}, y_{k}) - \nabla _{y}\Phi (x_{k-1}, y_{k-1}) \ \text{ and } \ s_{k} := \theta _{k} q_{k} + \nabla _{y}\Phi (x_{k}, y_{k}). \end{aligned}$$
(11)

We take an arbitrary \((x,y) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) and let \(k \ge 0\) be fixed. From (7) we derive

$$\begin{aligned} 0 \in \partial g(y_{k+1}) + \frac{1}{\sigma _{k}} (y_{k+1} - y_{k}) - s_{k}, \end{aligned}$$
(12)

and, as g is convex with modulus \(\nu\), this implies

$$\begin{aligned} \begin{aligned} g(y)&\ge g(y_{k+1}) + \left\langle s_{k}, y - y_{k+1} \right\rangle + \frac{1}{\sigma _{k}} \left\langle y_{k} - y_{k+1}, y - y_{k+1} \right\rangle + \frac{\nu }{2} \left\Vert y - y_{k+1} \right\Vert ^{2}\\&= g(y_{k+1}) + \left\langle s_{k}, y - y_{k+1} \right\rangle + \frac{1}{2\sigma _{k}} \left( \left\Vert y_{k} - y_{k+1} \right\Vert ^{2} + \left\Vert y - y_{k+1} \right\Vert ^{2} - \left\Vert y - y_{k} \right\Vert ^{2}\right) + \frac{\nu }{2} \left\Vert y - y_{k+1} \right\Vert ^{2}. \end{aligned} \end{aligned}$$
(13)

From (8) we get

$$\begin{aligned} 0 \in \partial [\Phi (\, \cdot \,, y_{k+1})](x_{k+1}) + \frac{1}{\tau _{k}} (x_{k+1} - x_{k}), \end{aligned}$$
(14)

hence the convexity of \(\Phi (\, \cdot \,, y)\) for \(y \in {{\,\mathrm{dom}\,}}g\) yields

$$\begin{aligned} \begin{aligned} \Phi (x,y_{k+1})&\ge \Phi (x_{k+1}, y_{k+1}) + \frac{1}{\tau _{k}} \left\langle x_{k} - x_{k+1}, x - x_{k+1} \right\rangle \\&= \Phi (x_{k+1}, y_{k+1}) + \frac{1}{2\tau _{k}} \left( \left\Vert x_{k} - x_{k+1} \right\Vert ^{2} + \left\Vert x - x_{k+1} \right\Vert ^{2} - \left\Vert x - x_{k} \right\Vert ^{2}\right) . \end{aligned} \end{aligned}$$
(15)

Combining (13) and (15) we obtain

$$\begin{aligned} \begin{aligned} \Psi (x_{k+1},y)-\Psi (x,y_{k+1})&= \Phi (x_{k+1}, y) - g(y) - \Phi (x, y_{k+1}) + g(y_{k+1})\\&\le \Phi (x_{k+1}, y) - \Phi (x, y_{k+1}) + \left\langle s_{k}, y_{k+1} - y \right\rangle - \frac{\nu }{2} \left\Vert y - y_{k+1} \right\Vert ^{2}\\&\quad + \frac{1}{2\sigma _{k}} \left( {-} \left\Vert y_{k} - y_{k+1} \right\Vert ^{2} - \left\Vert y - y_{k+1} \right\Vert ^{2} + \left\Vert y - y_{k} \right\Vert ^{2} \right) \\&\le \Phi (x_{k+1}, y) - \Phi (x_{k+1}, y_{k+1}) + \left\langle s_{k}, y_{k+1} - y \right\rangle - \frac{\nu }{2} \left\Vert y - y_{k+1} \right\Vert ^{2}\\&\quad + \frac{1}{2\tau _{k}} \left( {-} \left\Vert x_{k} - x_{k+1} \right\Vert ^{2} - \left\Vert x - x_{k+1} \right\Vert ^{2} + \left\Vert x - x_{k} \right\Vert ^{2} \right) \\&\quad + \frac{1}{2\sigma _{k}} \left( {-} \left\Vert y_{k} - y_{k+1} \right\Vert ^{2} - \left\Vert y - y_{k+1} \right\Vert ^{2} + \left\Vert y - y_{k} \right\Vert ^{2} \right) , \end{aligned} \end{aligned}$$

which, together with the concavity of \(\Phi\) in the second variable and (11), gives

$$\begin{aligned} \begin{aligned} \Psi (x_{k+1},y)-\Psi (x,y_{k+1}) \le&\ \theta _{k} \left\langle q_{k}, y_{k+1}-y \right\rangle - \frac{\nu }{2} \left\Vert y - y_{k+1} \right\Vert ^{2}\\&- \left\langle \nabla _{y}\Phi (x_{k+1}, y_{k+1}), y_{k+1} - y \right\rangle + \left\langle \nabla _{y}\Phi (x_{k}, y_{k}), y_{k+1} - y \right\rangle \\&+ \frac{1}{2\tau _{k}} \left( {-} \left\Vert x_{k} - x_{k+1} \right\Vert ^{2} - \left\Vert x - x_{k+1} \right\Vert ^{2} + \left\Vert x - x_{k} \right\Vert ^{2} \right) \\&+ \frac{1}{2\sigma _{k}} \left( {-} \left\Vert y_{k} - y_{k+1} \right\Vert ^{2} - \left\Vert y - y_{k+1} \right\Vert ^{2} + \left\Vert y - y_{k} \right\Vert ^{2} \right) \\ =&\ {-} \left\langle q_{k+1}, y_{k+1} - y \right\rangle + \theta _{k} \left\langle q_{k}, y_{k} - y \right\rangle - \frac{\nu }{2} \left\Vert y - y_{k+1} \right\Vert ^{2}\\&+ \frac{1}{2\tau _{k}} \left( {-} \left\Vert x_{k} - x_{k+1} \right\Vert ^{2} - \left\Vert x - x_{k+1} \right\Vert ^{2} + \left\Vert x - x_{k} \right\Vert ^{2} \right) \\&+ \frac{1}{2\sigma _{k}} \left( {-} \left\Vert y_{k} - y_{k+1} \right\Vert ^{2} - \left\Vert y - y_{k+1} \right\Vert ^{2} + \left\Vert y - y_{k} \right\Vert ^{2} \right) \\&+ \theta _{k} \left\langle q_{k}, y_{k+1} - y_{k} \right\rangle . \end{aligned} \end{aligned}$$
(16)

By using (2) we can evaluate the last term in the above expression as follows

$$\begin{aligned} \begin{aligned} \vert \left\langle q_k, y-y_{k} \right\rangle \vert&\le \left\Vert q_k \right\Vert \, \left\Vert y-y_{k} \right\Vert \le \left( L_{yx} \left\Vert x_{k}-x_{k-1} \right\Vert + L_{yy} \left\Vert y_{k}-y_{k-1} \right\Vert \right) \left\Vert y-y_{k} \right\Vert \\&\le \frac{L_{yx}}{2} \left( \alpha _{k} \left\Vert y-y_{k} \right\Vert ^{2} + \frac{1}{\alpha _{k}} \left\Vert x_{k}-x_{k-1} \right\Vert ^{2} \right) + \frac{L_{yy}}{2} \left( \left\Vert y-y_{k} \right\Vert ^{2} + \left\Vert y_{k}-y_{k-1} \right\Vert ^{2} \right) , \end{aligned} \end{aligned}$$
(17)

with \(\alpha _{k} > 0\) chosen such that (10) holds.

Writing (17) for \(y := y_{k+1}\) and combining the resulting inequality with (16) we derive

$$\begin{aligned} \Psi (x_{k+1},y) - \Psi (x,y_{k+1}) \le a_{k}(x, y) - b_{k+1}(x, y) - c_{k}, \end{aligned}$$
(18)

where

$$\begin{aligned} a_k(x, y)&:= \frac{1}{2\tau _{k}} \left\Vert x-x_{k} \right\Vert ^{2} + \frac{1}{2\sigma _{k}} \left\Vert y-y_{k} \right\Vert ^{2} + \theta _{k} \left\langle q_k, y_{k}-y \right\rangle + \theta _{k} \frac{L_{yx}}{2\alpha _{k}} \left\Vert x_{k}-x_{k-1} \right\Vert ^{2}\\&\quad + \theta _{k} \frac{L_{yy}}{2} \left\Vert y_{k}-y_{k-1} \right\Vert ^{2}, \\ b_{k+1}(x, y)&:= \frac{1}{2\tau _{k}} \left\Vert x-x_{k+1} \right\Vert ^{2} + \frac{1}{2} \left( \frac{1}{\sigma _{k}} + \nu \right) \left\Vert y-y_{k+1} \right\Vert ^{2} + \left\langle q_{k+1}, y_{k+1} - y \right\rangle \\&\quad + \frac{L_{yx}}{2\alpha _{k+1}} \left\Vert x_{k+1}-x_{k} \right\Vert ^{2} + \frac{L_{yy}}{2} \left\Vert y_{k+1} - y_{k} \right\Vert ^{2}, \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} c_{k}&:= \frac{1}{2} \left( \frac{1}{\tau _{k}} - \frac{L_{yx}}{\alpha _{k+1}} \right) \left\Vert x_{k+1}-x_{k} \right\Vert ^{2} + \frac{1}{2} \left( \frac{1}{\sigma _{k}} - L_{yy} - \theta _{k} (L_{yx} \alpha _{k} + L_{yy}) \right) \left\Vert y_{k+1} - y_{k} \right\Vert ^{2}. \end{aligned} \end{aligned}$$

Now, let us define for all \(k \ge 0\)

$$\begin{aligned} t_{k} := \frac{\theta _{0}}{\theta _{0} \theta _{1} \cdots \theta _{k}} \end{aligned}$$
(19)

and notice that

$$\begin{aligned} \frac{t_{k}}{t_{k+1}} = \theta _{k+1}. \end{aligned}$$

Relation (9) from Assumption 1 is equivalent to

$$\begin{aligned} \frac{t_{k}}{\tau _{k}} \ge \frac{t_{k+1}}{\tau _{k+1}} \quad \text{ and } \quad t_{k}\left( \frac{1}{\sigma _{k}} + \nu \right) \ge \frac{t_{k+1}}{\sigma _{k+1}}, \end{aligned}$$
(20)

which will be used in telescoping arguments in the following.

Let \(K \ge 1\) and denote

$$\begin{aligned} T_{K} := \sum _{k=0}^{K-1} t_{k}, \quad \bar{x}_{K} := \frac{1}{T_{K}} \sum _{k=0}^{K-1} t_{k}x_{k+1}, \quad \bar{y}_{K} := \frac{1}{T_{K}} \sum _{k=0}^{K-1} t_{k}y_{k+1}. \end{aligned}$$
(21)

Multiplying both sides of (18) by \(t_{k} > 0\) as defined in (19), followed by summing up the inequalities for \(k = 0, \ldots , K-1\) gives

$$\begin{aligned} \sum _{k=0}^{K-1} t_{k} \left( \Psi (x_{k+1}, y) - \Psi (x, y_{k+1}) \right) \le \sum _{k=0}^{K-1} t_{k} \left( a_{k}(x, y) - b_{k+1}(x, y) - c_{k} \right) . \end{aligned}$$

By Jensen’s inequality, as \(\Psi (\, \cdot \,, y) - \Psi (x, \, \cdot \,)\) is a convex function, we obtain

$$\begin{aligned} T_{K} \left( \Psi (\bar{x}_{K},y) - \Psi (x,\bar{y}_{K}) \right) \le \sum _{k=0}^{K-1} t_{k} \left( \Psi (x_{k+1}, y) - \Psi (x, y_{k+1}) \right) , \end{aligned}$$

and thus

$$\begin{aligned} T_{K} \left( \Psi (\bar{x}_{K},y) - \Psi (x,\bar{y}_{K}) \right) \le \sum _{k=0}^{K-1} t_{k} \left( a_{k}(x, y) - b_{k+1}(x, y) - c_{k} \right) . \end{aligned}$$
(22)

Furthermore, using (20), we get for all \(k \ge 0\)

$$\begin{aligned} \begin{aligned} t_{k} b_{k+1}(x, y)&= \frac{t_{k}}{2\tau _{k}} \left\Vert x-x_{k+1} \right\Vert ^{2} + \frac{t_{k}}{2} \left( \frac{1}{\sigma _{k}} + \nu \right) \left\Vert y-y_{k+1} \right\Vert ^{2} + t_{k} \left\langle q_{k+1}, y_{k+1} - y \right\rangle \\&\quad + t_{k} \frac{L_{yx}}{2\alpha _{k+1}} \left\Vert x_{k+1}-x_{k} \right\Vert ^{2} + t_{k} \frac{L_{yy}}{2} \left\Vert y_{k+1} - y_{k} \right\Vert ^{2}\\&\ge \frac{t_{k+1}}{2\tau _{k+1}} \left\Vert x-x_{k+1} \right\Vert ^{2} + \frac{t_{k+1}}{2\sigma _{k+1}} \left\Vert y-y_{k+1} \right\Vert ^{2} + t_{k+1} \theta _{k+1} \left\langle q_{k+1}, y_{k+1} - y \right\rangle \\&\quad + t_{k+1} \theta _{k+1} \frac{L_{yx}}{2\alpha _{k+1}} \left\Vert x_{k+1}-x_{k} \right\Vert ^{2} + t_{k+1} \theta _{k+1} \frac{L_{yy}}{2} \left\Vert y_{k+1} - y_{k} \right\Vert ^{2}\\&= t_{k+1} a_{k+1}(x, y). \end{aligned} \end{aligned}$$

Notice that by (10) in Assumption 1 there exists \(\delta > 0\) such that for all \(k \ge 0\)

$$\begin{aligned} c_{k} \ge \delta \left( \frac{1}{2\tau _{k}} \left\Vert x_{k+1}-x_{k} \right\Vert ^{2} + \frac{1}{2\sigma _{k}} \left\Vert y_{k+1}-y_{k} \right\Vert ^{2} \right) \ge 0. \end{aligned}$$
(23)

For the following recall that \(x_{-1} = x_{0}\) and \(y_{-1} = y_{0}\), which implies \(q_{0} = 0\). By using the above two inequalities in (22) and writing (17) for \(k = K\) we obtain

$$\begin{aligned} \begin{aligned} \Psi (\bar{x}_{K},y) - \Psi (x,\bar{y}_{K})&\le \frac{1}{T_{K}} \sum _{k=0}^{K-1} \left( t_{k} a_{k}(x, y) - t_{k+1} a_{k+1}(x, y) \right) = \frac{1}{T_{K}} \left( t_{0} a_{0}(x, y) - t_{K} a_{K}(x, y) \right) \\&= \frac{1}{T_{K}} \left( \frac{t_{0}}{2 \tau _{0}} \left\Vert x-x_{0} \right\Vert ^{2} + \frac{t_{0}}{2 \sigma _{0}} \left\Vert y-y_{0} \right\Vert ^{2} \right) - \frac{t_{K}}{T_{K}} \left( \frac{1}{2 \tau _{K}} \left\Vert x-x_{K} \right\Vert ^{2} + \frac{1}{2 \sigma _{K}} \left\Vert y-y_{K} \right\Vert ^{2} \right) \\&\quad - \frac{t_{K} \theta _{K}}{T_{K}} \left( \left\langle q_{K}, y_{K}-y \right\rangle + \frac{L_{yx}}{2\alpha _{K}} \left\Vert x_{K}-x_{K-1} \right\Vert ^{2} + \frac{L_{yy}}{2} \left\Vert y_{K}-y_{K-1} \right\Vert ^{2} \right) \\&\le \frac{1}{T_{K}} \left( \frac{t_{0}}{2 \tau _{0}} \left\Vert x-x_{0} \right\Vert ^{2} + \frac{t_{0}}{2 \sigma _{0}} \left\Vert y-y_{0} \right\Vert ^{2} \right) \\&\quad - \frac{t_{K}}{T_{K}} \left( \frac{1}{2 \tau _{K}} \left\Vert x-x_{K} \right\Vert ^{2} + \frac{1}{2} \left( \frac{1}{\sigma _{K}} - \theta _{K} ( L_{yx} \alpha _{K} + L_{yy}) \right) \left\Vert y-y_{K} \right\Vert ^{2} \right) . \end{aligned} \end{aligned}$$
(24)

By definition we have \(t_{0} = 1\) and by (10) that the last term of the above inequality is nonpositive, hence the following estimate for the minimax gap function evaluated at the ergodic sequences holds

$$\begin{aligned} \Psi (\bar{x}_{K},y) - \Psi (x,\bar{y}_{K}) \le \frac{1}{T_{K}} \left( \frac{1}{2 \tau _{0}} \left\Vert x-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma _{0}} \left\Vert y-y_{0} \right\Vert ^{2} \right) \ \forall K \ge 1. \end{aligned}$$
(25)

With these considerations at hand – in specific we want to point out (18), (24) and (25) – we will be able to obtain convergence statements for the two settings \(\nu = 0\) and \(\nu > 0\).

3.2 Fulfilment of step size assumptions

In this subsection we will investigate a particular choice of parameters to fulfil Assumption 1 which is suitable for both cases of \(\nu = 0\) and \(\nu > 0\).

Proposition 6

Let \(\nu \ge 0\), \(c_{\alpha } > L_{yx} \ge 0\), \(\theta _{0} = 1\) and \(\tau _{0}, \, \sigma _{0} > 0\) such that

$$\begin{aligned} \left( c_{\alpha } L_{yx} \tau _{0} + 2 L_{yy} \right) \sigma _{0} < 1. \end{aligned}$$

We define

$$\begin{aligned} \theta _{k+1} := \frac{1}{\sqrt{1 + \nu \sigma _{k}}}, \quad \tau _{k+1} := \frac{\tau _{k}}{\theta _{k+1}}, \quad \sigma _{k+1} := \theta _{k+1} \sigma _{k} \quad \text {for all } k \ge 0. \end{aligned}$$
(26)

Then the sequence \((\tau _{k})_{k \ge 0}\), \((\sigma _{k})_{k \ge 0}\) and \((\theta _{k})_{k \ge 0}\) fulfil (9) in Assumption 1 with equality and (10) for

$$\begin{aligned} \alpha _{k} := \left\{ \begin{array}{ll} c_{\alpha } \tau _{0} &{} \text {if } k = 0, \\ c_{\alpha } \tau _{k-1} &{} \text {if } k \ge 1, \end{array} \right. \end{aligned}$$
(27)

and

$$\begin{aligned} \delta := \min \left\{ 1 - \frac{L_{yx}}{c_{\alpha }}, \, 1 - \left( c_{\alpha } L_{yx} \tau _{0} + 2 L_{yy} \right) \sigma _{0} \right\} > 0. \end{aligned}$$
(28)

Furthermore, for \((t_k)_{k \ge 0}\) defined as in  (19) we have

$$\begin{aligned} t_{k} = \frac{\theta _{0}}{\theta _{0} \theta _{1} \cdots \theta _{k}} = \frac{\tau _{k}}{\tau _{0}} \quad \forall k \ge 0. \end{aligned}$$
(29)

Proof

First, we show that the particular choice (26) fulfils (9) in Assumption 1 with equality. We see that for all \(k \ge 0\)

$$\begin{aligned} \tau _{k+1} = \frac{\tau _{k}}{\theta _{k+1}}, \end{aligned}$$

as well as

$$\begin{aligned} \sigma _{k+1} = \theta _{k+1} \sigma _{k} = \frac{\sigma _{k}}{\theta _{k+1} \frac{1}{\theta _{k+1}^{2}}} = \frac{\sigma _{k}}{\theta _{k+1} (1 + \nu \sigma _{k})}, \end{aligned}$$

follow straight forward by definition.

Next, we show that (10) in Assumption 1 holds for \(\delta\) defined in (28) with the choices (26) and (27). The first inequality of (10) is equivalent to

$$\begin{aligned} 1 - \delta \ge \frac{L_{yx}}{\alpha _{k+1}} \tau _{k} = \frac{L_{yx}}{c_{\alpha }} \quad \forall k \ge 0, \end{aligned}$$

which clearly is fulfilled as

$$\begin{aligned} \delta \le 1 - \frac{L_{yx}}{c_{\alpha }}. \end{aligned}$$

On the other hand, the second inequality of (10) is equivalent to

$$\begin{aligned} 1 - \delta \ge L_{yx} \alpha _{k} \theta _{k} \sigma _{k} + L_{yy} (1 + \theta _{k}) \sigma _{k} \quad \forall k \ge 0. \end{aligned}$$

By definition of the step size parameters (26) we have for all \(k \ge 0\)

$$\begin{aligned} \tau _{k+1} \sigma _{k+1} = \tau _{0} \sigma _{0}, \quad \theta _{k+1} \le 1 = \theta _{0}, \quad \sigma _{k+1} \le \sigma _{0}, \quad \theta _{k+1} \tau _{k+1} = \tau _{k}, \end{aligned}$$

and thus

$$\begin{aligned} \begin{aligned} 1 - \delta&\ge L_{yx} \alpha _{0} \theta _{0} \sigma _{0} + L_{yy} (1 + \theta _{0}) \sigma _{0} = c_{\alpha } L_{yx} \tau _{0} \sigma _{0} + 2 L_{yy} \sigma _{0} \ge c_{\alpha } L_{yx} \theta _{k+1}^{2} \tau _{k+1} \sigma _{k+1} + L_{yy} (1 + \theta _{k+1}) \sigma _{k+1}\\&= L_{yx} c_{\alpha } \tau _{k} \theta _{k+1} \sigma _{k+1} + L_{yy} (1 + \theta _{k+1}) \sigma _{k+1} = L_{yx} \alpha _{k+1} \theta _{k+1} \sigma _{k+1} + L_{yy} (1 + \theta _{k+1}) \sigma _{k+1}. \end{aligned} \end{aligned}$$

This chain of inequalities holds since

$$\begin{aligned} \delta \le 1 - \left( c_{\alpha } L_{yx} \tau _{0} + 2 L_{yy} \right) \sigma _{0}. \end{aligned}$$

Finally, using the definition of \(t_{k}\) and (26) we conclude that for all \(k \ge 0\)

$$\begin{aligned} t_{k} = \frac{\theta _{0}}{\theta _{0} \theta _{1} \cdots \theta _{k}} = \frac{\frac{\tau _{0}}{\tau _{0}}}{ \frac{\tau _{0}}{\tau _{0}} \frac{\tau _{0}}{\tau _{1}} \cdots \frac{\tau _{k-1}}{\tau _{k}} } = \frac{\tau _{k}}{\tau _{0}}. \end{aligned}$$

\(\square\)

Remark 7

The choice \(L_{yy} = 0\) in (2) which was considered in [11] in the convex-strongly concave setting corresponds to the case when the coupling function \(\Phi\) is linear in y. We will prove convergence also for \(L_{yy}\) positive, which makes our algorithm applicable to a much wider range of problems, as we will see in the section with the numerical experiments.

When the coupling function \(\Phi : {{\mathcal {H}}}\times {{\mathcal {G}}}\rightarrow \mathbb {R}\) is bilinear, that is \(\Phi (x, y) = \left\langle y, Ax \right\rangle\) for some nonzero continuous linear operator \(A : {{\mathcal {H}}}\rightarrow {{\mathcal {G}}}\) then we are in the setting of [4]. In this situation one can choose \(L_{yy} = 0\) and \(L_{yx} = \Vert A\Vert\), and (28) yields

$$\begin{aligned} \delta = \min \left\{ 1 - \frac{\Vert A\Vert }{c_{\alpha }}, \, 1 - c_{\alpha } \Vert A\Vert \tau _{0} \sigma _{0} \right\} , \end{aligned}$$

with \(c_{\alpha } > \Vert A\Vert\). To guarantee \(\delta > 0\) we fix \(0< \varepsilon < 1\) and set

$$\begin{aligned} c_{\alpha } = (1 - \varepsilon )^{-1} \Vert A\Vert . \end{aligned}$$

Hence, we need to satisfy

$$\begin{aligned} \tau _{0} \sigma _{0}\Vert A\Vert ^{2} < 1 - \varepsilon , \end{aligned}$$

which heavily resembles the step size condition of [4, Algorithm 2]. Since \({\text {prox}}^{}_{\gamma \Phi (\cdot , y)}\left( x \right) = x - \gamma A^*y\) for all \((x,y)\in {{\mathcal {H}}}\times {{\mathcal {G}}}\) and all \(\gamma > 0\), our OGAProx scheme becomes the primal-dual algorithm PDHG from [4].

3.3 Convergence results

In this subsection we combine the preliminary considerations with the choice of parameters (26) from Proposition 6.

We will start with the case \(\nu = 0\) and constant step sizes, which gives weak convergence of the iterates to a saddle point \((x^{*}, y^{*})\) and convergence of the minimax gap evaluated at the ergodic iterates to zero like \(\mathcal {O}(\frac{1}{K})\). Afterwards we will consider the case \(\nu > 0\), which leads to an accelerated version of the algorithm with improved convergence results. In this setting we obtain convergence of \((y_{k})_{k \ge 0}\) to \(y^{*}\) like \(\mathcal {O}(\frac{1}{K})\) and convergence of the minimax gap evaluated at the ergodic iterates to zero like \(\mathcal {O}(\frac{1}{K^{2}})\).

3.3.1 Convex-concave setting

For the following we assume that the function g is convex with modulus \(\nu = 0\), meaning it is merely convex. Using the results of the previous subsection we will show that with the choice (26) all the parameters are constant.

Proposition 8

Let \(c_{\alpha } > L_{yx} \ge 0\) and \(\tau , \, \sigma > 0\) such that

$$\begin{aligned} \left( c_{\alpha } L_{yx} \tau + 2 L_{yy} \right) \sigma < 1. \end{aligned}$$

If \(\nu = 0\), then the sequences \((\tau _{k})_{k \ge 0}\), \((\sigma _{k})_{k \ge 0}\) and \((\theta _{k})_{k \ge 0}\) as defined in Proposition 6 are constant, in particular we have

$$\begin{aligned} \tau _{k} = \tau _{0} := \tau , \quad \sigma _{k} = \sigma _{0} := \sigma , \quad \theta _{k} = \theta _{0} = 1 \quad \text {for all } k \ge 0. \end{aligned}$$
(30)

Proof

As \(\nu = 0\), (26) gives for all \(k \ge 0\)

$$\begin{aligned} \theta _{k+1} = \frac{1}{\sqrt{1 + \nu \sigma _{k}}} = 1, \quad \tau _{k+1} = \frac{\tau _{k}}{\theta _{k+1}} = \tau _{0}, \quad \sigma _{k+1} = \theta _{k+1} \sigma _{k} = \sigma _{0}. \end{aligned}$$

\(\square\)

Next we will state and prove the convergence results in the convex-concave case.

Theorem 9

Let \(c_{\alpha } > L_{yx} \ge 0\) and \(\tau , \, \sigma > 0\) such that

$$\begin{aligned} \left( c_{\alpha } L_{yx} \tau + 2 L_{yy} \right) \sigma < 1. \end{aligned}$$

Then the sequence \((x_{k}, y_{k})_{k \ge 0}\) generated by OGAProx with the choice of constant parameters as in Proposition 8, namely,

$$\begin{aligned} \tau _{k} = \tau _{0} := \tau , \quad \sigma _{k} = \sigma _{0} := \sigma , \quad \theta _{k} = \theta _{0} = 1 \quad \text {for all } k \ge 0, \end{aligned}$$

converges weakly to a saddle point \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) of (1). Furthermore, let \(K \ge 1\) and denote

$$\begin{aligned} \bar{x}_{K} = \frac{1}{K} \sum _{k=0}^{K-1} x_{k+1} \quad \text{ and } \quad \bar{y}_{K} = \frac{1}{K} \sum _{k=0}^{K-1} y_{k+1}. \end{aligned}$$

Then for all \(K \ge 1\) and any saddle point \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) of (1) we have

$$\begin{aligned} 0 \le \Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K}) \le \frac{1}{K} \left( \frac{1}{2 \tau _{0}} \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma _{0}} \left\Vert y^{*}-y_{0} \right\Vert ^{2} \right) . \end{aligned}$$

Proof

First we will show weak convergence of the sequence of iterates \((x_{k}, y_{k})_{k \ge 0}\) to some saddle point \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) of (1). For this we will use the Opial Lemma (see Lemma 3).

Let \(k \ge 0\) and \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be an arbitrary but fixed saddle point. From (18) together with the choice (30) of constant parameters \(\theta _{k} = 1\), \(\tau _{k} = \tau\), \(\sigma _{k} = \sigma\) and \(\alpha _{k} = \alpha\) we obtain

$$\begin{aligned} \begin{aligned} 0 \le \Psi (x_{k+1}, y^{*}) - \Psi (x^{*}, y_{k+1}) \le a_{k}(x^{*}, y^{*}) - b_{k+1}(x^{*}, y^{*}) - c_{k} = a_{k}(x^{*}, y^{*}) - a_{k+1}(x^{*}, y^{*}) - c_{k}, \end{aligned} \end{aligned}$$
(31)

since

$$\begin{aligned} \begin{aligned} a_{k}(x^{*},y^{*})&= \frac{1}{2\tau } \left\Vert x^{*} - x_{k} \right\Vert ^{2} + \frac{1}{2\sigma } \left\Vert y^{*} - y_{k} \right\Vert ^{2} + \left\langle q_k, y_{k} - y^{*} \right\rangle + \frac{L_{yx}}{2\alpha } \left\Vert x_{k} - x_{k-1} \right\Vert ^{2} + \frac{L_{yy}}{2} \left\Vert y_{k} - y_{k-1} \right\Vert ^{2}\\&= b_{k}(x^{*}, y^{*}), \end{aligned} \end{aligned}$$
(32)

and

$$\begin{aligned} \begin{aligned} c_{k}&= \frac{1}{2} \left( \frac{1}{\tau } - \frac{L_{yx}}{\alpha } \right) \left\Vert x_{k+1} - x_{k} \right\Vert ^{2} + \frac{1}{2} \left( \frac{1}{\sigma } - L_{yy} - (L_{yx} \alpha + L_{yy} ) \right) \left\Vert y_{k+1} - y_{k} \right\Vert ^{2}. \end{aligned} \end{aligned}$$

We see that (32), writing (17) with \(y = y^{*}\) and (9) in Assumption 1 yield

$$\begin{aligned} a_{k}(x^{*}, y^{*}) \ge \frac{1}{2\tau } \left\Vert x^{*} - x_{k} \right\Vert ^{2} + \frac{1}{2\sigma } \left( 1 - \sigma ( L_{yx} \alpha + L_{yy} ) \right) \left\Vert y^{*} - y_{k} \right\Vert ^{2} \ge 0. \end{aligned}$$
(33)

Furthermore, from (31) and (23) we deduce

$$\begin{aligned} a_{k}(x^{*}, y^{*}) \ge a_{k+1}(x^{*}, y^{*}) + \delta \left( \frac{1}{2\tau } \left\Vert x_{k+1} - x_{k} \right\Vert ^{2} + \frac{1}{2\sigma } \left\Vert y_{k+1} - y_{k} \right\Vert ^{2} \right) . \end{aligned}$$

Telescoping this inequality and taking into account (33) give

$$\begin{aligned} \lim _{k \rightarrow + \infty } (x_{k+1} - x_{k}) = \lim _{k \rightarrow + \infty } (y_{k+1} - y_{k}) = 0, \end{aligned}$$
(34)

as well as the existence of the limit \(\lim _{k \rightarrow + \infty } a_{k}(x^{*},y^{*}) \in \mathbb {R}\).

From (33) we get that \((x_{k})_{k \ge 0}\) and \((y_{k})_{k \ge 0}\) are bounded sequences. Moreover, by using (2) and (34) in definition (11) we obtain that

$$\begin{aligned} (q_k)_{k \ge 0} \text{ converges } \text{ strongly } \text{ to } 0. \end{aligned}$$
(35)

From the definition of \(a_{k}(x^{*}, y^{*})\) in (32), (34) and (35) we derive that

$$\begin{aligned} \exists \lim _{k \rightarrow + \infty } \left( \frac{1}{2\tau } \left\Vert x_{k}-x^{*} \right\Vert ^{2} + \frac{1}{2\sigma } \left\Vert y_{k}-y^{*} \right\Vert ^{2} \right) \in \mathbb {R}. \end{aligned}$$

Since this is true for an arbitrary saddle point \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\), we have that the first statement of the Opial Lemma holds.

Next we will show that all weak cluster points of \((x_{k},y_{k})_{k \ge 0}\) are in fact saddle points of (1). Assume that \((x_{k_n})_{n \ge 0}\) converges weakly to \(x^{*} \in {{\mathcal {H}}}\) and \((y_{k_n})_{n \ge 0}\) converges weakly to \(y^{*} \in {{\mathcal {G}}}\) as \(n \rightarrow + \infty\). From (14), (11) and (12) we have

$$\begin{aligned} \begin{aligned}&\left( \frac{1}{\tau }(x_{k_n} - x_{k_n+1}),\frac{1}{\sigma }(y_{k_n} - y_{k_n+1}) + q_{k_n} - q_{k_n+1} \right) \\&\quad \in \partial [\Phi (\, \cdot \,,y_{k_n+1})](x_{k_n+1}) \times \left( {-} \nabla _{y}\Phi (x_{k_n+1},y_{k_n+1}) + \partial g(y_{k_n+1}) \right) \\&\quad = \partial [\Psi (\, \cdot \,,y_{k_n+1})](x_{k_n+1}) \times \partial [ {-} \Psi (x_{k_n+1}, \, \cdot \,)](y_{k_n+1}), \end{aligned} \end{aligned}$$
(36)

where we used that for all \(k \ge 0\) we have \(x_{k} \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) and \(y_{k} \in {{\,\mathrm{dom}\,}}g\). The sequence on the left hand side of the inclusion (36) converges strongly to (0, 0) as \(n\rightarrow +\infty\) [according to (34) and (35)]. Notice that the operator \((x,y) \mapsto \partial [\Psi (\, \cdot \,,y)](x) \times \partial [ {-} \Psi (x, \, \cdot \,)](y)\) is maximal monotone (see Proposition 5), hence its graph is sequentially closed with respect to the strong \(\times\) weak topology. From here we deduce

$$\begin{aligned} (0,0) \in \partial [\Psi (\, \cdot \,,y^{*})](x^{*}) \times \partial [ {-} \Psi (x^{*}, \, \cdot \,)](y^{*}), \end{aligned}$$

from which we easily derive that \((x^{*},y^{*})\) is a saddle point as it satisfies (4). This means that also the second statement of the Opial Lemma is fulfilled and we have weak convergence of \((x_{k}, y_{k})_{k \ge 0}\) to a saddle point \((x^{*}, y^{*})\).

The remaining part is to show the convergence rate of the minimax gap of the ergodic sequences. Let \(K \ge 1\) and \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be an arbitrary but fixed saddle point. Writing (25) for \((x^{*}, y^{*})\) yields

$$\begin{aligned} 0 \le \Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K}) \le \frac{1}{T_{K}} \left( \frac{1}{2 \tau } \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y^{*}-y_{0} \right\Vert ^{2} \right) , \end{aligned}$$

with

$$\begin{aligned} T_{K} = \sum _{k=0}^{K-1} t_{k}, \quad \bar{x}_{K} = \frac{1}{T_{K}} \sum _{k=0}^{K-1} t_{k}x_{k+1}, \quad \bar{y}_{K} = \frac{1}{T_{K}} \sum _{k=0}^{K-1} t_{k}y_{k+1}. \end{aligned}$$

Using (29) to get \(t_{k} = 1\) for all \(k \ge 0\) in the above expressions gives

$$\begin{aligned} T_{K} = \sum _{k=0}^{K-1} t_{k} = K, \quad \bar{x}_{K} = \frac{1}{K} \sum _{k=0}^{K-1} x_{k+1}, \quad \bar{y}_{K} = \frac{1}{K} \sum _{k=0}^{K-1} y_{k+1}. \end{aligned}$$

Finally we derive for all \(K \ge 1\)

$$\begin{aligned} 0 \le \Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K}) \le \frac{1}{K} \left( \frac{1}{2 \tau } \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y^{*}-y_{0} \right\Vert ^{2} \right) . \end{aligned}$$

\(\square\)

3.3.2 Convex-strongly concave setting

For the remainder of this section we assume that the function g is convex with modulus \(\nu > 0\), meaning it is \(\nu\)-strongly convex. In this case the choice (26) leads to adaptive parameters and accelerated convergence.

Proposition 10

Let \(c_{\alpha } > L_{yx} \ge 0\), \(\theta _{0} = 1\) and \(\tau _{0}, \, \sigma _{0} > 0\) such that

$$\begin{aligned} \left( c_{\alpha } L_{yx} \tau _{0} + 2 L_{yy} \right) \sigma _{0} < 1. \end{aligned}$$

If \(\nu > 0\) then \((\tau _{k})_{k \ge 0}\), \((\sigma _{k})_{k \ge 0}\) and \((\theta _{k})_{k \ge 0}\) as defined in Proposition 6 are adaptive, in particular we have

$$\begin{aligned} \theta _{k+1} = \frac{1}{\sqrt{1 + \nu \sigma _{k}}}< 1, \quad \tau _{k+1} = \frac{\tau _{k}}{\theta _{k+1}} > \tau _{k}, \quad \sigma _{k+1} = \theta _{k+1} \sigma _{k} < \sigma _{k} \quad \text {for all } k \ge 0. \end{aligned}$$
(37)

Proof

The statements follow directly from Proposition 6 for \(\nu > 0\). \(\square\)

To obtain statements regarding the (accelerated) convergence rates in the convex-strongly concave setting, we look at the behaviour of the sequences of step size parameters \((\tau _{k})_{k \ge 0}\) and \((\sigma _{k})_{k \ge 0}\) for \(k \rightarrow + \infty\).

Proposition 11

Let \(\theta _{0} = 1\), \(\tau _{0} > 0\),

$$\begin{aligned} 0 < \sigma _{0} \le \frac{9 + 3 \sqrt{13}}{2 \nu }, \end{aligned}$$

and for all \(k \ge 0\) denote

$$\begin{aligned} \gamma _{k} := \frac{\tau _{k}}{\sigma _{k}}. \end{aligned}$$

Then with the choice of adaptive parameters (37) we have for all \(k \ge 0\)

$$\begin{aligned} \gamma _{k} \ge \frac{\nu ^{2} \tau _{0} \sigma _{0}}{9} k^{2} \quad \text{ and } \quad \tau _{k} \ge \frac{\nu \tau _{0} \sigma _{0}}{3} k, \end{aligned}$$

and for all \(k \ge 1\)

$$\begin{aligned} \sigma _{k} \le \frac{3}{\nu } \frac{1}{k}. \end{aligned}$$

Proof

By (26) we conclude that for all \(k \ge 0\)

$$\begin{aligned} \gamma _{k+1} = \gamma _{k} (1 + \nu \sigma _{k}), \end{aligned}$$

and further

$$\begin{aligned} \sigma _{k+1} = \sigma _{k} \sqrt{ \frac{\gamma _{k}}{\gamma _{k+1}} }, \end{aligned}$$

which, applied recursively, gives

$$\begin{aligned} \sigma _{k} = \sigma _{0} \sqrt{ \frac{\gamma _{0}}{\gamma _{k}} } = \sqrt{ \tau _{0} \sigma _{0} } \frac{1}{\sqrt{\gamma _{k}}}. \end{aligned}$$

We obtain

$$\begin{aligned} \gamma _{k+1} = \gamma _{k} (1 + \nu \sigma _{k}) = \gamma _{k} + \nu \sqrt{ \tau _{0} \sigma _{0} } \sqrt{\gamma _{k}}, \end{aligned}$$

which we will use to show by induction that for all \(k \ge 0\)

$$\begin{aligned} \gamma _{k} \ge \frac{ \nu ^{2} \tau _{0} \sigma _{0} }{9} k^{2}. \end{aligned}$$
(38)

For \(k = 0\) the statement trivially holds, whereas for \(k = 1\) we need to verify that

$$\begin{aligned} \gamma _{1} = \gamma _{0} + \nu \sqrt{ \tau _{0} \sigma _{0} } \sqrt{\gamma _{0}} = \frac{\tau _{0}}{\sigma _{0}} \left( 1 + \nu \sigma _{0} \right) \ge \frac{ \nu ^{2} \tau _{0} \sigma _{0} }{9}, \end{aligned}$$

which is equivalent to the following quadratic inequality

$$\begin{aligned} \sigma _{0}^{2} - \frac{9}{\nu } \sigma _{0} - \frac{9}{\nu ^{2}} \le 0, \end{aligned}$$

and guaranteed to hold by our initial choice of \(\sigma _{0} > 0\). Now let \(k \ge 1\) and assume that (38) holds. Then

$$\begin{aligned} \gamma _{k+1} = \gamma _{k} + \nu \sqrt{ \tau _{0} \sigma _{0} } \sqrt{\gamma _{k}} \ge \frac{ \nu ^{2} \tau _{0} \sigma _{0} }{9} k^{2} + \frac{ \nu ^{2} \tau _{0} \sigma _{0} }{3} k \ge \frac{ \nu ^{2} \tau _{0} \sigma _{0} }{9} (k+1)^{2}. \end{aligned}$$

This shows the validity of (38) for all \(k \ge 0\).

Now we can use inequality (38) to deduce the convergence behaviour of the sequences \((\tau _{k})_{k \ge 0}\) and \((\sigma _{k})_{k \ge 0}\) for \(k \rightarrow + \infty\). We get for all \(k \ge 0\)

$$\begin{aligned} \tau _{k} = \sigma _{k} \gamma _{k} = \sqrt{ \tau _{0} \sigma _{0} } \sqrt{\gamma _{k}} \ge \frac{ \nu \tau _{0} \sigma _{0} }{3} k, \end{aligned}$$
(39)

which, combined with

$$\begin{aligned} \tau _{k} \sigma _{k} = \frac{\tau _{k}^{2}}{\gamma _{k}} = \tau _{0} \sigma _{0}, \end{aligned}$$

gives for all \(k \ge 1\)

$$\begin{aligned} \sigma _{k} \le \frac{3}{\nu } \frac{1}{k}. \end{aligned}$$

\(\square\)

Now we are ready to prove the convergence results in the convex-strongly concave setting.

Theorem 12

Let \(c_{\alpha } > L_{yx} \ge 0\), \(\theta _{0} = 1\) and \(\tau _{0}, \, \sigma _{0} > 0\) such that

$$\begin{aligned} \left( c_{\alpha } L_{yx} \tau _{0} + 2 L_{yy} \right) \sigma _{0}< 1 \quad \text{ and } \quad 0 < \sigma _{0} \le \frac{9 + 3 \sqrt{13}}{2 \nu }. \end{aligned}$$

Let \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be a saddle point of (1). Then for \((x_{k}, y_{k})_{k \ge 0}\) being the sequence generated by OGAProx with the choice of adaptive parameters

$$\begin{aligned} \theta _{k+1} = \frac{1}{\sqrt{1 + \nu \sigma _{k}}}< 1, \quad \tau _{k+1} = \frac{\tau _{k}}{\theta _{k+1}} > \tau _{k}, \quad \sigma _{k+1} = \theta _{k+1} \sigma _{k} < \sigma _{k} \quad \text {for all } k \ge 0, \end{aligned}$$

we have for all \(K \ge 1\)

$$\begin{aligned} \left\Vert y^{*}-y_{K} \right\Vert \le \frac{c_{1}}{K} {\left( \frac{1}{2 \tau _{0}} \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma _{0}} \left\Vert y^{*}-y_{0} \right\Vert ^{2} \right) }^{\frac{1}{2}}, \end{aligned}$$

with \(c_{1} := \sqrt{\frac{18}{\nu ^{2} \sigma _{0} \delta }}\), where \(\delta > 0\) is defined in (28). Furthermore, for \(K \ge 1\), denote

$$\begin{aligned} T_{K} = \sum _{k=0}^{K-1} t_{k}, \quad \bar{x}_{K} = \frac{1}{T_{K}} \sum _{k=0}^{K-1} t_{k} x_{k+1}, \quad \bar{y}_{K} = \frac{1}{T_{K}} \sum _{k=0}^{K-1} t_{k} y_{k+1}, \end{aligned}$$

where \(t_k = \frac{\tau _k}{\tau _0}\) for all \(k \ge 0\) [see also (29)]. Then for all \(K \ge 2\) it holds

$$\begin{aligned} 0 \le \Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K}) \le \frac{c_{2}}{K^{2}} \left( \frac{1}{2 \tau _{0}} \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma _{0}} \left\Vert y^{*}-y_{0} \right\Vert ^{2} \right) , \end{aligned}$$

with \(c_{2} := \frac{12}{\nu \sigma _{0}}\).

Proof

Let \(K \ge 1\) and let \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be an arbitrary but fixed saddle point. First we will prove the convergence rate of the sequence of iterates \((y_{k})_{k \ge 0}\). Plugging the particular choice of parameters (37) into (24) for \((x^{*}, y^{*})\), we obtain

$$\begin{aligned} \begin{aligned} \frac{1}{2 \tau _{0}} \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma _{0}} \left\Vert y^{*}-y_{0} \right\Vert ^{2}&\ge \frac{1}{2 \tau _{0}} \left\Vert x^{*}-x_{K} \right\Vert ^{2} + \frac{\tau _{K}}{\sigma _{K}} \left( 1 - \sigma _{K} \theta _{K} ( L_{yx} \alpha _{K} + L_{yy} ) \right) \frac{1}{2 \tau _{0}} \left\Vert y^{*}-y_{K} \right\Vert ^{2}\\&\ge \gamma _{K} \frac{\delta }{2 \tau _{0}} \left\Vert y^{*}-y_{K} \right\Vert ^{2}, \end{aligned} \end{aligned}$$

where we use (10) in Assumption 1 for the last inequality. Combining this with (38) we derive

$$\begin{aligned} \left\Vert y^{*}-y_{K} \right\Vert \le \frac{c_{1}}{K} {\left( \frac{1}{2 \tau _{0}} \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma _{0}} \left\Vert y^{*}-y_{0} \right\Vert ^{2} \right) }^{\frac{1}{2}}, \end{aligned}$$

with \(c_{1} := \sqrt{\frac{18}{\nu ^{2} \sigma _{0} \delta }}\).

Next we will show the convergence rate of the minimax gap at the ergodic sequences. Writing (25) for \((x^{*}, y^{*})\), we obtain

$$\begin{aligned} 0 \le \Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K}) \le \frac{1}{T_{K}} \left( \frac{1}{2 \tau _{0}} \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma _{0}} \left\Vert y^{*}-y_{0} \right\Vert ^{2} \right) . \end{aligned}$$
(40)

Plugging the particular choice of \(t_{k} = \frac{\tau _{k}}{\tau _{0}}\) for all \(k \ge 0\) from (29) into the definition of \(T_{K}\), together with (39) yields

$$\begin{aligned} T_{K} = \frac{1}{\tau _{0}} \sum _{k=0}^{K-1} \tau _{k} \ge \frac{\nu \sigma _{0}}{3} \sum _{k=0}^{K-1} k = \frac{\nu \sigma _{0}}{6} K(K-1). \end{aligned}$$

Combining this inequality with (40), we obtain for all \(K \ge 2\)

$$\begin{aligned} 0 \le \Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K}) \le \frac{c_{2}}{K^{2}} \left( \frac{1}{2 \tau _{0}} \left\Vert x^{*}-x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma _{0}} \left\Vert y^{*}-y_{0} \right\Vert ^{2} \right) , \end{aligned}$$

with \(c_{2} := \frac{12}{\nu \sigma _{0}}\), which concludes the proof. \(\square\)

4 Strongly convex-strongly concave setting

For this section we assume that the function g is convex with modulus \(\nu > 0\), meaning it is \(\nu\)-strongly convex. In addition to the assumptions we had until now, for this section we also assume that for all \(y \in {{\,\mathrm{dom}\,}}g\) the function \(\Phi (\, \cdot \,,y): {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{ + \infty \}\) is \(\mu\)-strongly convex with modulus \(\mu > 0\). This means that the saddle function \((x,y) \mapsto \Psi (x, y)\) is strongly convex-strongly concave.

As in the previous section we will state two step size assumptions that will be needed for the convergence analysis. These again will be followed by preparatory observations and a result to guarantee the validity of the stated assumptions. The section will be closed with the formulation and proof of convergence results.

Assumption 2

We assume that the step sizes \(\tau _{k}\), \(\sigma _{k}\) and the momentum parameter \(\theta _{k}\) are constant

$$\begin{aligned} \theta _{k} = \theta _{0} =: \theta , \quad \tau _{k} = \tau _{0} =: \tau , \quad \sigma _{k} = \sigma _{0} =: \sigma \quad \forall k \ge 0, \end{aligned}$$

and satisfy

$$\begin{aligned} 1 + \mu \tau = \frac{1}{\theta }, \quad 1 + \nu \sigma = \frac{1}{\theta }, \end{aligned}$$
(41)

with

$$\begin{aligned} 0< \theta < 1. \end{aligned}$$
(42)

Furthermore, we assume that there exists \(\alpha > 0\) such that

$$\begin{aligned} \frac{L_{yx}}{\alpha } \le \frac{1}{\tau }, \quad L_{yy} \le \frac{1 - \theta \sigma (\alpha L_{yx} + L_{yy})}{\sigma }, \end{aligned}$$
(43)

with

$$\begin{aligned} 1 - \theta \sigma (\alpha L_{yx} + L_{yy}) > 0. \end{aligned}$$
(44)

4.1 Preliminary considerations

We take an arbitrary \((x,y) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) and let \(k \ge 0\). Following similar considerations along (13)–(15), additionally taking into account the \(\mu\)-strong convexity of \(\Phi (\, \cdot \,, y)\) for \(y \in {{\,\mathrm{dom}\,}}g\), instead of (16) we derive

$$\begin{aligned} \begin{aligned} \Psi (x_{k+1},y)-\Psi (x,y_{k+1}) \le&\ \theta \left\langle q_{k}, y_{k} - y \right\rangle - \left\langle q_{k+1}, y_{k+1} - y \right\rangle \\&- \frac{\mu }{2} \left\Vert x - x_{k+1} \right\Vert ^{2} - \frac{\nu }{2} \left\Vert y - y_{k+1} \right\Vert ^{2} + \theta \left\langle q_{k}, y_{k+1} - y_{k} \right\rangle \\&+ \frac{1}{2\tau } \left( -\left\Vert x_{k} - x_{k+1} \right\Vert ^{2} - \left\Vert x - x_{k+1} \right\Vert ^{2} + \left\Vert x - x_{k} \right\Vert ^{2}\right) \\&+ \frac{1}{2\sigma } \left( -\left\Vert y_{k} - y_{k+1} \right\Vert ^{2} - \left\Vert y - y_{k+1} \right\Vert ^{2} + \left\Vert y - y_{k} \right\Vert ^{2}\right) \\ \le&\ \frac{1}{2 \tau } \left\Vert x - x_{k} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y - y_{k} \right\Vert ^{2} + \theta \left\langle q_{k}, y_{k} - y \right\rangle \\&- \frac{1 + \mu \tau }{2 \tau } \left\Vert x - x_{k+1} \right\Vert ^{2} - \frac{1 + \nu \sigma }{2 \sigma } \left\Vert y - y_{k+1} \right\Vert ^{2} - \left\langle q_{k+1}, y_{k+1} - y \right\rangle \\&+ \frac{\theta L_{yx}}{2 \alpha } \left\Vert x_{k} - x_{k-1} \right\Vert ^{2} - \frac{1}{2 \tau } \left\Vert x_{k+1} - x_{k} \right\Vert ^{2}\\&+ \frac{\theta L_{yy}}{2} \left\Vert y_{k} - y_{k-1} \right\Vert ^{2} - \frac{1 - \theta \sigma (\alpha L_{yx} + L_{yy})}{2 \sigma } \left\Vert y_{k+1} - y_{k} \right\Vert ^{2}. \end{aligned} \end{aligned}$$

By (41) in Assumption 2 and for \(\alpha >0\) fulfilling (43)–(44), we obtain

$$\begin{aligned} \begin{aligned} \Psi (x_{k+1},y)-\Psi (x,y_{k+1}) \le&\ \frac{1}{2 \tau } \left\Vert x - x_{k} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y - y_{k} \right\Vert ^{2} + \theta \left\langle q_{k}, y_{k} - y \right\rangle \\&- \frac{1}{2 \tau } \frac{1}{\theta } \left\Vert x - x_{k+1} \right\Vert ^{2} - \frac{1}{2 \sigma } \frac{1}{\theta } \left\Vert y - y_{k+1} \right\Vert ^{2} - \left\langle q_{k+1}, y_{k+1} - y \right\rangle \\&+ \frac{\theta L_{yx}}{2 \alpha } \left\Vert x_{k} - x_{k-1} \right\Vert ^{2} - \frac{1}{2 \tau } \left\Vert x_{k+1} - x_{k} \right\Vert ^{2}\\&+ \frac{\theta L_{yy}}{2} \left\Vert y_{k} - y_{k-1} \right\Vert ^{2} - \frac{1 - \theta \sigma (\alpha L_{yx} + L_{yy})}{2 \sigma } \left\Vert y_{k+1} - y_{k} \right\Vert ^{2}, \end{aligned} \end{aligned}$$

which together with (43) and (44) gives

$$\begin{aligned} \begin{aligned} \Psi (x_{k+1},y)-\Psi (x,y_{k+1}) \le&\ \frac{1}{2 \tau } \left( \left\Vert x - x_{k} \right\Vert ^{2} - \frac{1}{\theta } \left\Vert x - x_{k+1} \right\Vert ^{2}\right) + \frac{1}{2 \sigma } \left( \left\Vert y - y_{k} \right\Vert ^{2} - \frac{1}{\theta } \left\Vert y - y_{k+1} \right\Vert ^{2} \right) \\&+ \theta \left\langle q_{k}, y_{k} - y \right\rangle - \left\langle q_{k+1}, y_{k+1} - y \right\rangle + \frac{1}{2 \tau } \left( \theta \left\Vert x_{k} - x_{k-1} \right\Vert ^{2} - \left\Vert x_{k+1} - x_{k} \right\Vert ^{2} \right) \\&+ \frac{1}{2 \tilde{\sigma }} \left( \theta \left\Vert y_{k} - y_{k-1} \right\Vert ^{2} - \left\Vert y_{k+1} - y_{k} \right\Vert ^{2} \right) , \end{aligned} \end{aligned}$$
(45)

where

$$\begin{aligned} \tilde{\sigma } := \frac{\sigma }{1 - \theta \sigma (\alpha L_{yx} + L_{yy})}. \end{aligned}$$

Let \(K \ge 1\) and as in (21) denote

$$\begin{aligned} T_{K} = \sum _{k=0}^{K-1} t_{k}, \quad \bar{x}_{K} = \frac{1}{T_{K}} \sum _{k=0}^{K-1} t_{k}x_{k+1}, \quad \bar{y}_{K} = \frac{1}{T_{K}} \sum _{k=0}^{K-1} t_{k}y_{k+1}. \end{aligned}$$

with \(t_{k} > 0\) defined as in (19), in other words

$$\begin{aligned} t_{k} = \theta ^{-k} \quad \forall k \ge 0. \end{aligned}$$

Multiplying both sides of (45) by \(t_{k} > 0\) yields

$$\begin{aligned} \begin{aligned} \frac{1}{\theta ^{k}} \left( \Psi (x_{k+1},y)-\Psi (x,y_{k+1})\right)&\le \frac{1}{2 \tau } \left( \frac{1}{\theta ^{k}} \left\Vert x - x_{k} \right\Vert ^{2} - \frac{1}{\theta ^{k+1}} \left\Vert x - x_{k+1} \right\Vert ^{2}\right) \\&\quad + \frac{1}{2 \sigma } \left( \frac{1}{\theta ^{k}} \left\Vert y - y_{k} \right\Vert ^{2} - \frac{1}{\theta ^{k+1}} \left\Vert y - y_{k+1} \right\Vert ^{2} \right) \\&\quad + \frac{1}{\theta ^{k-1}}\left\langle q_{k}, y_{k} - y \right\rangle - \frac{1}{\theta ^{k}} \left\langle q_{k+1}, y_{k+1} - y \right\rangle \\&\quad + \frac{1}{2 \tau } \left( \frac{1}{\theta ^{k-1}} \left\Vert x_{k} - x_{k-1} \right\Vert ^{2} - \frac{1}{\theta ^{k}} \left\Vert x_{k+1} - x_{k} \right\Vert ^{2} \right) \\&\quad + \frac{1}{2 \tilde{\sigma }} \left( \frac{1}{\theta ^{k-1}} \left\Vert y_{k} - y_{k-1} \right\Vert ^{2} - \frac{1}{\theta ^{k}} \left\Vert y_{k+1} - y_{k} \right\Vert ^{2} \right) . \end{aligned} \end{aligned}$$

Summing up the above inequality for \(k = 0, \ldots , K-1\) and taking into account Jensen’s inequality for the convex function \(\Psi (\, \cdot \,, y) - \Psi (x, \, \cdot \,)\) give

$$\begin{aligned} \begin{aligned} T_{K} \left( \Psi (\bar{x}_{K},y)-\Psi (x,\bar{y}_{K}) \right)&\le \sum _{k=0}^{K-1} \frac{1}{\theta ^{k}} \left( \Psi (x_{k+1},y)-\Psi (x,y_{k+1}) \right) \\&\le \frac{1}{2 \tau } \left( \left\Vert x - x_{0} \right\Vert ^{2} - \frac{1}{\theta ^{K}} \left\Vert x - x_{K} \right\Vert ^{2} \right) + \frac{1}{2 \sigma } \left( \left\Vert y - y_{0} \right\Vert ^{2} - \frac{1}{\theta ^{K}} \left\Vert y - y_{K} \right\Vert ^{2} \right) \\&\quad - \frac{1}{\theta ^{K-1}} \left\langle q_{K}, y_{K} - y \right\rangle - \frac{1}{\theta ^{K-1}} \frac{1}{2 \tau } \left\Vert x_{K} - x_{K-1} \right\Vert ^{2} - \frac{1}{\theta ^{K-1}} \frac{1}{2 \tilde{\sigma }} \left\Vert y_{K} - y_{K-1} \right\Vert ^{2}\\&\le \frac{1}{2 \tau } \left( \left\Vert x - x_{0} \right\Vert ^{2} - \frac{1}{\theta ^{K}} \left\Vert x - x_{K} \right\Vert ^{2} \right) + \frac{1}{2 \sigma } \left( \left\Vert y - y_{0} \right\Vert ^{2} - \frac{1}{\theta ^{K}} \left\Vert y - y_{K} \right\Vert ^{2} \right) \\&\quad + \frac{1}{\theta ^{K-1}} \frac{L_{yx}}{2} \left( \frac{1}{\alpha } \left\Vert x_{K} - x_{K-1} \right\Vert ^{2} + \alpha \left\Vert y_{K} - y \right\Vert ^{2} \right) - \frac{1}{\theta ^{K-1}} \frac{1}{2 \tau } \left\Vert x_{K} - x_{K-1} \right\Vert ^{2}\\&\quad + \frac{1}{\theta ^{K-1}} \frac{L_{yy}}{2} \left( \left\Vert y_{K} - y_{K-1} \right\Vert ^{2} + \left\Vert y_{K} - y \right\Vert ^{2} \right) - \frac{1}{\theta ^{K-1}} \frac{1}{2 \tilde{\sigma }} \left\Vert y_{K} - y_{K-1} \right\Vert ^{2}\\&= \frac{1}{2 \tau } \left\Vert x - x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y - y_{0} \right\Vert ^{2}\\&\quad - \frac{1}{\theta ^{K}} \frac{1}{2 \tau } \left\Vert x - x_{K} \right\Vert ^{2} - \frac{1}{\theta ^{K}} \frac{1 - \theta \sigma (\alpha L_{yx} + L_{yy})}{2 \sigma } \left\Vert y - y_{K} \right\Vert ^{2}\\&\quad - \frac{1}{2 \theta ^{K-1}} \left( \frac{1}{\tau } - \frac{L_{yx}}{\alpha } \right) \left\Vert x_{K} - x_{K-1} \right\Vert ^{2} - \frac{1}{2 \theta ^{K-1}} \left( \frac{1}{\tilde{\sigma }} - L_{yy} \right) \left\Vert y_{K} - y_{K-1} \right\Vert ^{2}, \end{aligned} \end{aligned}$$

where in the second inequality we use (17). Omitting the last two terms which are non positive by (43), we obtain for all \(K \ge 1\)

$$\begin{aligned} \begin{aligned} T_{K} \theta ^{K} \big ( \Psi (\bar{x}_{K},y)-\Psi (x,\bar{y}_{K}) \big ) +&\frac{1}{2 \tau } \left\Vert x - x_{K} \right\Vert ^{2} + \frac{1}{2 \tilde{\sigma }} \left\Vert y - y_{K} \right\Vert ^{2} \le \theta ^{K} \left( \frac{1}{2 \tau } \left\Vert x - x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y - y_{0} \right\Vert ^{2} \right) , \end{aligned} \end{aligned}$$
(46)

which we will use to obtain our convergence results in the following.

4.2 Fulfilment of step size assumptions

In this subsection we will investigate a particular choice of parameters \(\tau\), \(\sigma\) and \(\theta\) such that Assumption 2 holds.

Proposition 13

For \(\alpha >0\) define

$$\begin{aligned} \tilde{\theta } := \max \left\{ \frac{L_{yx}}{\alpha \mu + L_{yx}}, \, \frac{\alpha L_{yx} + 2 L_{yy}}{\nu + \alpha L_{yx} + 2L_{yy}} \right\} . \end{aligned}$$
(47)

Let \(\theta > 0\) such that

$$\begin{aligned} 0 \le \tilde{\theta }< \theta < 1, \end{aligned}$$
(48)

and set

$$\begin{aligned} \tau = \frac{1}{\mu } \frac{1 - \theta }{\theta } \quad \text{ and } \quad \sigma = \frac{1}{\nu } \frac{1 - \theta }{\theta }. \end{aligned}$$
(49)

Then \(\tau\), \(\sigma\) and \(\theta\) fulfil Assumption 2.

Proof

If \(L_{yx} = L_{yy} =0\), then the conclusion follows immediately. Assume that \(L_{yx} + L_{yy} >0\). It is easy to verify that definition (47) yields

$$\begin{aligned} 0< \tilde{\theta } < 1 \end{aligned}$$

and that (49) is equivalent to (41) where (42) is ensured by (48). Furthermore, plugging the specific form of the step sizes (49) into (43) we obtain for the first inequality of (43)

$$\begin{aligned} \frac{L_{yx}}{\alpha } \le \frac{\mu \theta }{1 - \theta }, \end{aligned}$$

which is equivalent to

$$\begin{aligned} \theta \ge \frac{L_{yx}}{\alpha \mu + L_{yx}}. \end{aligned}$$

Note that by (48) we have

$$\begin{aligned} 0 \le \frac{L_{yx}}{\alpha \mu + L_{yx}} \le \tilde{\theta }< \theta < 1. \end{aligned}$$

Similarly, the second inequality of (43) is equivalent to the following quadratic inequality

$$\begin{aligned} \theta ^{2} - \frac{\alpha L_{yx} - \nu }{\alpha L_{yx} + L_{yy}} \theta - \frac{L_{yy}}{\alpha L_{yx} + L_{yy}} \ge 0. \end{aligned}$$

The non negative solution of the associated quadratic equation reads

$$\begin{aligned} \rho := \frac{1}{2} \left( \frac{\alpha L_{yx} - \nu }{\alpha L_{yx} + L_{yy}} + \sqrt{ \left( \frac{\alpha L_{yx} - \nu }{\alpha L_{yx} + L_{yy}} \right) ^{2} + \frac{4 L_{yy}}{\alpha L_{yx} + L_{yy}} } \right) \ge 0. \end{aligned}$$

Since

$$\begin{aligned} 0 \le \rho< \frac{\alpha L_{yx} + 2L_{yy}}{\nu + \alpha L_{yx} + 2L_{yy}} \le \tilde{\theta }< \theta < 1, \end{aligned}$$

the second inequality of (43) is also fulfilled. In order to see that

$$\begin{aligned} \rho < \frac{\alpha L_{yx} + 2L_{yy}}{\nu + \alpha L_{yx} + 2L_{yy}}, \end{aligned}$$

we notice that this inequality is equivalent to

$$\begin{aligned} \left( \frac{\alpha L_{yx} - \nu }{\alpha L_{yx} + L_{yy}} \right) ^{2} + \frac{4 L_{yy}}{\alpha L_{yx} + L_{yy}} < \frac{\big ( \nu ^{2} + 2 \nu L_{yy} + (\alpha L_{yx} + 2 L_{yy})^{2} \big )^{2}}{(\nu + \alpha L_{yx} + 2L_{yy})^{2}(\alpha L_{yx} + L_{yy})^{2}}, \end{aligned}$$

which holds if and only if

$$\begin{aligned}&\big ( (\alpha L_{yx} - \nu )^{2} + 4 L_{yy}(\alpha L_{yx} + L_{yy}) \big ) (\nu + \alpha L_{yx} + 2L_{yy})^{2}\\ <&\ (\nu ^{2} + 2 \nu L_{yy})^{2} + 2(\nu ^{2} + 2 \nu L_{yy})(\alpha L_{yx} + 2L_{yy})^{2} + (\alpha L_{yx} + 2 L_{yy})^{4} \end{aligned}$$

or, equivalently,

$$\begin{aligned} 0< 4 \nu ^{2} (\alpha L_{yx} + L_{yy})^{2}. \end{aligned}$$

For the remaining condition (44) to hold we need to ensure

$$\begin{aligned} \theta > \frac{\alpha L_{yx} + L_{yy} - \nu }{\alpha L_{yx} + L_{yy}}. \end{aligned}$$

For this we observe that

$$\begin{aligned} \begin{aligned} \rho&\ge \frac{1}{2} \left( \frac{\alpha L_{yx} - \nu }{\alpha L_{yx} + L_{yy}} + \sqrt{ \left( \frac{\alpha L_{yx} + 2L_{yy} - \nu }{\alpha L_{yx} + L_{yy}} \right) ^{2} } \right) \\&\ge \frac{1}{2} \frac{\alpha L_{yx} - \nu + \alpha L_{yx} + 2L_{yy} - \nu }{\alpha L_{yx} + L_{yy}} = \frac{\alpha L_{yx} + L_{yy} - \nu }{\alpha L_{yx} + L_{yy}}. \end{aligned} \end{aligned}$$

In conclusion, we obtain the following chain of inequalities

$$\begin{aligned} \frac{\alpha L_{yx} + L_{yy} - \nu }{\alpha L_{yx} + L_{yy}} \le \rho< \frac{\alpha L_{yx} + 2L_{yy}}{\nu + \alpha L_{yx} + 2L_{yy}} \le \tilde{\theta }< \theta < 1, \end{aligned}$$

which is satisfied by (48). \(\square\)

4.3 Convergence results

Now we can combine the previous results and prove the convergence statements in the strongly convex-strongly concave setting.

Theorem 14

Let \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be a saddle point of (1). Then for \((x_{k}, y_{k})_{k \ge 0}\) being the sequence generated by OGAProx with the choice of parameters

$$\begin{aligned} \tau = \frac{1}{\mu } \frac{1 - \theta }{\theta }, \quad \sigma = \frac{1}{\nu } \frac{1 - \theta }{\theta }, \quad 0 \le \tilde{\theta }< \theta < 1, \end{aligned}$$

with

$$\begin{aligned} \tilde{\theta } = \max \left\{ \frac{L_{yx}}{\alpha \mu + L_{yx}}, \, \frac{\alpha L_{yx} + 2 L_{yy}}{\nu + \alpha L_{yx} + 2L_{yy}} \right\} , \end{aligned}$$

for \(\alpha >0\), we denote for \(K \ge 1\)

$$\begin{aligned} T_{K} = \sum _{k=0}^{K-1} \theta ^{-k}, \quad \bar{x}_{K} = \frac{1}{T_{K}} \sum _{k=0}^{K-1} \theta ^{-k} x_{k+1}, \quad \bar{y}_{K} = \frac{1}{T_{K}} \sum _{k=0}^{K-1} \theta ^{-k} y_{k+1}, \end{aligned}$$

for which the following holds

$$\begin{aligned} 0 \le \theta \big ( \Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K}) \big ) + \frac{1}{2 \tau } \left\Vert x^{*} - x_{K} \right\Vert ^{2} + \frac{1}{2 \tilde{\sigma }} \left\Vert y^{*} - y_{K} \right\Vert ^{2} \le \theta ^{K} \left( \frac{1}{2 \tau } \left\Vert x^{*} - x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y^{*} - y_{0} \right\Vert ^{2} \right) , \end{aligned}$$

where \(\tilde{\sigma } := \frac{\sigma }{1 - \theta \sigma (\alpha L_{yx} + L_{yy})}\).

Proof

Let \(K \ge 1\) and \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be an arbitrary but fixed saddle point of (1). Writing (46) for \((x^{*}, y^{*})\) we get

$$\begin{aligned} 0 \le&\ T_{K} \theta ^{K} \big ( \Psi (\bar{x}_{K},y^{*})-\Psi (x^{*},\bar{y}_{K}) \big ) + \frac{1}{2 \tau } \left\Vert x^{*} - x_{K} \right\Vert ^{2} + \frac{1}{2 \tilde{\sigma }} \left\Vert y^{*} - y_{K} \right\Vert ^{2} \\ \le&\ \theta ^{K} \left( \frac{1}{2 \tau } \left\Vert x^{*} - x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y^{*} - y_{0} \right\Vert ^{2} \right) . \end{aligned}$$

Using

$$\begin{aligned} T_{K} = \sum _{k=0}^{K-1} \frac{1}{\theta ^{k}} = \frac{1}{\theta ^{K-1}} \frac{1 - \theta ^{K}}{1 - \theta } \ge \frac{1}{\theta ^{K-1}}, \end{aligned}$$

finally we obtain for all \(K \ge 1\)

$$\begin{aligned} 0 \le \theta \big ( \Psi (\bar{x}_{K},y^{*})-\Psi (x^{*},\bar{y}_{K}) \big ) + \frac{1}{2 \tau } \left\Vert x^{*} - x_{K} \right\Vert ^{2} + \frac{1}{2 \tilde{\sigma }} \left\Vert y^{*} - y_{K} \right\Vert ^{2} \le \theta ^{K} \left( \frac{1}{2 \tau } \left\Vert x^{*} - x_{0} \right\Vert ^{2} + \frac{1}{2 \sigma } \left\Vert y^{*} - y_{0} \right\Vert ^{2} \right) , \end{aligned}$$

with \(0< \theta < 1\) as defined in (48). \(\square\)

5 Numerical experiments

In this section we will treat three numerical applications of our method. The first one is of rather simple structure and has the purpose to highlight the convergence rates we obtained in the previous sections. The second one concerns multi kernel support vector machines to validate OGAProx on a more relevant application in practice, even though there are no theoretical guarantees for the “metric” reported there. The third numerical application addresses a classification problem incorporating minimax group fairness, which traces back to the solving of a minimax problem with nonsmooth coupling function.

5.1 Nonsmooth-linear problem

The first application we treat is to showcase the convergence rates we obtained in the previous sections and make a simple proof of concept. We look at the following nonsmooth-linear saddle point problem

$$\begin{aligned} \min _{x \in \mathbb {R}^{d}} \, \max _{y \in \mathbb {R}^{n}} \Psi (x, y) := \left\langle [x]_{+}, Ay \right\rangle - \left( \delta _{C}(y) + \frac{\nu }{2} \left\Vert y \right\Vert ^{2} \right) , \end{aligned}$$
(50)

with \(\nu \ge 0\) and \(A \in \mathbb {R}^{d \times n}\), \([ \, \cdot \,]_{+}\) being the component-wise positive part,

$$\begin{aligned} {[}x]_{+} = {\big ( \max \{ 0, x_{i} \} \big )}_{i=1}^{d}, \end{aligned}$$

and C being the following convex polytope

$$\begin{aligned} C := \{ y \in \mathbb {R}^{n} \ | \ A y \geqq 0 \}. \end{aligned}$$

For \(u = (u_{i})_{i=1}^{d}\), \(v = (v_{i})_{i=1}^{d} \in \mathbb {R}^{d}\) the relation \(u \geqq v\) denotes component-wise inequalities, namely,

$$\begin{aligned} u \geqq v \, \Leftrightarrow \, u_{i} \ge v_{i} \quad \text {for } 1 \le i \le d. \end{aligned}$$

Then \(g: \mathbb {R}^{n} \rightarrow \mathbb {R}\cup \{ + \infty \}\) with

$$\begin{aligned} g(y) := \delta _{C}(y) + \frac{\nu }{2} \left\Vert y \right\Vert ^{2} \end{aligned}$$

is proper, lower semicontinuous and convex with modulus \(\nu \ge 0\) and \({{\,\mathrm{dom}\,}}g = C\). Moreover, \(\Phi : \mathbb {R}^{d} \times \mathbb {R}^{n} \rightarrow \mathbb {R}\) with

$$\begin{aligned} \Phi (x, y) = \sum _{i = 1}^{d} \max \{ 0, x_{i} \} (Ay)_{i} \end{aligned}$$

has full domain, for all \(x \in \mathbb {R}^{d}\) we have that \(\Phi (x, \, \cdot \,)\) is linear and for all \(y \in {{\,\mathrm{dom}\,}}g = C\) the function \(\Phi (\, \cdot \,, y)\) is convex and continuous.

Furthermore, we obtain for all \((x, y), \, (x', y') \in \mathbb {R}^{d} \times {{\,\mathrm{dom}\,}}g\)

$$\begin{aligned} \left\Vert \nabla _{y}\Phi (x, y) - \nabla _{y}\Phi (x', y') \right\Vert = \left\Vert A^{T} \left( [x]_{+} - [x']_{+} \right) \right\Vert \le \left\Vert A \right\Vert \left\Vert x - x' \right\Vert , \end{aligned}$$

hence (2) holds with \(L_{yx} = \left\Vert A \right\Vert\) and \(L_{yy} = 0\).

The algorithm (7)–(8) iterates for \(k \ge 0\)

$$\begin{aligned} \left\{ \begin{array}{rcl} v_{k} &{}=&{} y_{k} + \sigma _{k} \left[ (1 + \theta _{k}) \nabla _{y}\Phi (x_{k}, y_{k}) - \theta _{k} \nabla _{y}\Phi (x_{k-1}, y_{k-1})\right] = y_{k} + \sigma _{k} A^{T} \big ( (1 + \theta _{k}) [x_{k}]_{+} - \theta _{k} [x_{k-1}]_{+} \big ),\\ y_{k+1} &{}=&{} {\text {prox}}^{}_{\sigma _{k} g}\left( v_{k} \right) = P_{C} \left( \frac{1}{1 + \nu \sigma _{k}} v_{k} \right) ,\\ x_{k+1} &{}=&{} {\text {prox}}^{}_{\tau _{k} \Phi (\, \cdot \,, y_{k+1})}\left( x_{k} \right) , \end{array} \right. \end{aligned}$$

where the calculation of the orthogonal projection on the set C is a simple quadratic program and

$$\begin{aligned} {\text {prox}}^{}_{\tau \Phi (\, \cdot \,, y)}\left( x \right) = \left( {\text {prox}}^{}_{\tau (Ay)_{i} \max \{ 0, \, \cdot \,\} }\left( x_{i} \right) \right) _{i = 1}^{d}, \end{aligned}$$

where, for \(i=1, ..., d\),

$$\begin{aligned} {\text {prox}}^{}_{\tau (Ay)_{i} \max \{ 0, \, \cdot \,\} }\left( x_{i} \right) = {\left\{ \begin{array}{ll} x_{i} &{} \text {if } x_{i} \le 0, \\ 0 &{} \text {if } 0 < x_{i} \le \tau (Ay)_{i}, \\ x_{i} - \tau (Ay)_{i} &{} \text {if } x_{i} > \tau (Ay)_{i}. \end{array}\right. } \end{aligned}$$

By writing the first order optimality conditions and using Lagrange duality we obtain the following characterisation.

$$\begin{aligned} \begin{aligned} (x^{*}, y^{*}) \text { is a saddle point of } (50)&\Leftrightarrow \left\{ \begin{array}{l} 0 \in \partial \left( \left\langle [ \, \cdot \,]_{+}, Ay^{*} \right\rangle - \delta _{C}(y^{*}) - \frac{\nu }{2} \left\Vert y^{*} \right\Vert ^{2} \right) (x^{*}) \\ 0 \in \partial \left( {-} \left\langle A^{T}[x^{*}]_{+}, \, \cdot \, \right\rangle + \delta _{C}(\, \cdot \,) + \frac{\nu }{2} \left\Vert \, \cdot \, \right\Vert ^{2} \right) (y^{*}) \end{array} \right. \\&\Leftrightarrow \left\{ \begin{array}{l} 0 \in \sum _{i = 1}^{d} (Ay^{*})_{i} \, \partial \max \{ 0, \, \cdot \,\} (x^{*}_{i})\\ A^{T}[x^{*}]_{+} - \nu y^{*} \in N_{C}(y^{*}) \end{array} \right. \\&\Leftrightarrow \left\{ \begin{array}{l} \forall i = 1, \ldots , d : \big ( (Ay^{*})_{i}> 0 \text { and } x^{*}_{i} \le 0 \big ) \text { or }\\ \quad \big ( (Ay^{*})_{i} = 0 \text { and } x^{*}_{i} \in \mathbb {R}\big ) \\ \left\langle A^{T} [x^{*}]_{+} - \nu y^{*}, y^{*} \right\rangle = 0 \\ \nu y^{*} - A^{T} [x^{*}]_{+} \in A^{T} \left( \mathbb {R}^{d}_{+} \right) \end{array} \right. \\&\Leftrightarrow \left\{ \begin{array}{l} \forall i = 1, \ldots , d : \big ( (Ay^{*})_{i} > 0 \text { and } x^{*}_{i} \le 0 \big ) \text { or }\\ \quad \big ( (Ay^{*})_{i} = 0 \text { and } x^{*}_{i} \in \mathbb {R}\big ) \\ \nu \left\Vert y^{*} \right\Vert ^{2} = \left\langle A^{T} [x^{*}]_{+} , y^{*} \right\rangle = \left\langle [x^{*}]_{+}, A y^{*} \right\rangle = 0 \\ \nu y^{*} \in A^{T} \left( [x^{*}]_{+} + \mathbb {R}^{d}_{+} \right) . \end{array} \right. \end{aligned} \end{aligned}$$

This means, that for \(\nu = 0\) we obtain

$$\begin{aligned} (x^{*}, y^{*}) \text { is a saddle point of } (50) \Leftrightarrow \left\{ \begin{array}{l} \forall i = 1, \ldots , d : \big ( (Ay^{*})_{i} > 0 \text { and } x^{*}_{i} \le 0 \big ) \text { or }\\ \quad \big ( (Ay^{*})_{i} = 0 \text { and } x^{*}_{i} \in \mathbb {R}\big ) \\ 0 \in A^{T} \left( [x^{*}]_{+} + \mathbb {R}^{d}_{+} \right) \end{array}, \right. \end{aligned}$$

whereas for \(\nu > 0\)

$$\begin{aligned} (x^{*}, y^{*}) \text { is a saddle point of } (50) \Leftrightarrow \left\{ \begin{array}{l} y^{*} = 0\\ 0 \in A^{T} \left( [x^{*}]_{+} + \mathbb {R}^{d}_{+} \right) \end{array}. \right. \end{aligned}$$

If \(A \in \mathbb {R}^{d \times n}\) has full row rank the inclusion

$$\begin{aligned} 0 \in A^{T} \left( [x^{*}]_{+} + \mathbb {R}^{d}_{+} \right) \end{aligned}$$

is equivalent to

$$\begin{aligned} x^{*} \leqq 0. \end{aligned}$$
Fig. 1
figure 1

Convergence of the minimax gap like \(\mathcal {O} (\frac{1}{K})\) for \(\nu = 0\)

For the experiments we choose dimensions \(d = 250\) and \(n = 350\). For easier validation of the solution \(x^{*}\) we ensure that the matrix \(A \in \mathbb {R}^{d \times n}\) with entries drawn from a uniform distribution on the interval \([ \text {-} 3, 3 ]\) has full row rank. The starting points \(x_{0} = x_{ {-} 1} \in \mathbb {R}^{d}\) and \(y_{0} \in \mathbb {R}^{n}\) have entries drawn from a uniform distribution on the interval \([ {-} 5, 5]\).

In the case \(\nu = 0\), i.e., the regulariser g being merely convex, we proved weak asymptotic convergence of the iterates to some saddle point \((x^{*}, y^{*})\) and convergence of the minimax gap at the ergodic sequences to zero like \(\mathcal {O} (\frac{1}{K})\) for any saddle point. The latter is illustrated in Fig. 1 for \((x^{*}, y^{*}) \in \mathbb {R}^{d} \times \mathbb {R}^{n}\) with \(x^{*} \leqq 0\) and \(y^{*} \in C\) with \(y^{*} \ne 0\) for a single random initialisation.

Fig. 2
figure 2

Convergence of the minimax gap like \(\mathcal {O} (\frac{1}{K^{2}})\) and of the sequence \((y_{k})_{k \ge 0}\) in norm like \(\mathcal {O} (\frac{1}{K})\) for \(\nu > 0\)

Let \((x^{*}, y^{*}) \in \mathbb {R}^{d} \times \mathbb {R}^{n}\) be a saddle point. In the case \(\nu > 0\), i.e., the regulariser g being \(\nu\)-strongly convex, we proved strong non-asymptotic convergence of the sequence \((y_{k})_{k \ge 0} \rightarrow y^{*}\) like \(\mathcal {O} (\frac{1}{K})\) and convergence of the minimax gap at the ergodic sequences to zero like \(\mathcal {O} (\frac{1}{K^{2}})\). The numerical behaviour of our method validating the theoretical claims for \(\nu > 0\) is highlighted in Fig. 2. The plots shown are for a single random initialisation and with the choice \(\nu = \frac{3}{10}\).

5.2 Multi kernel support vector machine

The second application to test our method in practice is to learn a combined kernel matrix for a multi kernel support vector machine (SVM). We have a set of labelled training data

$$\begin{aligned} S_{n} = \{ (a_{1}, b_{1}), \ldots , (a_{n}, b_{n}) \} \subseteq \mathbb {R}^{m} \times \{ -1, 1 \}, \end{aligned}$$

where we call \(b = (b_{i})_{i = 1}^{n}\), and a set of unlabelled test data

$$\begin{aligned} T_{l} = \{ a_{n + 1}, \ldots , a_{n + l} \} \subseteq \mathbb {R}^{m}. \end{aligned}$$

We consider embeddings of the data according to a kernel function \(\kappa : \mathbb {R}^{m} \times \mathbb {R}^{m} \rightarrow \mathbb {R}\) with the corresponding symmetric and positive semidefinite kernel matrix

$$\begin{aligned} \mathcal {K} = \left( \begin{array}{cc} \mathcal {K}^{tr} &{} \mathcal {K}^{tr, t}\\ \mathcal {K}^{t,tr} &{} \mathcal {K}^{t} \end{array} \right) , \end{aligned}$$

where \(\mathcal {K}_{ij} = \kappa (a_{i}, a_{j})\) for \(i, \, j = 1, \ldots , n, n + 1, \ldots , n+l\).

In the following e is a vector of appropriate size consisting of ones. According to [13] the problem of interest is

$$\begin{aligned} \min _{\begin{array}{c} \mathcal {K} \in \mathbb {K} \\ {{\,\mathrm{trace}\,}}(\mathcal {K}) = c \end{array}} \quad \max _{\begin{array}{c} 0 \leqq \alpha \leqq C \\ \left\langle \alpha , b \right\rangle = 0 \end{array}} \, \alpha ^{T}e - \frac{1}{2} \alpha ^{T} G(\mathcal {K}^{tr}) \alpha - \frac{\nu }{2} \left\Vert \alpha \right\Vert ^{2}_{2}, \end{aligned}$$
(51)

where \(\mathbb {K}\) is the model class of kernel matrices, \(c \in (0, +\infty )\), \(C \in (0, +\infty ]\) and \(\nu \in [0, +\infty )\) are model parameters and we define \(G(\mathcal {K}^{tr}) := {{\,\mathrm{diag}\,}}(b) \mathcal {K}^{tr} {{\,\mathrm{diag}\,}}(b)\).

The set \(\mathbb {K}\) is restricted to be the set of positive semidefinite matrices that can be written as a non negative linear combination of kernel matrices \(\mathcal {K}_{1}, \ldots , \mathcal {K}_{d}\), i.e.,

$$\begin{aligned} \mathbb {K} = \left\{ \mathcal {K} \in S^m_+ \ \bigg | \ \mathcal {K} = \sum _{i=1}^{d} \eta _{i} \mathcal {K}_{i}, \, \eta _{i} \ge 0 \text { for } i=1, ..., d \right\} . \end{aligned}$$

With this choice (51) becomes

$$\begin{aligned} \min _{\begin{array}{c} \left\langle \eta , r \right\rangle = c \\ \eta \geqq 0 \end{array}} \quad \max _{\begin{array}{c} 0 \leqq \alpha \leqq C \\ \left\langle \alpha , b \right\rangle = 0 \end{array}} \, \alpha ^{T}e - \frac{1}{2} \sum _{i=1}^{d} \eta _{i} \alpha ^{T} G(\mathcal {K}^{tr}_{i}) \alpha - \frac{\nu }{2} \left\Vert \alpha \right\Vert ^{2}, \end{aligned}$$
(52)

where \(\eta = (\eta _{i})_{i=1}^{d}\) and \(r = (r_{i})_{i=1}^{d}\) with \(r_{i} = {{\,\mathrm{trace}\,}}(\mathcal {K}_{i})\) for \(i=1, ..., d\). Assume \((\eta ^{*}, \alpha ^{*}) \in \mathbb {R}^{d} \times \mathbb {R}^{n}\) to be a saddle point of (52) and write

$$\begin{aligned} \mathcal {K}^{*} = \sum _{j=1}^{d} \eta ^{*}_{j} \mathcal {K}_{j}. \end{aligned}$$

Following the considerations of [11] we compute for \(a_{k} \in T_{l}\) with \(k \in \{ n+1, \ldots , n+l \}\),

$$\begin{aligned} \mathcal {L} (a_{k}) = {{\,\mathrm{sgn}\,}}\left( \sum _{i = 1}^{n} b_{i} \alpha ^{*}_{i} \mathcal {K}^{*}_{ik} + \gamma \right) = {{\,\mathrm{sgn}\,}}\left( \sum _{i = 1}^{n} \sum _{j=1}^{d} b_{i} \alpha ^{*}_{i} \eta ^{*}_{j} \left( \mathcal {K}_{j} \right) _{ik} + \gamma \right) , \end{aligned}$$
(53)

with

$$\begin{aligned} \gamma = b_{j_{0}} (1 - \nu \alpha ^{*}_{j_{0}}) - \sum _{i=1}^{n} b_{i} \alpha ^{*}_{i} \mathcal {K}^{*}_{i j_{0}} = b_{j_{0}} (1 - \nu \alpha ^{*}_{j_{0}}) - \sum _{i=1}^{n} \sum _{j=1}^{d} b_{i} \alpha ^{*}_{i} \eta ^{*}_{j} \left( \mathcal {K}_{j} \right) _{ij_{0}}, \end{aligned}$$

for some \(j_{0} \in \{ 1, \ldots , n \}\) such that \(0< \alpha ^{*}_{j_{0}} < C\).

After writing \(x_{i} = \frac{r_{i} \eta _{i}}{c}\) for \(i=1, ..., d\) and augmenting the objective with an additional (strongly) convex penalisation term, we obtain

$$\begin{aligned} \min _{x \in \mathbb {R}^{d}} \max _{y \in \mathbb {R}^{n}} \, \delta _{\Delta }(x) + \frac{\mu }{2} \left\Vert x \right\Vert ^{2} - \frac{1}{2} \sum _{i=1}^{d} x_{i} y^{T} M_{i} y + y^{T} e - \left( \delta _{Y}(y) + \frac{\nu }{2} \left\Vert y \right\Vert ^{2} \right) , \end{aligned}$$
(54)

where \(\mu \ge 0\) and \(M_{i} := \frac{c}{r_{i}} G(\mathcal {K}_{i}^{tr})\) for \(i=1, ..., d\),

$$\begin{aligned} \Delta := \{ x \in \mathbb {R}^{d} \ | \ x \geqq 0, \, \left\langle x, e \right\rangle = 1 \} \end{aligned}$$

is the m-dimensional unit simplex and

$$\begin{aligned} Y := \{ y \in \mathbb {R}^{n} \ | \ 0 \leqq y \leqq C, \, \left\langle y, b \right\rangle = 0 \} \end{aligned}$$

is the intersection of a box and a hyperplane.

In the notation of (1) we have \(\Phi : \mathbb {R}^{d} \times \mathbb {R}^{n} \rightarrow \mathbb {R}\cup \{ + \infty \}\) defined by

$$\begin{aligned} \Phi (x, y) = \delta _{\Delta }(x) + \frac{\mu }{2} \left\Vert x \right\Vert ^{2} - \frac{1}{2} \sum _{i=1}^{d} x_{i} y^{T} M_{i} y + y^{T} e, \end{aligned}$$

and \(g: \mathbb {R}^{n} \rightarrow \mathbb {R}\cup \{ +\infty \}\) given by

$$\begin{aligned} g(y) = \delta _{Y}(y) + \frac{\nu }{2} \left\Vert y \right\Vert ^{2}. \end{aligned}$$

We see that \(\Phi\) and g satisfy the assumptions considered for problem (1).

The algorithm (7)–(8) iterates as follows for \(k \ge 0\)

$$\begin{aligned} \begin{aligned} \left\{ \begin{array}{rcl} v_{k} &{}=&{} y_{k} + \sigma _{k} \left[ (1 + \theta _{k}) \nabla _{y}\Phi (x_{k}, y_{k}) - \theta _{k} \nabla _{y}\Phi (x_{k-1}, y_{k-1})\right] ,\\ y_{k+1} &{}=&{} {\text {prox}}^{}_{\sigma _{k} g}\left( v_{k} \right) = P_{Y} \left( \frac{1}{1 + \nu \sigma _{k}} v_{k} \right) ,\\ x_{k+1} &{}=&{} {\text {prox}}^{}_{\tau _{k} \Phi (\, \cdot \,, y_{k+1})}\left( x_{k} \right) = P_{\Delta } \left( \frac{1}{1 + \mu \tau _{k}} (x_{k} + \tau _{k} \xi ^{y_{k+1}}) \right) , \end{array} \right. \end{aligned} \end{aligned}$$

where

$$\begin{aligned} \nabla _{y}\Phi (x, y) = {-} \left( \frac{1}{2} \sum _{i = 1}^{d} x_{i} (M_{i} + M_{i}^{T}) \right) y + e = {-} \left( \sum _{i = 1}^{d} x_{i} M_{i} \right) y + e \ \text{ for } (x,y) \in \Delta \times \mathbb {R}^n \end{aligned}$$

and

$$\begin{aligned} \xi ^{y} := \left( \frac{1}{2} y^{T} M_{i} y \right) _{i=1}^{d}. \end{aligned}$$

To determine the correct step sizes and momentum parameter, we need to find Lipschitz constants for \(\nabla _{y}\Phi\), i.e., \(L_{yx}\), \(L_{yy} \ge 0\) such that (2) holds. Recall, that we require for all \((x,y), \, (x',y') \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) \times {{\,\mathrm{dom}\,}}g\)

$$\begin{aligned} \left\Vert \nabla _{y}\Phi (x, y) - \nabla _{y}\Phi (x', y') \right\Vert \le L_{yx} \left\Vert x-x' \right\Vert + L_{yy} \left\Vert y - y' \right\Vert , \end{aligned}$$

with \({{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) = \Delta\) and \({{\,\mathrm{dom}\,}}g = Y\).

Let \((x,y), \, (x',y') \in \Delta \times Y\). Then

$$\begin{aligned} \begin{aligned} \left\Vert \nabla _{y}\Phi (x, y) - \nabla _{y}\Phi (x', y') \right\Vert&= \left\Vert - \sum _{i = 1}^{d} x_{i} M_{i} y + e + \sum _{i = 1}^{d} x'_{i} M_{i} y' - e \right\Vert \\&= \left\Vert \sum _{i = 1}^{d} x_{i} M_{i} y' - \sum _{i = 1}^{d} x_{i} M_{i} y + \sum _{i = 1}^{d} x'_{i} M_{i} y' - \sum _{i = 1}^{d} x_{i} M_{i} y' \right\Vert \\&\le \left\Vert \sum _{i = 1}^{d} x_{i} M_{i} (y - y') \right\Vert + \left\Vert \sum _{i = 1}^{d} (x_{i} - x'_{i}) M_{i} y' \right\Vert \\&\le \sum _{i = 1}^{d} |x_{i} |\left\Vert M_{i} \right\Vert \left\Vert y - y' \right\Vert + \sum _{i = 1}^{d} |x_{i} - x'_{i} |\left\Vert M_{i} \right\Vert \left\Vert y' \right\Vert \\&\le \left( \left\Vert x \right\Vert _{1} \max _{1 \le i \le d} \left\Vert M_{i} \right\Vert \right) \left\Vert y - y' \right\Vert + \left( \left\Vert y' \right\Vert \max _{1 \le i \le d} \left\Vert M_{i} \right\Vert \right) \left\Vert x - x' \right\Vert _{1}\\&\le \left( \left\Vert x \right\Vert _{1} \max _{1 \le i \le d} \left\Vert M_{i} \right\Vert \right) \left\Vert y - y' \right\Vert + \left( \left\Vert y' \right\Vert \sqrt{d} \max _{1 \le i \le d} \left\Vert M_{i} \right\Vert \right) \left\Vert x - x' \right\Vert . \end{aligned} \end{aligned}$$

As \(x \in \Delta\), we have \(\left\Vert x \right\Vert _{1} = 1\) and since \(y' \in Y\) we get \(\left\Vert y' \right\Vert \le C \sqrt{n}\). Thus we obtain

$$\begin{aligned} \left\Vert \nabla _{y}\Phi (x, y) - \nabla _{y}\Phi (x', y') \right\Vert \le L_{yx} \left\Vert x-x' \right\Vert + L_{yy} \left\Vert y - y' \right\Vert , \end{aligned}$$

with

$$\begin{aligned} L_{yx} = C \sqrt{d n} \max _{1 \le i \le d} \left\Vert M_{i} \right\Vert , \quad L_{yy} = \max _{1 \le i \le d} \left\Vert M_{i} \right\Vert . \end{aligned}$$

For our experiments we use four different data sets from the “UCI Machine Learning Repository” [8]: the (original) Wisconsin breast cancer dataset [16] (699 total observations including 16 incomplete examples; 9 features), the Statlog heart disease data set (270 observations; 13 features), the Ionosphere data set (351 observations; 33 features) and the Connectionist Bench Sonar data set (208 observations; 60 features). All the data sets are normalised such that each feature column has zero mean and standard deviation equal to one.

Furthermore, we take \(d = 3\) given kernel functions, namely a polynomial kernel function \(k_{1}(a, a') = (1 + a^{T}a')^{2}\) of degree 2 for \(\mathcal {K}_{1}\), a Gaussian kernel function \(k_{2}(a, a') = \exp ( {-} \frac{1}{2}(a - a')^{T}(a - a')/\frac{1}{10})\) for \(\mathcal {K}_{2}\) and a linear kernel function \(k_{3} (a, a') = a^{T}a'\) for \(\mathcal {K}_{3}\). The resulting kernel matrices are normalised according to [13, Section 4.8], giving

$$\begin{aligned} r_{i} = {{\,\mathrm{trace}\,}}(\mathcal {K}_{i}) = n + l. \end{aligned}$$

The model parameter \(c > 0\) is chosen to be

$$\begin{aligned} c = \sum _{i = 1}^{d} r_{i} = d(n + l), \end{aligned}$$

and we set \(C = 1\).

On this application we test the three proposed versions of OGAProx. We refer to the version of OGAProx with constant parameters from Sect. 3.3.1 as OGAProx-C1, to the one with adaptive parameters from Sect. 3.3.2 as OGAProx-A and to the one from Sect. 4.3 giving linear convergence with constant parameters as OGAProx-C2. The results are compared with those obtained by APD1 and APD2 from [11]. In their experiments on multi kernel SVMs they showed superiority of their method compared to Mirror Prox by [19] in terms of accuracy, runtime and relative error. They also argued that with APD they are able to obtain decent approximations of solutions of (52) by interior point methods such as MOSEK [18] taking about the same amount of runtime.

The main difference between APD and our method OGAProx is that for the first a gradient step in the first component is employed whereas for the latter a purely proximal step is used. To be able to employ APD2 with adaptive parameters for \(\nu > 0\), the roles of x and y in (54) have to be switched, giving a different method than OGAProx-A. The runtime of both methods however is still very similar as both use the same number of gradient computations/storages and projections per iteration.

All algorithms are initialised with

$$\begin{aligned} x_{0} = x_{-1} = \frac{1}{d} e, \quad y_{0} = y_{-1} = 0. \end{aligned}$$

Each data set is randomly partitioned into 80 % training and 20 % test set. The test set is used to judge the quality of the obtained model by predicting the labels via (53) and computing the resulting test set accuracy (TSA). Note that the TSA is not guaranteed to converge or increase at all by our theoretical considerations, which only state convergence of the iterates and in terms of function values. The reported TSA values are the average over 10 random partitions. Due to occasionally occurring rather dramatic deflections of the TSA we actually compute 12 runs, but remove minimum and maximum values before calculating the mean.

5.2.1 1-norm soft margin classifier

For \(\mu = \nu = 0\) the formulation (52) realises the so-called 1-norm soft margin classifier. In this case g is merely convex and we can only use the constant parameter choice from Sect. 3.3.1 with the name OGAProx-C1. We compare the results with those obtained by APD1 from [11].

Table 1 TSA of 1-norm soft margin classifier (\(\mu = 0\), \(\nu = 0\), \(C = 1\)) trained with OGAProx-C1 and APD1, averaged over 10 random partitions

In the case of 1-norm soft margin classifier the results reported in Table 1 paint a clear picture. OGAProx outperforms APD on three out of four data sets and ties on one data set, achieving maximum TSA values of 97.45 %, 82.78 %, 93.24 % and 85.95 % on Breast cancer, Heart disease, Ionosphere and Sonar, respectively.

5.2.2 2-norm soft margin classifier

For \(\mu = 0\) and \(\nu > 0\) from (52) we obtain the so-called 2-norm soft margin classifier with \(C = 1\). In this case g is \(\nu\)-strongly convex and we can use both parameter choices from Sect. 3.3.1 and the one from Sect. 3.3.2 giving OGAProx-C1 and OGAProx-A, respectively. This time we compare the results with those obtained by APD1 as well as APD2 from [11].

Table 2 TSA of 2-norm soft margin classifier (\(\mu = 0\), \(\nu = \frac{1}{2}\), \(C = 1\)) trained with OGAProx-C1, OGAProx-A, APD1 and APD2, averaged over 10 random partitions

We see in Table 2 that the situation for the 2-norm soft margin classifier is more diverse than previously with the 1-norm soft margin classifier. Comparing the two constant methods – OGAProx-C1 and APD1 – with each other, as well as the two adaptive methods – OGAProx-A and APD2 – we see that in both cases two out of four times OGAProx is better than APD and vice versa. Notice that the two data sets with in general lower TSA, namely Heart disease and Sonar, seem to benefit from the regularising effect of \(\nu > 0\), while those with already very good results on the other hand do not, compared to the results of the 1-norm soft margin classifier with \(\nu = 0\). In addition note that the adaptive variant OGAProx-A improves on the result of OGAProx-C1 on three out of four data sets.

5.2.3 Regularised 2-norm soft margin classifier

For \(\mu > 0\) and \(\nu > 0\) from (52) we again obtain the so-called 2-norm soft margin classifier with \(C = 1\), this time, however, in a regularised version. Now not only g is strongly convex, but also \(\Phi (\, \cdot \,, y)\) and we can use all our parameter choices from Sects. 3.3.1,  3.3.2 and  4.3 yielding OGAProx-C1, OGAProx-A and OGAProx-C2, respectively. Once more we compare the results with those obtained by APD1 as well as APD2 from [11], pointing out that that OGAProx-C2 has no APD counterpart harnessing the additional strong convexity of the problem.

Table 3 TSA of regularised 2-norm soft margin classifier (\(\mu = 1\), \(\nu = \frac{1}{2}\), \(C = 1\)) trained with OGAProx-C1, OGAProx-A, OGAProx-C2, APD1 and APD2, averaged over 10 random partitions

We see in Table 3 that for the regularised 2-norm soft margin classifier the situation is similar to the version without additional regulariser. This time for the constant methods, OGAProx-C1 and APD1, OGAProx is better than APD on three data sets while APD is better than OGAProx on only one. On the contrary, for the adaptive methods, OGAProx-A and APD2, it is the other way round. APD performs better than APD on three data sets while OGAProx is better than APD on only one. For the second version of OGAProx with constant parameter choice exhibiting linear convergence in both iterates and function values, there is no APD counterpart. When we compare the results for OGAProx-C2 to those of OGAProx-C1, then we see that the TSA values become better in general with improvements on three out of four data sets and one draw. On the Breast cancer data set OGAProx-C2 even delivers the maximum TSA over all considered methods.

5.3 Classification incorporating minimax group fairness

We want to classify labelled data \((a_{j}, b_{j})_{j=1}^{n} \subseteq \mathbb {R}^{d} \times \{ \pm 1 \}\), additionally taking into account so-called minimax group fairness [7, 17]. The data is divided into m groups \(G_{1}, ..., G_{m}\), such that for \(i \in [m] := \{ 1, ..., m \}\) we have \(G_{i} = (a_{i_{j}}, b_{i_{j}})_{j=1}^{n_{i}} \subseteq (a_{j}, b_{j})_{j=1}^{n}\) with \(n_{i} := \left|G_{i} \right|\) and \(i_{j} \in [n]\) for all \(i \in [m]\) and all \(j \in [n_{i}]\). Fairness is measured by worst-case outcomes across the considered groups. Hence we consider the following problem,

$$\begin{aligned} \min _{x \in \mathbb {R}^{d}} \max _{i \in [m]} f_{i}(x), \end{aligned}$$
(55)

with

$$\begin{aligned} f_{i}(x) = \frac{1}{n_{i}} \sum _{j=1}^{n_{i}} L(h_{x}(a_{i_{j}}), b_{i_{j}}), \end{aligned}$$

where \(h_{x}\) is a function parametrised by x, mapping features to predicted labels, and L is a loss function measuring the error between the predicted and true labels.

It is easy to see that (55) is equivalent to

$$\begin{aligned} \min _{x \in \mathbb {R}^{d}} \max _{y \in \Delta _{m}} \sum _{i=1}^{m} y_{i} f_{i}(x), \end{aligned}$$

where \(\Delta _{m} := \{ (v_{1}, ..., v_{m}) \in \mathbb {R}^{m} \ | \ \sum _{i=1}^{m} v_{i} = 1, \; v_{i} \ge 0 \text { for } i = 1, ..., m \}\) denotes the probability simplex in \(\mathbb {R}^{m}\). We will work with a linear (affine) predictor \(h_{x}: \mathbb {R}^{d} \rightarrow \mathbb {R}\) given by

$$\begin{aligned} h_{x}(a) = a^{T}x, \end{aligned}$$

with \(x \in \mathbb {R}^{d}\) and \(L: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) being the hinge loss, i.e.,

$$\begin{aligned} L(r,s) = \max \{ 0, 1 - sr \}, \end{aligned}$$

for r, \(s \in \mathbb {R}\).

Combining all of the above we get

$$\begin{aligned} \min _{x \in \mathbb {R}^{d}} \max _{y \in \mathbb {R}^{m}} \Phi (x, y) - g(y), \end{aligned}$$
(56)

with \(\Phi : \mathbb {R}^{d} \times \mathbb {R}^{m} \rightarrow \mathbb {R}\) defined by

$$\begin{aligned} \Phi (x, y) = \sum _{i=1}^{m} y_{i} \frac{1}{n_{i}} \sum _{j=1}^{n_{i}} \max \{ 0, 1 - b_{i_{j}} a_{i_{j}}^{T} x \}, \end{aligned}$$

and \(g: \mathbb {R}^{m} \rightarrow \mathbb {R}\cup \{ + \infty \}\) given by

$$\begin{aligned} g(y) = \delta _{\Delta _{m}}(y). \end{aligned}$$

The function g is proper, lower semicontinuous and convex (with modulus \(\nu = 0\)). Furthermore, we observe that \(\Phi (\cdot , y): \mathbb {R}^{d} \rightarrow \mathbb {R}\) is proper, convex and lower semicontinuous for all \(y \in {{\,\mathrm{dom}\,}}g = \Delta _{m}\) and for all \(x \in \Pr _{\mathbb {R}^{d}} ({{\,\mathrm{dom}\,}}\Phi ) = \mathbb {R}^{d}\) we have \({{\,\mathrm{dom}\,}}\Phi (x, \cdot ) = \mathbb {R}^{m}\) and \(\Phi (x, \cdot ): \mathbb {R}^{m} \rightarrow \mathbb {R}\) is concave and Fréchet differentiable. However, note that \(\Phi\) is not differentiable in its first component.

Moreover the Lipschitz condition on the gradient is fulfilled as well. Indeed, for \((x,y), \, (x',y') \in \mathbb {R}^{d} \times \Delta _{m}\) we have

$$\begin{aligned} \left\Vert \nabla _{y} \Phi (x,y) - \nabla _{y} \Phi (x',y') \right\Vert \le L_{yx} \left\Vert x-x' \right\Vert + L_{yy} \left\Vert y-y' \right\Vert , \end{aligned}$$

with

$$\begin{aligned} L_{yx} = \sqrt{ \sum _{i=1}^{m} \frac{1}{n_{i}} \sum _{j=1}^{n_{i}} \left\Vert a_{i_{j}} \right\Vert ^{2} } \quad \text{ and } \quad L_{yy} = 0. \end{aligned}$$

Additionally, with \(\tau > 0\) and \(y \in {{\,\mathrm{dom}\,}}g\), we have for \(x \in \mathbb {R}^{d}\)

$$\begin{aligned} {\text {prox}}^{}_{\tau \Phi (\cdot , y)}\left( x \right) = {{\,\mathrm{arg\,min}\,}}_{u \in \mathbb {R}^{d}} \left\{ \tau \sum _{i=1}^{m} y_{i} \frac{1}{n_{i}} \sum _{j=1}^{n_{i}} \max \{ 0, 1 - b_{i_{j}} a_{i_{j}}^{T} u \} + \frac{1}{2} \left\Vert u - x \right\Vert ^{2} \right\} . \end{aligned}$$

By introducing slack variables for the pointwise maximum, we see that the above minimisation problem is equivalent to the following quadratic program

$$\begin{aligned} \begin{aligned} \begin{aligned} \min _{\begin{array}{c} u \in \mathbb {R}^{d},\\ r_{ij} \in \mathbb {R}, \\ i \in [m], \; j \in [n_{i}] \end{array}}&\quad \left\{ \tau \sum _{i=1}^{m} \sum _{j=1}^{n_{i}} y_{i} \frac{1}{n_{i}} r_{ij} + \frac{1}{2} \left\Vert u - x \right\Vert ^{2} \right\} .\\ \text {s.t.} \quad&r_{ij} \ge 0 \quad \forall \, i \in [m], \, \forall \, j \in [n_{i}]\\&r_{ij} + b_{i_{j}} a_{i_{j}}^{T} u \ge 1 \quad \forall \, i \in [m], \, \forall \, j \in [n_{i}]\\ \end{aligned} \end{aligned} \end{aligned}$$

For our practical applications we consider the Statlog heart disease data set (270 observations; 13 features) from the “UCI Machine Learning Repository” [8] and consider two different groupings; one consisting of the sex of the patients, while the other one is regarding the patients’ age. For “sex” we have two groups, that is female patients (Group S1) and male patients (Group S2), whereas for “age” we consider three groups, that is patients that are younger than 50 years old (Group A1), patients that are younger than 60 but at least 50 years old (Group A2), and patients that are 60 years of age or older (Group A3). The data set is randomly partitioned into 80 % training data and 20 % test data. The results in Tables 4 and  5 are the values of the achieved test set accuracy (TSA) averaged over 5 random partitions. For each considered group we state the intragroup TSA together with the overall TSA for the entire test set.

Table 4 TSA of the affine classifier after k iterations of OGAProx for the groups according to “sex”, averaged over five random partitions
Table 5 TSA of the affine classifier after k iterations of OGAProx for the groups according to “age”, averaged over five random partitions

Every time we report the results obtained by iterates of OGAProx governed by solving the minimax problem (56) taking into account the considered groups (“with fairness”), as well as the results obtained by not taking into account minimax group fairness (“without fairness”), i.e., solving the problem for a single extensive group \(G_{1} = (a_{j}, b_{j})_{j=1}^{n}\) with \(n_{1} = n\), yielding the minimisation of the average loss over the whole population and leading to an “ordinary” minimisation problem.

We see in Tables 4 and  5 that taking into account the groups regarding “sex” and “age”, respectively, is beneficial for training the affine classifier. In both cases “with fairness” achieves the highest TSA for each group and at the same time the highest overall TSA as well.