Abstract
In this work we aim to solve a convex-concave saddle point problem, where the convex-concave coupling function is smooth in one variable and nonsmooth in the other and not assumed to be linear in either. The problem is augmented by a nonsmooth regulariser in the smooth component. We propose and investigate a novel algorithm under the name of OGAProx, consisting of an optimistic gradient ascent step in the smooth variable coupled with a proximal step of the regulariser, and which is alternated with a proximal step in the nonsmooth component of the coupling function. We consider the situations convex-concave, convex-strongly concave and strongly convex-strongly concave related to the saddle point problem under investigation. Regarding iterates we obtain (weak) convergence, a convergence rate of order \(\mathcal {O}(\frac{1}{K})\) and linear convergence like \(\mathcal {O}(\theta ^{K})\) with \(\theta < 1\), respectively. In terms of function values we obtain ergodic convergence rates of order \(\mathcal {O}(\frac{1}{K})\), \(\mathcal {O}(\frac{1}{K^{2}})\) and \(\mathcal {O}(\theta ^{K})\) with \(\theta < 1\), respectively. We validate our theoretical considerations on a nonsmooth-linear saddle point problem, the training of multi kernel support vector machines and a classification problem incorporating minimax group fairness.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Saddle point—or minimax—problems arise traditionally in game theory [23] or for example in the context of determining primal-dual pairs of optimal solutions of constrained convex optimisation problems [1]. However, in recent years they have witnessed increased interest due to many relevant and challenging applications in the field of machine learning, with the most prominent being the training of Generative Adversarial Networks (GANs) [10]. Even though the problems in reality are often not of this form, in the classical setting the minimax objective comprises a smooth convex-concave coupling function with Lipschitz continuous gradient and a (potentially nonsmooth) regulariser in each variable, leading to a convex-concave objective in total.
One well established method in practice due to its simplicity and computational efficiency is Gradient Descent Ascent (GDA), either in a simultaneous or in an alternating variant (for a recent comparison of the convergence behaviour of the two schemes we refer to [24]). However, naive application of GDA is known to lead to oscillatory behaviour or even divergence already in simple cases such as bilinear objectives. Most algorithms with convergence guarantees in the general convex-concave setting make use of the formulation of the first order optimality conditions as monotone inclusion or variational inequality, treating both components in a symmetric fashion. For example we have the Extragradient method [12] whose application to minimax problems has been studied in [19] under the name of Mirror Prox, and the Forward-Backward-Forward method (FBF) [22] with application to saddle point problems in [3]. Both algorithms have even been successfully applied to the training of GANs (see [3, 9]), but, though being single-loop methods, suffer in practice from requiring two gradient evaluations per iteration. A possible way to avoid this is to reuse previous gradients. Doing this for FBF—as shown in [3]—recovers the Forward-Reflected Backward method [15] which was applied to saddle point problems under the name of Optimistic Mirror Descent and to GAN training under the name of Optimistic Gradient Descent Ascent [5, 6, 14].
The first method treating general coupling functions with an asymmetric scheme is the Accelerated Primal-Dual Algorithm (APD) by [11], involving an optimistic gradient ascent step in one component which is followed by a gradient descent step in the other one. In the special case of a bilinear coupling function APD recovers the Primal-Dual Hybrid Gradient Method (PDHG) [4]. In the case of the minimax objective being strongly convex-concave acceleration of PDHG is obtained in [4], which is also done for APD in [11], however only under the rather limiting assumption of linearity of the coupling function in one component.
In this paper we introduce a novel algorithm OGAProx for solving a convex-concave saddle point problem, where the convex-concave coupling function is smooth in one variable and nonsmooth in the other, and it is augmented by a nonsmooth regulariser in the smooth component. OGAProx consists of an optimistic gradient ascent step in the smooth component of the coupling function combined with a proximal step of the regulariser, which is followed by a proximal step of the coupling function in the nonsmooth component. We will be also able to accelerate our method in the convex-strongly concave setting without linearity assumption on the coupling function. Furthermore, we prove linear convergence if the problem is strongly convex-strongly concave, yielding similar results as for PDHG [4] in the bilinear case.
So far in most works nonsmoothness is only introduced via regularisers, as the coupling function is typically accessed through gradient evaluations. Recently there is another development, although with the saddle point problem not being convex-concave, where the assumption on differentiability of the coupling function in both components is weakened to only one component [2]. As the evaluation of the proximal mapping does not require differentiability we will assume the coupling function to be smooth in only one component, too.
The remainder of the paper is organised as follows. Next we will introduce the precise problem formulation and the setting we will work with, formulate the proposed algorithm OGAProx and state our contributions. This will be followed by preliminaries in Sect. 2. Afterwards we will discuss the properties of our algorithm in the convex-concave and convex-strongly concave setting and state respective convergence results in Sect. 3. After that we will investigate the convergence of the method under the additional assumption of strong convexity-strong concavity in Sect. 4. The paper will be concluded by numerical experiments in Sect. 5, where we treat a simple nonsmooth-linear saddle point problem, the training of multi kernel support vector machines and a classification problem taking into account minimax group fairness.
1.1 Problem description
Consider the saddle point problem
where \({{\mathcal {H}}}, \, {{\mathcal {G}}}\) are real Hilbert spaces, \(\Phi : {{\mathcal {H}}}\times {{\mathcal {G}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) is a coupling function with \({{\,\mathrm{dom}\,}}\Phi := \{(x,y) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\ | \ \Phi (x,y) < +\infty \} \ne \emptyset\) and \(g: {{\mathcal {G}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) a regulariser. Throughout the paper (unless otherwise specified) we will make the following assumptions:
-
g is proper, lower semicontinuous and convex with modulus \(\nu \ge 0\), i.e. \(g - \frac{\nu }{2} \left\Vert \, \cdot \, \right\Vert ^{2}\) is convex (notice that we also allow and consider the situation \(\nu = 0\), in which case g is convex; otherwise g is strongly convex);
-
for all \(y \in {{\,\mathrm{dom}\,}}g\), \(\Phi (\, \cdot \,,y): {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) is proper, convex and lower semicontinuous;
-
for all \(x \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) := \{ u \in {{\mathcal {H}}}\ | \ \exists y \in {{\mathcal {G}}}\ \text{ such } \text{ that } \ (u,y) \in {{\,\mathrm{dom}\,}}\Phi \}\) we have that \({{\,\mathrm{dom}\,}}\Phi (x, \, \cdot \,) = {{\mathcal {G}}}\) and \(\Phi (x,\, \cdot \,): {{\mathcal {G}}}\rightarrow \mathbb {R}\) is concave and Fréchet differentiable. Moreover, \({{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) is closed;
-
there exist \(L_{yx}, \, L_{yy} \ge 0\) such that for all \((x,y), \, (x',y') \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) \times {{\,\mathrm{dom}\,}}g\) it holds
$$\begin{aligned} \left\Vert \nabla _{y}\Phi (x, y) - \nabla _{y}\Phi (x', y') \right\Vert \le L_{yx} \left\Vert x-x' \right\Vert + L_{yy} \left\Vert y - y' \right\Vert . \end{aligned}$$(2)
By convention we set \(+\infty - (+\infty ) := +\infty\). Thus, the situation can be summarised by
We are interested in finding a saddle point of (1), which is a point \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) that fulfils the inequalities
For the remainder we assume that such a saddle point exists.
The assumptions considered above ensure that for any saddle point \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) we have
Finding a saddle point of (1) amounts to solving the necessary and sufficient first order optimality conditions, given by the following coupled inclusion problems
Remark 1
In case \(\Phi\) and g have full domain, \(\Psi\) is a convex-concave function with full domain and the set \({{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) is obviously closed. However, in order to allow more flexibility and to cover a wider range of problems (see also the last section with numerical experiments), our investigations are carried out in the more general setting given by the assumptions described above. Furthermore, these assumptions allow us to stay in the rigorous setting of the theory of convex-concave saddle functions as described by Rockafellar in [21] (see Definition 4 and Proposition 5 below).
Example 2
Consider the nonsmooth convex optimisation problem with inequality constraints
where \(f : {{\mathcal {H}}} \rightarrow {\mathbb {R}} \cup \{+\infty \}\) is a proper, convex and lower semicontinuous function and \(h_i : {{\mathcal {H}}} \rightarrow {\mathbb {R}}, i=1, ..., m,\) are convex and continuous functions. The Lagrangian attached to (5) reads
Then the saddle point problem
exhibits the structure of saddle point problem (1). It is known that if \((x^{*}, \lambda _1^{*}, ...,\lambda _m^{*})\) is a saddle point of (6), then \(x^{*}\) is an optimal solution of the constrained convex optimisation problem (5) and \((\lambda _1^{*}, ...,\lambda _m^{*})\) is an optimal solution of its Lagrange dual.
1.2 Algorithm
The algorithm we investigate performs an optimistic gradient ascent step of \(\Phi\) followed by an evaluation of the proximal mapping g in the variable y, while it carries out a purely proximal step of \(\Phi\) in x. We will call this method Optimistic Gradient Ascent – Proximal Point algorithm (OGAProx) in the following. For all \(k \ge 0\) we define
with the conventions \(x_{-1} := x_{0}\) and \(y_{-1} := y_{0}\) for starting points \(x_{0} \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) and \(y_{0} \in {{\,\mathrm{dom}\,}}g\). The particular choices of the sequences \((\sigma _{k})_{k \ge 0}, \, (\tau _{k})_{k \ge 0} \subseteq \mathbb {R}_{++}\) and \((\theta _{k})_{k \ge 0} \subseteq \left( 0, 1 \right]\) will be specified later.
1.3 Contribution
Let us summarize the main results of this paper:
-
1.
We introduce a novel algorithm to solve saddle point problems with nonsmooth coupling functions, which in general is not assumed to be linear in either component.
-
2.
We prove for the saddle function \(\Psi\) being
-
(a)
convex-concave (see Theorem 9):
-
weak convergence of the generated sequence \((x_{k}, y_{k})_{k \ge 0}\) to a saddle point \((x^{*}, y^{*})\) as \(k \rightarrow + \infty\);
-
convergence of the minimax gap \(\Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K})\) to zero like \(\mathcal {O}(\frac{1}{K})\) as \(K \rightarrow + \infty\), where \((\bar{x}_{K})_{K \ge 1}\) and \((\bar{y}_{K})_{K \ge 1}\) are the ergodic sequences obtained by averaging \((x_k)_{k \ge 1}\) and \((y_k)_{k \ge 1}\), respectively;
-
-
(b)
convex-strongly concave (see Theorem 12):
-
strong convergence of \((y_{k})_{k \ge 0}\) to \(y^{*}\) like \(\mathcal {O}(\frac{1}{k})\) as \(k \rightarrow + \infty\);
-
convergence of the minimax gap \(\Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K})\) to zero like \(\mathcal {O}(\frac{1}{K^{2}})\) as \(K \rightarrow + \infty\);
-
-
(c)
strongly convex-strongly concave (see Theorem 14):
-
linear convergence of \((x_{k}, y_{k})_{k \ge 0}\) to \((x^{*}, y^{*})\) like \(\mathcal {O}(\theta ^{k})\), with \(0< \theta < 1\), as \(k \rightarrow + \infty\);
-
linear convergence of the minimax gap \(\Psi (\bar{x}_{K},y^{*}) - \Psi (x^{*},\bar{y}_{K})\) to zero like \(\mathcal {O}(\theta ^{K})\) as \(K \rightarrow + \infty\).
-
-
(a)
2 Preliminaries
We recall some basic notions in convex analysis and monotone operator theory (see for example [1]). The real Hilbert spaces \({{\mathcal {H}}}\) and \({{\mathcal {G}}}\) are endowed with inner products \(\left\langle \, \cdot \,, \, \cdot \, \right\rangle _{{{\mathcal {H}}}}\) and \(\left\langle \, \cdot \,, \, \cdot \, \right\rangle _{{{\mathcal {G}}}}\), respectively. As it will be clear from the context which one is meant, we will drop the index for ease of notation and write \(\left\langle \, \cdot \,, \, \cdot \, \right\rangle\) for both. The norm induced by the respective inner products is defined by \(\left\Vert \, \cdot \, \right\Vert := \sqrt{\left\langle \, \cdot \,, \, \cdot \, \right\rangle }\).
A function \(f: {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) is said to be proper if \({{\,\mathrm{dom}\,}}f := \{x \in {{\mathcal {H}}}: f(x) < + \infty \} \ne \emptyset\). The (convex) subdifferential of the function \(f:{{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{ + \infty \}\) at \(x \in {{\mathcal {H}}}\) is defined by \(\partial f (x) := \{ u \in {{\mathcal {H}}}\quad \vert \, \left\langle y - x, u \right\rangle + f(x) \le f(y) \ \forall y \in {{\mathcal {H}}}\}\) if \(f(x) \in \mathbb {R}\) and by \(\partial f (x) := \emptyset\) otherwise. If the function f is convex and Fréchet differentiable at \(x \in {{\mathcal {H}}}\), then \(\partial f (x) = \{\nabla f (x)\}\). For the sum of a proper, convex and lower semicontinuous function \(f: {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{+\infty \}\) and a convex and Fréchet differentiable function \(h: {{\mathcal {H}}}\rightarrow \mathbb {R}\) we have \(\partial (f + h)(x) = \partial f(x) + \nabla h(x)\) for all \(x \in {{\mathcal {H}}}\). The subdifferential of the indicator function \(\delta _C\) of a nonempty closed convex set \(C \subseteq {{\mathcal {H}}}\), that is defined as \(\delta _C(x) = 0\) for \(x \in C\) and \(\delta _C(x) = +\infty\) otherwise, is denoted by \(N_C:=\partial \delta _C\) and is called the normal cone of the set C.
Let \(f: {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{ + \infty \}\) be proper, convex and lower semicontinuous. The proximal operator of f is defined by
The proximal operator of the indicator function \(\delta _C\) of a nonempty closed convex set \(C \subseteq {{\mathcal {H}}}\) is the orthogonal projection \(P_C : {{\mathcal {H}}}\rightarrow C\) of the set C.
A set-valued operator \(A: {{\mathcal {H}}}\rightrightarrows {{\mathcal {H}}}\) is said to be monotone if for all (x, u) , \((y, v) \in {{\,\mathrm{gra}\,}}A := \{ (z, w) \in {{\mathcal {H}}}\times {{\mathcal {H}}}\, \vert \, w \in A z \}\) we have \(\left\langle x - y, u - v \right\rangle \ge 0\). Furthermore, A is said to be maximal monotone if it is monotone and there exists no monotone operator \(B: {{\mathcal {H}}}\rightrightarrows {{\mathcal {H}}}\) such that \({{\,\mathrm{gra}\,}}A \subsetneqq {{\,\mathrm{gra}\,}}B\). The graph of a maximal monotone operator \(A: {{\mathcal {H}}}\rightrightarrows {{\mathcal {H}}}\) is sequentially closed in the strong \(\times\) weak topology, which means that if \((x_{k}, u_{k})_{k \ge 0}\) is a sequence in \({{\,\mathrm{gra}\,}}A\) such that \(x_{k} \rightarrow x\) and \(u_{k} \rightharpoonup u\) as \(k \rightarrow +\infty\), then \((x, u) \in {{\,\mathrm{gra}\,}}A\). The notation \(u_{k} \rightharpoonup u\) as \(k \rightarrow +\infty\) is used to denote convergence of the sequence \((u_k)_{k \ge 0}\) to u in the weak topology.
To show weak convergence of sequences in Hilbert spaces we use the following so-called Opial Lemma.
Lemma 3
(Opial Lemma [20]) Let \(C \subseteq {{\mathcal {H}}}\) be a nonempty set and \((x_{k})_{k \ge 0}\) a sequence in \({{\mathcal {H}}}\) such that the following two conditions hold:
-
(a)
for every \(x \in C\), \(\lim _{k \rightarrow + \infty } \left\Vert x_{k} - x \right\Vert\) exists;
-
(b)
every weak sequential cluster point of \((x_{k})_{k \ge 0}\) belongs to C.
Then \((x_{k})_{k \ge 0}\) converges weakly to an element in C.
In the following definition we adjust the term proper to the saddle point setting and refer to [21] for further considerations related to saddle functions.
Definition 4
A function \(\Psi : {{\mathcal {H}}}\times {{\mathcal {G}}}\rightarrow \mathbb {R}\cup \{ \pm \infty \}\) is called a saddle function, if \(\Psi (\, \cdot \,, y)\) is convex for all \(y \in {{\mathcal {G}}}\) and \(\Psi (x, \, \cdot \,)\) is concave for all \(x \in {{\mathcal {G}}}\). A saddle function \(\Psi\) is called proper, if there exists \((x', y') \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) such that \(\Psi (x', y) < + \infty\) for all \(y \in {{\mathcal {G}}}\) and \(- \infty < \Psi (x, y')\) for all \(x \in {{\mathcal {H}}}\).
We conclude the preliminary section with a useful result regarding the minimax objective from (1).
Proposition 5
The function \(\Psi : {{\mathcal {H}}}\times {{\mathcal {G}}}\rightarrow \mathbb {R}\cup \{ \pm \infty \}\) defined via (3) is a proper saddle function such that \(\Psi (\, \cdot \,, y)\) is lower semicontinuous for each \(y \in {{\mathcal {G}}}\) and \(\Psi (x, \, \cdot \,)\) is upper semicontinuous for each \(x \in {{\mathcal {H}}}\). Consequently, the operator
is maximal monotone.
Proof
We choose \((x', y') \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) and distinguish four cases.
Firstly, we look at the case \(y' \notin {{\,\mathrm{dom}\,}}g\). Then
thus \(x \mapsto \Psi (x,y')\) is convex and lower semicontinuous, since \({{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) is convex and closed. Secondly, if \(y' \in {{\,\mathrm{dom}\,}}g\), then \(g(y') \in \mathbb {R}\) and
which means that \(x \mapsto \Psi (x,y')\) is convex and lower semicontinuous. This proves that \(\Psi (\, \cdot \,, y)\) is convex and lower semicontinuous for all \(y \in {{\mathcal {G}}}\).
On the other hand, if \(x' \notin {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\), then
which means that \(y \mapsto \Psi (x', y)\) is upper semicontinuous and concave. Finally, if \(x' \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\), then
Hence \(y \mapsto -\Psi (x', y)\) is proper, convex and lower semicontinuous, and so \(y \mapsto \Psi (x', y)\) is concave and upper semicontinuous. This proves that \(\Psi (x, \, \cdot \,)\) is concave and upper semicontinuous for all \(x \in {{\mathcal {H}}}\).
Moreover, \(\Psi\) is a proper saddle function. By assumption we have \(g(y) > - \infty\) for all \(y \in {{\mathcal {G}}}\) and there exists \(x' \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) \ne \emptyset\) such that \({{\,\mathrm{dom}\,}}\Phi (x', \, \cdot \,) = {{\mathcal {G}}}\). Thus
Furthermore, by assumption there exist \(y' \in {{\,\mathrm{dom}\,}}g \subseteq {{\mathcal {G}}}\) such that \(g(y') < + \infty\) and for all \(x \in {{\mathcal {H}}}\) we have \(\Phi (x, y') > - \infty\). Hence,
The maximal monotonicity of \((x,y) \mapsto \partial [\Psi (\, \cdot \,,y)](x) \times \partial [\text {-} \Psi (x, \, \cdot \,)](y)\) follows from Corollary 1 and Theorem 3 in [21, pp. 248–249]. \(\square\)
3 Convex-(strongly) concave setting
First we will treat the case when the coupling function \(\Phi\) is convex-concave and g is convex with modulus \(\nu \ge 0\). In the case \(\nu = 0\) this corresponds to \(\Psi (x,y) = \Phi (x,y) - g(y)\) being convex-concave, while for \(\nu > 0\) the saddle function \(\Psi (x, y)\) is convex-strongly concave.
We will start with stating two assumptions on the step sizes of the algorithm which will be needed in the convergence analysis. These will be followed by a unified preparatory analysis for general \(\nu \ge 0\) that will be the base to show convergence of the iterates as well as of the minimax gap. After that we will introduce a choice of parameters that satisfy the aforementioned assumptions. The section will be closed by convergence results for the convex-concave (\(\nu = 0\)) and the convex-strongly concave (\(\nu > 0\)) setting.
Assumption 1
We assume that the step sizes \(\tau _{k}\), \(\sigma _{k}\) and the momentum parameter \(\theta _{k}\) satisfy
Furthermore, we assume that there exist \(\delta > 0\) and \((\alpha _{k})_{k \ge 0} \subseteq \mathbb {R}_{++}\) such that
where \(\theta _{0} := 1\).
3.1 Preliminary considerations
In this subsection we will make some preliminary considerations that will play an important role when proving the convergence properties of the numerical scheme given by (7)–(8). For all \(k \ge 0\) we will use the notations
We take an arbitrary \((x,y) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) and let \(k \ge 0\) be fixed. From (7) we derive
and, as g is convex with modulus \(\nu\), this implies
From (8) we get
hence the convexity of \(\Phi (\, \cdot \,, y)\) for \(y \in {{\,\mathrm{dom}\,}}g\) yields
Combining (13) and (15) we obtain
which, together with the concavity of \(\Phi\) in the second variable and (11), gives
By using (2) we can evaluate the last term in the above expression as follows
with \(\alpha _{k} > 0\) chosen such that (10) holds.
Writing (17) for \(y := y_{k+1}\) and combining the resulting inequality with (16) we derive
where
and
Now, let us define for all \(k \ge 0\)
and notice that
Relation (9) from Assumption 1 is equivalent to
which will be used in telescoping arguments in the following.
Let \(K \ge 1\) and denote
Multiplying both sides of (18) by \(t_{k} > 0\) as defined in (19), followed by summing up the inequalities for \(k = 0, \ldots , K-1\) gives
By Jensen’s inequality, as \(\Psi (\, \cdot \,, y) - \Psi (x, \, \cdot \,)\) is a convex function, we obtain
and thus
Furthermore, using (20), we get for all \(k \ge 0\)
Notice that by (10) in Assumption 1 there exists \(\delta > 0\) such that for all \(k \ge 0\)
For the following recall that \(x_{-1} = x_{0}\) and \(y_{-1} = y_{0}\), which implies \(q_{0} = 0\). By using the above two inequalities in (22) and writing (17) for \(k = K\) we obtain
By definition we have \(t_{0} = 1\) and by (10) that the last term of the above inequality is nonpositive, hence the following estimate for the minimax gap function evaluated at the ergodic sequences holds
With these considerations at hand – in specific we want to point out (18), (24) and (25) – we will be able to obtain convergence statements for the two settings \(\nu = 0\) and \(\nu > 0\).
3.2 Fulfilment of step size assumptions
In this subsection we will investigate a particular choice of parameters to fulfil Assumption 1 which is suitable for both cases of \(\nu = 0\) and \(\nu > 0\).
Proposition 6
Let \(\nu \ge 0\), \(c_{\alpha } > L_{yx} \ge 0\), \(\theta _{0} = 1\) and \(\tau _{0}, \, \sigma _{0} > 0\) such that
We define
Then the sequence \((\tau _{k})_{k \ge 0}\), \((\sigma _{k})_{k \ge 0}\) and \((\theta _{k})_{k \ge 0}\) fulfil (9) in Assumption 1 with equality and (10) for
and
Furthermore, for \((t_k)_{k \ge 0}\) defined as in (19) we have
Proof
First, we show that the particular choice (26) fulfils (9) in Assumption 1 with equality. We see that for all \(k \ge 0\)
as well as
follow straight forward by definition.
Next, we show that (10) in Assumption 1 holds for \(\delta\) defined in (28) with the choices (26) and (27). The first inequality of (10) is equivalent to
which clearly is fulfilled as
On the other hand, the second inequality of (10) is equivalent to
By definition of the step size parameters (26) we have for all \(k \ge 0\)
and thus
This chain of inequalities holds since
Finally, using the definition of \(t_{k}\) and (26) we conclude that for all \(k \ge 0\)
\(\square\)
Remark 7
The choice \(L_{yy} = 0\) in (2) which was considered in [11] in the convex-strongly concave setting corresponds to the case when the coupling function \(\Phi\) is linear in y. We will prove convergence also for \(L_{yy}\) positive, which makes our algorithm applicable to a much wider range of problems, as we will see in the section with the numerical experiments.
When the coupling function \(\Phi : {{\mathcal {H}}}\times {{\mathcal {G}}}\rightarrow \mathbb {R}\) is bilinear, that is \(\Phi (x, y) = \left\langle y, Ax \right\rangle\) for some nonzero continuous linear operator \(A : {{\mathcal {H}}}\rightarrow {{\mathcal {G}}}\) then we are in the setting of [4]. In this situation one can choose \(L_{yy} = 0\) and \(L_{yx} = \Vert A\Vert\), and (28) yields
with \(c_{\alpha } > \Vert A\Vert\). To guarantee \(\delta > 0\) we fix \(0< \varepsilon < 1\) and set
Hence, we need to satisfy
which heavily resembles the step size condition of [4, Algorithm 2]. Since \({\text {prox}}^{}_{\gamma \Phi (\cdot , y)}\left( x \right) = x - \gamma A^*y\) for all \((x,y)\in {{\mathcal {H}}}\times {{\mathcal {G}}}\) and all \(\gamma > 0\), our OGAProx scheme becomes the primal-dual algorithm PDHG from [4].
3.3 Convergence results
In this subsection we combine the preliminary considerations with the choice of parameters (26) from Proposition 6.
We will start with the case \(\nu = 0\) and constant step sizes, which gives weak convergence of the iterates to a saddle point \((x^{*}, y^{*})\) and convergence of the minimax gap evaluated at the ergodic iterates to zero like \(\mathcal {O}(\frac{1}{K})\). Afterwards we will consider the case \(\nu > 0\), which leads to an accelerated version of the algorithm with improved convergence results. In this setting we obtain convergence of \((y_{k})_{k \ge 0}\) to \(y^{*}\) like \(\mathcal {O}(\frac{1}{K})\) and convergence of the minimax gap evaluated at the ergodic iterates to zero like \(\mathcal {O}(\frac{1}{K^{2}})\).
3.3.1 Convex-concave setting
For the following we assume that the function g is convex with modulus \(\nu = 0\), meaning it is merely convex. Using the results of the previous subsection we will show that with the choice (26) all the parameters are constant.
Proposition 8
Let \(c_{\alpha } > L_{yx} \ge 0\) and \(\tau , \, \sigma > 0\) such that
If \(\nu = 0\), then the sequences \((\tau _{k})_{k \ge 0}\), \((\sigma _{k})_{k \ge 0}\) and \((\theta _{k})_{k \ge 0}\) as defined in Proposition 6 are constant, in particular we have
Proof
As \(\nu = 0\), (26) gives for all \(k \ge 0\)
\(\square\)
Next we will state and prove the convergence results in the convex-concave case.
Theorem 9
Let \(c_{\alpha } > L_{yx} \ge 0\) and \(\tau , \, \sigma > 0\) such that
Then the sequence \((x_{k}, y_{k})_{k \ge 0}\) generated by OGAProx with the choice of constant parameters as in Proposition 8, namely,
converges weakly to a saddle point \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) of (1). Furthermore, let \(K \ge 1\) and denote
Then for all \(K \ge 1\) and any saddle point \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) of (1) we have
Proof
First we will show weak convergence of the sequence of iterates \((x_{k}, y_{k})_{k \ge 0}\) to some saddle point \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) of (1). For this we will use the Opial Lemma (see Lemma 3).
Let \(k \ge 0\) and \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be an arbitrary but fixed saddle point. From (18) together with the choice (30) of constant parameters \(\theta _{k} = 1\), \(\tau _{k} = \tau\), \(\sigma _{k} = \sigma\) and \(\alpha _{k} = \alpha\) we obtain
since
and
We see that (32), writing (17) with \(y = y^{*}\) and (9) in Assumption 1 yield
Furthermore, from (31) and (23) we deduce
Telescoping this inequality and taking into account (33) give
as well as the existence of the limit \(\lim _{k \rightarrow + \infty } a_{k}(x^{*},y^{*}) \in \mathbb {R}\).
From (33) we get that \((x_{k})_{k \ge 0}\) and \((y_{k})_{k \ge 0}\) are bounded sequences. Moreover, by using (2) and (34) in definition (11) we obtain that
From the definition of \(a_{k}(x^{*}, y^{*})\) in (32), (34) and (35) we derive that
Since this is true for an arbitrary saddle point \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\), we have that the first statement of the Opial Lemma holds.
Next we will show that all weak cluster points of \((x_{k},y_{k})_{k \ge 0}\) are in fact saddle points of (1). Assume that \((x_{k_n})_{n \ge 0}\) converges weakly to \(x^{*} \in {{\mathcal {H}}}\) and \((y_{k_n})_{n \ge 0}\) converges weakly to \(y^{*} \in {{\mathcal {G}}}\) as \(n \rightarrow + \infty\). From (14), (11) and (12) we have
where we used that for all \(k \ge 0\) we have \(x_{k} \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi )\) and \(y_{k} \in {{\,\mathrm{dom}\,}}g\). The sequence on the left hand side of the inclusion (36) converges strongly to (0, 0) as \(n\rightarrow +\infty\) [according to (34) and (35)]. Notice that the operator \((x,y) \mapsto \partial [\Psi (\, \cdot \,,y)](x) \times \partial [ {-} \Psi (x, \, \cdot \,)](y)\) is maximal monotone (see Proposition 5), hence its graph is sequentially closed with respect to the strong \(\times\) weak topology. From here we deduce
from which we easily derive that \((x^{*},y^{*})\) is a saddle point as it satisfies (4). This means that also the second statement of the Opial Lemma is fulfilled and we have weak convergence of \((x_{k}, y_{k})_{k \ge 0}\) to a saddle point \((x^{*}, y^{*})\).
The remaining part is to show the convergence rate of the minimax gap of the ergodic sequences. Let \(K \ge 1\) and \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be an arbitrary but fixed saddle point. Writing (25) for \((x^{*}, y^{*})\) yields
with
Using (29) to get \(t_{k} = 1\) for all \(k \ge 0\) in the above expressions gives
Finally we derive for all \(K \ge 1\)
\(\square\)
3.3.2 Convex-strongly concave setting
For the remainder of this section we assume that the function g is convex with modulus \(\nu > 0\), meaning it is \(\nu\)-strongly convex. In this case the choice (26) leads to adaptive parameters and accelerated convergence.
Proposition 10
Let \(c_{\alpha } > L_{yx} \ge 0\), \(\theta _{0} = 1\) and \(\tau _{0}, \, \sigma _{0} > 0\) such that
If \(\nu > 0\) then \((\tau _{k})_{k \ge 0}\), \((\sigma _{k})_{k \ge 0}\) and \((\theta _{k})_{k \ge 0}\) as defined in Proposition 6 are adaptive, in particular we have
Proof
The statements follow directly from Proposition 6 for \(\nu > 0\). \(\square\)
To obtain statements regarding the (accelerated) convergence rates in the convex-strongly concave setting, we look at the behaviour of the sequences of step size parameters \((\tau _{k})_{k \ge 0}\) and \((\sigma _{k})_{k \ge 0}\) for \(k \rightarrow + \infty\).
Proposition 11
Let \(\theta _{0} = 1\), \(\tau _{0} > 0\),
and for all \(k \ge 0\) denote
Then with the choice of adaptive parameters (37) we have for all \(k \ge 0\)
and for all \(k \ge 1\)
Proof
By (26) we conclude that for all \(k \ge 0\)
and further
which, applied recursively, gives
We obtain
which we will use to show by induction that for all \(k \ge 0\)
For \(k = 0\) the statement trivially holds, whereas for \(k = 1\) we need to verify that
which is equivalent to the following quadratic inequality
and guaranteed to hold by our initial choice of \(\sigma _{0} > 0\). Now let \(k \ge 1\) and assume that (38) holds. Then
This shows the validity of (38) for all \(k \ge 0\).
Now we can use inequality (38) to deduce the convergence behaviour of the sequences \((\tau _{k})_{k \ge 0}\) and \((\sigma _{k})_{k \ge 0}\) for \(k \rightarrow + \infty\). We get for all \(k \ge 0\)
which, combined with
gives for all \(k \ge 1\)
\(\square\)
Now we are ready to prove the convergence results in the convex-strongly concave setting.
Theorem 12
Let \(c_{\alpha } > L_{yx} \ge 0\), \(\theta _{0} = 1\) and \(\tau _{0}, \, \sigma _{0} > 0\) such that
Let \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be a saddle point of (1). Then for \((x_{k}, y_{k})_{k \ge 0}\) being the sequence generated by OGAProx with the choice of adaptive parameters
we have for all \(K \ge 1\)
with \(c_{1} := \sqrt{\frac{18}{\nu ^{2} \sigma _{0} \delta }}\), where \(\delta > 0\) is defined in (28). Furthermore, for \(K \ge 1\), denote
where \(t_k = \frac{\tau _k}{\tau _0}\) for all \(k \ge 0\) [see also (29)]. Then for all \(K \ge 2\) it holds
with \(c_{2} := \frac{12}{\nu \sigma _{0}}\).
Proof
Let \(K \ge 1\) and let \((x^{*},y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be an arbitrary but fixed saddle point. First we will prove the convergence rate of the sequence of iterates \((y_{k})_{k \ge 0}\). Plugging the particular choice of parameters (37) into (24) for \((x^{*}, y^{*})\), we obtain
where we use (10) in Assumption 1 for the last inequality. Combining this with (38) we derive
with \(c_{1} := \sqrt{\frac{18}{\nu ^{2} \sigma _{0} \delta }}\).
Next we will show the convergence rate of the minimax gap at the ergodic sequences. Writing (25) for \((x^{*}, y^{*})\), we obtain
Plugging the particular choice of \(t_{k} = \frac{\tau _{k}}{\tau _{0}}\) for all \(k \ge 0\) from (29) into the definition of \(T_{K}\), together with (39) yields
Combining this inequality with (40), we obtain for all \(K \ge 2\)
with \(c_{2} := \frac{12}{\nu \sigma _{0}}\), which concludes the proof. \(\square\)
4 Strongly convex-strongly concave setting
For this section we assume that the function g is convex with modulus \(\nu > 0\), meaning it is \(\nu\)-strongly convex. In addition to the assumptions we had until now, for this section we also assume that for all \(y \in {{\,\mathrm{dom}\,}}g\) the function \(\Phi (\, \cdot \,,y): {{\mathcal {H}}}\rightarrow \mathbb {R}\cup \{ + \infty \}\) is \(\mu\)-strongly convex with modulus \(\mu > 0\). This means that the saddle function \((x,y) \mapsto \Psi (x, y)\) is strongly convex-strongly concave.
As in the previous section we will state two step size assumptions that will be needed for the convergence analysis. These again will be followed by preparatory observations and a result to guarantee the validity of the stated assumptions. The section will be closed with the formulation and proof of convergence results.
Assumption 2
We assume that the step sizes \(\tau _{k}\), \(\sigma _{k}\) and the momentum parameter \(\theta _{k}\) are constant
and satisfy
with
Furthermore, we assume that there exists \(\alpha > 0\) such that
with
4.1 Preliminary considerations
We take an arbitrary \((x,y) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) and let \(k \ge 0\). Following similar considerations along (13)–(15), additionally taking into account the \(\mu\)-strong convexity of \(\Phi (\, \cdot \,, y)\) for \(y \in {{\,\mathrm{dom}\,}}g\), instead of (16) we derive
By (41) in Assumption 2 and for \(\alpha >0\) fulfilling (43)–(44), we obtain
which together with (43) and (44) gives
where
Let \(K \ge 1\) and as in (21) denote
with \(t_{k} > 0\) defined as in (19), in other words
Multiplying both sides of (45) by \(t_{k} > 0\) yields
Summing up the above inequality for \(k = 0, \ldots , K-1\) and taking into account Jensen’s inequality for the convex function \(\Psi (\, \cdot \,, y) - \Psi (x, \, \cdot \,)\) give
where in the second inequality we use (17). Omitting the last two terms which are non positive by (43), we obtain for all \(K \ge 1\)
which we will use to obtain our convergence results in the following.
4.2 Fulfilment of step size assumptions
In this subsection we will investigate a particular choice of parameters \(\tau\), \(\sigma\) and \(\theta\) such that Assumption 2 holds.
Proposition 13
For \(\alpha >0\) define
Let \(\theta > 0\) such that
and set
Then \(\tau\), \(\sigma\) and \(\theta\) fulfil Assumption 2.
Proof
If \(L_{yx} = L_{yy} =0\), then the conclusion follows immediately. Assume that \(L_{yx} + L_{yy} >0\). It is easy to verify that definition (47) yields
and that (49) is equivalent to (41) where (42) is ensured by (48). Furthermore, plugging the specific form of the step sizes (49) into (43) we obtain for the first inequality of (43)
which is equivalent to
Note that by (48) we have
Similarly, the second inequality of (43) is equivalent to the following quadratic inequality
The non negative solution of the associated quadratic equation reads
Since
the second inequality of (43) is also fulfilled. In order to see that
we notice that this inequality is equivalent to
which holds if and only if
or, equivalently,
For the remaining condition (44) to hold we need to ensure
For this we observe that
In conclusion, we obtain the following chain of inequalities
which is satisfied by (48). \(\square\)
4.3 Convergence results
Now we can combine the previous results and prove the convergence statements in the strongly convex-strongly concave setting.
Theorem 14
Let \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be a saddle point of (1). Then for \((x_{k}, y_{k})_{k \ge 0}\) being the sequence generated by OGAProx with the choice of parameters
with
for \(\alpha >0\), we denote for \(K \ge 1\)
for which the following holds
where \(\tilde{\sigma } := \frac{\sigma }{1 - \theta \sigma (\alpha L_{yx} + L_{yy})}\).
Proof
Let \(K \ge 1\) and \((x^{*}, y^{*}) \in {{\mathcal {H}}}\times {{\mathcal {G}}}\) be an arbitrary but fixed saddle point of (1). Writing (46) for \((x^{*}, y^{*})\) we get
Using
finally we obtain for all \(K \ge 1\)
with \(0< \theta < 1\) as defined in (48). \(\square\)
5 Numerical experiments
In this section we will treat three numerical applications of our method. The first one is of rather simple structure and has the purpose to highlight the convergence rates we obtained in the previous sections. The second one concerns multi kernel support vector machines to validate OGAProx on a more relevant application in practice, even though there are no theoretical guarantees for the “metric” reported there. The third numerical application addresses a classification problem incorporating minimax group fairness, which traces back to the solving of a minimax problem with nonsmooth coupling function.
5.1 Nonsmooth-linear problem
The first application we treat is to showcase the convergence rates we obtained in the previous sections and make a simple proof of concept. We look at the following nonsmooth-linear saddle point problem
with \(\nu \ge 0\) and \(A \in \mathbb {R}^{d \times n}\), \([ \, \cdot \,]_{+}\) being the component-wise positive part,
and C being the following convex polytope
For \(u = (u_{i})_{i=1}^{d}\), \(v = (v_{i})_{i=1}^{d} \in \mathbb {R}^{d}\) the relation \(u \geqq v\) denotes component-wise inequalities, namely,
Then \(g: \mathbb {R}^{n} \rightarrow \mathbb {R}\cup \{ + \infty \}\) with
is proper, lower semicontinuous and convex with modulus \(\nu \ge 0\) and \({{\,\mathrm{dom}\,}}g = C\). Moreover, \(\Phi : \mathbb {R}^{d} \times \mathbb {R}^{n} \rightarrow \mathbb {R}\) with
has full domain, for all \(x \in \mathbb {R}^{d}\) we have that \(\Phi (x, \, \cdot \,)\) is linear and for all \(y \in {{\,\mathrm{dom}\,}}g = C\) the function \(\Phi (\, \cdot \,, y)\) is convex and continuous.
Furthermore, we obtain for all \((x, y), \, (x', y') \in \mathbb {R}^{d} \times {{\,\mathrm{dom}\,}}g\)
hence (2) holds with \(L_{yx} = \left\Vert A \right\Vert\) and \(L_{yy} = 0\).
The algorithm (7)–(8) iterates for \(k \ge 0\)
where the calculation of the orthogonal projection on the set C is a simple quadratic program and
where, for \(i=1, ..., d\),
By writing the first order optimality conditions and using Lagrange duality we obtain the following characterisation.
This means, that for \(\nu = 0\) we obtain
whereas for \(\nu > 0\)
If \(A \in \mathbb {R}^{d \times n}\) has full row rank the inclusion
is equivalent to
For the experiments we choose dimensions \(d = 250\) and \(n = 350\). For easier validation of the solution \(x^{*}\) we ensure that the matrix \(A \in \mathbb {R}^{d \times n}\) with entries drawn from a uniform distribution on the interval \([ \text {-} 3, 3 ]\) has full row rank. The starting points \(x_{0} = x_{ {-} 1} \in \mathbb {R}^{d}\) and \(y_{0} \in \mathbb {R}^{n}\) have entries drawn from a uniform distribution on the interval \([ {-} 5, 5]\).
In the case \(\nu = 0\), i.e., the regulariser g being merely convex, we proved weak asymptotic convergence of the iterates to some saddle point \((x^{*}, y^{*})\) and convergence of the minimax gap at the ergodic sequences to zero like \(\mathcal {O} (\frac{1}{K})\) for any saddle point. The latter is illustrated in Fig. 1 for \((x^{*}, y^{*}) \in \mathbb {R}^{d} \times \mathbb {R}^{n}\) with \(x^{*} \leqq 0\) and \(y^{*} \in C\) with \(y^{*} \ne 0\) for a single random initialisation.
Let \((x^{*}, y^{*}) \in \mathbb {R}^{d} \times \mathbb {R}^{n}\) be a saddle point. In the case \(\nu > 0\), i.e., the regulariser g being \(\nu\)-strongly convex, we proved strong non-asymptotic convergence of the sequence \((y_{k})_{k \ge 0} \rightarrow y^{*}\) like \(\mathcal {O} (\frac{1}{K})\) and convergence of the minimax gap at the ergodic sequences to zero like \(\mathcal {O} (\frac{1}{K^{2}})\). The numerical behaviour of our method validating the theoretical claims for \(\nu > 0\) is highlighted in Fig. 2. The plots shown are for a single random initialisation and with the choice \(\nu = \frac{3}{10}\).
5.2 Multi kernel support vector machine
The second application to test our method in practice is to learn a combined kernel matrix for a multi kernel support vector machine (SVM). We have a set of labelled training data
where we call \(b = (b_{i})_{i = 1}^{n}\), and a set of unlabelled test data
We consider embeddings of the data according to a kernel function \(\kappa : \mathbb {R}^{m} \times \mathbb {R}^{m} \rightarrow \mathbb {R}\) with the corresponding symmetric and positive semidefinite kernel matrix
where \(\mathcal {K}_{ij} = \kappa (a_{i}, a_{j})\) for \(i, \, j = 1, \ldots , n, n + 1, \ldots , n+l\).
In the following e is a vector of appropriate size consisting of ones. According to [13] the problem of interest is
where \(\mathbb {K}\) is the model class of kernel matrices, \(c \in (0, +\infty )\), \(C \in (0, +\infty ]\) and \(\nu \in [0, +\infty )\) are model parameters and we define \(G(\mathcal {K}^{tr}) := {{\,\mathrm{diag}\,}}(b) \mathcal {K}^{tr} {{\,\mathrm{diag}\,}}(b)\).
The set \(\mathbb {K}\) is restricted to be the set of positive semidefinite matrices that can be written as a non negative linear combination of kernel matrices \(\mathcal {K}_{1}, \ldots , \mathcal {K}_{d}\), i.e.,
With this choice (51) becomes
where \(\eta = (\eta _{i})_{i=1}^{d}\) and \(r = (r_{i})_{i=1}^{d}\) with \(r_{i} = {{\,\mathrm{trace}\,}}(\mathcal {K}_{i})\) for \(i=1, ..., d\). Assume \((\eta ^{*}, \alpha ^{*}) \in \mathbb {R}^{d} \times \mathbb {R}^{n}\) to be a saddle point of (52) and write
Following the considerations of [11] we compute for \(a_{k} \in T_{l}\) with \(k \in \{ n+1, \ldots , n+l \}\),
with
for some \(j_{0} \in \{ 1, \ldots , n \}\) such that \(0< \alpha ^{*}_{j_{0}} < C\).
After writing \(x_{i} = \frac{r_{i} \eta _{i}}{c}\) for \(i=1, ..., d\) and augmenting the objective with an additional (strongly) convex penalisation term, we obtain
where \(\mu \ge 0\) and \(M_{i} := \frac{c}{r_{i}} G(\mathcal {K}_{i}^{tr})\) for \(i=1, ..., d\),
is the m-dimensional unit simplex and
is the intersection of a box and a hyperplane.
In the notation of (1) we have \(\Phi : \mathbb {R}^{d} \times \mathbb {R}^{n} \rightarrow \mathbb {R}\cup \{ + \infty \}\) defined by
and \(g: \mathbb {R}^{n} \rightarrow \mathbb {R}\cup \{ +\infty \}\) given by
We see that \(\Phi\) and g satisfy the assumptions considered for problem (1).
The algorithm (7)–(8) iterates as follows for \(k \ge 0\)
where
and
To determine the correct step sizes and momentum parameter, we need to find Lipschitz constants for \(\nabla _{y}\Phi\), i.e., \(L_{yx}\), \(L_{yy} \ge 0\) such that (2) holds. Recall, that we require for all \((x,y), \, (x',y') \in {{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) \times {{\,\mathrm{dom}\,}}g\)
with \({{\,\mathrm{Pr}\,}}_{{{\mathcal {H}}}} ({{\,\mathrm{dom}\,}}\Phi ) = \Delta\) and \({{\,\mathrm{dom}\,}}g = Y\).
Let \((x,y), \, (x',y') \in \Delta \times Y\). Then
As \(x \in \Delta\), we have \(\left\Vert x \right\Vert _{1} = 1\) and since \(y' \in Y\) we get \(\left\Vert y' \right\Vert \le C \sqrt{n}\). Thus we obtain
with
For our experiments we use four different data sets from the “UCI Machine Learning Repository” [8]: the (original) Wisconsin breast cancer dataset [16] (699 total observations including 16 incomplete examples; 9 features), the Statlog heart disease data set (270 observations; 13 features), the Ionosphere data set (351 observations; 33 features) and the Connectionist Bench Sonar data set (208 observations; 60 features). All the data sets are normalised such that each feature column has zero mean and standard deviation equal to one.
Furthermore, we take \(d = 3\) given kernel functions, namely a polynomial kernel function \(k_{1}(a, a') = (1 + a^{T}a')^{2}\) of degree 2 for \(\mathcal {K}_{1}\), a Gaussian kernel function \(k_{2}(a, a') = \exp ( {-} \frac{1}{2}(a - a')^{T}(a - a')/\frac{1}{10})\) for \(\mathcal {K}_{2}\) and a linear kernel function \(k_{3} (a, a') = a^{T}a'\) for \(\mathcal {K}_{3}\). The resulting kernel matrices are normalised according to [13, Section 4.8], giving
The model parameter \(c > 0\) is chosen to be
and we set \(C = 1\).
On this application we test the three proposed versions of OGAProx. We refer to the version of OGAProx with constant parameters from Sect. 3.3.1 as OGAProx-C1, to the one with adaptive parameters from Sect. 3.3.2 as OGAProx-A and to the one from Sect. 4.3 giving linear convergence with constant parameters as OGAProx-C2. The results are compared with those obtained by APD1 and APD2 from [11]. In their experiments on multi kernel SVMs they showed superiority of their method compared to Mirror Prox by [19] in terms of accuracy, runtime and relative error. They also argued that with APD they are able to obtain decent approximations of solutions of (52) by interior point methods such as MOSEK [18] taking about the same amount of runtime.
The main difference between APD and our method OGAProx is that for the first a gradient step in the first component is employed whereas for the latter a purely proximal step is used. To be able to employ APD2 with adaptive parameters for \(\nu > 0\), the roles of x and y in (54) have to be switched, giving a different method than OGAProx-A. The runtime of both methods however is still very similar as both use the same number of gradient computations/storages and projections per iteration.
All algorithms are initialised with
Each data set is randomly partitioned into 80 % training and 20 % test set. The test set is used to judge the quality of the obtained model by predicting the labels via (53) and computing the resulting test set accuracy (TSA). Note that the TSA is not guaranteed to converge or increase at all by our theoretical considerations, which only state convergence of the iterates and in terms of function values. The reported TSA values are the average over 10 random partitions. Due to occasionally occurring rather dramatic deflections of the TSA we actually compute 12 runs, but remove minimum and maximum values before calculating the mean.
5.2.1 1-norm soft margin classifier
For \(\mu = \nu = 0\) the formulation (52) realises the so-called 1-norm soft margin classifier. In this case g is merely convex and we can only use the constant parameter choice from Sect. 3.3.1 with the name OGAProx-C1. We compare the results with those obtained by APD1 from [11].
In the case of 1-norm soft margin classifier the results reported in Table 1 paint a clear picture. OGAProx outperforms APD on three out of four data sets and ties on one data set, achieving maximum TSA values of 97.45 %, 82.78 %, 93.24 % and 85.95 % on Breast cancer, Heart disease, Ionosphere and Sonar, respectively.
5.2.2 2-norm soft margin classifier
For \(\mu = 0\) and \(\nu > 0\) from (52) we obtain the so-called 2-norm soft margin classifier with \(C = 1\). In this case g is \(\nu\)-strongly convex and we can use both parameter choices from Sect. 3.3.1 and the one from Sect. 3.3.2 giving OGAProx-C1 and OGAProx-A, respectively. This time we compare the results with those obtained by APD1 as well as APD2 from [11].
We see in Table 2 that the situation for the 2-norm soft margin classifier is more diverse than previously with the 1-norm soft margin classifier. Comparing the two constant methods – OGAProx-C1 and APD1 – with each other, as well as the two adaptive methods – OGAProx-A and APD2 – we see that in both cases two out of four times OGAProx is better than APD and vice versa. Notice that the two data sets with in general lower TSA, namely Heart disease and Sonar, seem to benefit from the regularising effect of \(\nu > 0\), while those with already very good results on the other hand do not, compared to the results of the 1-norm soft margin classifier with \(\nu = 0\). In addition note that the adaptive variant OGAProx-A improves on the result of OGAProx-C1 on three out of four data sets.
5.2.3 Regularised 2-norm soft margin classifier
For \(\mu > 0\) and \(\nu > 0\) from (52) we again obtain the so-called 2-norm soft margin classifier with \(C = 1\), this time, however, in a regularised version. Now not only g is strongly convex, but also \(\Phi (\, \cdot \,, y)\) and we can use all our parameter choices from Sects. 3.3.1, 3.3.2 and 4.3 yielding OGAProx-C1, OGAProx-A and OGAProx-C2, respectively. Once more we compare the results with those obtained by APD1 as well as APD2 from [11], pointing out that that OGAProx-C2 has no APD counterpart harnessing the additional strong convexity of the problem.
We see in Table 3 that for the regularised 2-norm soft margin classifier the situation is similar to the version without additional regulariser. This time for the constant methods, OGAProx-C1 and APD1, OGAProx is better than APD on three data sets while APD is better than OGAProx on only one. On the contrary, for the adaptive methods, OGAProx-A and APD2, it is the other way round. APD performs better than APD on three data sets while OGAProx is better than APD on only one. For the second version of OGAProx with constant parameter choice exhibiting linear convergence in both iterates and function values, there is no APD counterpart. When we compare the results for OGAProx-C2 to those of OGAProx-C1, then we see that the TSA values become better in general with improvements on three out of four data sets and one draw. On the Breast cancer data set OGAProx-C2 even delivers the maximum TSA over all considered methods.
5.3 Classification incorporating minimax group fairness
We want to classify labelled data \((a_{j}, b_{j})_{j=1}^{n} \subseteq \mathbb {R}^{d} \times \{ \pm 1 \}\), additionally taking into account so-called minimax group fairness [7, 17]. The data is divided into m groups \(G_{1}, ..., G_{m}\), such that for \(i \in [m] := \{ 1, ..., m \}\) we have \(G_{i} = (a_{i_{j}}, b_{i_{j}})_{j=1}^{n_{i}} \subseteq (a_{j}, b_{j})_{j=1}^{n}\) with \(n_{i} := \left|G_{i} \right|\) and \(i_{j} \in [n]\) for all \(i \in [m]\) and all \(j \in [n_{i}]\). Fairness is measured by worst-case outcomes across the considered groups. Hence we consider the following problem,
with
where \(h_{x}\) is a function parametrised by x, mapping features to predicted labels, and L is a loss function measuring the error between the predicted and true labels.
It is easy to see that (55) is equivalent to
where \(\Delta _{m} := \{ (v_{1}, ..., v_{m}) \in \mathbb {R}^{m} \ | \ \sum _{i=1}^{m} v_{i} = 1, \; v_{i} \ge 0 \text { for } i = 1, ..., m \}\) denotes the probability simplex in \(\mathbb {R}^{m}\). We will work with a linear (affine) predictor \(h_{x}: \mathbb {R}^{d} \rightarrow \mathbb {R}\) given by
with \(x \in \mathbb {R}^{d}\) and \(L: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) being the hinge loss, i.e.,
for r, \(s \in \mathbb {R}\).
Combining all of the above we get
with \(\Phi : \mathbb {R}^{d} \times \mathbb {R}^{m} \rightarrow \mathbb {R}\) defined by
and \(g: \mathbb {R}^{m} \rightarrow \mathbb {R}\cup \{ + \infty \}\) given by
The function g is proper, lower semicontinuous and convex (with modulus \(\nu = 0\)). Furthermore, we observe that \(\Phi (\cdot , y): \mathbb {R}^{d} \rightarrow \mathbb {R}\) is proper, convex and lower semicontinuous for all \(y \in {{\,\mathrm{dom}\,}}g = \Delta _{m}\) and for all \(x \in \Pr _{\mathbb {R}^{d}} ({{\,\mathrm{dom}\,}}\Phi ) = \mathbb {R}^{d}\) we have \({{\,\mathrm{dom}\,}}\Phi (x, \cdot ) = \mathbb {R}^{m}\) and \(\Phi (x, \cdot ): \mathbb {R}^{m} \rightarrow \mathbb {R}\) is concave and Fréchet differentiable. However, note that \(\Phi\) is not differentiable in its first component.
Moreover the Lipschitz condition on the gradient is fulfilled as well. Indeed, for \((x,y), \, (x',y') \in \mathbb {R}^{d} \times \Delta _{m}\) we have
with
Additionally, with \(\tau > 0\) and \(y \in {{\,\mathrm{dom}\,}}g\), we have for \(x \in \mathbb {R}^{d}\)
By introducing slack variables for the pointwise maximum, we see that the above minimisation problem is equivalent to the following quadratic program
For our practical applications we consider the Statlog heart disease data set (270 observations; 13 features) from the “UCI Machine Learning Repository” [8] and consider two different groupings; one consisting of the sex of the patients, while the other one is regarding the patients’ age. For “sex” we have two groups, that is female patients (Group S1) and male patients (Group S2), whereas for “age” we consider three groups, that is patients that are younger than 50 years old (Group A1), patients that are younger than 60 but at least 50 years old (Group A2), and patients that are 60 years of age or older (Group A3). The data set is randomly partitioned into 80 % training data and 20 % test data. The results in Tables 4 and 5 are the values of the achieved test set accuracy (TSA) averaged over 5 random partitions. For each considered group we state the intragroup TSA together with the overall TSA for the entire test set.
Every time we report the results obtained by iterates of OGAProx governed by solving the minimax problem (56) taking into account the considered groups (“with fairness”), as well as the results obtained by not taking into account minimax group fairness (“without fairness”), i.e., solving the problem for a single extensive group \(G_{1} = (a_{j}, b_{j})_{j=1}^{n}\) with \(n_{1} = n\), yielding the minimisation of the average loss over the whole population and leading to an “ordinary” minimisation problem.
We see in Tables 4 and 5 that taking into account the groups regarding “sex” and “age”, respectively, is beneficial for training the affine classifier. In both cases “with fairness” achieves the highest TSA for each group and at the same time the highest overall TSA as well.
Data availability
The data that support the findings of this study are available from the corresponding author upon request.
Change history
27 August 2022
Missing Open Access funding information has been added in the Funding Note.
References
Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York (2011)
Boţ, R.I., Böhm, A.: Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax problems (2020). arXiv:2007.13605
Böhm, A., Sedlmayer, M., Csetnek, E.R., Boţ, R.I.: Two steps at a time–taking GAN training in stride with Tseng’s method (2020). arXiv:2006.09033
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40, 120–145 (2011)
Daskalakis, C., Ilyas, A., Syrgkanis, V., Zeng, H.: Training GANs with optimism. In: International conference on learning representations (2018). https://openreview.net/forum?id=SJJySbbAZ
Daskalakis, C., Panageas, I.: The limit points of (optimistic) gradient descent in min-max optimization. Adv. Neural Inf. Process. Syst. 31, 9236–9246 (2018)
Diana, E., Gill, W., Kearns, M., Kenthapadi, K., Roth, A.: Minimax group fairness: algorithms and experiments (2020). arXiv:2011.03108
Dua, D., Graff, C.: UCI machine learning repository. School of Information and Computer Science, University of California (2019). http://archive.ics.uci.edu/ml
Gidel, G., Berard, H., Vignoud, G., Vincent, P., Lacoste-Julien, S.: A variational inequality perspective on generative adversarial networks. In: International conference on learning representations (2019). https://openreview.net/forum?id=r1laEnA5Ym
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27, 2672–2680 (2014)
Hamedani, E.Y., Aybat, N.S.: A primal-dual algorithm with line search for general convex-concave saddle point problems. SIAM J. Optim. 31(2), 1299–1329 (2021)
Korpelevich, G.M.: The extragradient method for finding saddle points and other problems. Ekonomika i Matematicheskie Metody 12(4), 747–756 (1976)
Lanckriet, G.R., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5, 27–72 (2004)
Liang, T., Stokes, J.: Interaction matters: a note on non-asymptotic local convergence of generative adversarial networks. In: Chaudhuri, K., Sugiyama, M. (eds.) The 22nd international conference on artificial intelligence and statistics, proceedings of machine learning research, vol. 89, pp. 907–915 (2019)
Malitsky, Y., Tam, M.K.: A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 30(2), 1451–1472 (2020)
Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM News 23(5), 1–18 (1990)
Martinez, N., Bertran, M., Sapiro, G.: Minimax pareto fairness: a multi objective perspective. In: International conference on machine learning, pp. 6755–6764 (2020)
MOSEK ApS: The MOSEK optimization toolbox for MATLAB manual. Version 9.0 (2019). http://docs.mosek.com/9.0/toolbox/index.html
Nemirovski, A.: Prox-method with rate of convergence \(O(1/t)\) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive mappings. Bull. Am. Math. Soc. 73(4), 591–597 (1967)
Rockafellar, R.T.: Monotone operators associated with saddle-functions and minimax problems. In: Browder, F.E. (ed.) Nonlinear functional analysis, proceedings of symposia in pure mathematics, vol. 18, pp. 241–250 (1970)
Tseng, P.: Applications of a splitting algorithm to decomposition in convex programming and variational inequalities. SIAM J. Control Optim. 29(1), 119–138 (1991)
Von Neumann, J., Morgenstern, O.: Theory of games and economic behavior. Princeton University Press, Princeton (1944)
Zhang, G., Wang, Y., Lessard, L., Grosse, R.: Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization (2021). arXiv:2102.09468
Funding
Open access funding provided by Austrian Science Fund (FWF). The work of RIB and of ERC is supported by FWF (Austrian Science Fund), projects W 1260 and P 29809-N32, respectively. The work of MS is supported by FFG (Austrian Research Promotion Agency), project Smart operation of wind turbines under icing conditions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Boţ, R.I., Csetnek, E.R. & Sedlmayer, M. An accelerated minimax algorithm for convex-concave saddle point problems with nonsmooth coupling function. Comput Optim Appl 86, 925–966 (2023). https://doi.org/10.1007/s10589-022-00378-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-022-00378-8