An accelerated minimax algorithm for convex-concave saddle point problems with nonsmooth coupling function

In this work we aim to solve a convex-concave saddle point problem, where the convex-concave coupling function is smooth in one variable and nonsmooth in the other and not assumed to be linear in either. The problem is augmented by a nonsmooth regulariser in the smooth component. We propose and investigate a novel algorithm under the name of OGAProx, consisting of an optimistic gradient ascent step in the smooth variable coupled with a proximal step of the regulariser, and which is alternated with a proximal step in the nonsmooth component of the coupling function. We consider the situations convex-concave, convex-strongly concave and strongly convex-strongly concave related to the saddle point problem under investigation. Regarding iterates we obtain (weak) convergence, a convergence rate of order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\frac{1}{K})$$\end{document}O(1K) and linear convergence like \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\theta ^{K})$$\end{document}O(θK) with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta < 1$$\end{document}θ<1, respectively. In terms of function values we obtain ergodic convergence rates of order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\frac{1}{K})$$\end{document}O(1K), \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\frac{1}{K^{2}})$$\end{document}O(1K2) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(\theta ^{K})$$\end{document}O(θK) with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta < 1$$\end{document}θ<1, respectively. We validate our theoretical considerations on a nonsmooth-linear saddle point problem, the training of multi kernel support vector machines and a classification problem incorporating minimax group fairness.


Introduction
Saddle point -or minimax -problems have witnessed increased interest due to many relevant and challenging applications in the field of machine learning, with the most prominent being the training of Generative Adversarial Networks (GANs) [10].Even though the problems in reality are often not of this form, in the classical setting the minimax objective comprises a smooth convex-concave coupling function with Lipschitz continuous gradient and a (potentially nonsmooth) regulariser in each variable, leading to a convex-concave objective in total.
One well established method in practice due to its simplicity and computational efficiency is Gradient Descent Ascent (GDA), either in a simultaneous or in an alternating variant (for a recent comparison of the convergence behaviour of the two schemes we refer to [23]).However, naive application of GDA is known to lead to oscillatory behaviour or even divergence already in simple cases such as bilinear objectives.Most algorithms with convergence guarantees in the general convex-concave setting make use of the formulation of the first order optimality conditions as monotone inclusion or variational inequality, treating both components in a symmetric fashion.For example we have the Extragradient method [12] whose application to minimax problems has been studied in [19] under the name of Mirror Prox, and the Forward-Backward-Forward method (FBF) [22] with application to saddle point problems in [3].Both algorithms have even been successfully applied to the training of GANs (see [9,3]), but, though being single-loop methods, suffer in practice from requiring two gradient evaluations per iteration.A possible way to avoid this is to reuse previous gradients.Doing this for FBF -as shown in [3] -recovers the Forward-Reflected Backward method [15] which was applied to saddle point problems under the name of Optimistic Mirror Descent and to GAN training under the name of Optimistic Gradient Descent Ascent [6,5,14].
The first method treating general coupling functions with an asymmetric scheme is the Accelerated Primal-Dual Algorithm (APD) by [11], involving an optimistic gradient ascent step in one component which is followed by a gradient descent step in the other one.In the special case of a bilinear coupling function APD recovers the Primal-Dual Hybrid Gradient Method (PDHG) [4].In the case of the minimax objective being strongly convex-concave acceleration of PDHG is obtained in [4], which is also done for APD in [11], however only under the rather limiting assumption of linearity of the coupling function in one component.
In this paper we introduce a novel algorithm OGAProx for solving a convex-concave saddle point problem, where the convex-concave coupling function is smooth in one variable and nonsmooth in the other, and it is augmented by a nonsmooth regulariser in the smooth component.OGAProx consists of an optimistic gradient ascent step in the smooth component of the coupling function combined with a proximal step of the regulariser, and which is followed by a proximal step of the coupling function in the nonsmooth component.We will be also able to accelerate our method in the convex-strongly concave setting without linearity assumption on the coupling function.Furthermore we prove linear convergence if the problem is strongly convex-strongly concave, yielding similar results as for PDHG [4] in the bilinear case.
So far in most works nonsmoothness is only introduced via regularisers, as the coupling function is typically accessed through gradient evaluations.Recently there is another development, although with the saddle point problem not being convex-concave, where the assumption on differentiability of the coupling function in both components is weakened to only one component [2].As the evaluation of the proximal mapping does not require differentiability we will assume the coupling function to be smooth in only one component, too.
The remainder of the paper is organised as follows.Next we will introduce the precise problem formulation and the setting we will work with, formulate the proposed algorithm OGAProx and state our contributions.This will be followed by preliminaries in Section 2. Afterwards we will discuss the properties of our algorithm in the convex-concave and convex-strongly concave setting and state respective convergence results in Section 3.After that we will investigate the convergence of the method under the additional assumption of strong convexity-strong concavity in Section 4. The paper will be concluded by numerical experiments in Section 5, where we treat a simple nonsmooth-linear saddle point problem, the training of multi kernel support vector machines and a classification problem taking into account minimax group fairness.

Problem description
Consider the saddle point problem where H, G are real Hilbert spaces, Φ : H × G → R ∪ {+∞} is a coupling function with dom Φ := {(x, y) ∈ H × G | Φ(x, y) < +∞} = ∅ and g : G → R ∪ {+∞} a regulariser.Throughout the paper (unless otherwise specified) we will make the following assumptions: • g is proper, lower semicontinuous and convex with modulus ν ≥ 0, i.e. g − ν 2 • 2 is convex (notice that we also allow and consider the situation ν = 0, in which case g is convex; otherwise g is strongly convex); • for all y ∈ dom g, Φ( • , y) : H → R ∪ {+∞} is proper, convex and lower semicontinuous; • there exist L yx , L yy ≥ 0 such that for all (x, y), (x , y ) ∈ Pr H (dom Φ) × dom g it holds By convention we set +∞ − (+∞) := +∞.Thus, the situation can be summarised by We are interested in finding a saddle point of (1), which is a point For the remainder we assume that such a saddle point exists.The assumptions considered above ensure that for any saddle point (x * , y * ) ∈ H × G we have Finding a saddle point of (1) amounts to solving the necessary and sufficient first order optimality conditions, given by the following coupled inclusion problems Remark 1.In case Φ and g have full domain, Ψ is a convex-concave function with full domain and the set Pr H (dom Φ) is obviously closed.However, in order to allow more flexibility and to cover a wider range of problems (see also the last section with numerical experiments), our investigations are carried out in the more general setting given by the assumptions described above.Furthermore, these assumptions allow us to stay in the rigorous setting of the theory of convex-concave saddle functions as described by Rockafellar in [21] (see Definition 3 and Proposition 4 below).

Algorithm
The algorithm we investigate performs an optimistic gradient ascent step of Φ followed by an evaluation of the proximal mapping g in the variable y, while it carries out a purely proximal step of Φ in x.We will call this method Optimistic Gradient Ascent -Proximal Point algorithm (OGAProx) in the following.For all k ≥ 0 we define with the conventions x −1 := x 0 and y −1 := y 0 for starting points x 0 ∈ Pr H (dom Φ) and y 0 ∈ dom g.The particular choices of the sequences (σ k ) k≥0 , (τ k ) k≥0 ⊆ R ++ and (θ k ) k≥0 ⊆ (0, 1] will be specified later.

Contribution
Let us summarize the main results of this paper: 1. We introduce a novel algorithm to solve saddle point problems with nonsmooth coupling functions, which in general is not assumed to be linear in either component.

Preliminaries
We recall some basic notions in convex analysis and monotone operator theory (see for example [1]).The real Hilbert spaces H and G are endowed with inner products • , • H and • , • G , respectively.As it will be clear from the context which one is meant, we will drop the index for ease of notation and write • , • for both.The norm induced by the respective inner products is defined by A set-valued operator A : H ⇒ H is said to be monotone if for all (x, u), (y, v) ∈ gra A := {(z, w) ∈ H × H | w ∈ Az} we have x − y, u − v ≥ 0. Furthermore, A is said to be maximal monotone if it is monotone and there exists no monotone operator B : H ⇒ H such that gra A gra B. The graph of a maximal monotone operator A : H ⇒ H is sequentially closed in the strong × weak topology, which means that if (x k , u k ) k≥0 is a sequence in gra A such that x k → x and u k u as k → +∞, then (x, u) ∈ gra A. The notation u k u as k → +∞ is used to denote convergence of the sequence (u k ) k≥0 to u in the weak topology.
To show weak convergence of sequences in Hilbert spaces we use the following so-called Opial Lemma.
Lemma 2. (Opial Lemma [20]) Let C ⊆ H be a nonempty set and (x k ) k≥0 a sequence in H such that the following two conditions hold: (a) for every x ∈ C, lim k→+∞ x k − x exists; (b) every weak sequential cluster point of (x k ) k≥0 belongs to C.
Then (x k ) k≥0 converges weakly to an element in C.
In the following definition we adjust the term proper to the saddle point setting and refer to [21] for further considerations related to saddle functions.
) is convex for all y ∈ G and Ψ(x, • ) is concave for all x ∈ G.A saddle function Ψ is called proper, if there exists (x , y ) ∈ H × G such that Ψ(x , y) < +∞ for all y ∈ G and −∞ < Ψ(x, y ) for all x ∈ H.
We conclude the preliminary section with a useful result regarding the minimax objective from (1).Proposition 4. The function Ψ : H × G → R ∪ {±∞} defined via (3) is a proper saddle function such that Ψ( • , y) is lower semicontinuous for each y ∈ G and Ψ(x, • ) is upper semicontinuous for each x ∈ H. Consequently, the operator is maximal monotone.
Proof.We choose (x , y ) ∈ H × G and distinguish four cases.Firstly, we look at the case y / ∈ dom g.Then Furthermore, by assumption there exist y ∈ dom g ⊆ G such that g(y ) < +∞ and for all x ∈ H we have 3 Convex-(strongly) concave setting First we will treat the case when the coupling function Φ is convex-concave and g is convex with modulus ν ≥ 0. In the case ν = 0 this corresponds to Ψ(x, y) = Φ(x, y) − g(y) being convex-concave, while for ν > 0 the saddle function Ψ(x, y) is convex-strongly concave.We will start with stating two assumptions on the step sizes of the algorithm which will be needed in the convergence analysis.These will be followed by a unified preparatory analysis for general ν ≥ 0 that will be the base to show convergence of the iterates as well as of the minimax gap.After that we will introduce a choice of parameters that satisfy the aforementioned assumptions.The section will be closed by convergence results for the convex-concave (ν = 0) and the convex-strongly concave (ν > 0) setting.
Assumption 1.We assume that the step sizes τ k , σ k and the momentum parameter θ k satisfy for all k ≥ 0.

Preliminary considerations
In this subsection we will make some preliminary considerations that will play an important role when proving the convergence properties of the numerical scheme given by ( 5)- (6).For all k ≥ 0 we will use the notations We take an arbitrary (x, y) ∈ H × G and let k ≥ 0 be fixed.From (5) we derive and, as g is convex with modulus ν, this implies From ( 6) we get hence the convexity of Φ( • , y) for y ∈ dom g yields Combining ( 11) and ( 13) we obtain which, together with the concavity of Φ in the second variable and ( 9), gives By using (2) we can evaluate the last term in the above expression as follows with α k > 0 chosen such that (8) holds.
Writing (15) for y := y k+1 and combining the resulting inequality with ( 14) we derive where and Now, let us define for all k ≥ 0 and notice that Relation (7) from Assumption 1 is equivalent to which will be used in telescoping arguments in the following.
Let K ≥ 1 and denote Multiplying both sides of ( 16) by t k > 0 as defined in ( 17), followed by summing up the inequalities for k = 0, . . ., K − 1 gives By Jensen's inequality, as and thus Furthermore, using (18), we get for all k ≥ 0 Notice that by (8) in Assumption 1 there exists δ > 0 such that for all k ≥ 0 For the following recall that x −1 = x 0 and y −1 = y 0 , which implies q 0 = 0.By using the above two inequalities in (20) and writing (15) for k = K we obtain By definition we have t 0 = 1 and by ( 8) that the last term of the above inequality is nonpositive, hence the following estimate for the minimax gap function evaluated at the ergodic sequences holds With these considerations at hand -in specific we want to point out ( 16), ( 22) and ( 23) -we will be able to obtain convergence statements for the two settings ν = 0 and ν > 0.

Fulfilment of step size assumptions
In this subsection we will investigate a particular choice of parameters to fulfil Assumption 1 which is suitable for both cases of ν = 0 and ν > 0.
We define Then the sequence (τ k ) k≥0 , (σ k ) k≥0 and (θ k ) k≥0 fulfil (7) in Assumption 1 with equality and (8) for Furthermore, for (t k ) k≥0 defined as in (17) we have Proof.First, we show that the particular choice (24) fulfils ( 7) in Assumption 1 with equality.We see that for all k ≥ 0 as well as follow straight forward by definition.Next, we show that (8) in Assumption 1 holds for δ defined in (26) with the choices (24) and (25).The first inequality of ( 8) is equivalent to which clearly is fulfilled as On the other hand, the second inequality of ( 8) is equivalent to By definition of the step size parameters (24) we have for all k ≥ 0 and thus This chain of inequalities holds since Finally, using the definition of t k and (24) we conclude that for all k ≥ 0 Remark 6.The choice L yy = 0 in (2) which was considered in [11] in the convex-strongly concave setting corresponds to the case when the coupling function Φ is linear in y.We will prove convergence also for L yy positive, which makes our algorithm applicable to a much wider range of problems, as we will see in the section with the numerical experiments.
When the coupling function Φ : H × G → R is bilinear, that is Φ(x, y) = y, Ax for some nonzero continuous linear operator A : H → G then we are in the setting of [4].In this situation one can choose L yy = 0 and L yx = A , and (26) yields with c α > A .To guarantee δ > 0 we fix 0 < ε < 1 and set Hence, we need to satisfy which heavily resembles the step size condition of [4, Algorithm 2].Since prox γΦ(•,y) (x) = x − γA * y for all (x, y) ∈ H × G and all γ > 0, our OGAProx scheme becomes the primal-dual algorithm PDHG from [4].

Convergence results
In this subsection we combine the preliminary considerations with the choice of parameters (24) from Proposition 5. We will start with the case ν = 0 and constant step sizes, which gives weak convergence of the iterates to a saddle point (x * , y * ) and convergence of the minimax gap evaluated at the ergodic iterates to zero like O( 1 K ).Afterwards we will consider the case ν > 0, which leads to an accelerated version of the algorithm with improved convergence results.In this setting we obtain convergence of (y k ) k≥0 to y * like O( 1 K ) and convergence of the minimax gap evaluated at the ergodic iterates to zero like O( 1 K 2 ).

Convex-concave setting
For the following we assume that the function g is convex with modulus ν = 0, meaning it is merely convex.Using the results of the previous subsection we will show that with the choice (24) all the parameters are constant.
Proposition 7. Let c α > L yx ≥ 0 and τ, σ > 0 such that If ν = 0, then the sequences (τ k ) k≥0 , (σ k ) k≥0 and (θ k ) k≥0 as defined in Proposition 5 are constant, in particular we have Proof.As ν = 0, (24) gives for all k ≥ 0 Next we will state and prove the convergence results in the convex-concave case.
Theorem 8. Let c α > L yx ≥ 0 and τ, σ > 0 such that Then the sequence (x k , y k ) k≥0 generated by OGAProx with the choice of constant parameters as in Proposition 7, namely, converges weakly to a saddle point (x * , y * ) ∈ H × G of (1).Furthermore, let K ≥ 1 and denote x k+1 and ȳK = 1 Then for all K ≥ 1 and any saddle point (x * , y * ) ∈ H × G of (1) we have Proof.First we will show weak convergence of the sequence of iterates (x k , y k ) k≥0 to some saddle point (x * , y * ) ∈ H × G of (1).For this we will use the Opial Lemma (see Lemma 2).Let k ≥ 0 and (x * , y * ) ∈ H × G be an arbitrary but fixed saddle point.From ( 16) together with the choice (28) of constant parameters and We see that (30), writing (15) with y = y * and ( 7) in Assumption 1 yield Furthermore, from (29) and ( 21) we deduce Telescoping this inequality and taking into account (31) give as well as the existence of the limit lim k→+∞ a k (x * , y * ) ∈ R.
From the definition of a k (x * , y * ) in ( 30), ( 32) and (33) we derive that Since this is true for an arbitrary saddle point (x * , y * ) ∈ H × G, we have that the first statement of the Opial Lemma holds.Next we will show that all weak cluster points of (x k , y k ) k≥0 are in fact saddle points of (1).Assume that (x kn ) n≥0 converges weakly to x * ∈ H and (y kn ) n≥0 converges weakly to y * ∈ G as n → +∞.From ( 12), ( 9) and ( 10) we have where we used that for all k ≥ 0 we have x k ∈ Pr H (dom Φ) and y k ∈ dom g.The sequence on the left hand side of the inclusion (34) converges strongly to (0, 0) as n → +∞ (according to (32) and ( 33)).Notice that the operator • )](y) is maximal monotone (see Proposition 4), hence its graph is sequentially closed with respect to the strong × weak topology.From here we deduce from which we easily derive that (x * , y * ) is a saddle point as it satisfies (4).This means that also the second statement of the Opial Lemma is fulfilled and we have weak convergence of (x k , y k ) k≥0 to a saddle point (x * , y * ).
The remaining part is to show the convergence rate of the minimax gap of the ergodic sequences.Let K ≥ 1 and (x * , y * ) ∈ H × G be an arbitrary but fixed saddle point.Writing (23) for (x * , y * ) yields with Using (27) to get t k = 1 for all k ≥ 0 in the above expressions gives Finally we derive for all K ≥ 1

Convex-strongly concave setting
For the remainder of this section we assume that the function g is convex with modulus ν > 0, meaning it is ν-strongly convex.In this case the choice (24) leads to adaptive parameters and accelerated convergence.
To obtain statements regarding the (accelerated) convergence rates in the convex-strongly concave setting, we look at the behaviour of the sequences of step size parameters (τ k ) k≥0 and (σ k ) k≥0 for k → +∞. , and for all k ≥ 0 denote Then with the choice of adaptive parameters (35) we have for all k ≥ 0 and for all k ≥ 1 Proof.By (24) we conclude that for all k ≥ 0 and further which, applied recursively, gives We obtain which we will use to show by induction that for all k ≥ 0 For k = 0 the statement trivially holds, whereas for k = 1 we need to verify that which is equivalent to the following quadratic inequality and guaranteed to hold by our initial choice of σ 0 > 0. Now let k ≥ 1 and assume that (36) holds.Then This shows the validity of (36) for all k ≥ 0. Now we can use inequality (36) to deduce the convergence behaviour of the sequences (τ k ) k≥0 and (σ k ) k≥0 for k → +∞.We get for all k ≥ 0 which, combined with gives for all k ≥ 1 Now we are ready to prove the convergence results in the convex-strongly concave setting.
Let (x * , y * ) ∈ H×G be a saddle point of (1).Then for (x k , y k ) k≥0 being the sequence generated by OGAProx with the choice of adaptive parameters we have for all K ≥ 1 , with c 1 := 18 ν 2 σ 0 δ , where δ > 0 is defined in (26).Furthermore, for K ≥ 1, denote where t k = τ k τ 0 for all k ≥ 0 (see also (27)).Then for all K ≥ 2 it holds with c 2 := 12 νσ 0 .Proof.Let K ≥ 1 and let (x * , y * ) ∈ H × G be an arbitrary but fixed saddle point.First we will prove the convergence rate of the sequence of iterates (y k ) k≥0 .Plugging the particular choice of parameters (35) into ( 22) for (x * , y * ), we obtain where we use (8) in Assumption 1 for the last inequality.Combining this with (36) we derive , with c 1 := 18 ν 2 σ 0 δ .Next we will show the convergence rate of the minimax gap at the ergodic sequences.Writing ( 23) for (x * , y * ), we obtain Plugging the particular choice of t k = τ k τ 0 for all k ≥ 0 from (27) into the definition of T K , together with (37) yields Combining this inequality with (38), we obtain for all with c 2 := 12 νσ 0 , which concludes the proof.

Strongly convex-strongly concave setting
For this section we assume that the function g is convex with modulus ν > 0, meaning it is ν-strongly convex.In addition to the assumptions we had until now, for this section we also assume that for all y ∈ dom g the function Φ( • , y) : H → R ∪ {+∞} is µ-strongly convex with modulus µ > 0. This means that the saddle function (x, y) → Ψ(x, y) is strongly convex-strongly concave.As in the previous section we will state two step size assumptions that will be needed for the convergence analysis.These again will be followed by preparatory observations and a result to guarantee the validity of the stated assumptions.The section will be closed with the formulation and proof of convergence results.Assumption 2. We assume that the step sizes τ k , σ k and the momentum parameter θ k are constant and satisfy with Furthermore we assume that there exists α > 0 such that with 1 − θσ(αL yx + L yy ) > 0. (42)
Let K ≥ 1 and as in (19) denote with t k > 0 defined as in (17), in other words Multiplying both sides of (43) by t k > 0 yields Summing up the above inequality for k = 0, . . ., K − 1 and taking into account Jensen's inequality for the convex function where in the second inequality we use (15).Omitting the last two terms which are non positive by (41), we obtain for all K ≥ 1 which we will use to obtain our convergence results in the following.

Fulfilment of step size assumptions
In this subsection we will investigate a particular choice of parameters τ , σ and θ such that Assumption 2 holds.
Proof.If L yx = L yy = 0, then the conclusion follows immediately.Assume that L yx + L yy > 0. It is easy to verify that definition (45) yields 0 < θ < 1 and that ( 47) is equivalent to (39) where ( 40) is ensured by (46).Furthermore, plugging the specific form of the step sizes (47) into (41) we obtain for the first inequality of (41) Note that by (46) we have Similarly, the second inequality of ( 41) is equivalent to the following quadratic inequality The non negative solution of the associated quadratic equation reads the second inequality of (41) is also fulfilled.In order to see that we notice that this inequality is equivalent to which holds if and only if For the remaining condition (42) to hold we need to ensure θ > αL yx + L yy − ν αL yx + L yy .
For this we observe that In conclusion, we obtain the following chain of inequalities which is satisfied by (46).

Convergence results
Now we can combine the previous results and prove the convergence statements in the strongly convexstrongly concave setting.

Numerical experiments
In this section we will treat three numerical applications of our method.The first one is of rather simple structure and has the purpose to highlight the convergence rates we obtained in the previous sections.
The second one concerns multi kernel support vector machines to validate OGAProx on a more relevant application in practice, even though there are no theoretical guarantees for the "metric" reported there.
The third numerical application addresses a classification problem incorporating minimax group fairness, which traces back to the solving of a minimax problem with nonsmooth coupling function.

Nonsmooth-linear problem
The first application we treat is to showcase the convergence rates we obtained in the previous sections and make a simple proof of concept.We look at the following nonsmooth-linear saddle point problem with ν ≥ 0 and A ∈ R d×n , [ • ] + being the component-wise positive part, and C being the following convex polytope is proper, lower semicontinuous and convex with modulus ν ≥ 0 and dom g = C.Moreover, Φ : has full domain, for all x ∈ R d we have that Φ(x, • ) is linear and for all y ∈ dom g = C the function Φ( • , y) is convex and continuous.Furthermore, we obtain for all (x, y), (x , y holds with L yx = A and L yy = 0.The algorithm ( 5)-( 6) iterates for k ≥ 0 where the calculation of the orthogonal projection on the set C is a simple quadratic program and where, for i = 1, ..., d, By writing the first order optimality conditions and using Lagrange duality we obtain the following characterisation.
This means, that for ν = 0 we obtain

Multi kernel support vector machine
The second application to test our method in practice is to learn a combined kernel matrix for a multi kernel support vector machine (SVM).We have a set of labelled training data where we call b = (b i ) n i=1 , and a set of unlabelled test data We consider embeddings of the data according to a kernel function κ : R m ×R m → R with the corresponding symmetric and positive semidefinite kernel matrix where K ij = κ(a i , a j ) for i, j = 1, . . ., n, n + 1, . . ., n + l.
In the following e is a vector of appropriate size consisting of ones.According to [13] the problem of interest is min where K is the model class of kernel matrices, c ∈ (0, +∞), C ∈ (0, +∞] and ν ∈ [0, +∞) are model parameters and we define G(K tr ) := diag(b)K tr diag(b).
The set K is restricted to be the set of positive semidefinite matrices that can be written as a non negative linear combination of kernel matrices K 1 , . . ., K d , i.e.,

With this choice (49) becomes
where η = (η i ) d i=1 and r = (r i ) d i=1 with r i = trace(K i ) for i = 1, ..., d.Assume (η * , α * ) ∈ R d × R n to be a saddle point of (50) and write Following the considerations of [11] we compute for a k ∈ T l with k ∈ {n + 1, . . ., n + l}, with for some j 0 ∈ {1, . . ., n} such that 0 < α * j 0 < C.After writing x i = r i η i c for i = 1, ..., d and augmenting the objective with an additional (strongly) convex penalisation term, we obtain where µ ≥ 0 and is the m-dimensional unit simplex and is the intersection of a box and a hyperplane.
In the notation of (1) we have Φ : x i y T M i y + y T e, and g : R n → R ∪ {+∞} given by g We see that Φ and g satisfy the assumptions considered for problem (1).
The algorithm ( 5)-( 6) iterates as follows for k ≥ 0 where x i M i y + e for (x, y) ∈ ∆ × R n and .
To determine the correct step sizes and momentum parameter, we need to find Lipschitz constants for ∇ y Φ, i.e., L yx , L yy ≥ 0 such that (2) holds.Recall, that we require for all (x, y), (x , y with Pr H (dom Φ) = ∆ and dom g = Y .
Let (x, y), (x , y ) ∈ ∆ × Y .Then As x ∈ ∆, we have x 1 = 1 and since y ∈ Y we get y ≤ C √ n.Thus we obtain For our experiments we use four different data sets from the "UCI Machine Learning Repository" [8]: the (original) Wisconsin breast cancer dataset [16]  On this application we test the three proposed versions of OGAProx.We refer to the version of OGAProx with constant parameters from Section 3.3.1 as OGAProx-C1, to the one with adaptive parameters from Section 3.3.2as OGAProx-A and to the one from Section 4.3 giving linear convergence with constant parameters as OGAProx-C2.The results are compared with those obtained by APD1 and APD2 from [11].In their experiments on multi kernel SVMs they showed superiority of their method compared to Mirror Prox by [19] in terms of accuracy, runtime and relative error.They also argued that with APD they are able to obtain decent approximations of solutions of (50) by interior point methods such as MOSEK [18] taking about the same amount of runtime.
The main difference between APD and our method OGAProx is that for the first a gradient step in the first component is employed whereas for the latter a purely proximal step is used.To be able to employ APD2 with adaptive parameters for ν > 0, the roles of x and y in (52) have to be switched, giving a different method than OGAProx-A.The runtime of both methods however is still very similar as both use the same number of gradient computations/storages and projections per iteration.
All algorithms are initialised with Each data set is randomly partitioned into 80 % training and 20 % test set.The test set is used to judge the quality of the obtained model by predicting the labels via (51) and computing the resulting test set accuracy (TSA).Note that the TSA is not guaranteed to converge or increase at all by our theoretical considerations, which only state convergence of the iterates and in terms of function values.The reported TSA values are the average over 10 random partitions.Due to occasionally occurring rather dramatic deflections of the TSA we actually compute 12 runs, but remove minimum and maximum values before calculating the mean.

1-norm soft margin classifier
For µ = ν = 0 the formulation (50) realises the so-called 1-norm soft margin classifier.In this case g is merely convex and we can only use the constant parameter choice from Section 3.3.1 with the name OGAProx-C1.We compare the results with those obtained by APD1 from [11].

2-norm soft margin classifier
For µ = 0 and ν > 0 from (50) we obtain the so-called 2-norm soft margin classifier with C = 1.In this case g is ν-strongly convex and we can use both parameter choices from Section 3.3.1 and the one from Section 3.3.2giving OGAProx-C1 and OGAProx-A, respectively.This time we compare the results with those obtained by APD1 as well as APD2 from [11].We see in Table 2 that the situation for the 2-norm soft margin classifier is more diverse than previously with the 1-norm soft margin classifier.Comparing the two constant methods -OGAProx-C1 and APD1 -with each other, as well as the two adaptive methods -OGAProx-A and APD2 -we see that in both cases two out of four times OGAProx is better than APD and vice versa.Notice that the two data sets with in general lower TSA, namely Heart disease and Sonar, seem to benefit from the regularising effect of ν > 0, while those with already very good results on the other hand do not, compared to the results of the 1-norm soft margin classifier with ν = 0.In addition note that the adaptive variant OGAProx-A improves on the result of OGAProx-C1 on three out of four data sets.

Regularised 2-norm soft margin classifier
For µ > 0 and ν > 0 from (50) we again obtain the so-called 2-norm soft margin classifier with C = 1, this time, however, in a regularised version.Now not only g is strongly convex, but also Φ( • , y) and we can use all our parameter choices from Section 3.3.1,Section 3.3.2and Section 4.3 yielding OGAProx-C1, OGAProx-A and OGAProx-C2, respectively.Once more we compare the results with those obtained by APD1 as well as APD2 from [11], pointing out that that OGAProx-C2 has no APD counterpart harnessing the additional strong convexity of the problem.We see in Table 3 that for the regularised 2-norm soft margin classifier the situation is similar to the version without additional regulariser.This time for the constant methods, OGAProx-C1 and APD1, OGAProx is better than APD on three data sets while APD is better than OGAProx on only one.On the contrary, for the adaptive methods, OGAProx-A and APD2, it is the other way round.APD performs better than APD on three data sets while OGAProx is better than APD on only one.For the second version of OGAProx with constant parameter choice exhibiting linear convergence in both iterates and function values, there is no APD counterpart.When we compare the results for OGAProx-C2 to those of OGAProx-C1, then we see that the TSA values become better in general with improvements on three out of four data sets and one draw.On the Breast cancer data set OGAProx-C2 even delivers the maximum TSA over all considered methods.

Classification incorporating minimax group fairness
We want to classify labelled data (a j , b j ) n j=1 ⊆ R d × {±1}, additionally taking into account so-called minimax group fairness [17,7].The data is divided into m groups G 1 , ..., G m , such that for i ∈ [m] := {1, ..., m} we have G i = (a i j , b i j ) n i j=1 ⊆ (a j , b j ) n j=1 with n i := |G i | and i j ∈ [n] for all i ∈ [m] and all j ∈ [n i ].Fairness is measured by worst-case outcomes across the considered groups.Hence we consider the following problem, min with where h x is a function parametrised by x, mapping features to predicted labels, and L is a loss function measuring the error between the predicted and true labels.
It is easy to see that (53) is equivalent to For our practical applications we consider the Statlog heart disease data set (270 observations; 13 features) from the "UCI Machine Learning Repository" [8] and consider two different groupings; one consisting of the sex of the patients, while the other one is regarding the patients' age.For "sex" we have two groups, that is female patients (Group S1) and male patients (Group S2), whereas for "age" we consider three groups, that is patients that are younger than 50 years old (Group A1), patients that are younger than 60 but at least 50 years old (Group A2), and patients that are 60 years of age or older (Group A3).The data set is randomly partitioned into 80 % training data and 20 % test data.The results in Table 4 and Table 5 are the values of the achieved test set accuracy (TSA) averaged over 5 random partitions.For each considered group we state the intragroup TSA together with the overall TSA for the entire test set.
Every time we report the results obtained by iterates of OGAProx governed by solving the minimax problem (54) taking into account the considered groups ("with fairness"), as well as the results obtained by not taking into account minimax group fairness ("without fairness"), i.e., solving the problem for a single extensive group G 1 = (a j , b j ) n j=1 with n 1 = n, yielding the minimisation of the average loss over the whole population and leading to an "ordinary" minimisation problem.
We see in Table 4 and Table 5 that taking into account the groups regarding "sex" and "age", respectively, is beneficial for training the affine classifier.In both cases "with fairness" achieves the highest TSA for each group and at the same time the highest overall TSA as well.

2 y − x 2 .
and by ∂f (x) := ∅ otherwise.If the function f is convex and Fréchet differentiable at x ∈ H, then ∂f (x) = {∇f (x)}.For the sum of a proper, convex and lower semicontinuous function f : H → R ∪ {+∞} and a convex and Fréchet differentiable function h : H → R we have ∂(f + h)(x) = ∂f (x) + ∇h(x) for all x ∈ H.The subdifferential of the indicator function δ C of a nonempty closed convex set C ⊆ H, defined as δ C (x) = 0 for x ∈ C and δ C (x) = +∞ otherwise, is denoted by N C := ∂δ C and is called the normal cone the set C. Let f : H → R ∪ {+∞} be proper, convex and lower semicontinuous.The proximal operator of f is defined by prox f : H → H, prox f (x) := arg min y∈H f (y) + 1 The proximal operator of the indicator function δ C of a nonempty closed convex set C ⊆ H is the orthogonal projection P C : H → C of the set C.
(699 total observations including 16 incomplete examples; 9 features), the Statlog heart disease data set (270 observations; 13 features), the Ionosphere data set (351 observations; 33 features) and the Connectionist Bench Sonar data set (208 observations; 60 features).All the data sets are normalised such that each feature column has zero mean and standard deviation equal to one.Furthermore we take d = 3 given kernel functions, namely a polynomial kernel function k 1 (a, a ) = (1 + a T a ) 2 of degree 2 for K 1 , a Gaussian kernel function k 2 (a, a ) = exp(− 1 2 (a − a ) T (a − a )/ 1 10 ) for K 2 and a linear kernel function k 3 (a, a ) = a T a for K 3 .The resulting kernel matrices are normalised according to [13, Section 4.8], giving r i = trace(K i ) = n + l.The model parameter c > 0 is chosen to be c = d i=1 r i = d(n + l), and we set C = 1.

ymax{0, 1 −
i f i (x),where∆ m := {(v 1 , ..., v m ) ∈ R m | m i=1 v i = 1, v i ≥ 0 for i = 1, ..., m} denotes the probability simplex in R m .We will work with a linear (affine) predictor h x : R d → R given by h x (a) = a T x, with x ∈ R d and L : R × R → R being the hinge loss, i.e., L(r, s) = max{0, 1 − sr}, for r, s ∈ R.Combining all of the above we getmin x∈R d max y∈R m Φ(x, y) − g(y),(54)withΦ : R d × R m → R defined by Φ(x, y) b i j a T i j x},and g : R m → R ∪ {+∞} given by g(y) = δ ∆m (y).The function g is proper, lower semicontinuous and convex (with modulus ν = 0).Furthermore we observe that Φ(•, y) : R d → R is proper, convex and lower semicontinuous for all y ∈ dom g = ∆ m and for all x ∈ Pr R d (dom Φ) = R d we have dom Φ(x, •) = R m and Φ(x, •) : R m → y ) is convex and lower semicontinuous, since Pr H (dom Φ) is convex and closed.Secondly, if y ∈ dom g, then g(y ) ∈ R and

Table 1 :
TSA of 1-norm soft margin classifier (µ = 0, ν = 0, C = 1) trained with OGAProx-C1 and APD1, averaged over 10 random partitions.In the case of 1-norm soft margin classifier the results reported in Table1paint a clear picture.OGAProx outperforms APD on three out of four data sets and ties on one data set, achieving maximum TSA values of 97.45 %, 82.78 %, 93.24 % and 85.95 % on Breast cancer, Heart disease, Ionosphere and Sonar, respectively.

Table 4 :
TSA of the affine classifier after k iterations of OGAProx for the groups according to "sex", averaged over 5 random partitions.Additionally, with τ > 0 and y ∈ dom g, we have forx ∈ R d prox τ Φ(•,y) (x) =arg min By introducing slack variables for the pointwise maximum, we see that the above minimisation problem is equivalent to the following quadratic program min u∈R d , r ij ∈R, i∈[m], j∈[n i ] R is concave and Fréchet differentiable.However, note that Φ is not differentiable in its first component.Moreover the Lipschitz condition on the gradient is fulfilled as well.Indeed, for (x, y), (x , y) ∈ R d ×∆ m we have ∇ y Φ(x, y) − ∇ y Φ(x , y ) ≤ L yx x − x + L yy y − y ,

Table 5 :
TSA of the affine classifier after k iterations of OGAProx for the groups according to "age", averaged over 5 random partitions.