A generalized Frank–Wolfe method with “dual averaging” for strongly convex composite optimization

Zhao, Renbo; Zhu, Qiuyun

doi:10.1007/s11590-022-01951-0

A generalized Frank–Wolfe method with “dual averaging” for strongly convex composite optimization

Original Paper
Open access
Published: 07 November 2022

Volume 17, pages 1595–1611, (2023)
Cite this article

Download PDF

You have full access to this open access article

Optimization Letters Aims and scope Submit manuscript

A generalized Frank–Wolfe method with “dual averaging” for strongly convex composite optimization

Download PDF

1985 Accesses
1 Altmetric
Explore all metrics

Abstract

We propose a simple variant of the generalized Frank–Wolfe method for solving strongly convex composite optimization problems, by introducing an additional averaging step on the dual variables. We show that in this variant, one can choose a simple constant step-size and obtain a linear convergence rate on the duality gaps. By leveraging the convergence analysis of this variant, we then analyze the local convergence rate of the logistic fictitious play algorithm, which is well-established in game theory but lacks any form of convergence rate guarantees. We show that, with high probability, this algorithm converges locally at rate O(1/t), in terms of certain expected duality gap.

Extending the applicability of the Gauss–Newton method for convex composite optimization using restricted convergence domains and average Lipschitz conditions

Article 01 February 2016

Efficiency of higher-order algorithms for minimizing composite functions

Article 10 October 2023

Analysis of the Frank–Wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier

Article Open access 14 May 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Given two finite-dimensional real normed spaces $(\mathbb {X},\Vert \cdot \Vert _{\mathbb {X}})$ and $(\mathbb {U},\Vert \cdot \Vert _{\mathbb {U}})$, with dual spaces denoted by $(\mathbb {X}^*,\Vert \cdot \Vert _{\mathbb {X}^*})$ and $(\mathbb {U}^*,\Vert \cdot \Vert _{\mathbb {U}^*})$, respectively, let us consider the following convex optimization problem:

$$\begin{aligned} p^*:= {\min }_{x\in \mathbb {X}}\;\;\; [p(x):=f(\mathsf {A}x) + h(x)], \end{aligned}$$

(P)

where $\mathsf {A}:\mathbb {X}\rightarrow \mathbb {U}$ is a (bounded) linear operator, $f:\mathbb {U}\rightarrow \mathbb {R}$ is a convex differentiable function whose gradient is L-Lipschitz on $\mathbb {U}$ (for some $L>0$), namely,

$$\begin{aligned} \Vert \nabla f(u) - \nabla f(v)\Vert _{\mathbb {U}^*}\le L\Vert u-v\Vert _{\mathbb {U}}, \quad \forall \,u,v\in \mathbb {U}, \end{aligned}$$

(1.1)

and $h:\mathbb {X}\rightarrow \mathbb {R}\cup \{+\infty \}$ is a closed convex function with nonempty domain, which is denoted by $\mathsf {dom}\,h:=\{x\in \mathbb {X}:h(x)<+\infty \}$. We assume that h is “simple” such that the “generalized” linear optimization sub-problem, namely

$$\begin{aligned} {\min }_{x\in \mathbb {X}}\;\;\; \left\langle {c},{x}\right\rangle + h(x), \end{aligned}$$

(GLO)

can be easily solved for any $c\in \mathbb {X}^*$. As an example, if h is an indicator function of a polytope $\mathcal {P}$, then (GLO) becomes a linear program, which has an optimal solution that is a vertex of $\mathcal {P}$.

First-order methods that involve computing the gradients of f and solving sub-problems in the form of (GLO) are referred to as the generalized Frank–Wolfe (GFW) method, which is shown Algorithm 1. This method have been studied in several previous works, e.g., Bach [1], Nesterov [2], Ghadimi [3] and Pena [4] (all of which we will review shortly in Sect. 1.1). Indeed, the name of the GFW method comes from the fact that it can be regarded as a generalization of the Frank–Wolfe (FW) method [5], which dates back to 1950s (see [6] and references therein). Specifically, if h is the indicator function of some “simple” convex compact set (under which (GLO) can be easily solved), then the GFW method specializes precisely to the FW method.

1.1 Review of computational guarantees of the GFW method

When h is non-strongly-convex on $\mathsf {dom}\,h$, the computational guarantees of the GFW method (i.e., Algorithm 1) are exactly the same as those of the classical FW methods (see e.g., [6]). Specifically, assume $\mathsf {dom}\,h$ to be bounded, and define the diameter of $\mathsf {A}(\mathsf {dom}\,h):= \{\mathsf {A}x: x\in \mathsf {dom}\,h\}$ as

$$\begin{aligned} \bar{D}:= {\sup }_{x,y\in \mathsf {dom}\,h} \;\Vert \mathsf {A}(x-y)\Vert _\mathbb {U}<+\infty . \end{aligned}$$

Additionally, for all $t\ge 0$, let us define the primal optimality gap at $x^t$ by

$$\begin{aligned} \delta (x^t):= p(x^t) - p^* \end{aligned}$$

and the FW-gap at $x^t$ by

$$\begin{aligned} G(x^t):= \langle {\mathsf {A}^*\nabla f(\mathsf {A}x^t)},{x^t - v^t}\rangle + h(x^t) - h(v^t). \end{aligned}$$

(1.2)

In fact, it is easy to see that:

(i)
$G(x^t)\ge \delta (x^t)\ge 0$ for all $t\ge 0$ (cf. [7, Eq. (2.4)]),
(ii)
$G(x^t)$ has the same value for any $v^t\in \mathcal{V}(x^t)$ (cf. Step 1 in Algorithm 1), and hence the particular choice of $v^t\in \mathcal {V}(x^t)$ in (1.2) does not matter.

By choosing the adaptive step-sizes $\alpha _t=\min \{1,G(x^t)/(L\bar{D}^2)\}$ or the predetermined step-sizes $\alpha _t = 2/(t+2)$ for $t\ge 0$, one can show that (see e.g., [1, Sect. 4])

$$\begin{aligned} \delta (x^t)&\le \frac{2L\bar{D}^2}{t+1} \quad \text{ and }\quad \bar{G}_t:=\min _{0 \le i \le t} G(x^i) \le \frac{8L\bar{D}^2}{t+1},\quad \forall \,t\ge 0. \end{aligned}$$

(1.3)

Next, let us consider the case where h is $\mu$-strongly-convex on $\mathsf {dom}\,h$ (for some $\mu >0$), namely

$$\begin{aligned}h(\lambda x + (1-\lambda ) y) \le \lambda h(x) + (1-\lambda ) h(y) - (\mu /2)\lambda (1-\lambda )\Vert x-y\Vert _\mathbb {X}^2, \quad \forall \,x,y\in \mathsf {dom}\,h, \;\; \forall \lambda \in [0,1]. \end{aligned}$$

In contrast to the well-studied non-strongly-convex case, the computational guarantees of the GFW method (i.e., Algorithm 1) in the strongly-convex case appear to be less studied. Nesterov [2, Sect. 5] showed that, if $\mathsf {dom}\,h$ is bounded with diameter

$$\begin{aligned} D:= {\sup }_{x,y\in \mathsf {dom}\,h} \;\Vert x-y\Vert _\mathbb {X}<+\infty , \end{aligned}$$

and one chooses the predetermined step-sizes $\alpha _t = \frac{6(t+1)}{(t+2)(2t+3)}$ for all $t\ge 0$, then

$$\begin{aligned} \delta (x^t)\le \frac{27\kappa ^2\mu D^2}{(t+1)(2t+1)}, \quad \forall \,t\ge 0, \qquad \text{ where }\quad \kappa := \frac{L\Vert \mathsf {A}\Vert ^2}{\mu } \end{aligned}$$

(1.4)

denotes the condition number of (P), and $\Vert \mathsf {A}\Vert : = {\sup }_{\Vert x\Vert _\mathbb {X}=1}\;\Vert \mathsf {A}x\Vert _\mathbb {U}$ denotes the operator norm of $\mathsf {A}$. Later on, Ghadimi [3, Corollary 1(b)] showed that without assuming the boundedness of $\mathsf {dom}\,h$, as long as one chooses the constant step-sizes $\alpha _t = 1/(1+4\kappa )$ for $t\ge 0$, then the sequence of minimum FW-gaps $\{\bar{G}_t\}_{t\ge 0}$ [cf. (1.3)] converges to zero linearly:

$$\begin{aligned} \bar{G}_t\le 4\delta (x^0) (1+4\kappa )\big (1-1/(2(1+4\kappa ))\big )^t, \quad \forall \, t\ge 1. \end{aligned}$$

(1.5)

In addition, Ghadimi [3, Corollary 2] showed that the convergence rate in (1.5) continues to hold (up to absolute constants) if one instead chooses the step-sizes $\{\alpha _t\}_{t\ge 0}$ via certain backtracking line-search procedure. More recently, Pena [4] showed that by choosing the step-sizes $\{\alpha _t\}_{t\ge 0}$ via exact line-search, we have the following simpler linear convergence result:

$$\begin{aligned} \widetilde{G}_t \le \widetilde{G}_0\big (1-(1/2)\min \{1,(2\kappa )^{-1}\}\big )^t, \quad \forall \,t\ge 0, \end{aligned}$$

(1.6)

where

$$\begin{aligned} \widetilde{G}_t := f(\mathsf {A}x^t) + h(x^t) + {\min }_{0\le i \le t}\; f^*(\nabla f(x^i)) + h^*(-\mathsf {A}^*\nabla f(x^i)), \quad \forall \,t\ge 0. \end{aligned}$$

(1.7)

In (1.7), $f^*$ and $h^*$ denote Fenchel conjugates of f and h, respectively, and $\mathsf {A}^*:\mathbb {U}^*\rightarrow \mathbb {X}^*$ denotes the adjoint of $\mathsf {A}$ (see Sect. 2 for details). In addition, Pena [4, Theorem 2] showed that (1.6) still holds (up to absolute constants) if the step-sizes $\{\alpha _t\}_{t\ge 0}$ are chosen by backtracking line-search.

1.2 Main contributions

In this work, we focus on the case where h is $\mu$-strongly convex (for some $\mu >0$), and propose a simple variant of the GFW method (cf. Algorithm 1) in Algorithm 2. Compared with Algorithm 1, we see that it simply adds an additional averaging step to the dual iterates $\{y^t\}_{t\ge 0}$ (cf. Step 3), and the weights of averaging are exactly given by the primal step-sizes $\{\alpha _t\}_{t \ge 0}$. As a result, the primal iterates $\{x^t\}_{t\ge 0}$ and the dual iterates $\{y^t\}_{t\ge 0}$ are updated in a “symmetric” fashion. Despite the simplicity of this additional “dual averaging” step, as we will show in Sect. 3, it allows us to establish a simple and elegant contraction on the sequence of duality gaps evaluated on the primal-dual pairs $\{(x^t,y^t)\}_{t\ge 0}$, by simply choosing the step-sizes $\{\alpha _t\}_{t\ge 0}$ to be a constant (that only depends on the condition number $\kappa$). Compared with the previous results established for the GFW method in Algorithm 1 [2,3,4], our result shows that Algorithm 2 has the benefit of a simple choice of step-sizes, which does not involve any form of line-search—this feature is particularly attractive when the condition number $\kappa$ [cf. (1.4)] is explicitly known or can be easily estimated.

As the second main contribution of this work, we analyze the local convergence rate of the logistic fictitious play (LFP) algorithm [8], which is a classical stochastic game-theoretic algorithm that has only been shown to converge asymptotically. The key to our analysis is to observe that the deterministic version of LFP (D-LFP) is a special instance of Algorithm 2, and LFP can be regarded as a certain “stochastic approximation” of D-LFP. As a result, by properly incorporating the “stochastic noise” in LFP to the analysis of D-LFP (which is precisely that of Algorithm 2), we then obtain the local convergence rate of LFP. Our analysis shows that, with high probability, LFP converges locally at rate O(1/t) in terms of certain expected duality gap. Somewhat surprisingly, our numerical results on the empirical behavior of LFP are in excellent consistence with our theory.

Notations. For any non-empty set $\mathcal {X}$, we denote its relative interior as $\mathsf {ri}\,\mathcal {X}$. In addition, let $\iota _\mathcal {X}$ denote its indicator function (namely, $\iota _\mathcal {X}(x) = 0$ if $x\in \mathcal {X}$ and $\iota _\mathcal {X}(x) = +\infty$ otherwise). For a matrix M and any $p,q\in [1,+\infty ]$, we define its (p, q)-operator-norm as $\Vert M\Vert _{p,q}=\sup _{\Vert z\Vert _p=1}\,\Vert Mz\Vert _q$.

2 Preliminaries

Let us provide some background on duality theory that will be useful in our analysis in Sect. 3. In the rest of this work, for notational brevity, we will omit the subscript of norms, and the meaning of $\Vert \cdot \Vert$ and $\Vert \cdot \Vert _*$ can be inferred from the context.

To begin with, let us first write down the (Fenchel) dual problem associated with (P):

$$\begin{aligned} -d^*:=-{\min }_{y\in \mathsf {dom}\,f^*} \;\; [d(y):=f^*(y) + h^*(-\mathsf {A}^*y)] = {\max }_{y\in \mathsf {dom}\,f^*} - d(y), \end{aligned}$$

(D)

where recall that $\mathsf {A}^*:\mathbb {U}^*\rightarrow \mathbb {X}^*$ denotes the adjoint of $\mathsf {A}$, and $f^*:\mathbb {U}^*\rightarrow \mathbb {R}\cup \{+\infty \}$ and $h^*:\mathbb {X}^*\rightarrow \mathbb {R}$ denote the Fenchel conjugates of f and h, respectively:

$$\begin{aligned} f^*(y)&:= {\sup }_{u\in \mathbb {U}}\; \langle {y},{u}\rangle - f(u), \;\;\;\;\;\quad \forall \,y\in \mathbb {U}^*,\end{aligned}$$

(2.1)

$$\begin{aligned} h^*(z)&: = {\sup }_{x\in \mathsf {dom}\,h}\; \langle {z},{x}\rangle - h(x), \;\;\;\;\forall \,z\in \mathbb {X}^*. \end{aligned}$$

(2.2)

From standard results (see e.g., [9]), we know that

(i)
The function $f^*$ is (1/L)-strongly convex on $\mathsf {dom}\,f^*$ in the following sense:
$$\begin{aligned}f^*(y)\ge f^*(w) + \langle {g},{y-w}\rangle + (2L)^{-1} \Vert y-w\Vert ^2, \;\; \forall \,y\in \mathsf {dom}\,f^*, \;\forall \,w\in \mathsf {dom}\,\partial f^*, \;\forall \,g\in \partial f^*(w), \end{aligned}$$
(2.3)
where $\mathsf {dom}\,\partial f^*:= \{y\in \mathsf {dom}\,f^*: \partial f^*(y)\ne \emptyset \}$ denotes the set of sub-differentiable points of $f^*$.
(ii)
The function $h^*$ is convex and differentiable on $\mathbb {X}^*$ and $\nabla h^*$ is $(1/\mu )$-Lipschitz on $\mathbb {X}^*$.

In addition, from [10, Theorem 3.51], we see that strong duality holds between (P) and (D), namely $p^* = -d^*$. Next, let us define the duality gap $\Delta :\mathsf {dom}\,h\times \mathsf {dom}\,f^*\rightarrow \mathbb {R}$ as

$$\begin{aligned} \Delta (x,y) := p(x) + d(y) = f(\mathsf {A}x) + h(x) + f^*(y) + h^*(-\mathsf {A}^*y), \quad \forall \,x\in \mathsf {dom}\,h, \;\; \forall \,y\in \mathsf {dom}\,f^*. \end{aligned}$$

(2.4)

Using standard results in Fenchel duality (see e.g., [1]), we see that for any $x\in \mathsf {dom}\,h$,

$$\begin{aligned} \Delta (x,\nabla f(\mathsf {A}x)) = G(x)= \langle {\mathsf {A}^*\nabla f(\mathsf {A}x)},{x - v(x)}\rangle + h(x) - h(v(x)), \end{aligned}$$

(2.5)

where $v(x):= \mathop {\mathrm {arg\,min}}_{x'\in \mathbb {X}}\; \langle {\nabla f(\mathsf {A}x)},{\mathsf {A}x'}\rangle + h(x')$ and $G:\mathsf {dom}\,h\rightarrow \mathbb {R}$ is defined in (1.2). In words, for all $x\in \mathsf {dom}\,h$, (2.5) means that the duality gap at $(x,\nabla f(\mathsf {A}x))$ is equal to the FW-gap at x.

3 Convergence rate of algorithm 2

We derive the (global) convergence rate of Algorithm 2 in the following theorem.

Theorem 3.1

In Algorithm 2, if we choose $\alpha _t = \min \{1/(2\kappa ),1\}$ for all $t\ge 0$, then

$$\begin{aligned} \Delta (x^t,y^t)\le \rho (\kappa )^t \Delta (x^0,y^0), \quad \forall \,t\ge 0, \end{aligned}$$

(3.1)

where $\kappa$ denotes the condition number of (P) [cf. (1.4)] and the linear rate function $\rho :[0,+\infty )\rightarrow [0,1)$ is defined as

$$\begin{aligned} \rho (\kappa ):= {\left\{ \begin{array}{ll} \kappa , & \text{ if } \;\; 0\le \kappa \le 1/2\\ 1-{1}/(4\kappa ), & \text{ if } \;\; \kappa > 1/2 \end{array}\right. }. \end{aligned}$$

(3.2)

Proof

First, from the definition of $v^t$ in Step 1, we see that

$$\begin{aligned} v^t = {\mathop {\mathrm {arg\,max}}\limits }_{x\in \mathbb {X}}\; \langle {-\mathsf {A}^* y^t},{x}\rangle - h(x), \end{aligned}$$

and hence from the definition of $h^*$ in (2.2), we have

$$\begin{aligned} h^*(-\mathsf {A}^* y^t) = \langle {-\mathsf {A}^* y^t},{v^t}\rangle - h(v^t). \end{aligned}$$

(3.3)

Since both f and $h^*$ have Lipschitz gradients on $\mathbb {U}$ and $\mathbb {X}^*$, respectively, we have

$$\begin{aligned} f(\mathsf {A}x^{t+1})&\le f(\mathsf {A}x^{t}) + \alpha _t\langle {\nabla f(\mathsf {A}x^{t})},{\mathsf {A}(v^t-x^t)}\rangle + (L\alpha _t^2/2) \Vert \mathsf {A}(v^t-x^t)\Vert ^2 \\&\le f(\mathsf {A}x^{t}) + \alpha _t\langle {g^t},{\mathsf {A}(v^t-x^t)}\rangle + (L\alpha _t^2\Vert \mathsf {A}\Vert ^2/2) \Vert v^t-x^t\Vert ^2, \end{aligned}$$

(3.4)

$$\begin{aligned} h^*(-\mathsf {A}^* y^{t+1})&\le h^*(-\mathsf {A}^* y^{t}) + \alpha _t\langle {\nabla h^*(-\mathsf {A}^* y^{t})},{\mathsf {A}^*(y^t-g^t)}\rangle + (\alpha _t^2/(2\mu )) \Vert \mathsf {A}^*(y^t-g^t)\Vert ^2 \\&\le h^*(-\mathsf {A}^* y^{t}) + \alpha _t\langle {v^t},{\mathsf {A}^*(y^t-g^t)}\rangle + (\alpha _t^2\Vert \mathsf {A}\Vert ^2/(2\mu )) \Vert y^t-g^t\Vert ^2. \end{aligned}$$

(3.5)

In addition, by the convexities of h and $f^*$ on their respective domains, we have

$$\begin{aligned} h(x^{t+1})&\le (1-\alpha _t) h(x^t) + \alpha _t h(v^t), \end{aligned}$$

(3.6)

$$\begin{aligned} f^*(y^{t+1})&\le (1-\alpha _t) f^*(y^t) + \alpha _t\,f^*(g^t). \end{aligned}$$

(3.7)

Combining (3.4)–(3.7), we have

$$\begin{aligned} \Delta (x^{t+1},y^{t+1})&= f(\mathsf {A}x^{t+1}) + h(x^{t+1}) + h^*(-\mathsf {A}^* y^{t+1}) + f^*(y^{t+1}) \\&\le \Delta (x^{t},y^{t}) -\alpha _t\big \{\big [\langle {\mathsf {A}x^t},{g^t}\rangle - f^*(g^t)\big ] \\&\quad + \big [\langle {-\mathsf {A}^* y^t},{v^t}\rangle - h(v^t)\big ] + f^*(y^t)+ h(x^t) \big \} \\&\quad + \alpha _t^2\kappa \big \{(\mu /2)\Vert v^t-x^t\Vert ^2+(2L)^{-1}\Vert y^t-g^t\Vert ^2\big \}. \end{aligned}$$

(3.8)

Since

$$\begin{aligned} g^t = \nabla f(\mathsf {A}x^t) = {\mathop {\mathrm {arg\,max}}\limits }_{y\in \mathbb {U}^*}\; \langle {\mathsf {A}x^t},{y}\rangle - f^*(y), \end{aligned}$$

(3.9)

we see that $g^t\in \mathsf {dom}\,\partial f^*$ and

$$\begin{aligned} f(\mathsf {A}x^t) = \langle {\mathsf {A}x^t},{g^t}\rangle - f^*(g^t). \end{aligned}$$

(3.10)

By substituting (3.3) and (3.10) into (3.11), we see that

$$\begin{aligned} \Delta (x^{t+1},y^{t+1}) \le (1-\alpha _t) \Delta (x^{t},y^{t}) + \alpha _t^2\kappa \big \{(\mu /2)\Vert v^t-x^t\Vert ^2+(2L)^{-1}\Vert y^t-g^t\Vert ^2\big \}. \end{aligned}$$

(3.11)

By the $\mu$-strong convexity of h on its domain, the definition of $v^t$ in Step 1 and (3.3), we have

$$\begin{aligned} (\mu /2)\Vert x^t-v^t\Vert ^2&\le h(x^t)-h(v^t)+\langle {y^t},{\mathsf {A}(x^t-v^t)}\rangle \\&= h(x^t) + \langle {y^t},{\mathsf {A}x^t}\rangle + h^*(-\mathsf {A}^* y^t). \end{aligned}$$

(3.12)

In addition, using the (1/L)-strong convexity of $f^*$ in the sense of (2.3), (3.9) and (3.10), we have

$$\begin{aligned} (2L)^{-1}\Vert y^t-g^t\Vert ^2&\le f^*(y^t)-f^*(g^t)+\langle {\mathsf {A}x^t},{g^t-y^t}\rangle \\&= f^*(y^t)-\langle {\mathsf {A}x^t},{y^t}\rangle + f(\mathsf {A}x^t). \end{aligned}$$

(3.13)

Substituting (3.12) and (3.13) into (3.11), we have

$$\begin{aligned} \Delta (x^{t+1},y^{t+1})\le (1-\alpha _t + \alpha _t^2\kappa )\Delta (x^{t},y^{t}). \end{aligned}$$

(3.14)

If we choose $\alpha _t = \min \{1/(2\kappa ),1\}$, then we see that $1-\alpha _t + \alpha _t^2\kappa = \rho (\kappa )$, for all $t\ge 0$. $\square$

Remark 3.1

Note that in Theorem 3.1, the linear rate function $\rho$ is continuous, concave and strictly increasing on $[0,+\infty )$. Hence the smaller the condition number $\kappa$, the better the linear rate. In the regime that $\kappa >1/2$, we have $\rho (\kappa ) = 1-1/(4\kappa )$, and therefore, to find a primal-dual pair $(x,y)\in \mathsf {dom}\,h\times \mathsf {dom}\,f^*$ such that $\Delta (x,y)\le \varepsilon$, it requires no more than

$$\begin{aligned} \left\lceil 4\kappa \ln \left( \frac{\Delta (x^0,y^0)}{\varepsilon }\right) \right\rceil \quad \text{ iterations }. \end{aligned}$$

(3.15)

4 Application to LFP

The LFP algorithm (a.k.a. stochastic fictitious play with best logit response), first introduced by Fudenberg and Kreps in 1993 [8], is a classical algorithm in game theory (see [11] and references therein). In this work we focus on the two-player zero-sum version, which is shown in Algorithm 3. Specifically, players I and II are given finite action spaces $[n]:=\{1,2,\ldots ,n\}$ and [m], respectively, and a payoff matrix $A\in \mathbb {R}^{m\times n}$. At the beginning, players I and II choose their initial actions $i_0\in [n]$ and $j_0\in [m]$, respectively, and their initial “history of actions” are denoted by $x^0 := e_{i_0}$ and $y^0 := e_{j_0}$, respectively (where $e_i$ denotes the ith standard coordinate vector). At any time $t\ge 0$, in order for player I to choose the next action $i_{t+1}$, she first computes the best-logit-response distribution $w^t$ based on player II’s history of actions $y^t$:

$$\begin{aligned} w^t:= \mathsf {P}_\mathsf {x}(y^t):= {\mathop {\mathrm {arg\,min}}\limits }_{x\in \Delta _n} \; \langle {A^\top y^t},{x}\rangle + \eta h_\mathsf {x}(x), \end{aligned}$$

(4.1)

where

$$\begin{aligned} h_\mathsf {x}(x):= \textstyle \sum _{i=1}^n x_i\ln (x_i) \qquad (\text{ for } \,x\ge 0) \end{aligned}$$

(4.2)

is the (negative) entropic function defined on $\mathbb {R}^n_+$, $\Delta _n:= \{x\in \mathbb {R}_+^n: \sum _{i=1}^n x_i=1\}$ denotes the $(n-1)$-dimensional probability simplex, and $\eta >0$ is the regularization parameter. Then, based on $w^t$, she randomly chooses $i_{t+1}$ by sampling from the distribution $w^t$, such that $\Pr (i_{t+1} = i) = w^t_{i}$ for $i\in [n]$. After obtaining $i_{t+1}$, she updates her history of actions from $x^t$ to $x^{t+1}$ by a convex combination of $x^t$ and $e_{i_{t+1}}$:

$$\begin{aligned} x^{t+1} := (1-\alpha _t)x^t + \alpha _t e_{i_{t+1}}, \end{aligned}$$

(4.3)

where $\alpha _t\in [0,1]$ can be interpreted as the “step-size” of player I, and is required to satisfy

$$\begin{aligned} \textstyle \sum _{t=0}^{+\infty } \,\alpha _t = +\infty \quad \text{ and }\quad \textstyle \sum _{t=0}^{+\infty } \,\alpha _t^2 < +\infty . \end{aligned}$$

(4.4)

For player II, the update of her history of actions from $y^t$ to $y^{t+1}$ is symmetric to that of player I. Specifically, based on player I’s history of actions $x^t$, she computes her best-logit-response distribution $s^t$ as

$$\begin{aligned} s^t:= \mathsf {P}_\mathsf {y}(x^t):={\mathop {\mathrm {arg\,max}}\limits }_{y\in \Delta _m} \; \langle {A x^t},{y}\rangle - \eta h_\mathsf {y}(y), \end{aligned}$$

(4.5)

where

$$\begin{aligned} h_\mathsf {y}(y):= \textstyle \sum _{j=1}^m y_j\ln (y_j) \qquad (\text{for} \,y\ge 0) \end{aligned}$$

(4.6)

is the (negative) entropic function defined on $\mathbb {R}^m_+$. Then she samples her next action $j_{t+1}$ from $s^t$, and updates her history of actions from $y^t$ to $y^{t+1}$ as follows:

$$\begin{aligned} y^{t+1} := (1-\alpha _t)y^t + \alpha _t e_{j_{t+1}}. \end{aligned}$$

(4.7)

In the literature, the asymptotic convergence of LFP (i.e., Algorithm 3) has been well-studied. For example, Hofbauer and Sandholm [12] proves the following theorem:

Theorem 4.1

(Hofbauer and Sandholm [12, Theorem 6.1(ii)]) Consider the fixed-point equation

$$\begin{aligned} \mathsf {P}_\mathsf {x}(y) = x, \quad \mathsf {P}_\mathsf {y}(x) = y, \end{aligned}$$

(4.10)

and its unique solution $(x^*,y^*)\in \mathsf {ri}\,\Delta _n\times \mathsf {ri}\,\Delta _m$. Then in Algorithm 3, for any initialization $i_0\in [n]$ and $j_0\in [m]$, and any step-sizes $\{\alpha _t\}_{t\ge 0}$ satisfying (4.4), we have

$$\begin{aligned} \Pr \left( {\lim }_{t\rightarrow +\infty }\, x_t = x^* \;\; \text{ and }\;\; {\lim }_{t\rightarrow +\infty }\, y_t = y^*\right) = 1. \end{aligned}$$

(4.11)

However, in contrast to the well-understanding of the asymptotic convergence of LFP, the convergence rate of LFP is largely unknown. The purpose of this section is to conduct a local convergence rate analysis of LFP, and show that with high probability, LFP converges locally at rate O(1/t), where the convergence is measured in terms of certain expected duality gap. To that end, let us first consider D-LFP (i.e., the deterministic version of LFP), which is shown in Algorithm 4, and relate it to Algorithm 2.

4.1 Relating D-LFP to algorithm 2

Let us observe a simple but important fact, that is, D-LFP (i.e., Algorithm 4) is an instance of Algorithm 2 for solving the following instance of (P):

$$\begin{aligned} \min _{x\in \mathbb {R}^n}\; \underbrace{\eta \ln \big (\textstyle \sum _{j=1}^m \exp (a_j^\top x/\eta )\big )}_{:= f(\mathsf {A}x)} + \underbrace{\eta h_\mathsf {x}(x)+\iota _{\Delta _{n}}(x)}_{:=h(x)}, \end{aligned}$$

(P-LFP)

where $a_j^\top$ denotes the jth row of A (for $j\in [m]$), the linear operator $\mathsf {A}x:= Ax$ for $x\in \mathbb {R}^n$, and

$$\begin{aligned} f(u):= \eta \ln (\textstyle \sum _{j=1}^m \exp (u_j/\eta )), \quad \forall \,u\in \mathbb {R}^m. \end{aligned}$$

(4.14)

As a result, the dual problem of (P-LFP) reads:

$$\begin{aligned} -\min _{y\in \mathbb {R}^m}\; \underbrace{\eta h_\mathsf {y}(y)+\iota _{\Delta _{m}}(y)}_{:=f^*(y)} + \underbrace{\eta \ln \big (\textstyle \sum _{i=1}^n \exp (-A_i^\top y/\eta )\big )}_{:= h^*(-A^\top y)} , \end{aligned}$$

(D-LFP)

where $A_i$ denotes the ith column of A (for $i\in [n]$). Consequently, according to (2.4), the duality gap $\Delta :\Delta _n\times \Delta _m\rightarrow \mathbb {R}$ has the following form: for any $(x,y)\in \Delta _n\times \Delta _m,$

$$\begin{aligned} \Delta (x,y) = \ln \big (\textstyle \sum _{j=1}^m \exp (a_j^\top x/\eta )\big ) + \eta h_\mathsf {x}(x) + \eta \ln \big (\textstyle \sum _{i=1}^n \exp (-A_i^\top y/\eta )\big ) + \eta h_\mathsf {y}(y). \end{aligned}$$

(4.15)

Now, in order to see that (P-LFP) is an instance of (P) (which in turn implies that (D-LFP) is an instance of (D)), it suffices to note that

(i)
The function f given in (4.14) is convex and differentiable on $\mathbb {R}^m$, and $\nabla f$ is $(1/\eta )$-Lipschitz on $\mathbb {R}^m$ with respect to $\Vert \cdot \Vert _\infty$, i.e.,
$$\begin{aligned} \Vert \nabla f(u) - \nabla f(u')\Vert _1\le (1/\eta )\Vert u-u'\Vert _{\infty }, \quad \forall \,u,u'\in \mathbb {R}^m. \end{aligned}$$
(4.16)
(To see this, note that for all $u\in \mathbb {R}^m$, $\Vert \nabla ^2 f(u)\Vert _{\infty ,1}:= {\sup }_{\Vert z\Vert _{\infty }=1}\; \Vert \nabla ^2 f(u)z\Vert _1\le 1/\eta .)$
(ii)
The function $h := \eta h_\mathsf {x}+ \iota _{\Delta _n}$ is $\eta$-strongly convex on $\mathsf {dom}\,h = \Delta _n$ with respect to $\Vert \cdot \Vert _1$, i.e.,
$$\begin{aligned}h(\lambda x + (1-\lambda ) y) \le \lambda h(x) + (1-\lambda ) h(y) - (\eta /2)\lambda (1-\lambda )\Vert x-y\Vert _1^2, \quad \forall \,x,y\in \Delta _n, \; \forall \lambda \in [0,1]. \end{aligned}$$
(For details, see e.g., [13, Lemma 3].)

In addition, to see that D-LFP (i.e., Algorithm 4) is an instance of Algorithm 2, we simply note that for all $t\ge 0$: (i) $w^t = v^t= \nabla h^*(-\mathsf {A}^* y^t)$ and (ii) from the definition of f in (4.14),

$$\begin{aligned} s_j^t = \frac{\exp (a_j^\top x^t/\eta )}{\sum _{l\in [m]}\exp (a_l^\top x^t/\eta )} = \nabla _j f(Ax^t) = g_j^t, \quad \forall \, j\in [m], \end{aligned}$$

(4.17)

and hence $s^t = g^t= \nabla f(Ax^t)$. As a result, we see that D-LFP also has the same linear convergence rate in (3.1) as Algorithm 2, if we choose the stepsizes $\{\alpha _t\}_{t\ge 0}$ in the same way as in Theorem 3.1 with $\kappa := \Vert A\Vert _{1,\infty }^2/\eta ^2$, which is the condition number of (P-LFP). (Recall that $\Vert A\Vert _{1,\infty }$ denotes the $(1,\infty )$-operator norm of A, and is given by $\Vert A\Vert _{1,\infty }:= {\sup }_{\Vert z\Vert _{1}=1}\; \Vert Az\Vert _\infty = {\max }_{j\in [m],i\in [n]}\, \vert a_{j,i}\vert ,$ where $a_{j,i}$ denotes the (j, i)th entry of A, for $j\in [m]$ and $i\in [n]$.)

4.2 Local convergence rate analysis of LFP

Now, let us analyze the local convergence rate of LFP (i.e., Algorithm 3), by regarding it as certain “stochastic approximation” of D-LFP. Specifically, we can rewrite the iterations (4.8) and (4.9) in Algorithm 3 as

$$\begin{aligned} x^{t+1}&:= (1-\alpha _t)x^t + \alpha _t (w^t + \zeta _\mathsf {x}^t) = x^t + \alpha _t (w^t + \zeta _\mathsf {x}^t-x^t),&\quad \text{ where }\quad \zeta _\mathsf {x}^t:= e_{i_{t+1}} - w^t, \end{aligned}$$

(4.18)

$$\begin{aligned} y^{t+1}&:= (1-\alpha _t)y^t + \alpha _t (s^t + \zeta _\mathsf {y}^t) = y^t + \alpha _t (s^t + \zeta _\mathsf {y}^t - y^t),&\quad \text{ where }\quad \zeta _\mathsf {y}^t:= e_{j_{t+1}} - s^t. \end{aligned}$$

(4.19)

Note that $\zeta _\mathsf {x}^t$ and $\zeta _\mathsf {y}^t$ can be regarded as the stochastic errors resulted from the sampling steps $i_{t+1}\sim w^t$ and $j_{t+1}\sim s^t$, respectively. In fact, we can easily see that the sequences of errors $\{\zeta _\mathsf {x}^t\}_{t\ge 0}$ and $\{\zeta _\mathsf {y}^t\}_{t\ge 0}$ are martingale difference sequences. Formally, let us define a filtration $\{\mathcal {F}_t\}_{t\ge 0}$ such that for all $t\ge 0$, $\mathcal {F}_t:= \sigma \big (\{(x_i,y_i)\}_{i=0}^t\big )$, namely the $\sigma$-field generated by the set of random variables $\{(x_i,y_i)\}_{i=0}^t$. Then we have

$$\begin{aligned} \mathbb {E}_t[{\zeta _\mathsf {x}^t}]: =\mathbb {E}[{\zeta _\mathsf {x}^t}\,|\,\mathcal {F}_t] = 0 \quad \text{ and }\quad \mathbb {E}_t[{\zeta _\mathsf {y}^t}]: =\mathbb {E}[{\zeta _\mathsf {y}^t}\,|\,\mathcal {F}_t] = 0, \quad \forall \,t\ge 0. \end{aligned}$$

(4.20)

Next, let us note that since $(x^*,y^*)\in \mathsf {ri}\,\Delta _n\times \mathsf {ri}\,\Delta _m$ (cf. Theorem 4.1), there exist radii $r_\mathsf {x},r_\mathsf {y}>0$ such that $\mathcal {B}_{r_\mathsf {x}}(x^*)\times \mathcal {B}_{r_\mathsf {y}}(y^*)\subseteq \mathsf {ri}\,\Delta _n\times \mathsf {ri}\,\Delta _m$, where

$$\begin{aligned} \mathcal {B}_{r_\mathsf {x}}(x^*)&:= \{x\in \mathbb {R}^n:e^\top x=1,\;\Vert x-x^*\Vert _1\le r_\mathsf {x}\}, \\ \mathcal {B}_{r_\mathsf {y}}(y^*)&:= \{y\in \mathbb {R}^m:e^\top y=1,\;\Vert y-y^*\Vert _1\le r_\mathsf {y}\}. \end{aligned}$$

For notational convenience, let us write $\mathcal {N}(x^*,y^*):= \mathcal {B}_{r_\mathsf {x}}(x^*)\times \mathcal {B}_{r_\mathsf {y}}(y^*)$, which is a compact neighborhood of $(x^*,y^*)$ that is bounded away from the relative boundary of $\Delta _n\times \Delta _m$. This neighborhood will play an important role in our local convergence rate analysis of LFP. The advantage of this neighborhood can be seen from the following lemma.

Lemma 4.1

There exist finite constants $L_\mathsf {x},L_\mathsf {y}\ge 0$ such that

$$\begin{aligned} h_\mathsf {x}(x')&\le h_\mathsf {x}(x) + \langle {\nabla h_\mathsf {x}(x)},{x'-x}\rangle + (L_\mathsf {x}/2)\Vert x'-x\Vert _1^2,\quad&\forall \,x,x'\in \mathcal {B}_{r_\mathsf {x}}(x^*), \end{aligned}$$

(4.21)

$$\begin{aligned} h_\mathsf {y}(y')&\le h_\mathsf {y}(y) + \langle {\nabla h_\mathsf {y}(y)},{y'-y}\rangle + (L_\mathsf {y}/2)\Vert y'-y\Vert _1^2,\quad&\forall \,y,y'\in \mathcal {B}_{r_\mathsf {y}}(y^*). \end{aligned}$$

(4.22)

Proof

Indeed, we can set $L_\mathsf {x}:= {\max }_{x\in \mathcal {B}_{r_\mathsf {x}}(x^*)}\,\Vert \nabla ^2 h_\mathsf {x}(x)\Vert _{1,\infty },$ which is finite since $\nabla ^2 h_\mathsf {x}$ is continuous on the compact set $\mathcal {B}_{r_\mathsf {x}}(x^*)$. Similarly, we can set $L_\mathsf {y}:= {\max }_{y\in \mathcal {B}_{r_\mathsf {y}}(y^*)}\,\Vert \nabla ^2 h_\mathsf {y}(y)\Vert _{1,\infty }<+\infty$. $\square$

In addition, let us make another important observation: with high probability, the sequence $\{(x^t,y^t)\}_{t\ge 0}$ produced by LFP (i.e., Algorithm 3) will eventually lie inside $\mathcal {N}(x^*,y^*)$, for any initial actions $i_0\in [n]$ and $j_0\in [m]$, and any step-sizes $\{\alpha _t\}_{t\ge 0}$ satisfying the conditions in (4.4). In fact, this is a simple corollary of Theorem 4.1, which is stated as follows.

Corollary 4.1

Define the sequence of events $\{A_{T}\}_{T\ge 0}$ such that

$$\begin{aligned} \mathcal {A}_T:= \big \{\forall \,t\ge T,\;\; (x^t,y^t)\in \mathcal {N}(x^*,y^*)\big \}, \quad \forall \,T\ge 0. \end{aligned}$$

(4.23)

If the step-sizes $\{\alpha _t\}_{t\ge 0}$ satisfy (4.4), then for any $\delta \in (0,1)$, there exists $T(\delta )<+\infty$ such that $\Pr \big (\mathcal {A}_{T(\delta )}\big )\ge 1-\delta .$

Proof

Indeed, from standard results (e.g., [14, Theorem 3.3]), we know that the almost sure convergence result in (4.11) is equivalent to $\lim _{T\rightarrow \infty }\Pr (A_T) = 1$. Therefore, for any $\delta \in (0,1)$, there exists $T(\delta )<+\infty$ such that for all $T\ge T(\delta )$, $\Pr (A_T)\ge 1-\delta$. This completes the proof. $\square$

Lastly, our analysis requires the following technical lemma, whose proof follows from standard techniques (see e.g., [6, Sect. 3]). For completeness, we provide its proof in Appendix A.

Lemma 4.2

Let $\{V_t\}_{t\ge 0}$ be a nonnegative sequence that satisfies the following recursion:

$$\begin{aligned} V_{t+1}\le (1-\alpha _t)V_t + \alpha _t^2 C,\quad \forall \,t\ge t_0 \qquad \text{ for } \text{ some } t_0\ge 0, \end{aligned}$$

(4.24)

where $C\ge 0$ and $\alpha _t\in [0,1]$ for all $t\ge 0$. If we choose $\alpha _t = {2}/({t+2})$ for all $t\ge 0$, then we have

$$\begin{aligned} V_t&\le \frac{t_0(t_0+1)}{t(t+1)} V_{t_0} + \frac{4C(t-t_0)}{t(t+1)}, \qquad&\forall \,t\ge t_0+1. \end{aligned}$$

(4.25)

Equipped with the results above, we are ready to analyze the local convergence rate of LFP. Indeed, our analysis of LFP modifies the analysis of D-LFP (i.e., Algorithm 2), by properly handling the stochastic errors $\zeta _\mathsf {x}^t$ and $\zeta _\mathsf {y}^t$ that appear in (4.18) and (4.19), respectively. Before presenting our results, let us first observe that the duality gap $\Delta (\cdot ,\cdot )$ in (4.15) is jointly continuous on $\Delta _n\times \Delta _m$, and hence we can define its maximum on $\Delta _n\times \Delta _m$ as

$$\begin{aligned} \Delta _{\max }:={\max }_{(x,y)\in \Delta _n\times \Delta _m} \; \Delta (x,y) <+\infty . \end{aligned}$$

(4.26)

Theorem 4.2

(Local convergence rate of LFP). In Algorithm 3, choose any initial actions $i_0\in [n]$ and $j_0\in [m]$, and $\alpha _t = {2}/({t+2})$ for all $t\ge 0$. Then for any $\delta \in (0,1)$, there exists $T(\delta )<+\infty$ such that $\Pr (\mathcal {A}_{T(\delta )})\ge 1-\delta$ and

$$\begin{aligned} \mathbb {E}\big [\Delta \big (x^t,y^{t}\big )\big |\mathcal {A}_{T(\delta )}\big ]&\le \frac{T(\delta )(T(\delta )+1)}{t(t+1)} \Delta _{\max } + \frac{8\eta ( L_\mathsf {x}+L_\mathsf {y}+2\kappa )(t-T(\delta ))}{t(t+1)}, \;\; \;\;\forall\, t\ge T(\delta )+1, \end{aligned}$$

(4.27)

where recall that $\kappa =\Vert A\Vert _{1,\infty }^2/\eta ^2$ and $L_\mathsf {x}$ and $L_\mathsf {y}$ satisfy (4.21) and (4.22), respectively.

Proof

Since the step-sizes $\{\alpha _t\}_{t\ge 0}$ satisfy the conditions in (4.4), from Corollary 4.1, we see that there exists $T(\delta )<+\infty$ such that $\Pr (\mathcal {A}_{T(\delta )})\ge 1-\delta$. Now, by conditioning on the event $\mathcal {A}_{T(\delta )}$, we see that $(x^t,y^t)\in \mathcal {N}(x^*,y^*)$ for all $t\ge T(\delta )$. Thus, using (4.21), we have for all $t\ge T(\delta )$,

$$\begin{aligned} h_\mathsf {x}(x^{t+1}) - h_\mathsf {x}(x^t)&\le \langle {\nabla h_\mathsf {x}(x^t)},{x^{t+1}-x^t}\rangle + ({L_\mathsf {x}}/{2})\Vert x^{t+1} - x^t\Vert _1^2 \\&= \alpha _t\langle {\nabla h_\mathsf {x}(x^t)},{e_{i_{t+1}} - x^t}\rangle + \alpha _t^2(L_\mathsf {x}/{2}) \Vert e_{i_{t+1}} - x^t\Vert _1^2 \\&\le \alpha _t\langle {\nabla h_\mathsf {x}(x^t)},{w^t - x^t}\rangle + \alpha _t\langle {\nabla h_\mathsf {x}(x^t)},{\zeta _\mathsf {x}^t}\rangle + 2L_\mathsf {x}\alpha _t^2 \end{aligned}$$

(4.28)

$$\begin{aligned}&\le \alpha _t(h_\mathsf {x}(w^t) - h_\mathsf {x}(x^t)) + \alpha _t\langle {\nabla h_\mathsf {x}(x^t)},{\zeta _\mathsf {x}^t}\rangle + 2L_\mathsf {x}\alpha _t^2, \end{aligned}$$

(4.29)

where (4.28) follows from the definition of $\zeta _\mathsf {x}^t$ in (4.18) and $\Vert e_{i_{t+1}} - x^t\Vert _1^2\le 2(\Vert e_{i_{t+1}}\Vert _1^2 + \Vert x^t\Vert _1^2) = 4$, and (4.29) follows from the convexity of $h_\mathsf {x}$. Similarly, we have

$$\begin{aligned} h_\mathsf {y}(y^{t+1}) - h_\mathsf {y}(y^t)&\le \langle {\nabla h_\mathsf {y}(y^t)},{y^{t+1} - y^t}\rangle + (L_\mathsf {y}/2) \Vert y^{t+1} - y^t\Vert _1^2 \\&\le \alpha _t\langle {\nabla h_\mathsf {y}(y^t)},{s^t + \zeta _\mathsf {y}^t - y^t}\rangle + 2{L_\mathsf {y}}\alpha _t^2 \\&\le \alpha _t(h_\mathsf {y}(s^t) - h_\mathsf {y}(y^t)) + \alpha _t\langle {\nabla h_\mathsf {y}(y^t)},{\zeta _\mathsf {y}^t}\rangle + 2{L_\mathsf {y}}\alpha _t^2. \end{aligned}$$

(4.30)

In addition, from (P-LFP) and (D-LFP), we see that f and $h^*$ are differentiable with $\eta ^{-1}$-Lipschitz gradients on $\mathbb {R}^n$ and $\mathbb {R}^m$, respectively, and hence

$$\begin{aligned} f(Ax^{t+1})-f(Ax^{t})&\le \alpha _t\langle {\nabla f(Ax^{t})},{A(e_{i_{t+1}}-x^t)}\rangle \\&\quad + \alpha _t^2(\eta ^{-1}/2) \Vert A\Vert _{1,\infty }^2\Vert e_{i_{t+1}}-x^t\Vert _1^2 \\&\le \alpha _t(\langle {s^t},{A(w^t-x^t)}\rangle +\langle {s^t},{A\zeta _\mathsf {x}^t}\rangle ) \\&\quad + \alpha _t^2(2\Vert A\Vert _{1,\infty }^2/\eta ), \end{aligned}$$

(4.31)

$$\begin{aligned} h^*(-A^\top y^{t+1})-h^*(-A^\top y^{t})&\le - \alpha _t\langle {\nabla h^*(-A^\top y^{t})},{A^\top (e_{j_{t+1}}-y^{t})}\rangle \\&\quad + \alpha _t^2(\eta ^{-1}/2) \Vert A\Vert _{1,\infty }^2\Vert e_{j_{t+1}}-y^t\Vert _1^2, \\&\le - \alpha _t(\langle {w^t},{A^\top (s^t-y^{t})}\rangle + \langle {w^t},{A^\top \zeta _\mathsf {y}^t}\rangle ) \\&\quad +\alpha _t^2(2\Vert A\Vert _{1,\infty }^2/\eta ), \end{aligned}$$

(4.32)

where we use $s^t = \nabla f(Ax^{t})$ and $w^t = \nabla h^*(-A^\top y^t)$ (cf. Sect. 4.1). Therefore, by combining (4.29)–(4.32), and use the definitions $h := \eta h_\mathsf {x}+ \iota _{\Delta _n}$ and $f^*:= \eta h_\mathsf {y}+ \iota _{\Delta _m}$ in (P-LFP) and (D-LFP), respectively, we have

$$\begin{aligned} \begin{aligned} \Delta (x^{t+1},y^{t+1}) - \Delta (x^{t},y^{t})&\le \alpha _t\big \{h(w^t)+ \langle {A w^t},{ y^t}\rangle -h(x^t) \\&\quad + f^*(s^t)-\langle {s^t},{A x^t}\rangle -f^*(y^t)\big \}\\&\quad + \alpha _t\big \{\langle {\nabla h(x^t)+ A^\top s^t},{\zeta _\mathsf {x}^t}\rangle +\langle {\nabla f^*(y^t)-A w^t},{\zeta _\mathsf {y}^t}\rangle \big \}\\&\quad + 2\alpha _t^2\eta \left\{ L_\mathsf {x}+L_\mathsf {y}+2\Vert A\Vert _{1,\infty }^2/\eta ^2\right\} , \ \end{aligned} \end{aligned}$$

(4.33)

Since $s^t = \nabla f(Ax^{t})$ and $w^t = \nabla h^*(-A^\top y^t)$, we know that

$$\begin{aligned} f^*(s^t)-\langle {s^t},{A x^t}\rangle = -f(Ax^t) \quad \text{ and }\quad h(w^t)+ \langle {A w^t},{ y^t}\rangle = -h^*(-A^\top y^t). \end{aligned}$$

(4.34)

By combining (4.33), (4.34) and (4.20), we know that

$$\begin{aligned} \mathbb {E}_t[\Delta (x^{t+1},y^{t+1})]&\le (1-\alpha _t)\Delta (x^{t},y^{t}) + 2\alpha _t^2\eta ( L_\mathsf {x}+L_\mathsf {y}+2\kappa ), \quad \forall \,t\ge T(\delta ). \end{aligned}$$

(4.35)

Finally, by applying Lemma 4.2 to (4.35) and using the definition of $\Delta _{\max }$ in (4.26), we complete the proof. $\square$

5 Preliminary experimental studies

Experimental setup We compare the numerical performance of several previously mentioned methods on the (P-LFP) problem. These methods include

(i)
GFW-N: The GFW method (i.e., Algorithm 1) with decreasing step-sizes in Nesterov [2, Sect. 5]. Specifically, $\alpha _t = \frac{6(t+1)}{(t+2)(2t+3)}$ for $t\ge 0$.
(ii)
GFW-G: The GFW method (i.e., Algorithm 1) with constant step-sizes in Ghadimi [3, Corollary 1(b)]. Specifically, $\alpha _t = 1/(1+4\kappa )$ for $t\ge 0$, where $\kappa =\Vert A\Vert _{1,\infty }^2/\eta ^2$.
(iii)
GFWDA: Algorithm 2 (or equivalently, Algorithm 4) with constant step-sizes as in Theorem 3.1. Specifically, $\alpha _t = \min \{1/(2\kappa ),1\}$ for $t\ge 0$, where $\kappa =\Vert A\Vert _{1,\infty }^2/\eta ^2$.
(iv)
LFP: Algorithm 3 with decreasing step-sizes as in Theorem 4.2. Specifically, $\alpha _t = 2/(t+2)$ for $t\ge 0$.

To generate the data matrix A, we choose the dimensions $m = 100$ and $n=200$, and generate each entry of A independently from the uniform distribution on the interval $[-8, 8]$. In addition, we choose $\eta = 10$. For the specific instance of A used in our experiments, we have $\Vert A\Vert _{1,\infty }\approx 8.0$ and hence $\kappa =\Vert A\Vert _{1,\infty }^2/\eta ^2 \approx 0.64.$

Comparison criterion and starting points Note that each of the four methods above is able to generate certain sequence of duality gaps that converges to zero. Specifically, the sequence generated by GFW-N and GFW-G is $\{\bar{G}_t\}_{t\ge 0}$ [cf. (1.3)] and the sequence generated by GFWDA and LFP is $\{\Delta (x^t,y^t)\}_{t\ge 0}$ [cf. (2.4)]. Due to this, we will use the convergence speed of these duality gaps as the comparison criterion. For starting points, we choose $x^0 = e_1$ for all the four methods. In addition, for GFWDA, we choose $y^0 = \nabla f(\mathsf {A}x^0)$, so that GFW-N, GFW-G and GFWDA have the same initial duality gap, namely $\bar{G}_0 = G(x^0) = \Delta (x^0,\nabla f(\mathsf {A}x^0)) = \Delta (x^0,y^0)$. As for LFP, since we need to choose $y^0 = e_{j_0}$ for some $j_0\in [m]$ (cf. Algorithm 3), we let $j_0 = \mathop {\mathrm {arg\,max}}\limits _{j\in [m]} \nabla _j f(\mathsf {A}x^0)$. Note that this choice of $y^0$ will result in a larger duality gap compared to the one given by $y^0 = \nabla f(\mathsf {A}x^0)$. However, in our experiments, we observe that the difference is not significant.

Experimental results We plot the duality gaps generated by all the four methods versus iterations in Fig. 1, in both log-linear and log-log scales. Since LFP is a stochastic algorithm, we repeatedly run it for 10 times (with the same starting points as described above) and plot the averaged duality-gap trajectories. From Fig. 1, we can make the following observations. First, GFWDA and LFP are the fastest and slowest among all the four methods, respectively. In fact, GFWDA produces a duality gap of order $10^{-14}$ in less than 15 iterations, while LFP hardly makes any progress during the first 30 iterations. Second, GFW-N converges at a sub-linear rate that is much faster than $O(1/t^2)$, which is derived from theory [cf. (1.4)]. This is probably because (P-LFP) possesses certain structural properties (other than smoothness and strong convexity) that are favorable to GFW-N. Third, although both GFWDA and GFW-G converge linearly, the linear rate of GFW-G is slower than that of GFWDA. This indeed agrees with our theoretical analysis. Specifically, the linear rate of GFW-G is $1-1/(2(1+4\kappa ))$ [cf. (1.5)], which is slower than the linear rate of GFWDA, namely $1-{1}/(4\kappa )$ (cf. Theorem 3.1).

Table 1 Slopes of the plot in Fig. 2 over iteration intervals of constant lengths in log-scale

Full size table

Next, let us examine the local convergence rate of LFP. From the plot in Fig. 2, we can observe an O(1/t) convergence rate starting from around 100 iterations. For better illustration, in Table 1, we compute the slopes of this plot over iteration intervals of constant lengths in log-scale. From Table 1, we can clearly see that the magnitudes of the slopes are initially very small, which correspond to the slow initial convergence of the duality gaps. However, they gradually converge to one, which correspond to the O(1/t) local convergence rate as “predicted” in Theorem 4.2.

References

Bach, F.: Duality between subgradient and conditional gradient methods. SIAM J. Optim. 25(1), 115–129 (2015)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Complexity bounds for primal-dual methods minimizing the model of objective function. Math. Program. 171, 311–330 (2018)
Article MathSciNet MATH Google Scholar
Ghadimi, S.: Conditional gradient type methods for composite nonlinear and stochastic optimization. Math. Program. 173, 431–464 (2019)
Article MathSciNet MATH Google Scholar
Pena, J.: Affine invariant convergence rates of the conditional gradient method. arXiv:2112.06727 (2021)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Q. 3(1–2), 95–110 (1956)
Article MathSciNet Google Scholar
Freund, R.M., Grigas, P.: New analysis and results for the Frank–Wolfe method. Math. Program. 155, 199–230 (2016)
Article MathSciNet MATH Google Scholar
Zhao, R., Freund, R.M.: Analysis of the Frank–Wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier. Math. Program. Accepted 2022
Fudenberg, D., Kreps, D.M.: Learning mixed equilibria. Games Econ. Behav. 5(3), 320–367 (1993)
Article MathSciNet MATH Google Scholar
Kakade, S. M., Shalev-Shwartz, S., Tewari, A.: On the duality of strong convexity and strong smoothness: learning applications and matrix regularization. Technical Report, TTIC. https://home.ttic.edu/~shai/papers/KakadeShalevTewari09.pdf (2009)
Peypouquet, J.: Convex Optimization in Normed Spaces: Theory, Methods and Examples. Springer, Berlin (2015)
Book MATH Google Scholar
Ny, J.L.: On some extensions of fictitious play. Technical Report. MIT (2006)
Hofbauer, J., Sandholm, W.H.: On the global convergence of stochastic fictitious play. Econometrica 70(6), 2265–2294 (2002)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Article MathSciNet MATH Google Scholar
Hunter, D.: Lecture notes in asymptotic tools, chapter 3. http://personal.psu.edu/drh20/asymp/fall2006/lectures/ANGELchpt03.pdf (2006)

Download references

Acknowledgements

The first author’s research is supported by AFOSR Grant No. FA9550-19-1-0240.

Funding

Open Access funding provided by the MIT Libraries.

Author information

Authors and Affiliations

MIT Operations Research Center, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA
Renbo Zhao
Department of Mathematics and Statistics, Boston University, 111 Cummington Mall, Boston, MA, 02215, USA
Qiuyun Zhu

Authors

Renbo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Qiuyun Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renbo Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Proof of Lemma 4.2

Let $\{\beta _t\}_{t\ge 0}$ be a nonnegative auxiliary sequence such that

$$\begin{aligned} \beta _t\ge \beta _{t+1}(1-\alpha _{t+1}),\quad \forall \,t\ge t_0. \end{aligned}$$

(A.1)

As a result, we have

$$\begin{aligned} \beta _{t+1}(1-\alpha _{t+1})V_{t+1}\le \beta _{t}(1-\alpha _{t})V_{t} + \beta _t\alpha _t^2C, \quad \forall \,t\ge t_0. \end{aligned}$$

(A.2)

After telescoping (A.2) over $i = t_0, \ldots , t-1$, we have

$$\begin{aligned}&\beta _{t}(1-\alpha _{t})V_{t} \le \beta _{t_0}(1-\alpha _{t_0})V_{t_0} + C\textstyle \sum _{i=t_0}^{t-1} \beta _i\alpha _i^2,\qquad \forall \,t\ge t_0+1, \end{aligned}$$

and consequently,

$$\begin{aligned} V_t \le \frac{\beta _{t_0}(1-\alpha _{t_0})V_{t_0}}{\beta _{t}(1-\alpha _{t})} + \dfrac{C\sum _{i=t_0}^{t-1} \beta _i\alpha _i^2}{\beta _{t}(1-\alpha _{t})},\qquad \forall \,t\ge t_0+1, \end{aligned}$$

(A.3)

If we choose $\alpha _t = 2/(t+2)$ for all $t\ge 0$, to satisfy (A.1), we can then choose $\beta _t = (t+2)(t+1)/2$ for all $t\ge 0$. Therefore, we have

$$\begin{aligned} \beta _{t}(1-\alpha _{t}) = t(t+1)/2 \quad \text{ and }\quad \textstyle \sum\limits_{i=t_0}^{t-1} \beta _i\alpha _i^2 = \sum\limits_{i=t_0}^{t-1} {2(i+1)}/{(i+2)}\le 2(t-t_0). \end{aligned}$$

(A.4)

Substituting (A.4) into (A.3), we then complete the proof.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, R., Zhu, Q. A generalized Frank–Wolfe method with “dual averaging” for strongly convex composite optimization. Optim Lett 17, 1595–1611 (2023). https://doi.org/10.1007/s11590-022-01951-0

Download citation

Received: 12 February 2022
Accepted: 22 October 2022
Published: 07 November 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11590-022-01951-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A generalized Frank–Wolfe method with “dual averaging” for strongly convex composite optimization

Abstract

Similar content being viewed by others

Extending the applicability of the Gauss–Newton method for convex composite optimization using restricted convergence domains and average Lipschitz conditions

Efficiency of higher-order algorithms for minimizing composite functions

Analysis of the Frank–Wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier

1 Introduction

1.1 Review of computational guarantees of the GFW method

1.2 Main contributions

2 Preliminaries

3 Convergence rate of algorithm 2

Theorem 3.1

Proof

Remark 3.1

4 Application to LFP

Theorem 4.1

4.1 Relating D-LFP to algorithm 2

4.2 Local convergence rate analysis of LFP

Lemma 4.1

Proof

Corollary 4.1

Proof

Lemma 4.2

Theorem 4.2

Proof

5 Preliminary experimental studies

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Proof of Lemma 4.2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A generalized Frank–Wolfe method with “dual averaging” for strongly convex composite optimization

Abstract

Similar content being viewed by others

Extending the applicability of the Gauss–Newton method for convex composite optimization using restricted convergence domains and average Lipschitz conditions

Efficiency of higher-order algorithms for minimizing composite functions

Analysis of the Frank–Wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier

1 Introduction

1.1 Review of computational guarantees of the GFW method

1.2 Main contributions

2 Preliminaries

3 Convergence rate of algorithm 2

Theorem 3.1

Proof

Remark 3.1

4 Application to LFP

Theorem 4.1

4.1 Relating D-LFP to algorithm 2

4.2 Local convergence rate analysis of LFP

Lemma 4.1

Proof

Corollary 4.1

Proof

Lemma 4.2

Theorem 4.2

Proof

5 Preliminary experimental studies

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Proof of Lemma 4.2

A Proof of Lemma 4.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation