Optimistic optimisation of composite objective with exponentiated update

Shao, Weijia; Sivrikaya, Fikret; Albayrak, Sahin

doi:10.1007/s10994-022-06229-1

Optimistic optimisation of composite objective with exponentiated update

Open access
Published: 22 August 2022

Volume 111, pages 4719–4764, (2022)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Optimistic optimisation of composite objective with exponentiated update

Download PDF

Weijia Shao¹,
Fikret Sivrikaya² &
Sahin Albayrak^1,2

2156 Accesses
2 Citations
1 Altmetric
Explore all metrics

This article has been updated

Abstract

This paper proposes a new family of algorithms for the online optimisation of composite objectives. The algorithms can be interpreted as the combination of the exponentiated gradient and p-norm algorithm. Combined with algorithmic ideas of adaptivity and optimism, the proposed algorithms achieve a sequence-dependent regret upper bound, matching the best-known bounds for sparse target decision variables. Furthermore, the algorithms have efficient implementations for popular composite objectives and constraints and can be converted to stochastic optimisation algorithms with the optimal accelerated rate for smooth objectives.

Adaptive Zeroth-Order Optimisation of Nonconvex Composite Objectives

A hybrid stochastic optimization framework for composite nonconvex optimization

Article 04 January 2021

Constrained composite optimization and augmented Lagrangian methods

Article Open access 08 February 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Many machine learning problems involve minimising high dimensional composite objectives (Dhurandhar et al., 2018; Lu et al., 2014; Ribeiro et al., 2016; Xie et al., 2018). For example, in the task of explaining predictions of an image classifier (Dhurandhar et al., 2018; Ribeiro et al., 2016), we need to find a sufficiently small set of features explaining the prediction by solving the following constrained optimisation problem

$$\begin{aligned} \begin{aligned} \min _{x\in {\mathbb{R}}^{d}}\quad&l(x)+\lambda _{1}\Vert x\Vert _{1}+\frac{\lambda _{2}}{2}\Vert x\Vert _{2}^{2}\\ \ {\text{s.t.}} \quad&\vert x_{i} |\le c_{i} \quad {\text{for all }} i=1,\ldots , d, \\ \end{aligned} \end{aligned}$$

where l is a function relating to the classifier, $\lambda _{1}$ controls the sparsity of the feature set, $\lambda _{2}$ controls the complexity of the feature set, and $c_{1},\ldots ,c_{d}$ are the ranges of the features. For l with a complicated structure and large d, it is practical to solve the problem by optimising the first-order approximation of the objective function (Lan, 2020). However, the first-order methods can not attain optimal performance due to the non-smooth component $\lambda _{1}\Vert \cdot \Vert _{1}$. Furthermore, the purpose of introducing the $\ell _{1}$ regularisation is to ensure the sparsity of the decision variable. Applying the first-order algorithms directly on the subgradient of $\lambda _{1}\Vert \cdot \Vert _{1}$ does not lead to sparse updates (Duchi et al., 2010). We refer to the objective function consisting of a loss with a complicated structure and a simple (possibly non-smooth) convex regularisation term as a composite objective.

This paper focuses on the more general online convex optimisation (OCO), which can be considered as an iterative game between a player and an adversary. In each round t of the game, the player makes a decision $x_{t}\in {\mathcal{K}}$. Next, the adversary selects and reveals a convex loss $l_{t}$ to the player, who then suffers the composite loss $f_{t}(x)=l_{t}(x)+r_{t}(x)$, where $l_{t}:{\mathcal{K}}\rightarrow {\mathbb{R}}$ is a convex function revealed at each iteration and $r_{t}:{\mathbb{X}}\rightarrow {\mathbb{R}}_{\ge 0}$ is a known closed convex function. The target is to develop algorithms minimising the regret of not choosing the best decision $x\in {\mathcal{K}}$

$$\begin{aligned} {\mathcal{R}}_{1:T}=\sum _{t=1}^{T}f_{t}(x_{t})-\min _{x\in {\mathcal{K}}}\sum _{t=1}^{T}f_{t}(x). \end{aligned}$$

An online optimisation algorithm can be converted into a stochastic optimisation algorithm using the online-to-batch conversion technique (Cesa-Bianchi et al., 2004), which is our primary motivation. In addition to that, online optimisation also has many direct applications, such as recommender systems (Song et al., 2014) and time series prediction (Anava et al., 2013).

Given a sequence of subgradients $\{g_{t}\}$ of $\{l_{t}\}$, we are interested in the so-called adaptive algorithms ensuring regret bounds of the form ${\mathcal{O}}(\sqrt{\sum _{t=1}^{T}\Vert g_{t}\Vert _{*} ^{2}})$. The adaptive algorithms are worst-case optimal in the online setting (McMahan & Streeter, 2010) and can be converted into stochastic optimisation algorithms with optimal convergence rates (Cutkosky, 2019; Joulani et al., 2020; Kavis et al., 2019; Levy et al., 2018). The adaptive subgradient methods (AdaGrad) (Duchi et al., 2011) and their variants (Alacaoglu et al., 2020; Duchi et al., 2011; Orabona & Pál, 2018; Orabona et al., 2015) have become the most popular adaptive algorithms in recent years. They are often applied to estimating deep learning models and outperform standard optimisation algorithms when the gradient vectors are sparse. However, such property can not be expected in every problem. If the decision variables are in an $\ell _{1}$ ball and gradient vectors are dense, the Adagrad-style algorithms do not have an optimal theoretical guarantee due to the sub-linear regret dependence on the dimensionality.

The exponentiated gradient (EG) methods (Arora et al., 2012; Kivinen & Warmuth, 1997), which are designed for estimating weights in the positive orthant, enjoy the regret bound growing logarithmically with the dimensionality. The ${{\textit{EG}}^{\pm }}$ algorithm generalises this idea to negative weights (Kivinen & Warmuth, 1997; Warmuth, 2007). Given d dimensional problems with the maximum norm of the gradient bounded by G, the regret of ${{\textit{EG}}^{\pm }}$ is upper bounded by ${\mathcal{O}}(G\sqrt{T\ln d})$. As the performance of the ${{\textit{EG}}^{\pm }}$ algorithm depends strongly on the choice of hyperparameters, the p-norm algorithm (Gentile, 2003), which is less sensitive to the tuning of hyperparameters, is introduced to approach the logarithmic behaviour of ${{\textit{EG}}^{\pm }}$. Kakade et al. (2012) further extends the p-norm algorithm to learning with matrices. An adaptive version of the p-norm algorithm is analysed in Orabona et al. (2015), which has a regret upper bound proportional to $\Vert x\Vert _{p,*}^{2}\sqrt{\sum _{t=1}^{T}\Vert g_{t}\Vert _{p}^{2}}$ for a given sequence of gradients $\{g_{t}\}$. By choosing $p=2\ln d$, a regret upper bound ${\mathcal{O}}(\Vert x\Vert _{1}^{2}\sqrt{\ln d \sum _{t=1}^{T}\Vert g_{t}\Vert _{\infty }^{2}})$ can be achieved. However, tuning hyperparameters is still required to attain the optimal regret ${\mathcal{O}}(\Vert x\Vert _{1}\sqrt{\ln d \sum _{t=1}^{T}\Vert g_{t}\Vert _{\infty }^{2}})$.

Recently, Ghai et al. (2020) has introduced a hyperbolic regulariser for online mirror descent update (HU), which can be viewed as an interpolation between gradient descent and EG. It has a logarithmic behaviour as in EG and a stepsize that can be flexibly scheduled as gradient descent. However, many optimisation problems with sparse targets have an $\ell _{1}$ or nuclear regulariser in the objective function. Otherwise, the optimisation algorithm has to pick a decision variable from a compact decision set. Due to the hyperbolic regulariser, it is difficult to derive a closed-form solution for either case. Ghai et al. (2020) has proposed a workaround by tuning a temperature-like hyperparameter to normalise the decision variable at each iteration, which is equivalent to the ${{\textit{EG}}^{\pm }}$ algorithm and leads to a performance dependence on the tuning.

This paper proposes a family of algorithms for the online optimisation of composite objectives. The algorithms employ an entropy-like regulariser combined with algorithmic ideas of adaptivity and optimism. Equipped with the regulariser, the online mirror descent (OMD) and the follow-the-regulariser-leader (FTRL) algorithms update the absolute value of the scalar components of the decision variable in the same way as EG in the positive orthant. The directions of the decision variables are set in the same way as the p-norm algorithm. To derive the regret upper bound, we first show that the regulariser is strongly convex with respect to the $\ell _{1}$-norm over the $\ell _{1}$ ball. Then we analyse the algorithms in the comprehensive framework for optimistic algorithms with adaptive regularisers (Joulani et al., 2017). Given the radius of decision set D, sequences of gradients $\{g_{t}\}$ and hints $\{h_{t}\}$, the proposed algorithms achieve a regret upper bound in the form of ${\mathcal{O}}(D\sqrt{\ln d\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert ^{2}_{\infty }})$. With the techniques introduced in Ghai et al. (2020), a spectral analogue of the entropy-like regulariser can be found and proved to be strongly convex with respect to the nuclear norm over the nuclear ball, from which the best-known regret upper bound depending on $\sqrt{\ln (\min \{m,n\})}$ for problems in ${\mathbb{R}}^{m,n}$ follows.

Furthermore, the algorithms have closed-form solutions for the $\ell _{1}$ and nuclear regularised objective functions. For the $\ell _{2}$ and Frobenius regularised objectives, the update rules involve values of the principal branch of the Lambert function, which can be well approximated. We propose a sorting based procedure projecting the solution to the decision set for the $\ell _{1}$ or nuclear ball constrained problems. Finally, the proposed online algorithms can be converted into algorithms for stochastic optimisation with the technique introduced in Joulani et al. (2020). We show that the converted algorithms guarantee an optimal accelerated convergence rate for smooth objective functions. The convergence rate depends logarithmically on the dimensionality of the problem, which suggests its advantage compared to the accelerated AdaGrad-Style algorithms (Cutkosky, 2019; Joulani et al., 2020; Levy et al., 2018).

The rest of the paper is organised as follows. Section 2 reviews the existing work. Section 3 introduces the notation and preliminary concepts. Next, we present and analyse our algorithms in Sect. 4. In Sect. 5, we derive efficient implementations for some popular choices of composite objectives, constraints and stochastic optimisation. Section 6 demonstrates the empirical evaluations using both synthetic and real-world data. Finally, we conclude our work in Sect. 7.

2 Related work

Our primary motivation is to solve the optimisation problems with an elastic net regulariser in their objective function, which are highly involved in attacking (Cancela et al., 2021; Carlini & Wagner, 2017; Chen et al., 2018) and explaining (Dhurandhar et al., 2018; Ribeiro et al., 2016) deep neural networks. The proximal gradient method (PGD) (Nesterov, 2003) and its accelerated variants (Beck & Teboulle, 2009) are usually applied to solving the problem. However, these algorithms are not practical since they require prior knowledge about the smoothness of the objective function to ensure their convergence.

The AdaGrad-style algorithms (Alacaoglu et al., 2020; Duchi et al., 2011; Orabona & Pál, 2018; Orabona et al., 2015) have become popular in the machine learning community in recent years. Given the gradient vectors $g_{1},\ldots , g_{t}$ received at iteration t, the core idea of these algorithms is to set the stepsizes proportional to $\frac{1}{\sqrt{\sum _{s=1}^{t-1}\Vert g_{s}\Vert _{*} ^{2}}}$ to ensure a regret upper bounded by ${\mathcal{O}}(\sqrt{\sum _{t=1}^{T}\Vert g_{t}\Vert _{*} ^{2}})$ after T iterations. Online learning algorithms with this adaptive regret can be directly applied to the stochastic optimisation problems (Alacaoglu et al., 2020; Li & Orabona, 2019) or can be converted into a stochastic algorithm (Cesa-Bianchi & Gentile, 2008) with a convergence rate ${\mathcal{O}}(\frac{1}{\sqrt{T}})$. This rate can be further improved to ${\mathcal{O}}(\frac{1}{T^{2}})$ for unconstrained problems with smooth loss functions by applying the acceleration techniques (Cutkosky, 2019; Kavis et al., 2019; Levy et al., 2018). These acceleration techniques do not require prior knowledge about the smoothness of the loss function and a guarantee convergence rate of ${\mathcal{O}}(\frac{1}{\sqrt{T}})$ for non-smooth functions. Joulani et al. (2020) has proposed a simple approach to accelerate optimistic online optimisation algorithms with adaptive regret bound.

Given a d-dimensional problem, the algorithms mentioned above have a regret upper bound depending (sub-) linearly on d. We are interested in a logarithmic regret dependence on the dimensionality, which can be attained by the ${\textit{EG}}$ family algorithms (Arora et al., 2012; Kivinen & Warmuth, 1997; Warmuth, 2007) and their adaptive optimistic extension (Steinhardt & Liang, 2014). However, these algorithms work only for decision sets in the form of cross-polytopes and require prior knowledge about the radius of the decision set for general convex optimisation problems. The p-norm algorithm (Gentile, 2003; Kakade et al., 2012) does not have the limitation mentioned above; however, it still requires prior knowledge about the problem to attain optimal performance (Orabona et al., 2015). The HU algorithm (Ghai et al., 2020), which interpolates gradient descent and EG, can theoretically be applied to loss functions with elastic net regularisers and decision sets other than cross-polytopes. However, it is not practical due to the complex projection step.

Following the idea of HU, we propose more practical algorithms interpolating EG and the p-norm algorithm. The core of our algorithm is a symmetric logarithmic function. Orabona (2013) first introduced the idea of composing the single-dimensional symmetric logarithmic function and a norm to generalise EG to the infinite-dimensional space. It has become popular for parameter-free optimisation (Cutkosky & Boahen, 2016, 2017a, b; Kempka et al., 2019) since one can easily construct an adaptive regulariser with this composition (Cutkosky & Boahen, 2017a). In this paper, instead of using the composition, we apply the symmetric logarithmic function directly to each entry of a vector to construct a symmetric entropy-like function that is strongly convex with respect to the $\ell _{1}$ norm. We analyse MD and FTRL with the entropy-like function in the framework developed in Joulani et al. (2017). The analysis of the spectral analogue of the entropy-like function follows the idea proposed in Ghai et al. (2020).

3 Preliminary

The focus of this paper is OCO with the decision variable taken from a compact convex subset ${\mathcal{K}}\subseteq {\mathbb{X}}$ of finite dimensional vector space equipped with a norm $\Vert \cdot \Vert$. Given a sequence of vectors $\{v_{t}\}$, we use the compressed-sum notation $v_{1:t}= \sum _{s=1}^{t}v_{s}$ for simplicity. We denote by ${\mathbb{X}}_{*}$ the dual space with the dual norm $\Vert \cdot \Vert _{*}$. The bi-linear map combining vectors in ${\mathbb{X}}_{*}$ and ${\mathbb{X}}$ is denoted by

$$\begin{aligned} \langle \cdot ,\cdot \rangle :{\mathbb{X}}_{*}\times {\mathbb{X}}\rightarrow {\mathbb{R}}, (\theta , x)\mapsto \theta x. \end{aligned}$$

For ${\mathbb{X}}={\mathbb{R}}^{d}$, we denote by $\Vert \cdot \Vert _{1}$ the $\ell _{1}$ norm, the dual norm of which is the maximum norm denoted by $\Vert \cdot \Vert _{\infty }$. It is well known that the $\ell _{2}$ norm denoted by $\Vert \cdot \Vert _{2}$ is self-dual. In case ${\mathbb{X}}$ is the space of the matrices, for simplicity, we also use $\Vert \cdot \Vert _{1}$, $\Vert \cdot \Vert _{2}$ and $\Vert \cdot \Vert _{\infty }$ for the nuclear, Frobenius and spectral norm, respectively.

Let $\sigma :{\mathbb{R}}^{m,n}\rightarrow {\mathbb{R}}^{\min \{m,n\}}$ be the function mapping a matrix to its singular values. Define

$$\begin{aligned} {\text{diag}}:{\mathbb{R}}^{\min \{m,n\}}\rightarrow {\mathbb{R}}^{m,n}, x\mapsto X \end{aligned}$$

with

$$\begin{aligned} X_{ij}= {\left\{ \begin{array}{ll} x_{i}, &{}\quad {\text{if }} i=j \\ 0, &{} \quad {\text{otherwise}}. \\ \end{array}\right. } \end{aligned}$$

Clearly, the singular value decomposition (SVD) of a matrix X can be expressed as

$$\begin{aligned} X=U{\text{diag}}(\sigma (X))V^{\top }. \end{aligned}$$

Similarly, we write the eigendecomposition of a symmetric matrix X as

$$\begin{aligned} X=U{\text{diag}}(\lambda (X))U^{\top }, \end{aligned}$$

where we denote by $\lambda :{\mathbb{S}}^{d}\mapsto {\mathbb{R}}^{d}$ the function mapping a symmetric matrix to its spectrum.

Given a convex set ${\mathcal{K}}\subseteq {\mathbb{X}}$ and a convex function $f:{\mathcal{K}}\rightarrow {\mathbb{R}}$ defined on ${\mathcal{K}}$, we denote by $\partial f(y)=\{g\in {\mathbb{X}}_{*}|\forall y\in {\mathcal{K}}.f(x)-f(y)\ge \langle g,x-y\rangle \}$ the subgradient of f at y. We refer to $\triangledown f(y)$ any element in $\partial f(y)$. A function is $\eta$-strongly convex with respect to $\Vert \cdot \Vert$ over ${\mathcal{K}}$ if

$$\begin{aligned} f(x)-f(y)\ge \langle \triangledown f(y),x-y\rangle +\frac{\eta }{2}\Vert x-y\Vert ^{2} \end{aligned}$$

holds for all $x,y\in {\mathcal{K}}$ and $\triangledown f(y)\in \partial f(y)$.

4 Algorithms and analysis

In this section, we present and analyse our algorithms, which begins with a short review on EG and the p-norm algorithm for the case $f_{t}= l_{t}$. The EG algorithm can be considered as an instance of OMD, the update rules of which is given by

$$\begin{aligned} x_{t+1,i}\propto \exp \left( \ln (x_{t,i})-\frac{1}{\eta }g_{t,i}\right) , \end{aligned}$$

where $g_{t}\in \partial f_{t}(x_{t})$ is the subgradient, and $\eta > 0$ is the stepsize. Although the algorithm has the expected logarithmic dependence on the dimensionality, its update rule is applicable only to the decision variables on the standard simplex. For the problem with decision variables taken from an $\ell _{1}$ ball $\{x|\Vert x\Vert _{1}\le D\}$, one can apply the ${\textit{EG}}^{\pm }$ trick, i.e. use the vector $[\frac{D}{2}g_{t}^{\top },-\frac{D}{2}g_{t}^{\top }]^{\top }$ to update $[x_{t+1,+}^{\top },x_{t+1,-}^{\top }]^{\top }$ at iteration t and choose the decision variable $x_{t+1,+}-x_{t+1,-}$. However, if the decision set is implicitly given by a regularisation term, the parameter D has to be tuned. Since applying an overestimated D increases regret, while using an underestimated D decreases the freedom of the model, the algorithm is sensitive to tuning. For composite objectives, EG is not practical due to its update rule.

Compared to EG, the p-norm algorithm, the update rule of which is given by

$$\begin{aligned} \begin{aligned} y_{t+1,i}=\,&{\text{sgn}}(x_{t,i})\vert x_{t,i} |^{p-1}\Vert x_{t}\Vert _{p}^{\frac{2}{p-1}}-\frac{1}{\eta }g_{t,i}\\ x_{t+1,i}=\,&{\text{sgn}}(y_{t+1,i})\vert y_{t+1,i} |^{q-1}\Vert y_{t+1}\Vert _{q}^{\frac{2}{q-1}},\\ \end{aligned} \end{aligned}$$

is better applicable for unknown D. To combine the ideas of EG and the p-norm algorithm, we consider the following generalised entropy function

$$\begin{aligned} \phi : {\mathbb{R}}\rightarrow {\mathbb{R}}, x\mapsto \alpha (\vert x |+\beta )\ln \left( \frac{\vert x |}{\beta }+1\right) -\alpha \vert x |. \end{aligned}$$

(1)

In the next lemma, we show the twice differentiability and strict convexity of $\phi$, based on which a strongly convex potential function for OMD in a compact decision set can be constructed.

Lemma 1

$\phi$ is twice continuous differentiable and strictly convex with

1.
$\phi '(x)=\alpha \ln \left( \frac{\vert x |}{\beta }+1\right) {\text{sgn}}(x)$
2.
$\phi ''(x)=\frac{\alpha }{\vert x |+\beta }$.

Furthermore, the convex conjugate given by $\phi ^{*}:{\mathbb{R}}\rightarrow {\mathbb{R}}, \theta \mapsto \alpha \beta \exp \frac{\vert \theta |}{\alpha }-\beta \vert \theta |-\alpha \beta$ is also twice continuous differentiable with

1.
$\phi ^{*\prime }(\theta )=\left( \beta \exp \frac{\vert \theta |}{\alpha }-\beta \right) {\text{sgn}}(\theta )$
2.
$\phi ^{*\prime \prime }(\theta )=\frac{\beta }{\alpha } \exp \frac{\vert \theta |}{\alpha }.$

Since we can expand the natural logarithm as $\ln (\frac{\vert x |}{\beta }+1)=\frac{\vert x |}{\beta }-\frac{\vert x |^{2}}{2\beta ^{2}}+\frac{\vert x |^{3}}{3\beta ^{3}}-\cdots$, $\phi (x)$ can be intuitively considered as an interpolation between the absolute value and square. As observed in Fig. 1a, it is closer to the absolute value compared to the hyperbolic entropy introduced in Ghai et al. (2020). Moreover, running OMD with regulariser $x\mapsto \sum _{i=1}^{d}\phi (x_{i})$ yields an update rule

$$\begin{aligned} \begin{aligned} y_{t+1,i}=\,&{\text{sgn}}(x_{t,i})\ln \left( \frac{\vert x_{t,i} |}{\beta }+1\right) -\frac{1}{\alpha }g_{t,i}\\ x_{t+1,i}=\,&{\text{sgn}}(y_{t+1,i})(\beta \exp (\vert y_{t+1,i} |)-\beta ),\\ \end{aligned} \end{aligned}$$

which sets the signs of coordinates like the p-norm algorithm and updates the scale similarly to EG. As illustrated in Fig. 1b, the mirror map $\triangledown \phi ^{*}$ is close to the mirror map of EG, while the behavior of HU is more similar to the gradient descent update.

4.1 Algorithms in the Euclidean space

To obtain an adaptive and optimistic algorithm, we define the following time varying function

$$\begin{aligned} \phi _{t}:{\mathbb{R}}^{d}\rightarrow {\mathbb{R}}, x\mapsto \alpha _{t} \sum _{i=1}^{d}\left( (\vert x_{i} |+\beta )\ln \left( \frac{\vert x_{i} |}{\beta }+1\right) -\vert x_{i} |\right) , \end{aligned}$$

(2)

and apply it to the adaptive optimistic OMD (AO-OMD) given by

$$\begin{aligned} \begin{aligned} x_{t+1}&=\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}\langle g_{t}-h_{t}+h_{t+1},x\rangle +r_{t+1}(x)+{\mathcal{B}}_{\phi _{t+1}}(x,x_{t}) \end{aligned} \end{aligned}$$

(3)

for the sequence of subgradients $\{g_{t}\}$ and hints $\{h_{t}\}$. In a bounded domain, $\phi _{t}$ is strongly convex with respect to $\Vert \cdot \Vert _{1}$, which is shown in the next lemma.

Lemma 2

Let ${\mathcal{K}}\subseteq {\mathbb{R}}^{d}$ be convex and bounded such that $\Vert x\Vert _{1}\le D$ for all $x\in {\mathcal{K}}$. Then we have for all $x,y\in {\mathcal{K}}$

$$\begin{aligned} \phi _{t}(x)-\phi _{t}(y)\ge \triangledown \phi _{t}(y)^{\top }(x-y)+\frac{\alpha _{t}}{D+d\beta }\Vert x-y\Vert _{1}^{2}. \end{aligned}$$

With the property of the strong convexity, the regret of AO-OMD with regulariser (2) can be analysed in the framework of optimistic algorithm (Joulani et al., 2017) and is upper bounded by the following theorem.

Theorem 1

Let ${\mathcal{K}}\subseteq {\mathbb{R}}^{d}$ be a compact convex set. Assume that there is some $D>0$ such that $\Vert x\Vert _{1}\le D$ holds for all $x\in {\mathcal{K}}$. Let $\{x_{t}\}$ be the sequence generated by update rule (3) with regulariser (2). Setting $\beta =\frac{1}{d}$, $\eta =\sqrt{\frac{1}{\ln (D+1)+\ln d}}$, and $\alpha _{t}=\eta \sqrt{\sum _{s=1}^{t-1}\Vert g_{s}-h_{s}\Vert ^{2}_{\infty }}$, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le&r_{1}(x_{1})+c(d,D)\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}} \end{aligned} \end{aligned}$$

for some $c(d,D)\in {\mathcal{O}}(D\sqrt{\ln (D+1)+\ln d})$.

EG can also be considered as an instance of FTRL with a constant stepsize. The update rule of the adaptive optimistic FTRL (AO-FTRL) is given by

$$\begin{aligned} \begin{aligned} x_{t+1}&=\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}\langle g_{1:t}+h_{t+1},x\rangle +r_{1:t+1}(x)+{\mathcal{B}}_{\phi _{t+1}}(x,x_{1}). \end{aligned} \end{aligned}$$

(4)

The regret of AO-FTRL is upper bounded by the following theorem.

Theorem 2

Let ${\mathcal{K}}\subseteq {\mathbb{R}}^{d}$ be a compact convex set with $d>e$. Assume that there is some $D\ge 1$ such that $\Vert x\Vert _{1}\le D$ holds for all $x\in {\mathcal{K}}\subseteq {\mathbb{R}}^{d}$. Let $\{x_{t}\}$ be the sequence generated by updating rule (4) with regulariser (2) at iteration t. Setting $\beta =\frac{1}{d}$, $\eta =\sqrt{\frac{1}{\ln (D+1)+\ln d}}$ and $\alpha _{t}=\eta \sqrt{\sum _{s=1}^{t-1}\Vert g_{s}-h_{s}\Vert _{\infty }^{2}}$, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&c(d,D)\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}\\ \end{aligned} \end{aligned}$$

for some $c(d,D)\in {\mathcal{O}}(D\sqrt{\ln (D+1)+\ln d})$.

4.2 Spectral algorithms

We now consider the setting in which the decision variables are matrices taken from a compact convex set ${\mathcal{K}}\subseteq {\mathbb{R}}^{m,n}$. A direct attempt to solve this problem is to apply the updating rule (3) or (4) to the vectorised matrices. A regret bound of ${\mathcal{O}}(D\sqrt{T\ln (mn)})$ can be guaranteed if the $\ell _{1}$ norm of the vectorised matrices from ${\mathcal{K}}$ are bounded by D, which is not optimal. In many applications, elements in ${\mathcal{K}}$ are assumed to have bounded nuclear norm, for which the regulariser

$$\begin{aligned} \Phi _{t}=\phi _{t}\circ \sigma \end{aligned}$$

(5)

can be applied. The next theorem gives the strong convexity of $\Phi _{t}$ with respect to $\Vert \cdot \Vert _{1}$ over ${\mathcal{K}}$, which allows us to use $\{\Phi _{t}\}$ as the potential functions in OMD and FTRL.

Theorem 3

Let $\sigma :{\mathbb{R}}^{m,n}\rightarrow {\mathbb{R}}^{d}$ be the function mapping a matrix to its singular values. Then the function $\Phi _{t}=\phi _{t}\circ \sigma$ is $\frac{\alpha _{t}}{2(D+\min \{m,n\}\beta )}$-strongly convex with respect to the nuclear norm over the nuclear ball with radius D.

The proof of Theorem 3 follows the idea introduced in Ghai et al. (2020). Define the operator

$$\begin{aligned} S:{\mathbb{R}}^{m,n}\rightarrow {\mathbb{S}}^{m+n}, X\mapsto \begin{bmatrix} 0 &{}\quad X\\ X^{\top } &{}\quad 0 \end{bmatrix} \end{aligned}$$

The set ${\mathcal{X}}=\{S(X)|\in {\mathbb{R}}^{m,n}\}$ is a finite dimensional linear subspace of the space of symmetric matrices ${\mathbb{S}}^{m+n}$. Its dual space ${\mathcal{X}}_{*}$ determined by the Frobenius inner product can be represented by ${\mathcal{X}}$ itself. For any $S(X)\in {\mathcal{X}}$, the set of eigenvalues of S(X) consists of the singular values and the negative singular values of X. Since $\phi$ is even, we have $\sum _{i=1}^{d}\phi (\sigma _{i}(X))=\sum _{i=1}^{d}\phi (\lambda _{i}(X))$ for symmetric X. The next lemma shows that both $\Phi _{t}|_{\mathcal{X}}$ and $\Phi ^{*}_{t}|_{\mathcal{X}}$ are twice differentiable.

Lemma 3

Let $f:{\mathbb{R}}\rightarrow {\mathbb{R}}$ be twice continuously differentiable. Then the function given by

$$\begin{aligned} F:{\mathbb{S}}^{d}\rightarrow {\mathbb{R}}, X\mapsto \sum _{i=1}^{d} f(\lambda _{i}(X)) \end{aligned}$$

is twice differentiable. Furthermore, let $X\in {\mathbb{S}}^{d}$ be a symmetric matrix with eigenvalue decomposition

$$\begin{aligned}X=U{\text{diag}}(\lambda _{1}(X),\ldots ,\lambda _{d}(X))U^{\top }.\end{aligned}$$

Define the matrix of the divided difference $\Gamma (f,X)=[\gamma (f,X)_{ij}]$ with

$$\begin{aligned} \gamma (f,X)_{ij}={\left\{ \begin{array}{ll} \frac{f(\lambda _{i}(X))-f(\lambda _{j}(X))}{\lambda _{i}(X)-\lambda _{j}(X)},&{}\quad {\text{if }}\lambda _{i}(X)\ne \lambda _{j}(X)\\ f'(\lambda _{i}(X)), &{} \quad {\text{otherwise}} \end{array}\right. } \end{aligned}$$

Then for any $G,H\in {\mathbb{S}}^{d}$, we have

$$\begin{aligned} D^{2}F(X)(G,H)=\sum _{i.j}\gamma (f',X)_{ij}\tilde{g}_{ij}\tilde{h}_{ij},\\ \end{aligned}$$

where $\tilde{g}_{ij}$ and $\tilde{h}_{ij}$ are the elements of the i-th row and j-th column of the matrix $U^{\top } G U$ and $U^{\top } H U$, respectively.

Lemma 3 implies the unsurprising positive semidefiniteness of $D^{2}F(X)$ for convex f. Furthermore, the exact expression of the second differential allows us to show the local smoothness of $\Phi ^{*}_{t}$ using the local smoothness of $\phi ^{*}$. Together with Lemma 4, the locally strong convexity of $\Phi _{t}|_{\mathcal{X}}$ can be proved.

Lemma 4

Let $\Phi :{\mathbb{X}}\rightarrow {\mathbb{R}}$ be a closed convex function such that $\Phi ^{*}$ is twice differentiable at some $\theta \in {\mathbb{X}}_{*}$ with positive definite $D^{2}\Phi ^{*}(\theta )\in {\mathcal{L}}({\mathbb{X}}_{*},{\mathcal{L}}({\mathbb{X}}_{*},{\mathbb{R}}))$. Suppose that $D^{2}\Phi ^{*}(\theta )(v,v)\le \Vert v\Vert _{*} ^{2}$ holds for all $v\in {\mathbb{X}}_{*}$. Then we have $D^{2}\Phi (D\Phi ^{*}(\theta ))(x,x)\ge \Vert x\Vert ^{2}$ for all $x\in {\mathbb{X}}$.

Lemma 4 can be considered as a generalised version of the local duality of smoothness and convexity proved in Ghai et al. (2020). The required positive definiteness of $D^{2}\Phi ^{*}_{t}(\theta )$ is guaranteed by the exact expression of the second differential described in Lemma 3 and the fact $\phi ^{*\prime \prime }(\theta )> 0$ for all $\theta \in {\mathbb{R}}$. Finally, using the construction of ${\mathcal{X}}$, the locally strong convexity of $\Phi _{t}|_{\mathcal{X}}$ can be extended to $\Phi _{t}$. The complete proofs of Theorem 3 and the technical lemmata can be found in “Appendix 2.1”.

With the property of the strong convexity, the regret of applying (5) to AO-OMD and AO-FTRL can be upper bounded by the following theorems.

Theorem 4

Let ${\mathcal{K}}\subseteq {\mathbb{R}}^{m,n}$ be a compact convex set. Assume that there is some $D>0$ such that $\Vert x\Vert _{1}\le D$ holds for all $x\in {\mathcal{K}}$. Let $\{x_{t}\}$ be the sequence generated by update rule (3) with regulariser (5) at iteration t. Setting $\beta =\frac{1}{\min \{m,n\}}$, $\eta =\sqrt{\frac{1}{\ln (D+1)+\ln \min \{m,n\}}}$, and $\alpha _{t}=\eta \sqrt{\sum _{s=1}^{t-1}\Vert g_{s}-h_{s}\Vert ^{2}_{\infty }}$, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le&r_{1}(x_{1})+c(m,n,D)\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}} \end{aligned} \end{aligned}$$

with $c(m,n,D)\in {\mathcal{O}}(D\sqrt{\ln (D+1)+\ln \min \{m,n\}})$.

Theorem 5

Let ${\mathcal{K}}\subseteq {\mathbb{R}}^{\min \{m,n\}}$ be a compact convex set with $\min \{m,n\}>e$. Assume that there is some $D\ge 1$ such that $\Vert x\Vert _{1}\le D$ holds for all $x\in {\mathcal{K}}$. Let $\{x_{t}\}$ be the sequence generated by updating rule (4) with time varying regulariser (5). Setting $\beta =\frac{1}{\min \{m,n\}}$, $\eta =\sqrt{\frac{1}{\ln (D+1)+\ln \min \{m,n\}}}$ and $\alpha _{t}=\eta \sqrt{\sum _{s=1}^{t-1}\Vert g_{s}-h_{s}\Vert _{\infty }^{2}}$, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&c(m,n,D)\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}},\\ \end{aligned} \end{aligned}$$

with $c(m,n,D)\in {\mathcal{O}}(D\sqrt{\ln (D+1)+\ln \min \{m,n\}})$.

With regulariser (5), both AO-OMD and AO-FTRL guarantee a regret upper bound proportional to $\sqrt{\ln \min \{m,n\}}$, which is the best known dependence on the size of the matrices.

5 Derived algorithms

Given $z_{t+1}\in {\mathbb{X}}_{*}$ and a time varying closed convex function $R_{t+1}:{\mathcal{K}}\rightarrow {\mathbb{R}}$, we consider the following updating rule

$$\begin{aligned} \begin{aligned} y_{t+1}&=\triangledown \phi ^{*}_{t+1}(z_{t+1})\\ x_{t+1}&=\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}R_{t+1}(x)+{\mathcal{B}}_{\phi _{t+1}}(x,y_{t+1}). \end{aligned} \end{aligned}$$

(6)

It is easy to verify that (6) is equivalent to

$$\begin{aligned} \begin{aligned} x_{t+1}=\,&\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}R_{t+1}(x)+{\mathcal{B}}_{\phi _{t+1}}(x,y_{t+1})\\ =\,&\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}R_{t+1}(x)+\phi _{t+1}(x)-\langle \triangledown \phi _{t+1}(y_{t+1}),x\rangle \\ =\,&\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}R_{t+1}(x)+\phi _{t+1}(x)-\langle z_{t+1},x\rangle \\ \end{aligned} \end{aligned}$$

Setting $z_{t+1}=\triangledown \phi _{t+1}(x_{t})-g_{t}+h_{t}-h_{t+1}$ and $R_{t+1}=r_{t+1}$, we obtain the AO-OMD update

$$\begin{aligned} \begin{aligned} x_{t+1}=\,&\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}\langle g_{t}-h_{t}+h_{t+1},x\rangle -\langle \triangledown \phi _{t+1}(x_{t}),x\rangle +\phi _{t+1}(x)+r_{t+1}(x)\\ =\,&\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}\langle g_{t}-h_{t}+h_{t+1},x\rangle +r_{t+1}(x)+{\mathcal{B}}_{\phi _{t+1}}(x,x_{t}).\\ \end{aligned} \end{aligned}$$

Setting $z_{t+1}=-\triangledown \phi _{t+1}(x_{1})+g_{1:t}+h_{t+1}$ and $R_{t+1}=r_{1:t+1}$, we obtain the AO-FTRL update

$$\begin{aligned} \begin{aligned} x_{t+1}=\,&\mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}\langle g_{1:t}-\theta _{1}+h_{t+1},x\rangle +\phi _{t+1}(x)+r_{1:t+1}(x).\\ \end{aligned} \end{aligned}$$

The rest of this section focuses on solving the second line of (6) for some popular choices of $r$ and ${\mathcal{K}}$.

5.1 Elastic net regularisation

We first consider the setting of ${\mathcal{K}}={\mathbb{R}}^{d}$ and $R_{t+1}(x)=\gamma _{1} \Vert x\Vert _{1}+\frac{\gamma _{2}}{2}\Vert x\Vert ^{2}_{2}$, which has countless applications in machine learning. It is easy to verify that the Bregman divergence associated with $\psi _{t+1}$ is given by

$$\begin{aligned} \begin{aligned} {\mathcal{B}}_{\phi _{t+1}}(x,y)=\,&\alpha _{t+1}\sum _{i=1}^{d}\left( (\vert x_{i} |+\beta )\ln \left( \frac{\vert x_{i} |}{\beta }+1\right) -\vert x_{i} |\right. \\&\left. -\,({\text{sgn}}(y_{i})x_{i}+\beta )\ln \left( \frac{\vert y_{i} |}{\beta }+1\right) +\vert y_{i} |\right) . \end{aligned} \end{aligned}$$

The minimiser of

$$\begin{aligned} R_{t+1}(x)+{\mathcal{B}}_{\phi _{t+1}}(x,y_{t+1}) \end{aligned}$$

in ${\mathbb{R}}^{d}$ can be simply obtained by setting the subgradient to 0. For $\ln (\frac{\vert y_{i,t+1} |}{\beta }+1)\le \frac{\gamma _{1}}{\alpha _{t+1}}$, we set $x_{i,t+1}=0$. Otherwise, the 0 subgradient implies ${\text{sgn}}(x_{i,t+1})={\text{sgn}}(y_{i,t+1})$ and $\vert x_{i,t+1} |$ given by the root of

$$\begin{aligned} \begin{aligned} \ln \left( \frac{\vert y_{i,t+1} |}{\beta }+1\right) =\ln \left( \frac{\vert x_{i,t+1} |}{\beta }+1\right) +\frac{\gamma _{1}}{\alpha _{t+1}}+\frac{\gamma _{2}}{\alpha _{t+1}}\vert x_{i,t+1} | \end{aligned} \end{aligned}$$

for $i=1,\ldots , d$. For simplicity, we set $a=\beta$, $b=\frac{\gamma _{2}}{\alpha _{t+1}}$ and $c=\frac{\gamma _{1}}{\alpha _{t+1}}-\ln (\frac{\vert y_{i,t+1} |}{\beta }+1)$. It can be verified that $\vert x_{i,t+1} |$ is given by

$$\begin{aligned} \vert x_{i,t+1} |=\frac{1}{b}W_{0}(ab\exp (ab-c))-a, \end{aligned}$$

(7)

where $W_{0}$ is the principal branch of the Lambert function and can be well approximated. For $\gamma _{2}=0$, i.e. the $\ell _{1}$ regularised problem, $\vert x_{i,t+1} |$ has the closed form solution

$$\begin{aligned} \vert x_{i,t+1} |=\beta \exp \left( \ln \left( \frac{\vert y_{i,t+1} |}{\beta }+1\right) -\frac{\gamma _{1}}{\alpha _{t+1}}\right) -\beta . \end{aligned}$$

(8)

The implementation is described in Algorithm 1.

5.2 Nuclear and Frobenius regularisation

Similarly, we consider ${\mathcal{K}}={\mathbb{R}}^{m,n}$ with a regulariser $R_{t+1}(x)=\gamma _{1} \Vert x\Vert _{1}+\frac{\gamma _{2}}{2}\Vert x\Vert ^{2}_{2}$ mixed with the nuclear and Frobenius norm. The second line of update rule (6) can be implemented as follows

$$\begin{aligned} \begin{aligned} {\text{Compute SVD:}}\; y_{t+1}=\,&U_{t+1}{\text{diag}}(\tilde{y}_{t+1})V_{t+1}^{\top }\\ {\text{Apply Algorithm 1:}}\; \tilde{x}_{t+1}=\,&\mathop {{\text{arg min}}}\limits _{x\in {\mathbb{R}}^{d}} R_{t+1}(x)+{\mathcal{B}}_{\phi _{t+1}}(x,\tilde{y}_{t+1})\\ {\text{Construct:}}\; x_{t+1}=\,&U_{t+1}{\text{diag}}(\tilde{x}_{t+1})V_{t+1}^{\top }. \end{aligned} \end{aligned}$$

(9)

Let $y_{t+1}$ and $\tilde{y}_{t+1}$ be as defined in (9). It is easy to verify

$$\begin{aligned} \begin{aligned}&\mathop {{\text{arg min}}}\limits _{x\in {\mathbb{R}}^{m,n}} R_{t+1}(x)+{\mathcal{B}}_{\Phi _{t+1}}(x,y_{t+1})\\ =\,&\mathop {{\text{arg min}}}\limits _{x\in {\mathbb{R}}^{m,n}} R_{t+1}(x)+\Phi _{t+1}(x)-\langle U_{t+1}{\text{diag}}(\triangledown \phi _{t+1}(\tilde{y}_{t+1}))V_{t+1}^{\top },x\rangle _{F}. \end{aligned} \end{aligned}$$

(10)

From the characterisation of subgradient, it follows

$$\begin{aligned} \begin{aligned} \triangledown R_{t+1}(x)=U {\text{diag}}(\gamma _{1} {\text{sgn}}(\sigma (x))+\gamma _{2} \sigma (x)) V^{\top }, \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} \triangledown \Phi _{t}(x)=U {\text{diag}}(\triangledown \phi _{t}(\sigma (x))) V^{\top }, \end{aligned} \end{aligned}$$

where $x=U{\text{diag}}(\sigma (x))V^{\top }$ is SVD of x. Similar to the case in ${\mathbb{R}}^{d}$, $\tilde{x}_{t+1}$ is the root of

$$\begin{aligned} \gamma _{1} {\text{sgn}}(\sigma (x))+\gamma _{2} \sigma (x)+\triangledown \phi _{t}(\sigma (x))=\triangledown \phi _{t}(\tilde{y}_{t+1}). \end{aligned}$$

The subgradient of the objective (10) at $x_{t+1}=U_{t+1}{\text{diag}}(\tilde{x}_{t+1})V_{t+1}^{\top }$ is clearly 0.

5.3 Projection onto the cross-polytope

Next, we consider the setting where $r_{t}$ is the zero function and ${\mathcal{K}}$ is the $\ell _{1}$ ball with radius D. Clearly, we simply set $x_{t+1}=y_{t+1}$ for $\Vert y_{t+1}\Vert _{1}\le D$. Otherwise, Algorithm 2 describes a sorting based procedure projecting $y_{t+1}$ onto the $\ell _{1}$ ball with time complexity ${\mathcal{O}}(d\log d)$. The correctness of the algorithm is shown in the next lemma.

Lemma 5

Let $y\in {\mathbb{R}}^{d}$ with $\Vert y\Vert _{1}>D$ and $x^{*}$ as returned by Algorithm 2, then we have

$$\begin{aligned} x^{*}\in \mathop {{\text{arg min}}}\limits _{x\in {\mathcal{K}}}{\mathcal{B}}_{\psi _{t+1}}(x,y). \end{aligned}$$

For the case that ${\mathcal{K}}\subseteq {\mathbb{R}}^{m,n}$ is the nuclear ball with radius D and $\Vert y_{t+1}\Vert _{1}> D$, we need to solve the problem

$$\begin{aligned} \min _{x\in {\mathcal{K}}} \Phi _{t+1}(x)-\langle U_{t+1}{\text{diag}}(\triangledown \phi _{t+1}(\tilde{y}_{t+1}))V_{t+1}^{\top },x\rangle _{F}, \end{aligned}$$

where the constant part of the Bregman divergence is removed. From the von Neumann’s trace inequality, the Frobenius inner product is upper bounded by

$$\begin{aligned} \langle U_{t+1}\triangledown \phi _{t+1}(\tilde{y}_{t+1})V_{t+1}^{\top },x\rangle _{F}\le \sigma (x)^{\top }\triangledown \phi _{t+1}(\tilde{y}_{t+1}). \end{aligned}$$

The equality holds when x and $U_{t+1}\triangledown \phi _{t+1}(\tilde{y}_{t+1})V_{t+1}^{\top }$ share a simultaneous SVD, i.e. the minimiser has an SVD of the form

$$\begin{aligned} x=U_{t+1}{\text{diag}}(\triangledown \sigma (x))V_{t+1}^{\top }. \end{aligned}$$

Thus the problem is reduced to

$$\begin{aligned} \begin{aligned} \min _{x\in {\mathbb{R}}^{\min \{m,n\}}}&\phi _{t+1}(x)-\triangledown \phi _{t+1}(\tilde{y}_{t+1})^{\top } x\\ {\text{s.t.}} \quad&\sum _{i=1}^{\min \{m,n\}} x_{i}\le D\\&x_{i}\ge 0 {\text{ for all }} i=1,\ldots ,\min \{m,n\}, \\ \end{aligned} \end{aligned}$$

which can be solved by Algorithm 2. Thus, the projection of update rule (6) can be implemented as follows

$$\begin{aligned} \begin{aligned} {\text{Compute SVD:}}\; y_{t+1}=\,&U_{t+1}{\text{diag}}(\tilde{y}_{t+1})V_{t+1}^{\top }\\ {\text{Apply Algorithm 2:}}\; \tilde{x}_{t+1}=\,&{\text{project}}(\tilde{y}_{t+1},D,\beta )\\ {\text{Construct:}}\; x_{t+1}=\,&U_{t+1}{\text{diag}}(\tilde{x}_{t+1})V_{t+1}^{\top }. \end{aligned} \end{aligned}$$

(11)

5.4 Stochastic acceleration

Finally, we consider the stochastic optimisation problem of the form

$$\begin{aligned} \min _{x\in {\mathcal{K}}} l(x)+r(x), \end{aligned}$$

where $l:{\mathbb{X}}\rightarrow {\mathbb{R}}$ and $r:{\mathcal{K}}\rightarrow {\mathbb{R}}_{\ge 0}$ are closed convex functions. In the stochastic setting, instead of having a direct access to $\triangledown l$, we query a stochastic gradient $g_{t}$ of l at $z_{t}$ in each iteration t with ${\mathbb{E}}[g_{t}|z_{t}]\in \partial l(z_{t})$. Algorithms with a regret bound of the form ${\mathcal{O}}(\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{*} ^{2}})$ can be easily converted into a stochastic optimisation algorithm by applying the update rule to the scaled stochastic gradient $a_{t}g_{t}$ and hint $a_{t+1}g_{t}$, which is described in Algorithm 3. Joulani et al. (2020) has shown the convergence of accelerating Adagrad for the problem in ${\mathbb{R}}^{d}$. We extend the result to any finite dimensional normed vector space in the following corollary.

Corollary 1

Let $({\mathbb{X}},\Vert \cdot \Vert )$ be a finite dimensional normed vector space and ${\mathcal{K}}\subseteq {\mathbb{X}}$ a compact convex set. Denote by $\mathcal{A}$ be some optimistic algorithm generating $x_{t}\in {\mathcal{K}}$ at iteration t. Denote by

$$\begin{aligned} \nu _{t}^{2}={\mathbb{E}}[\Vert g_{t}-\triangledown l_{t}(z_{t})\Vert _{*} ^{2}|z_{t}] \end{aligned}$$

the variance. If $\mathcal{A}$ has a regret upper bound in the form of

$$\begin{aligned} c_{1}+c_{2}\sqrt{\sum _{t=1}^{T}\Vert a_{t}(g_{t}-g_{t-1})\Vert _{*} ^{2}} \end{aligned}$$

then there is some $L>0$ such that the error incurred by Algorithm 3 is upper bounded by

$$\begin{aligned} \begin{aligned} {\mathbb{E}}[f(z_{T})-f(x)]\le \,&\frac{c_{1}+c_{2}\sqrt{8\sum _{t=1}^{T}a_{t}^{2}(\nu _{t}^{2}+L^{2})}}{a_{1:T}}.\\ \end{aligned} \end{aligned}$$

Furthermore, if l is M-smooth, then we have

$$\begin{aligned} \begin{aligned} {\mathbb{E}}[f(z_{T})-f(x)]\le \,&\frac{ c_{1}+c_{2}\sqrt{8\sum _{t=1}^{T}a_{t}^{2}\nu _{t}^{2}}+\sqrt{2}c_{2}L+2Mc_{2}^{2}}{a_{1:T}}.\\ \end{aligned} \end{aligned}$$

Setting $\alpha _{t}=t$, we obtain a convergence of ${\mathcal{O}}(\frac{c_{2}}{\sqrt{T}})$ in general case, and ${\mathcal{O}}(\frac{c_{2}}{T^{2}}+\frac{c_{2}\max _{t}\nu _{t}}{\sqrt{T}})$ for smooth loss function. Applying update rule (3) or (4) with regulariser (2) or (5) to Algorithm 3, the constant $c_{2}$ is proportional to $\sqrt{\ln d}$ and $\sqrt{\ln (\min \{m,n\})}$ for ${\mathbb{X}}={\mathbb{R}}^{d}$ and ${\mathbb{X}}={\mathbb{R}}^{m,n}$ respectively, while the accelerated AdaGrad has a linear dependence on the dimensionality (Joulani et al., 2020).

6 Experiments

This section shows the empirical evaluation of the developed algorithms. We carry out experiments on both synthetic and real-world data and demonstrate the performances of the OMD (Exp-MD) and FTRL (Exp-FTRL) based on the exponentiated update.

6.1 Online logistic regression

For a sanity check, we simulate an d-dimensional online logistic regression problem, in which the model parameter $w^{*}$ has a $99\%$ sparsity and the non-zero values are randomly drawn from the uniform distribution over $[-1,1]$. At each iteration t, we sample a random feature vector $x_{t}$ from a uniform distribution over $[-1,1]^{d}$ and generate a label $y_{t}\in \{-1,1\}$ using a logit model, i.e. ${\text{Pr}}[y_{t}=1]=(1+\exp (-w^{\top } x_{t}))^{-1}$. The goal is to minimise the cumulative regret

$$\begin{aligned} {\mathcal{R}}_{1:T}=\sum _{t=1}^{T}l_{t}(w_{t})-\sum _{t=1}^{T}l_{t}(w^{*}) \end{aligned}$$

with $l_{t}(w)=\ln (1+\exp (-y_{t}w^{\top } x_{t}))$. We choose $d=10$,000 and compare our algorithms with AdaGrad, AdaFTRL (Duchi et al., 2011) and HU (Ghai et al., 2020). For both AdaGrad and AdaFTRL, we set the i-th entry of the proximal matrix $H_{t}$ to $h_{ii}=10^{-6}+\sum _{s=1}^{t-1}g_{s,i}^{2}$ as their theory suggested (Duchi et al., 2011). The stepsize of HU is set to $\sqrt{\frac{1}{\sum _{s=1}^{t-1}\Vert g_{s}\Vert _{\infty }^{2}}}$ leading to an adaptive regret upper bound. All algorithms take decision variables from an $\ell _{1}$ ball $\{w\in {\mathbb{R}}^{d}|\Vert w\Vert _{1}\le D\}$, which is the ideal case for HU. We examine the performances of the algorithms with known, underestimated and overestimated $\Vert w^{*}\Vert _{1}$ by setting $D=\Vert w^{*}\Vert _{1}$, $D=\frac{1}{2}\Vert w^{*}\Vert _{1}$ and $D=2\Vert w^{*}\Vert _{1}$, respectively. For each choice of D, we simulate the online process of each algorithm for 10,000 iterations and repeat the experiments for 20 trials.

Figure 2 plots the curves of the average cumulative regret with the ranges of standard deviation as shaded regions. As can be observed, our algorithms have a clear and stable advantage over the AdaGrad-style algorithms and slightly outperform HU in the experiments with known $\Vert w^{*}\Vert _{1}$. As the combination of the entropy-like regulariser and FTRL can also be used for parameter-free optimisation (Cutkosky & Boahen, 2017a), overestimating $\Vert w^{*}\Vert _{1}$ does not have a tangible impact on the performance of Exp-FTRL, which leads to its clear advantage over the rest.

6.2 Online multitask learning

Next, we examine the performance of the developed spectral algorithms using a simulated online multi-task learning problem (Kakade et al., 2012), in which we need to solve k highly correlated d-dimensional online prediction problems simultaneously. The data are generated as follows. We first randomly draw two orthogonal matrices $U\in {\text{GL}}(d,{\mathbb{R}})$ and $V\in {\text{GL}}(k,{\mathbb{R}})$. Then we generate a k-dimensional vector $\sigma$ with r non-zero values randomly drawn from a uniform distribution over [0, 10] and construct a low rank parameter matrix $W^{*}=U{\text{diag}}(\sigma )V$. At each iteration t, k feature and label pairs $(x_{t,1},y_{t,1}),\ldots ,(x_{t,k},y_{t,k})$ are generated using k logit models with the i-th parameters taken from the i-th rows of W. The loss function is given by $l_{t}(W)=\sum _{i=1}^{k}\ln (1+\exp (-y_{t,i}w_{i}^{\top } x_{t,i}))$. We set $d=100$, $k=25$ and $r=5$, take the nuclear ball $\{W\in {\mathbb{R}}^{d,k}|\Vert W\Vert _{1}\le D\}$ as the decision set and run the experiment as in Sect. 6.1. The average and standard deviation of the results over 20 trials are shown in Fig. 3.

Similar to the online logistic regression, our algorithms have a clear advantage over AdaGrad and AdaFTRL and slightly outperform HU in all settings. While the regret of the AdaGrad-style algorithms spread over a wider range, our algorithms yield relatively stabler results. The superiority of Exp-FTRL for the overestimated $\Vert W^{*}\Vert _{1}$ can also be observed from Fig. 3c.

6.3 Optimisation for contrastive explanations

Generating the contrastive explanation of a machine learning model (Dhurandhar et al., 2018) is the most motivating application of this paper. Given a sample $x_{0}\in {\mathcal{X}}$ and machine learning model $f:{\mathcal{X}}\rightarrow {\mathbb{R}}^{K}$, the contrastive explanation consists of a set of pertinent positive (PP) features and a set of pertinent negative (PN) features, which can be found by solving the following optimisation problem (Dhurandhar et al., 2018)

$$\begin{aligned} \begin{aligned} \min _{x\in {\mathcal{W}}}\quad&l_{x_{0}}(x)+\lambda _{1}\Vert x\Vert _{1}+\frac{\lambda _{2}}{2}\Vert x\Vert _{2}^{2}.\\ \end{aligned} \end{aligned}$$

Let $\kappa \ge 0$ be a constant and define $k_{0}=\arg \max _{i}f(x_{0})_{i}$. The loss function for finding PP is given by

$$\begin{aligned} l_{x_{0}}(x)=\max \left\{ \max _{i\ne k_{0}}f(x)_{i}-f(x)_{k_{0}},-\kappa \right\} , \end{aligned}$$

which imposes a penalty on the features that do not justify the prediction. PN is the set of features altering the final classification and is modelled by the following loss function

$$\begin{aligned} l_{x_{0}}(x)=\max \left\{ f(x_{0}+x)_{k_{0}}-\max _{i\ne k_{0}}f(x_{0}+x)_{i},-\kappa \right\} . \end{aligned}$$

In the experiment, we first train a ResNet20 model (He et al., 2016) on the CIFAR-10 dataset (Krizhevsky, 2009), which attains a test accuracy of $91.49\%$. For each class of the images, we randomly pick 100 correctly classified images from the test dataset and generate PP and PN for them. For PP, we take the set of all feasible images as the decision set, while for PN, we take the set of tensors x, such that $x_{0}+x$ is a feasible image.

We first consider the white-box setting, in which we have the access to $\triangledown l_{x_{0}}$. Our goal is to demonstrate the performance of the accelerated AO-OMD and AO-FTRL based on the exponentiated update (AccAOExpMD and AccAOExpFTRL). In Dhurandhar et al. (2018), the fast iterative shrinkage-thresholding algorithm (FISTA) (Beck & Teboulle, 2009) is applied to finding the PP and PN. Therefore, we take FISTA as our baseline. In addition, our algorithms are also compared with the accelerated AO-OMD and AO-FTRL with AdaGrad-style stepsizes (AccAOMD and AccAOFTRL) (Joulani et al., 2020).

We pick $\lambda _{1}=\lambda _{2}=\frac{1}{2}$, which is the largest value from the set $\{2^{-i}|i\in {\mathbb{N}}\}$ allowing FISTA to attain a negative loss $l_{x_{0}}$ for 10 randomly selected images. All algorithms start from $x_{1}=0$. Figure 4 plots the convergence behaviour of the five algorithms, averaged over the 1000 images. In the experiment for PP, our algorithms are obviously better than the AdaGrad-style algorithms. Although FISTA converges faster at the first 100 iterations, it does not make further progress afterwards due to the tiny stepsize found by the backtracking rule. In the experiment for PN, all algorithms behave similarly. It is worth pointing out that the backtracking rule of FISTA requires multiple function evaluations, which are expensive for explaining deep neural networks.

Next, we consider the black-box setting, in which the gradient is estimated through the two-points estimation

$$\begin{aligned} \frac{1}{b}\sum _{i=1}^{b}\frac{\delta }{\mu }(f(x+\mu v_{i})-f(x))v_{i}, \end{aligned}$$

where $\delta$, $\mu$ are constants and $v_{i}$ is a random vector. Following Chen et al. (2019), we set $\delta =d$ and sample $v_{i}$ independently from the uniform distribution over the unit sphere for AdaGrad-style algorithms. Since the convergence of our algorithms depends on the variance of the gradient estimation in $({\mathbb{R}}^{d}, \Vert \cdot \Vert _{\infty })$, we set $\delta =1$ and sample $\nu _{i,1},\ldots ,\nu _{i,d}$ independently from Rademacher distribution according to Corollary 3 in Duchi et al. (2015). To ensure a small bias of the gradient estimation, we set $\mu =\frac{1}{\sqrt{dT}}$, which is the recommended value for non-convex and constrained optimisation in Chen et al. (2019). The performances of the algorithms are examined in the high and low variance settings with $b=1$ and $b=\sqrt{T}$, respectively. Since the problem is stochastic, FISTA, which searches for the stepsize at each iteration, is not practical. Thus, we remove it from the comparison.

Figure 5 plots the convergence behaviour of the algorithms in the high variance setting. Our algorithms outperform the AdaGrad-style algorithms for generating both PP and PN. Furthermore, the FTRL based algorithms have higher convergence rates than the MD based ones at the first few iterations, leading to overall better performance. The experimental results of the low variance setting are plotted in Fig. 6. Though AccAOExpFTRL yields the smallest objective value at the beginning of the experiments, it gets stuck in the local minimum around 0 and is outperformed by AccAOExpMD and AccAOFTRL at the later iterations. Overall, the algorithms based on the exponentiated update have an advantage over the AdaGrad-style algorithms for both high and low variance settings.

7 Conclusion

This paper proposes and analyses a family of online optimisation algorithms based on an entropy-like regulariser combined with the ideas of optimism and adaptivity. The proposed algorithms have adaptive regret bounds depending logarithmically on the dimensionality of the problem, can handle popular composite objectives and can be easily converted into stochastic optimisation algorithms with optimal accelerated convergence rates for smooth function. As a future research direction, we plan to analyse the convergence of the proposed algorithms together with variance reduction techniques for non-convex stochastic optimisation and analyse their empirical performance for training deep neural networks.

Availability of data and materials

The source code generating synthetic data, creating neural networks and model training are available on GitHub https://github.com/mrdexteritas/exp_grad. The CIFAR-10 data are collected from https://www.cs.toronto.edu/~kriz/cifar.html.

Code availability

The implementation of the experiments and all algorithms involved in the experiments are available on GitHub https://github.com/mrdexteritas/exp_grad.

Change history

19 November 2022
Missing Open Access funding information has been added in the Funding Note.

References

Alacaoglu, A., Malitsky, Y., Mertikopoulos, P., & Cevher, V. (2020). A new regret analysis for Adam-type algorithms. In International conference on machine learning (pp. 202–210).
Allen-Zhu, Z., & Orecchia, L. (2017). Linear coupling: An ultimate unification of gradient and mirror descent. In 8th Innovations in theoretical computer science conference (ITCS 2017).
Anava, O., Hazan, E., Mannor, S., & Shamir, O. (2013). Online learning for time series prediction. In Conference on learning theory (pp. 172–184).
Arora, S., Hazan, E., & Kale, S. (2012). The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing, 8(1), 121–164.
Article MathSciNet MATH Google Scholar
Barbu, V., & Precupanu, T. (2012). Convexity and optimization in banach spaces. Berlin: Springer.
Book MATH Google Scholar
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.
Article MathSciNet MATH Google Scholar
Bhatia, R. (2013). Matrix analysis (Vol. 169). Berlin: Springer.
MATH Google Scholar
Cancela, B., Bolón-Canedo, V., & Alonso-Betanzos, A. (2021). A delayed elastic-net approach for performing adversarial attacks. In 2020 25th International conference on pattern recognition (ICPR) (pp. 378–384). https://doi.org/10.1109/ICPR48806.2021.9413170.
Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE symposium on security and privacy (SP) (pp. 39–57).
Cesa-Bianchi, N., Conconi, A., & Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9), 2050–2057.
Article MathSciNet MATH Google Scholar
Cesa-Bianchi, N., & Gentile, C. (2008). Improved risk tail bounds for on-line algorithms. IEEE Transactions on Information Theory, 54(1), 386–390.
Article MathSciNet MATH Google Scholar
Chen, P.-Y., Sharma, Y., Zhang, H., Yi, J., & Hsieh, C.-J. (2018). Ead: Elastic-net attacks to deep neural networks via adversarial examples. In Thirty-second AAAI conference on artificial intelligence.
Chen, X., Liu, S., Xu, K., Li, X., Lin, X., Hong, M., & Cox, D. (2019). Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 32). Berlin: Curran Associates, Inc.
Google Scholar
Cutkosky, A. (2019). Anytime online-to-batch, optimism and acceleration. In International conference on machine learning (pp. 1446–1454).
Cutkosky, A., & Boahen, K. (2017a). Online learning without prior information. In Conference on learning theory (pp. 643–677).
Cutkosky, A., & Boahen, K. A. (2016). Online convex optimization with unconstrained domains and losses. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29). Berlin: Curran Associates, Inc.
Google Scholar
Cutkosky, A., & Boahen, K. A. (2017b). Stochastic and adversarial online learning without hyperparameters. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 30). Berlin: Curran Associates, Inc.
Google Scholar
Dhurandhar, A., Chen, P.-Y., Luss, R., Tu, C.-C., Ting, P., Shanmugam, K., & Das, P. (2018). Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems. (Vol. 31). Berlin: Curran Associates Inc.
Google Scholar
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
MathSciNet MATH Google Scholar
Duchi, J. C., Jordan, M. I., Wainwright, M. J., & Wibisono, A. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5), 2788–2806.
Article MathSciNet MATH Google Scholar
Duchi, J. C., Shalev-Shwartz, S., Singer, Y., & Tewari, A. (2010). Composite objective mirror descent. In A. T. Kalai & M. Mohri (Eds.), COLT 2010—The 23rd conference on learning theory, Haifa, Israel, June 27–29, 2010 (pp. 14–26). Omnipress.
Gentile, C. (2003). The robustness of the p-norm algorithms. Machine Learning, 53(3), 265–299.
Article MATH Google Scholar
Ghai, U., Hazan, E., & Singer, Y. (2020). Exponentiated gradient meets gradient descent. In Algorithmic learning theory (pp. 386–407).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770-778). https://doi.org/10.1109/CVPR.2016.90.
Joulani, P., György, A., & Szepesvári, C. (2017). A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, and variational bounds. Journal of Machine Learning Research, 1, 40.
MathSciNet MATH Google Scholar
Joulani, P., Raj, A., Gyorgy, A., & Szepesvári, C. (2020). A simpler approach to accelerated optimization: Iterative averaging meets optimism. In International conference on machine learning (pp. 4984–4993).
Kakade, S. M., Shalev-Shwartz, S., & Tewari, A. (2012). Regularization techniques for learning with matrices. The Journal of Machine Learning Research, 13(1), 1865–1890.
MathSciNet MATH Google Scholar
Kavis, A., Levy, K. Y., Bach, F., & Cevher, V. (2019). Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems (pp. 6260–6269). Berlin: Curran Associates Inc.
Google Scholar
Kempka, M., Kotlowski, W., & Warmuth, M. K. (2019). Adaptive scale-invariant online algorithms for learning linear models. In International conference on machine learning (pp. 3321–3330).
Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–63.
Article MathSciNet MATH Google Scholar
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Master’s thesis, University of Tront.
Lan, G. (2020). First-order and stochastic optimization methods for machine learning. Berlin: Springer.
Book MATH Google Scholar
Levy, Y. K., Yurtsever, A., & Cevher, V. (2018). Online adaptive methods, universality and acceleration. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (pp. 6500–6509). Berlin: Curran Associates Inc.
Google Scholar
Lewis, A. S. (1995). The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2(1), 173–183.
MathSciNet MATH Google Scholar
Li, X., & Orabona, F. (2019). On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd international conference on artificial intelligence and statistics (pp. 983–992).
Lu, C., Lin, Z., & Yan, S. (2014). Smoothed low rank and sparse matrix recovery by iteratively reweighted least squares minimization. IEEE Transactions on Image Processing, 24(2), 646–654.
MathSciNet MATH Google Scholar
McMahan, H. B., & Streeter, M. J. (2010). Adaptive bound optimization for online convex optimization. In A. T. Kalai & M. Mohri (Eds.), COLT 2010—The 23rd conference on learning theory, Haifa, Israel, June 27–29, 2010 (pp. 244–256). Omnipress.
Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course (Vol. 87). Berlin: Springer.
MATH Google Scholar
Orabona, F. (2013). Dimension-free exponentiated gradient. In NIPS (pp. 1806–1814).
Orabona, F., Crammer, K., & Cesa-Bianchi, N. (2015). A generalized online mirror descent with applications to classification and regression. Machine Learning, 99(3), 411–435.
Article MathSciNet MATH Google Scholar
Orabona, F., & Pál, D. (2018). Scale-free online learning. Theoretical Computer Science, 716, 50–69.
Article MathSciNet MATH Google Scholar
Ribeiro, M. T., Singh, S., Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135–1144).
Song, L., Tekin, C., & Van Der Schaar, M. (2014). Online learning in large-scale contextual recommender systems. IEEE Transactions on Services Computing, 9(3), 433–445.
Article Google Scholar
Steinhardt, J., & Liang, P. (2014). Adaptivity and optimism: An improved exponentiated gradient algorithm. In International conference on machine learning (pp. 1593–1601).
Warmuth, M. K. (2007). Winnowing subspaces. In Proceedings of the 24th international conference on machine learning (pp. 999–1006).
Xie, C., Bijral, A., & Ferres, J. L. (2018). Nonstop: A nonstationary online prediction method for time series. IEEE Signal Processing Letters, 25(10), 1545–1549.
Article Google Scholar

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. The research leading to these results received funding from the German Federal Ministry for Economic Affairs and Climate Action under Grant Agreement No. 01MK20002C.

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin, Ernst-Reuter-Platz 7, 10587, Berlin, Germany
Weijia Shao & Sahin Albayrak
GT-ARC Gemeinnützige GmbH, Ernst-Reuter-Platz 7, 10587, Berlin, Germany
Fikret Sivrikaya & Sahin Albayrak

Authors

Weijia Shao
View author publications
You can also search for this author in PubMed Google Scholar
Fikret Sivrikaya
View author publications
You can also search for this author in PubMed Google Scholar
Sahin Albayrak
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: WS; Methodology: WS; Formal analysis and investigation: WS; Software: WS; Validation: WS, FS; Visualization: WS; Writing - original draft preparation: WS; Writing - review and editing: WS, FS; Funding acquisition: SA; Resources: SA; Supervision: FS, SA.

Corresponding author

Correspondence to Weijia Shao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest or competing interests.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editors: Krzysztof Dembczynski and Emilie Devijver.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Missing proofs of section 3.1

1.1 Appendix 1.1: Proof of Lemma 1

Proof

It is straightforward that $\phi$ is differentiable at $x\ne 0$ with

$$\begin{aligned}\phi '(x)=\alpha \ln \left( \frac{\vert x |}{\beta }+1\right) {\text{sgn}}(x).\end{aligned}$$

For any $h\in {\mathbb{R}}$, we have

$$\begin{aligned} \begin{aligned} \phi (0+h)-\phi (0)=\,&\alpha (\vert h |+\beta )\ln \left( \frac{\vert h |}{\beta }+1\right) -\alpha \vert h |\\ \le \,&\alpha (\vert h |+\beta )\frac{\vert h |}{\beta }-\alpha \vert h |\\ =\,&\frac{\alpha }{\beta }h^{2}, \end{aligned} \end{aligned}$$

where the first inequality uses the fact $\ln x\le x-1$. Further more, we have

$$\begin{aligned} \begin{aligned} \phi (0+h)-\phi (0)=\,&\alpha (\vert h |+\beta )\ln \left( \frac{\vert h |}{\beta }+1\right) -\alpha \vert h |\\ \ge \,&\alpha (\vert h |+\beta )\left( \frac{\vert h |}{\vert h |+\beta }\right) -\alpha \vert h |\\ \ge \,&0, \end{aligned} \end{aligned}$$

where the first inequality uses the fact $\ln x\ge 1-\frac{1}{x}$. Thus, we have

$$\begin{aligned} 0\le \frac{\phi (0+h)-\phi (0)}{h}\le \frac{\alpha }{\beta }h \end{aligned}$$

for $h>0$ and

$$\begin{aligned} \frac{\alpha }{\beta }h\le \frac{\phi (0+h)-\phi (0)}{h}\le 0. \end{aligned}$$

for $h<0$, from which it follows $\lim _{h\rightarrow 0} \frac{\phi (0+h)-\phi (0)}{h}=0$. Similarly, we have for $x\ne 0$

$$\begin{aligned} \phi ''(x)=\frac{\alpha }{\vert x |+\beta }. \end{aligned}$$

Let $h\ne 0$. Then we have

$$\begin{aligned} \frac{\phi '(0+h)-\phi '(0)}{h}=\frac{\alpha \ln \left( \frac{\vert h |}{\beta }+1\right) {\text{sgn}}(h)}{h}=\frac{\alpha \ln \left( \frac{\vert h |}{\beta }+1\right) }{\vert h |}. \end{aligned}$$

From the inequalities of the logarithm, it follows

$$\begin{aligned} \frac{\alpha }{\vert h |+\beta }\le \frac{\phi '(0+h)-\phi '(0)}{h}\le \frac{\alpha }{\beta }. \end{aligned}$$

Thus, we obtain $\phi ''(0)=\frac{\alpha }{\beta }$. By the definition of the convex conjugate we have

$$\begin{aligned} \phi ^{*}(\theta )=\max _{x\in {\mathbb{R}}} \theta x-\phi (x), \end{aligned}$$

(12)

which is differentiable. The maximiser y satisfies

$$\begin{aligned} \ln (\frac{\vert y |}{\beta }+1){\text{sgn}}(y)=\theta . \end{aligned}$$

Since $\ln \left( \frac{\vert y |}{\beta }+1\right) \ge 0$ holds, we have ${\text{sgn}}(y)={\text{sgn}}(\theta )$ and

$$\begin{aligned} \vert y |=\beta {\exp \left( \frac{\vert \theta |}{\alpha }\right) -\beta }. \end{aligned}$$

Thus, we obtain the maximiser $y=\phi ^{*\prime }(\theta )$ by setting

$$\begin{aligned} y={\text{sgn}}(\theta )\left( \beta {\exp \left( \frac{\vert \theta |}{\alpha }\right) -\beta }\right) . \end{aligned}$$

(13)

Combining (12) and (13), we obtain

$$\begin{aligned} \phi ^{*}(\theta )=\alpha \beta \exp \frac{\vert \theta |}{\alpha }-\beta \vert \theta |-\alpha \beta . \end{aligned}$$

To prove that $\phi ^{*}$ is twice differentiable, it suffices to show that $\phi ^{*\prime }$ is differentiable at 0. For any $h\ne 0$, we have

$$\begin{aligned} \frac{\phi ^{*\prime }(0+h)-\phi ^{*\prime }(0)}{h}=\frac{{\text{sgn}}(h)\left( \beta \exp \left( \frac{\vert h |}{\alpha }\right) -\beta \right) }{h}. \end{aligned}$$

Applying the inequalities of the logarithm, we obtain

$$\begin{aligned} \frac{\beta }{\alpha }\le \frac{{\text{sgn}}(h)\left( \beta \exp \left( \frac{\vert h |}{\alpha }\right) -\beta \right) }{h}\le \frac{\beta }{\alpha }\exp \left( \frac{\vert h |}{\alpha }\right) , \end{aligned}$$

from which it follows $\phi ^{*}$ is twice differentiable at 0 and

$$\begin{aligned} \phi ^{*\prime \prime }(0)=\frac{\beta }{\alpha }. \end{aligned}$$

$\square$

1.2 Appendix 1.2: Proof of Lemma 2

Proof

Let $x\in {\mathcal{K}}$ be arbitrary. We have

$$\begin{aligned} \begin{aligned} v^{\top } \triangledown ^{2} \phi _{t}(x) v=\,&\alpha _{t}\sum _{i=1}^{d} \frac{v_{i}^{2}}{\vert x_{i} |+\beta }\\ =\,&\alpha _{t}\sum _{i=1}^{d} \frac{v_{i}^{2}}{\vert x_{i} |+\beta }\sum _{i=1}^{d}(\vert x_{i} |+\beta )\frac{1}{\sum _{i=1}^{d}(\vert x_{i} |+\beta )}\\ \ge \,&\frac{\alpha _{t}}{\sum _{i=1}^{d}(\vert x_{i} |+\beta )}\left( \sum _{i=1}^{d} \vert v_{i} |\right) ^{2}\\ \ge \,&\frac{\alpha _{t}}{D+d\beta }\left( \sum _{i=1}^{d} \vert v_{i} |\right) ^{2}\\ = \,&\frac{\alpha _{t}}{D+d\beta }\Vert v\Vert _{1}^{2} \end{aligned} \end{aligned}$$

for all $v\in {\mathbb{R}}^{d}$, where the first inequality follows from the Cauchy-Schwarz inequality. This leads clearly to the strong convexity for a twice differentiable function. $\square$

1.3 Appendix 1.3: Proof of Theorem 1

Proposition 1

Let ${\mathcal{K}}\subseteq {\mathbb{X}}$ be a convex set. Assume that $r_{t}:{\mathcal{K}}\rightarrow {\mathbb{R}}_{\ge 0}$ is closed convex function defined on ${\mathcal{K}}$ and $\psi _{t}:{\mathcal{K}}\mapsto {\mathbb{R}}$ is $\eta _{t}$-strongly convex w.r.t. $\Vert \cdot \Vert$ over ${\mathcal{K}}$. Then the sequence $\{x_{t}\}$ generated by (3) with regulariser $\{\psi _{t}\}$ guarantees

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&r_{1}(x_{1})+{\mathcal{B}}_{\phi _{1}}(x,x_{1})+\sum _{t=1}^{T}({\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t}}(x,x_{t}))+\sum _{t=1}^{T}\frac{\Vert g_{t}-h_{t}\Vert _{*} ^{2}}{2\eta _{t+1}}.\\ \end{aligned} \end{aligned}$$

Proof

From the optimality condition, it follows that for all $x\in {\mathcal{K}}$

$$\begin{aligned} \begin{aligned}&\langle g_{t}-h_{t}+h_{t+1}+\triangledown r_{t+1}(x_{t+1}),x_{t+1}-x\rangle \\ \le \,&\langle \triangledown \phi _{t+1}(x_{t})-\triangledown \phi _{t+1}(x_{t+1}),x-x_{t+1}\rangle \\ =\,&{\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t+1}}(x,x_{t+1})-{\mathcal{B}}_{\phi _{t+1}}(x_{t+1},x_{t}). \end{aligned} \end{aligned}$$

Then, we have

$$\begin{aligned} \begin{aligned}&\langle g_{t},x_{t}-x\rangle +r_{t+1}(x_{t+1})-r_{t+1}(x)\\ \le \,&\langle g_{t},x_{t}-x_{t+1}\rangle +\langle g_{t}-h_{t}+h_{t+1}+\triangledown r_{t+1}(x_{t+1}),x_{t+1}-x\rangle \\&+\, \langle h_{t}-h_{t+1},x_{t+1}-x\rangle \\ \le \,&\langle g_{t}-h_{t},x_{t}-x_{t+1}\rangle +\langle h_{t},x_{t}-x\rangle -\langle h_{t+1},x_{t+1}-x\rangle \\&+\, {\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t+1}}(x,x_{t+1})-{\mathcal{B}}_{\phi _{t+1}}(x_{t+1},x_{t})\\ \end{aligned} \end{aligned}$$

Adding up from 1 to T , we obtain

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}(\langle g_{t},x_{t}-x\rangle +r_{t+1}(x_{t+1})-r_{t+1}(x))\\ \le \,&\sum _{t=1}^{T}\langle g_{t}-h_{t},x_{t}-x_{t+1}\rangle +\sum _{t=1}^{T}(\langle h_{t},x_{t}-x\rangle -\langle h_{t+1},x_{t+1}-x\rangle )\\&+\, \sum _{t=1}^{T}({\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t+1}}(x,x_{t+1})-{\mathcal{B}}_{\phi _{t+1}}(x_{t+1},x_{t}))\\ \le \,&\sum _{t=1}^{T}(\langle g_{t}-h_{t},x_{t}-x_{t+1}\rangle -{\mathcal{B}}_{\phi _{t+1}}(x_{t+1},x_{t}))\\&+\, \langle h_{1},x_{1}-x\rangle -\langle h_{T+1},x_{T+1}-x\rangle \\&+\, {\mathcal{B}}_{\phi _{1}}(x,x_{1})+\sum _{t=1}^{T}({\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t}}(x,x_{t}))\\ \end{aligned} \end{aligned}$$

$h_{1}$, $h_{T+1}$ and $x_{T+1}$, which are artifacts of the analysis, can be set to 0. Then, we simply obtain

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}(\langle g_{t},x_{t}-x\rangle +r_{t}(x_{t})-r_{t}(x))\\ =\,&\sum _{t=1}^{T}(\langle g_{t},x_{t}-x\rangle +r_{t+1}(x_{t+1})-r_{t+1}(x))\\&+\,r_{1}(x_{1})-r_{1}(x)-r_{T+1}(x_{T+1})+r_{T+1}(x)\\ \le \,&\sum _{t=1}^{T}(\langle g_{t},x_{t}-x\rangle +r_{t+1}(x_{t+1})-r_{t+1}(x))+r_{1}(x_{1})-r_{1}(x)+r_{T+1}(x)\\ \le \,&r_{1}(x_{1})-r_{1}(x)+r_{T+1}(x)+\sum _{t=1}^{T}(\langle g_{t}-h_{t},x_{t}-x_{t+1}\rangle -{\mathcal{B}}_{\phi _{t+1}}(x_{t+1},x_{t}))\\&+\,{\mathcal{B}}_{\phi _{1}}(x,x_{1})+\sum _{t=1}^{T}({\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t}}(x,x_{t}))\\ \end{aligned} \end{aligned}$$

Since $r_{T+1}$ is not involved in the regret, we assume without loss of generality $r_{1}=r_{T+1}$. From the $\eta _{t}$-strong convexity of $\phi _{t}$ we have

$$\begin{aligned} \begin{aligned}&\langle g_{t}-h_{t},x_{t}-x_{t+1}\rangle -{\mathcal{B}}_{\phi _{t+1}}(x_{t+1},x_{t})\\ \le \,&\langle g_{t}-h_{t},x_{t}-x_{t+1}\rangle -\frac{\eta _{t+1}}{2}\Vert x_{t}-x_{t+1}\Vert ^{2}\\ \le \,&\Vert g_{t}-h_{t}\Vert _{*} \Vert x_{t}-x_{t+1}\Vert -\frac{\eta _{t+1}}{2}\Vert x_{t}-x_{t+1}\Vert ^{2}\\ \le \,&\frac{\Vert g_{t}-h_{t}\Vert _{*} ^{2}}{2\eta _{t+1}}+\frac{\eta _{t+1}}{2}\Vert x_{t}-x_{t+1}\Vert ^{2}-\frac{\eta _{t+1}}{2}\Vert x_{t}-x_{t+1}\Vert ^{2}\\ =\,&\frac{\Vert g_{t}-h_{t}\Vert _{*} ^{2}}{2\eta _{t+1}}, \end{aligned} \end{aligned}$$

where the second inequality uses the definition of dual norm, the third inequality follows from the fact $ab\le \frac{a^{2}}{2}+\frac{b^{2}}{2}$. The claimed the result follows. $\square$

Proof of Theorem 1

Proposition 1 can be directly applied, and we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&\sum _{t=1}^{T}({\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t}}(x,x_{t}))+\sum _{t=1}^{T}\frac{D+d\beta }{2\alpha _{t}}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}\\&+\, {\mathcal{B}}_{\phi _{1}}(x,x_{1})+r(x_{1}).\\ \end{aligned} \end{aligned}$$

(14)

Using Lemma 8, we bound the first term of (14)

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}({\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t}}(x,x_{t}))\\ \le \,&4D(\ln (D+1)+\ln d)\sum _{t=2}^{T}(\alpha _{t+1}-\alpha _{t})\\ \le \,&4D(\ln (D+1)+\ln d)\alpha _{T+1}\\ \le \,&4D(\ln (D+1)+\ln d)\eta \sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert ^{2}_{\infty }}. \end{aligned} \end{aligned}$$

Using Lemma 6, the second term of (14) can be bounded as

$$\begin{aligned} \begin{aligned} \sum _{t=1}^{T}\frac{(D+1)\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{4\alpha _{t}}\le \,&\frac{D+1}{2\eta }\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}\\ \end{aligned} \end{aligned}$$

The third term of (14) is simply 0 since we set $\alpha _{1}=0$. Setting $\eta =\sqrt{\frac{1}{\ln (D+1)+\ln d}}$ and combining the inequalities above, we obtain the claimed result. $\square$

1.4 Appendix 1.4: Proof of Theorem 2

Proposition 2

Let ${\mathcal{K}}\subseteq {\mathbb{X}}$ be a compact convex set such that $\Vert x\Vert \le D$ holds for all $x\in {\mathcal{K}}$, $r_{t}:{\mathcal{K}}\rightarrow {\mathbb{R}}_{\ge 0}$ and $\phi _{t}:{\mathcal{K}}\mapsto {\mathbb{R}}$ closed convex function defined on ${\mathcal{K}}$. Assume $\phi _{t}$ is $\eta _{t}$-strongly convex w.r.t. $\Vert \cdot \Vert$ over ${\mathcal{K}}$ and $\phi _{t}\le \phi _{t+1}$ for all $t=1,\ldots ,T$. Then the sequence $\{x_{t}\}$ generated by (4) with guarantees

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \phi _{T+1}(x)+\sum _{t=1}^{T}\frac{2D\Vert g_{t}-h_{t}\Vert _{*} ^{2}}{\sqrt{16D^{2}\eta _{t} ^{2}+\Vert g_{t}-h_{t}\Vert _{*} ^{2}}}. \end{aligned} \end{aligned}$$

(15)

Proof of Proposition 2

First, define $\psi _{t}=r_{1:t}+\phi _{t}$. Then, we have

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}\psi _{t+1}^{*}(\theta _{t+1}-h_{t+1})-\psi ^{*}_{t}(\theta _{t}-h_{t})\\ =\,&\psi ^{*}_{T+1}(\theta _{T+1}-h_{T+1})-\psi ^{*}_{1}(\theta _{1}-h_{1})\\ \ge \,&\langle \theta _{T+1}-h_{T+1},x\rangle -\psi _{T+1}(x)-\psi ^{*}_{1}(\theta _{1}-h_{1})\\ \ge \,&\left\langle {-\sum _{t=1}^{T}g_{t}-h_{T+1}}{x}\right\rangle -\psi _{T+1}(x)-\psi ^{*}_{1}(\theta _{1}-h_{1})\\ \end{aligned} \end{aligned}$$

Setting the artifacts $h_{T+1}$ to 0, rearranging and adding $\sum _{t=1}^{T}\langle g_{t},w_{t}\rangle$ to both sides, we obtain

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}\langle g_{t},x_{t}-x\rangle \\ \le \,&\psi _{T+1}(x)+\psi ^{*}_{1}(\theta _{1}-h_{1})+\sum _{t=1}^{T}(\psi ^{*}_{t+1}(\theta _{t+1}-h_{t+1})-\psi ^{*}_{t}(\theta _{t}-h_{t})+\langle g_{t},x_{t}\rangle )\\ =\,&\psi _{T+1}(x)-\langle h_{1},x_{1}\rangle -r_{1}(x_{1})\\&+\,\sum _{t=1}^{T}(\psi ^{*}_{t+1}(\theta _{t+1}-h_{t+1})-\psi ^{*}_{t}(\theta _{t+1}))\\&+\,\sum _{t=1}^{T}(\psi ^{*}_{t}(\theta _{t+1})-\psi _{t}^{*}(\theta _{t}-h_{t})+\langle \theta _{t}-\theta _{t+1},\triangledown \psi _{t}^{*}(\theta _{t}-h_{t})\rangle )\\ \le \,&\psi _{T+1}(x)-\langle h_{1},x_{1}\rangle -r_{1}(x_{1})\\&+\,\sum _{t=1}^{T}(\psi ^{*}_{t+1}(\theta _{t+1}-h_{t+1})-\psi ^{*}_{t}(\theta _{t+1}))\\&+\,\sum _{t=1}^{T}(\psi ^{*}_{t}(\theta _{t+1})-\psi _{t}^{*}(\theta _{t}-h_{t})+\langle \theta _{t}-\theta _{t+1},\triangledown \psi _{t}^{*}(\theta _{t}-h_{t})\rangle ),\\ \end{aligned} \end{aligned}$$

From the definition of $\psi _{t}$, it follows

$$\begin{aligned} \psi _{T+1}(x)=\phi _{T+1}(x)+r_{1:T+1}(x)=\phi _{T+1}(x)+r_{1:T}(x), \end{aligned}$$

where we assumed $r_{T+1}\equiv 0$, since it is not involved in the regret. Furthermore, we have for $t\ge 1$

$$\begin{aligned} \begin{aligned}&\psi ^{*}_{t+1}(\theta _{t+1}-h_{t+1})-\psi ^{*}_{t}(\theta _{t+1})\\ \le \,&\langle \theta _{t+1}-h_{t+1},x_{t+1}\rangle -\psi _{t+1}(x_{t+1})-\langle \theta _{t+1},x_{t+1}\rangle +\psi _{t}(x_{t+1})\\ =\,&-\langle h_{t+1},x_{t+1}\rangle -\psi _{t+1}(x_{t+1})+\psi _{t}(x_{t+1})\\ =\,&-\langle h_{t+1},x_{t+1}\rangle -r_{1:t+1}(x_{t+1})+r_{1:t}(x_{t+1})-\phi _{t+1}(x_{t+1})+\phi _{t}(x_{t+1})\\ \le \,&-\langle h_{t+1},x_{t+1}\rangle -r_{t+1}(x_{t+1}),\\ \end{aligned} \end{aligned}$$

where the first inequality uses the definition of convex conjugate and the second inequality follows from the fact $\phi _{t+1}\le \phi _{t}$. Adding up from 1 to T, we obtain

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}(\psi ^{*}_{t+1}(\theta _{t+1}-h_{t+1})-\psi ^{*}_{t}(\theta _{t+1}))\\ \le \,&-\sum _{t=1}^{T}r_{t+1}(x_{t+1})-\sum _{t=1}^{T}\langle h_{t+1},x_{t+1}\rangle \\ =\,&r_{1}(x_{1})+\langle h_{1},x_{1}\rangle -r_{T+1}(x_{t+1})-\langle h_{T+1},x_{T+1}\rangle -\sum _{t=1}^{T}r_{t}(x_{t})-\sum _{t=1}^{T}\langle h_{t},x_{t}\rangle \\ =\,&r_{1}(x_{1})+\langle h_{1},x_{1}\rangle -\sum _{t=1}^{T}r_{t}(x_{t})-\sum _{t=1}^{T}\langle h_{t},x_{t}\rangle ,\\ \end{aligned} \end{aligned}$$

where we use $r_{T+1}\equiv 0$ and $h_{T+1}=0$. Combining the inequality above and rearranging, we have

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}(\langle g_{t},x_{t}-x\rangle +r_{t}(x_{t})-r_{t}(x))\\ \le \,&\phi _{T+1}(x)+\sum _{t=1}^{T}(\psi ^{*}_{t}(\theta _{t+1})-\psi _{t}^{*}(\theta _{t}-h_{t})+\langle \theta _{t}-h_{t}-\theta _{t+1},\triangledown \psi _{t}^{*}(\theta _{t}-h_{t})\rangle )\\ \le \,&\phi _{T+1}(x)+\sum _{t=1}^{T}{\mathcal{B}}_{\psi _{t}^{*}}(\theta _{t+1},\theta _{t}-h_{t}).\\ \end{aligned} \end{aligned}$$

(16)

Next, by the definition of the Bregman divergence, we have

$$\begin{aligned} \begin{aligned}&{\mathcal{B}}_{\psi _{t}^{*}}(\theta _{t+1},\theta _{t}-h_{t})\\ \le \,&\langle \theta _{t+1},\triangledown \psi _{t}^{*}(\theta _{t+1})\rangle -\psi _{t}(\triangledown \psi _{t}^{*}(\theta _{t+1}))-\langle \theta _{t}-h_{t},x_{t}\rangle +\psi _{t}(x_{t})+\langle g_{t}-h_{t},x_{t}\rangle \\ =\,&\langle \theta _{t}-h_{t},\triangledown \psi _{t}^{*}(\theta _{t+1})-x_{t}\rangle -\psi _{t}(\triangledown \psi _{t}^{*}(\theta _{t+1}))+\psi _{t}(x_{t})+\langle g_{t}-h_{t},x_{t}-\triangledown \psi _{t}^{*}(\theta _{t+1})\rangle \\ =\,&\langle g_{t}-h_{t},x_{t}-\triangledown \psi _{t}^{*}(\theta _{t+1})\rangle -{\mathcal{B}}_{\psi _{t}}(\triangledown \psi _{t}^{*}(\theta _{t+1}),x_{t}).\\ \end{aligned} \end{aligned}$$

Since $\phi _{t}$ is $\eta _{t}$ strongly convex, we have

$$\begin{aligned} \begin{aligned}&\langle g_{t}-h_{t},x_{t}-\triangledown \psi _{t}^{*}(\theta _{t+1})\rangle -{\mathcal{B}}_{\psi _{t}}(\triangledown \psi _{t}^{*}(\theta _{t+1}),x_{t})\\ \le \,&\frac{1}{2\eta _{t}}\Vert g_{t}-h_{t}\Vert _{*} ^{2}+\frac{\eta _{t}}{2}\Vert x_{t}-\triangledown \psi _{t}^{*}(\theta _{t+1})\Vert ^{2}-{\mathcal{B}}_{\psi _{t}}(\triangledown \psi _{t}^{*}(\theta _{t+1}),x_{t})\\ \le \,&\frac{1}{2\eta _{t}}\Vert g_{t}-h_{t}\Vert _{*} ^{2}\\ \end{aligned} \end{aligned}$$

(17)

We also have

$$\begin{aligned} \begin{aligned}&\langle g_{t}-h_{t},x_{t}-\triangledown \psi _{t}^{*}(\theta _{t+1})\rangle -{\mathcal{B}}_{\psi _{t}}(\triangledown \psi _{t}^{*}(\theta _{t+1}),x_{t})\\ \le \,&\langle g_{t}-h_{t},x_{t}-\triangledown \psi _{t}^{*}(\theta _{t+1})\rangle \\ \le \,&2D\Vert g_{t}-h_{t}\Vert _{*} .\\ \end{aligned} \end{aligned}$$

(18)

Putting (17) and (18) together, we have

$$\begin{aligned} \begin{aligned}&\langle g_{t}-h_{t},x_{t}-\triangledown \psi _{t}^{*}(\theta _{t+1})\rangle -{\mathcal{B}}_{\psi _{t}}(\triangledown \psi _{t}^{*}(\theta _{t+1}),x_{t})\\ \le \,&\min \left\{ \frac{1}{2\eta _{t}}\Vert g_{t}-h_{t}\Vert _{*} ^{2},2D\Vert g_{t}-h_{t}\Vert _{*} \right\} \\ \le \,&\frac{1}{\frac{2\eta _{t}}{\Vert g_{t}-h_{t}\Vert _{*} ^{2}}+\frac{1}{2D\Vert g_{t}-h_{t}\Vert _{*} }}\\ \le \,&\frac{2D\Vert g_{t}-h_{t}\Vert _{*} ^{2}}{4D\eta _{t}+\Vert g_{t}-h_{t}\Vert _{*} }\\ \le \,&\frac{2D\Vert g_{t}-h_{t}\Vert _{*} ^{2}}{\sqrt{16D^{2}\eta _{t} ^{2}+\Vert g_{t}-h_{t}\Vert _{*} ^{2}}}\\ \end{aligned} \end{aligned}$$

Combining the inequalities above, we obtain

$$\begin{aligned} {\mathcal{R}}_{1:T}\le \phi _{T+1}(x)+\sum _{t=1}^{T}\frac{2D\Vert g_{t}-h_{t}\Vert _{*} ^{2}}{\sqrt{16D^{2}\eta _{t} ^{2}+\Vert g_{t}-h_{t}\Vert _{*} ^{2}}} \end{aligned}$$

$\square$

Proof of Theorem 2

We take the Bregman divergence ${\mathcal{B}}_{\phi _{t}}(x,x_{1})$ as the regulariser at iteration t. Since ${\mathcal{B}}_{\phi _{t}}(x,x_{1})$ is non-negative, increasing with t and $\frac{2\alpha _{t}}{D+\beta d}$ strongly-convex w.r.t. $\Vert \cdot \Vert _{1}$, Proposition 2 can be directly applied, and we get

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&{\mathcal{B}}_{\phi _{T+1}}(x,x_{1})+\sum _{t=1}^{T}\frac{2D\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\frac{64D^{2}\alpha _{t}^{2}}{(D+\beta d)^{2}} +\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ =\,&{\mathcal{B}}_{\phi _{T+1}}(x,x_{1})+\frac{2D}{\eta }\sum _{t=1}^{T}\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\frac{64D^{2}}{(D+\beta d)^{2}}\sum _{s=1}^{t-1}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty } +\frac{1}{\eta ^{2}}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ \end{aligned} \end{aligned}$$

Setting $\beta =\frac{1}{d}$ and $\eta =\frac{1}{\sqrt{\ln (D+1)+\ln d}}$, we have

$$\begin{aligned} \begin{aligned}&\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\frac{64D^{2}}{(D+\beta d)^{2}}\sum _{s=1}^{t-1}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty } +\frac{1}{\eta ^{2}}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ =\,&\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\frac{64D^{2}}{(D+1)^{2}}\sum _{s=1}^{t-1}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty } +(\ln (D+1)+\ln d)\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ \le&\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\sum _{s=1}^{t-1}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty } +\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ =\,&\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\sum _{s=1}^{t}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty }}},\\ \end{aligned} \end{aligned}$$

where the inequality uses the assumptions $D\ge 1$ and $d>e$. Adding up from 1 to T, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&{\mathcal{B}}_{\phi _{T+1}}(x,x_{1})+2D\sqrt{\ln (D+1)+\ln d}\sum _{t=1}^{T}\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\sum _{s=1}^{t}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty }}}\\ \le \,&{\mathcal{B}}_{\phi _{T+1}}(x,x_{1})+4D\sqrt{\ln (D+1)+\ln d}\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}} \end{aligned} \end{aligned}$$

The first term can be bounded by Lemma 8

$$\begin{aligned} \begin{aligned} {\mathcal{B}}_{\phi _{T+1}}(x,x_{1})\le \,&4D\sqrt{\ln (D+1)+\ln d}\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}\\ \end{aligned} \end{aligned}$$

Combining the inequality above, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&c(D,d)\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}},\\ \end{aligned} \end{aligned}$$

with $c(D,d)\in {\mathcal{O}}(D\sqrt{\ln (D+1)+\ln d})$, which is the claimed result. $\square$

Appendix 2: Missing Proofs of section 3.2

1.1 Appendix 2.1: Proof of Theorem 3

The Proof of Theorem 3 is based on the idea of Ghai et al. (2020). We first revise some technical lemmata.

Proof of Lemma 3

Define $\tilde{F}:{\mathbb{S}}^{d}\rightarrow {\mathbb{S}}^{d}, X\mapsto U{\text{diag}}(f(\lambda _{1}(X)),\ldots ,f(\lambda _{d}(X)))U^{\top }$. Apparently, we have $F(X)={\text{Tr}}\tilde{F}(X)$. From the Theorem V.3.3 in Bhatia (2013), it follows that $\tilde{F}$ is differentiable and

$$\begin{aligned} D\tilde{F}(X)(H)=U({\Gamma }(f,X)\odot U^{\top } HU)U^{\top }. \end{aligned}$$

Using the linearity of the trace and the chain rule, F is differentiable and the directional derivative at X in H is given by

$$\begin{aligned} \begin{aligned} D_{H}F(X)=\,&D{\text{Tr}}(\tilde{F}(X))\circ D\tilde{F}(X)(H)\\ =\,&{\text{Tr}}(D\tilde{F}(X)(H))\\ =\,&{\text{Tr}}(U(\tilde{\Gamma }(f,X)\odot U^{\top } HU)U^{\top })\\ =\,&{\text{Tr}}(\tilde{\Gamma }(f,X)\odot U^{\top } HU)\\ =\,&\sum _{i=1}^{d}f'(\lambda _{i}(X))\tilde{h}_{ii}\\ =\,&{\text{Tr}}( U{\text{diag}}(f'(\lambda _{1}(X)),\ldots , f'(\lambda _{d}(X)))U^{\top } H) \end{aligned} \end{aligned}$$

where $\tilde{h}_{ii}$ is the i-th element in the diagonal of the matrix $U^{\top } HU$. Next, define

$$\begin{aligned} \bar{F}:{\mathbb{S}}^{d}\rightarrow {\mathbb{S}}^{d}, X\mapsto U{\text{diag}}(f'(\lambda _{1}(X)),\ldots , f'(\lambda _{d}(X)))U^{\top }. \end{aligned}$$

And we have

$$\begin{aligned} DF(X)=H\mapsto {\text{Tr}}(\bar{F}(X)H) \end{aligned}$$

Applying Theorem V.3.3 in Bhatia (2013) again, we obtain the differentiability of $\bar{F}$ and

$$\begin{aligned} \begin{aligned} D\bar{F}(X)(G)=U(\Gamma (f',X)\odot U^{\top } GU)U^{\top }. \end{aligned} \end{aligned}$$

Note that $X\mapsto {\text{Tr}}(X(\cdot ))$ is a linear map between finite dimensional spaces. Thus F is twice differentiable. From the linearity of the trace operator and matrix multiplication, it follows that $D_{H}F(X)$ is differentiable. Applying the chain rule, we obtain

$$\begin{aligned} \begin{aligned} D^{2}F(X)(G,H)=\,&D_{G}(D_{H}F)(X)\\ =\,&D(D_{H} F)(X)(G)\\ =\,&{\text{Tr}}((D\bar{F}(X)(G))H)\\ =\,&{\text{Tr}}(U(\Gamma (f',X)\odot U^{\top } GU)U^{\top } H)\\ =\,&{\text{Tr}}((\Gamma (f',X)\odot U^{\top } GU) U^{\top } HU)\\ =\,&\sum _{i.j}\gamma (f',X)_{ij}\tilde{g}_{ij}\tilde{h}_{ij},\\ \end{aligned} \end{aligned}$$

which is the claimed result. $\square$

Proof of Lemma 4

Since $D^{2}\Phi ^{*}(\theta )\in {\mathcal{L}}({\mathbb{X}}_{*},{\mathcal{L}}({\mathbb{X}}_{*},{\mathbb{R}}))$ is positive definite and ${\mathbb{X}}$ is finite dimensional, the map

$$\begin{aligned} f_{\theta }:{\mathbb{X}}_{*}\rightarrow {\mathbb{X}}, v\mapsto D^{2}\Phi ^{*}(\theta )(v,\cdot ) \end{aligned}$$

is invertible. Furthermore, defining $\psi _{\theta }:{\mathbb{X}}_{*}\rightarrow {\mathbb{R}}, v\mapsto \frac{1}{2} D^{2}\Phi ^{*}(\theta )(v,v)$, we have

$$\begin{aligned} \begin{aligned} D\psi _{\theta }(v)=\,&\frac{1}{2}D^{2}\Phi ^{*}(\theta )(v,\cdot )+\frac{1}{2}D^{2}\Phi ^{*}(\theta )(\cdot ,v)\\ =\,&f_{\theta }(v). \end{aligned} \end{aligned}$$

Thus, we obtain the convex conjugate $\psi _{\theta }^{*}$

$$\begin{aligned} \begin{aligned} \psi _{\theta }^{*}(x)=\,&\sup _{v\in {\mathbb{X}}_{*}}\langle v,x\rangle -\psi _{\theta }(v)\\ =\,&\langle f^{-1}_{\theta }(x),x\rangle -\psi _{\theta }(f^{-1}_{\theta }(x))\\ =\,&\langle f^{-1}_{\theta }(x),x\rangle -\frac{1}{2}\langle f^{-1}_{\theta }(x),D^{2}\Phi ^{*}(\theta )(f^{-1}_{\theta }(x),\cdot )\rangle \\ =\,&\langle f^{-1}_{\theta }(x),x\rangle -\frac{1}{2}\langle f^{-1}_{\theta }(x),f_{\theta }(f^{-1}_{\theta }(x))\rangle \\ =\,&\frac{1}{2}\langle f^{-1}_{\theta }(x),x\rangle \\ \end{aligned} \end{aligned}$$

by setting $x=D\psi _{\theta }(v)$. Denote by $I:{\mathbb{X}}\rightarrow {\mathbb{X}}, x\mapsto x$ the identity function. From $D\Phi ^{*}=D\Phi ^{-1}$, it follows

$$\begin{aligned} \begin{aligned} I(x)=\,&D I(v)(x)\\ =\,&D(D\Phi ^{*}\circ D\Phi )(v)(x)\\ =\,&D^{2}\Phi ^{*}(D\Phi (v))\circ D^{2}\Phi (v)(x),\\ =\,&D^{2}\Phi ^{*}(\theta )\circ D^{2}\Phi (D\Phi ^{*}(\theta ))(x)\\ \end{aligned} \end{aligned}$$

for $\theta =D\Phi (v)$ and all $x\in {\mathbb{X}}$. Thus, we have $f_{\theta }^{-1}= D^{2}\Phi (D\Phi ^{*}(\theta ))$ and

$$\begin{aligned} \begin{aligned} \psi _{\theta }^{*}(x)=\,&\frac{1}{2}\langle f^{-1}_{\theta }(x),x\rangle \\ =\,&\frac{1}{2}D^{2}\Phi (D\Phi ^{*}(\theta ))(x,x). \end{aligned} \end{aligned}$$

Finally, since $\psi _{\theta }(v)\le \frac{1}{2}\Vert v\Vert _{*} ^{2}$ holds for all $v\in {\mathbb{X}}_{*}$, we can reverse the order by applying Proposition 2.19 in Barbu and Precupanu (2012) and obtain for all $x\in {\mathbb{X}}$

$$\begin{aligned} \frac{1}{2}D^{2}\Phi (D\Phi ^{*}(\theta ))(x,x)=\psi _{\theta }^{*}(x)\ge \frac{1}{2}\Vert x\Vert ^{2}, \end{aligned}$$

which is the claimed result. $\square$

Finally, we can prove Theorem 3.

Proof of Theorem 3

We start the proof by introducing the required definitions. Define the operator

$$\begin{aligned} S:{\mathbb{R}}^{m,n}\rightarrow {\mathbb{S}}^{m+n}, X\mapsto \begin{bmatrix} 0 &{}\quad X\\ X^{\top } &{}\quad 0 \end{bmatrix} \end{aligned}$$

The set ${\mathcal{X}}=\{S(X)|X\in {\mathbb{R}}^{m,n}\}$ is a finite dimensional linear subspace of the space of symmetric matrices ${\mathbb{S}}^{m+n}$, and thus $({\mathcal{X}},\Vert \cdot \Vert _{1})$ is a finite dimensional Banach space. Its dual space ${\mathcal{X}}_{*}$ determined by the Frobenius inner product can be represented by ${\mathcal{X}}$ itself. Denote by ${\mathbb{B}}(D)=\{X\in {\mathbb{R}}^{m,n}|\Vert X\Vert _{1}\le D\}$ the nuclear ball with radius D. Then the set ${\mathcal{K}}=\{S(X)|X\in {\mathbb{B}}(D)\}$ is a nuclear ball in ${\mathcal{X}}$ with radius 2D, since $\Vert S(X)\Vert _{1}=2\Vert X\Vert _{1}$ for all $X\in {\mathbb{R}}^{m,n}$.

Let $S(X)\in {\mathcal{K}}$ be arbitrary. Denote by $F_{t}=\Phi _{t}\vert _{{\mathcal{X}}}$ the restriction of $\Phi _{t}$ to ${\mathcal{X}}$. Next, we show the strong convexity of $F_{t}$ over ${\mathcal{K}}$. From the conjugacy formula of Theorem 2.4 in Lewis (1995) and Lemma 1, it follows

$$\begin{aligned} \begin{aligned} F_{t}^{*}(S(X))=\,&\phi _{t}^{*}\circ \sigma (S(X))=\phi _{t}^{*}\circ \lambda (S(X)),\\ \end{aligned} \end{aligned}$$

where the second equality follows from the fact that $\Phi ^{*}_{t}$ is absolutely symmetric. By Lemmas 1 and 3, $F_{t}^{*}$ is twice differentiable. Let $X\in {\mathcal{K}}$ be arbitrary and $\Theta =DF_{t}(X)\in {\mathcal{X}}_{*}$. For simplicity, we define

$$\begin{aligned} f_{t}:{\mathbb{R}}\rightarrow {\mathbb{R}}, x\mapsto \alpha _{t} \beta \exp \frac{\vert x |}{\alpha _{t}}-\beta \vert x |-\alpha _{t}\beta . \end{aligned}$$

Then, for all $H\in {\mathcal{X}}$,

$$\begin{aligned} D^{2}F_{t}^{*}(\Theta )(H,H)=\sum _{ij}\gamma (f_{t}',\Theta )_{ij}\tilde{h}_{ij}^{2}, \end{aligned}$$

where $\Gamma (f_{t}',\Theta )=[\gamma (f_{t}',\Theta )_{ij}]$ is the matrix of the second divided difference with

$$\begin{aligned} \gamma (f_{t}',\Theta )_{ij}={\left\{ \begin{array}{ll} \frac{f_{t}'(\lambda _{i}(\Theta ))-f_{t}'(\lambda _{j}(\Theta ))}{\lambda _{i}(\Theta )-\lambda _{j}(\Theta )},&{}\quad {\text{if }}\lambda _{i}(\Theta )\ne \lambda _{j}(\Theta )\\ f_{t}^{\prime \prime }(\lambda _{i}(\Theta )), &{}\quad {\text{otherwise}}. \end{array}\right. } \end{aligned}$$

$D^{2}F_{t}^{*}(\Theta )$ is clearly positive definite over ${\mathbb{S}}^{m+n}$, since $\gamma (f_{t}',\Theta )_{ij}>0$ for all i and j. Furthermore, from the mean value theorem and the convexity of $f_{t}^{\prime \prime }$, there is a $c_{ij}\in (0,1)$ such that

$$\begin{aligned} \begin{aligned} \frac{f_{t}^{\prime }(\lambda _{i}(\Theta ))-f_{t}^{\prime }(\lambda _{j}(\Theta ))}{\lambda _{i}(\Theta )-\lambda _{j}(\Theta )}\le \,&f_{t}^{\prime \prime }(c_{ij}\lambda _{i}(\Theta )+(1-c_{ij})\lambda _{j}(\Theta ))\\ \le \,&c_{ij}f_{t}^{\prime \prime }(\lambda _{i}(\Theta ))+(1-c_{ij})f_{t}^{\prime \prime }(\lambda _{j}(\Theta ))\\ \le \,&f_{t}^{\prime \prime }(\lambda _{i}(\Theta ))+f_{t}^{\prime \prime }(\lambda _{j}(\Theta ))\\ \end{aligned} \end{aligned}$$

holds for all $\lambda _{i}(\Theta )\ne \lambda _{j}(\Theta )$. Thus, we obtain

$$\begin{aligned} \begin{aligned} D^{2}F_{t}^{*}(\Theta )(H,H)=\,&\sum _{ij}\gamma (f_{t},\Theta )_{ij}\tilde{h}_{ij}^{2}\\ \le \,&\sum _{ij}(f_{t}^{\prime \prime }(\lambda _{i}(\Theta ))+f_{t}^{\prime \prime }(\lambda _{j}(\Theta )))\tilde{h}_{ij}^{2}\\ = \,&2\sum _{i=1}^{m+n}f_{t}^{\prime \prime }(\lambda _{i}(\Theta ))\sum _{j=1}^{m+m}\tilde{h}_{ij}^{2}\\ =\,&2{\text{Tr}}(UHU^{\top }{\text{diag}}(f_{t}^{\prime \prime }(\lambda _{1}(\Theta )),\ldots ,f_{t}^{\prime \prime }(\lambda _{m+n}(\Theta ))UHU^{\top })\\ =\,&2{\text{Tr}}(H^{2}{\text{diag}}(f_{t}^{\prime \prime }(\lambda _{1}(\Theta )),\ldots ,f_{t}^{\prime \prime }(\lambda _{m+n}(\Theta )))\\ \le&\, 2\sum _{i=1}^{2\min \{m,n\}}\sigma _{i}(H^{2})\sigma _{i}({\text{diag}}(f_{t}^{\prime \prime }(\lambda _{1}(\Theta )),\ldots ,f_{t}^{\prime \prime }(\lambda _{m+n}(\Theta ))) \end{aligned} \end{aligned}$$

(19)

where the last line uses von Neumann’s trace inequality and the fact that the rank of $H\in {\mathcal{X}}$ and $\Theta$ is at most $2\min \{m,n\}$. Since $H^{2}$ is positive semi-definite, $\sigma _{i}(H^{2})=\sigma _{i}(H)^{2}$ holds for all i. Furthermore, $f_{t}''(x)\ge 0$ holds for all $x\in {\mathbb{R}}$. Thus, the last line of (19) can be rewritten into

$$\begin{aligned} \begin{aligned} D^{2}F_{t}^{*}(\Theta )(H,H)\le \,&2\sum _{i=1}^{2\min \{m,n\}}\sigma _{i}(H)^{2}\sigma _{i}({\text{diag}}(f_{t}^{\prime \prime }(\lambda _{1}(\Theta )),\ldots ,f_{t}^{\prime \prime }(\lambda _{m+n}(\Theta )))\\ \le \,&2\Vert H\Vert _{\infty }^{2}\sum _{i=1}^{2\min \{m,n\}}\sigma _{i}({\text{diag}}(f_{t}^{\prime \prime }(\lambda _{1}(\Theta )),\ldots ,f_{t}^{\prime \prime }(\lambda _{m+n}(\Theta )))\\ \le \,&2\Vert H\Vert _{\infty }^{2}\sum _{i=1}^{2\min \{m,n\}}f_{t}^{\prime \prime }(\lambda _{i}(\Theta )). \end{aligned} \end{aligned}$$

(20)

Recall $\Theta =DF_{t}(S(X))$ for $S(X)\in {\mathcal{K}}$. Together with Lemma 1, we obtain

$$\begin{aligned} \begin{aligned} f_{t}^{\prime \prime }(\lambda _{i}(\Theta ))=\,&\frac{\beta }{\alpha _{t}} \exp \frac{\vert \lambda _{i}(\Theta ) |}{\alpha _{t}}\\ =\,&\frac{\beta }{\alpha _{t}} \exp \frac{\vert \alpha _{t} \ln \big (\frac{\vert \lambda _{i}(S(X)\big ) |}{\beta }+1) |}{\alpha _{t}}\\ =\,&\frac{\vert \lambda _{i}(S(X)) |+\beta }{\alpha _{t}}.\\ \end{aligned} \end{aligned}$$

By the construction of ${\mathcal{K}}$, it is clear that $\sum _{i=1}^{2\min \{m,n\}}\vert \lambda _{i}(S(X)) |\le 2D$. Thus, (20) can be simply further upper bounded by

$$\begin{aligned} \begin{aligned} D^{2}F_{t}^{*}(\Theta )(H,H)\le \,&2\Vert H\Vert _{\infty }^{2}\sum _{i=1}^{2\min \{m,n\}}\frac{\vert \lambda _{i}(S(X)) |+\beta }{\alpha _{t}}\\ \le \,&2\Vert H\Vert _{\infty }^{2}\frac{2D+2\min \{m,n\}\beta }{\alpha _{t}} \end{aligned} \end{aligned}$$

Finally, applying Lemma 4, we obtain

$$\begin{aligned} D^{2}F_{t}(S(X))(Y,Y)\ge \frac{\alpha _{t}}{4(D+\min \{m,n\}\beta )}\Vert Y\Vert _{1}^{2}, \end{aligned}$$

which implies the $\frac{\alpha _{t}}{4(D+\min \{m,n\}\beta )}$-strong convexity of $F_{t}$ over ${\mathcal{K}}$.

Finally, we prove the strongly convexity of $\Phi _{t}$ over $B(D)\in {\mathbb{R}}^{m+n}$. Let $X,Y\in B(D)$ be arbitrary matrices in the nuclear ball. The following inequality can be obtained

$$\begin{aligned} \begin{aligned}&2\Phi _{t}(X)-2\Phi _{t}(Y)\\ =\,&\Phi _{t}(S(X))-\Phi _{t}(S(Y))\\ \ge \,&\langle D\Phi _{t}(S(Y)),S(X)-S(Y)\rangle _{F}+\frac{\alpha _{t}}{8(D+\min \{m,n\}\beta )}\Vert S(X)-S(Y)\Vert _{1}^{2}\\ =\,&2\langle D\Phi _{t}(Y),X-Y\rangle _{F}+\frac{\alpha _{t}}{2(D+\min \{m,n\}\beta )}\Vert X-Y\Vert _{1}^{2},\\ \end{aligned} \end{aligned}$$

which implies the $\frac{\alpha _{t}}{2(D+\min \{m,n\}\beta )}$-strong convexity of $\Phi _{t}$ as desired. $\square$

1.2 Appendix 2.2: Proof of Theorem 4

Proof

The proof is almost identical to the proof of Theorem 1. From the strong convexity of $\Phi _{t}$ shown in Theorem 3 and the general upper bound in Proposition 1, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&r_{1}(x_{1})+{\mathcal{B}}_{\phi _{1}}(x,x_{1})+\sum _{t=1}^{T}({\mathcal{B}}_{\phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\phi _{t}}(x,x_{t}))+\sum _{t=1}^{T}\frac{\Vert g_{t}-h_{t}\Vert _{*} ^{2}}{2\eta _{t+1}}. \end{aligned} \end{aligned}$$

(21)

Using Lemma 8, we have

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}({\mathcal{B}}_{\Phi _{t+1}}(x,x_{t})-{\mathcal{B}}_{\Phi _{t}}(x,x_{t}))\\ \le \,&4D(\ln (D+1)+\ln \min \{m,n\})\sum _{t=2}^{T}(\alpha _{t+1}-\alpha _{t})\\ \le \,&4D(\ln (D+1)+\ln \min \{m,n\})\alpha _{T+1}\\ =\,&4D(\ln (D+1)+\ln \min \{m,n\}) \eta \sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert ^{2}_{\infty }}\\ =\,&4D\sqrt{\ln (D+1)+\ln \min \{m,n\}} \sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert ^{2}_{\infty }}\\ \end{aligned} \end{aligned}$$

Furthermore, from Lemma 6, it follows

$$\begin{aligned} \begin{aligned} \sum _{t=1}^{T}\frac{(D+1)\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{4\alpha _{t}}\le \,&\frac{D+1}{2}\sqrt{\ln (D+1)+\ln \min \{m,n\}}\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}\\ \end{aligned} \end{aligned}$$

The claimed result is obtained by combining the inequalities above. $\square$

1.3 Appendix 2.3: Proof of Theorem 5

Proof

Since ${\mathcal{B}}_{\Phi _{t}}(x,x_{1})$ is non-negative, increasing and $\frac{2\alpha _{t}}{D+\beta d}$ strongly-convex w.r.t. $\Vert \cdot \Vert _{1}$, Proposition 2 can be directly applied, and we get

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&{\mathcal{B}}_{\Phi _{t}}(x,x_{1})+\sum _{t=1}^{T}\frac{2D\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\frac{64D^{2}\alpha _{t}^{2}}{(D+\beta d)^{2}} +\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ =\,&{\mathcal{B}}_{\Phi _{t}}(x,x_{1})+\frac{2D}{\eta }\sum _{t=1}^{T}\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\frac{64D^{2}}{(D+\beta d)^{2}}\sum _{s=1}^{t-1}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty } +\frac{1}{\eta ^{2}}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ \end{aligned} \end{aligned}$$

Setting $\beta =\frac{1}{\min \{m,n\}}$ and $\eta =\frac{1}{\sqrt{\ln (D+1)+\ln \min \{m,n\}}}$, we have

$$\begin{aligned} \begin{aligned}&\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\frac{64D^{2}}{(D+\beta d)^{2}}\sum _{s=1}^{t-1}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty } +\frac{1}{\eta ^{2}}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ =\,&\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\frac{64D^{2}}{(D+1)^{2}}\sum _{s=1}^{t-1}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty } +(\ln (D+1)+\ln d)\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ \le&\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\sum _{s=1}^{t-1}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty } +\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}}\\ =\,&\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\sum _{s=1}^{t}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty }}},\\ \end{aligned} \end{aligned}$$

where the inequality uses the assumptions $D\ge 1$ and $\min \{m,n\}>e$. Adding up from 1 to T, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&{\mathcal{B}}_{\Phi _{t}}(x,x_{1})+2D\sqrt{\ln (D+1)+\ln \min \{m,n\}}\sum _{t=1}^{T}\frac{\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}{\sqrt{\sum _{s=1}^{t}\Vert g_{s}-h_{t}\Vert ^{2}_{\infty }}}\\ \le \,&{\mathcal{B}}_{\Phi _{t}}(x,x_{1})+4D\sqrt{\ln (D+1)+\ln \min \{m,n\}}\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}} \end{aligned} \end{aligned}$$

The first term can be bounded by Lemma 8

$$\begin{aligned} \begin{aligned} {\mathcal{B}}_{\Phi _{T+1}}(x,x_{1})\le \,&4D(\ln (D+1)+\ln \min \{m,n\})\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}}\\ \end{aligned} \end{aligned}$$

Combining the inequalities above, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{R}}_{1:T}\le \,&c(D,m,n)\sqrt{\sum _{t=1}^{T}\Vert g_{t}-h_{t}\Vert _{\infty }^{2}},\\ \end{aligned} \end{aligned}$$

with $c(D,m,n)\in {\mathcal{O}}(D\sqrt{\ln (D+1)+\ln \min \{m,n\}})$, which is the claimed result. $\square$

Appendix 3: Missing Proofs of section 3.4

1.1 Appendix 3.1: Proof of Lemma 5

Proof of Lemma 5

Let $x^{*}$ be the minimiser of ${\mathcal{B}}_{\psi _{t+1}}(x,y_{t+1})$ in ${\mathcal{K}}$. Using the the fact $\ln a\ge 1-\frac{1}{a}$, we obtain

$$\begin{aligned} \ln \left( \frac{\vert x^{*}_{i} |}{\beta }+1\right) \ge \frac{\vert x^{*}_{i} |}{\vert x^{*}_{i}+\beta |} \end{aligned}$$

and

$$\begin{aligned} ((\vert x^{*}_{i} |+\beta )\ln \left( \frac{\vert x^{*}_{i} |}{\beta }+1\right) -\vert x^{*}_{i} |\ge 0. \end{aligned}$$

Thus, $y_{i}=0$ implies $x^{*}_{i}=0$. Furthermore ${\text{sgn}}(x^{*}_{i})={\text{sgn}}(y_{i})$ must hold for all i with $y_{i}\ne 0$, since otherwise we can always flip the sign of $x^{*}_{i}$ to obtain smaller objective value. So we assume without loss of generality that $y_{i}\ge 0$. We claim that $\sum _{i=1}^{d} x^{*}_{i}=D$ holds for the minimiser $x^{*}$. If it is not the case, there must be some i with $x^{*}_{i}< y_{i}$, and increasing $x^{*}_{i}$ by a small enough amount can decrease the objective function. Thus minimising the Bregman divergence can be rewritten into

$$\begin{aligned} \begin{aligned} \min _{x\in {\mathbb{R}}^{d}}\quad&\sum _{i=1}^{d}\left( (x_{i}+\beta )\ln \frac{x_{i}+\beta }{y_{i}+\beta }-x_{i}\right) \\ {\text{s.t.}} \quad&\sum _{i=1}^{d} x_{i}=D\\&x_{i}\ge 0 {\text{ for all }} i=1,\ldots , d. \\ \end{aligned} \end{aligned}$$

(22)

Using Lagrange multipliers for $x\in {\mathbb{R}}^{d}$, $\lambda \in {\mathbb{R}}$ and $\nu \in {\mathbb{R}}^{d}_+$

$$\begin{aligned} {\mathcal{L}}(x,\lambda ,\nu )=\sum _{i=1}^{d}\left( (x_{i}+\beta )\ln \frac{x_{i}+\beta }{y_{i}+\beta }-x_{i}\right) -\nu ^{\top } x-\lambda \left( D-\sum _{i=1}^{d} x_{i}\right) . \end{aligned}$$

Setting $\frac{\partial {\mathcal{L}}}{\partial x_{i}}=0$, we obtain

$$\begin{aligned} \ln \frac{x_{i}+\beta }{y_{i}+\beta }=\nu _{i}-\lambda . \end{aligned}$$

From the complementary slackness, we have $\nu _{i}=0$ for $x_{i}\ne 0$, which implies

$$\begin{aligned} x_{i}+\beta =\frac{1}{z}(y_{i}+\beta ), \end{aligned}$$

where $z=\exp (\lambda )$. Let $x^{*}$ be the minimiser and ${\mathcal{I}}=\{i:x^{*}_{i}>0\}$ the support of $x^{*}$. Then we have

$$\begin{aligned} D+\vert {\mathcal{I}} |\beta =\frac{1}{z}\left( \sum _{i\in {\mathcal{I}}}y_{i}+\vert {\mathcal{I}} |\beta \right) . \end{aligned}$$

Let p be a permutation of $\{1,\ldots ,d\}$ such that $y_{p(i)}\le y_{p(i+1)}$. Define

$$\begin{aligned} \theta (j)=y_{p(j)}(D+(d-j+1)\beta )+\beta D-\beta \sum _{i\ge j}y_{p(i)}. \end{aligned}$$

It follows from

$$\begin{aligned} \theta (j+1)-\theta (j)=(y_{p(j+1)}-y_{p(j)})(D+(d-j+1)\beta )\ge 0 \end{aligned}$$

that $\theta (j)$ is increasing in j. Let $\rho =\min \{i|\theta (i) >0\}$. For all $j<\rho$, p(j) is not in the support ${\mathcal{I}}$, since otherwise it would imply $x^{*}_{p(j)}\le 0$. Thus the minimisation problem (22) is equivalent to

$$\begin{aligned} \begin{aligned} \min _{x\in {\mathbb{R}}^{d}}\quad&\sum _{i=\rho }^{d}(x_{p(i)}+\beta )\ln \frac{x_{p(i)}+\beta }{y_{p(i)}+\beta }\\ {\text{s.t.}} \quad&\sum _{i=\rho }^{d} x_{p(i)}=D\\&x_{p(i)}> 0 {\text{ for all }} i=\rho ,\ldots , d. \\ \end{aligned} \end{aligned}$$

(23)

Define function $R:{\mathbb{R}}_{>0}\rightarrow {\mathbb{R}}, x\mapsto x\ln x$. It can be verified that R is convex. The objective function in (23) can be further rewritten into

$$\begin{aligned} \begin{aligned}&\sum _{i=\rho }^{d}(x_{p(i)}+\beta )\ln \frac{x_{p(i)}+\beta }{y_{p(i)}+\beta }\\ =\,&\sum _{i=\rho }^{d}(y_{p(i)}+\beta )R\left( \frac{x_{p(i)}+\beta }{y_{p(i)}+\beta }\right) \\ \ge \,&\frac{1}{\sum _{i=\rho }^{d}(y_{p(i)}+\beta )}R\left( \frac{\sum _{i=\rho }^{d}(x_{p(i)}+\beta )}{\sum _{i=\rho }^{d}(y_{p(i)}+\beta )}\right) \\ =\,&\frac{1}{\sum _{i=\rho }^{d}(y_{p(i)}+\beta )}R\left( \frac{D+(d-\rho +1)\beta }{\sum _{i=\rho }^{d}(y_{p(i)}+\beta )}\right) ,\\ \end{aligned} \end{aligned}$$

where the inequality follows from the Jensen’s inequality. The minimum is attained if and only if $\frac{x_{p(i)}+\beta }{y_{p(i)}+\beta }$ are equal for all i. This is only possible when $\sigma (i)$ is in the support ${\mathcal{I}}$ for all $i\ge \rho$. Thus we can set $z=\frac{\sum _{i=\rho }^{d}(\vert y_{p(i)} |+\beta )}{D+(d-\rho +1)\beta }$ and obtain $x^{*}_{i}=\max \{\frac{(\vert y_{i} |+\beta )-\beta }{z},0\}{\text{sgn}}(y_{i})$ for $i=1\ldots d$, which is the claimed result. $\square$

1.2 Appendix 3.2: Proof of Corollary 1

Proposition 3

Let $\{x_{t}\}$ be any sequences and $\{y_{t}\}$ be the sequence produced by $y_{t+1}=\frac{a_{t}}{a_{1:t}}x_{t}+(1-\frac{a_{t}}{a_{1:t}})y_{t}$. Choosing $a_{t}>0$ , we have, for all $x\in {\mathcal{W}}$

$$\begin{aligned} \begin{aligned} a_{1:T}{\mathbb{E}}[f(y_{T+1})-f(x)]\le \,&{\mathbb{E}}[{\mathcal{R}}_{1:T}]-\sum _{t=1}^{T}(a_{1:t-1}{\mathcal{B}}_{l}(y_{t},y_{t+1})),\\ \end{aligned} \end{aligned}$$

with ${\mathcal{R}}_{1:T}=\sum _{t=1}^{T}a_{t}(\langle g_{t},x_{t}-x\rangle +r(x_{t})-r(x))$.

Proof

It is interesting to see that the average scheme can be considered as an instance of the linear coupling introduced in Allen-Zhu and Orecchia (2017). For any sequence $\{x_{t}\}$, $\{y_{t}\}$ and $z_{t}=\frac{a_{t}}{a_{1:t}}x_{t}+(1-\frac{a_{t}}{a_{1:t}})y_{t}$, we start the proof by bounding $a_{t}(f(y_{t+1})-f(x))$ as follows

$$\begin{aligned} \begin{aligned}&a_{t}(l(y_{t+1})-l(x))\\ =\,&a_{t}(l(y_{t+1})-l(z_{t})+l(z_{t})-l(x))\\ =\,&a_{t}(l(y_{t+1})-l(z_{t})+\langle \triangledown l(z_{t}),z_{t}-x\rangle -{\mathcal{B}}_{l}(z_{t},x))\\ =\,&a_{t}(l(y_{t+1})-l(z_{t})+\langle \triangledown l(z_{t}),z_{t}-x_{t}\rangle +\langle \triangledown l(z_{t}),x_{t}-x\rangle -{\mathcal{B}}_{l}(z_{t},x))\\ \end{aligned} \end{aligned}$$

(24)

Denote by $\tau _{t}=\frac{a_{t}}{a_{1:t}}$ the weight. The first term of the the inequality above can be further bounded by

$$\begin{aligned} \begin{aligned}&a_{t}(l(y_{t+1})-l(z_{t})+\langle \triangledown l(z_{t}),z_{t}-x_{t}\rangle )\\ =\,&a_{t}(l(y_{t+1})-l(z_{t})+\frac{1-\tau _{t}}{\tau _{t}}\langle \triangledown l(z_{t}),y_{t}-z_{t}\rangle )\\ =\,&a_{t}(l(y_{t+1})-l(z_{t})+\left( \frac{1}{\tau _{t}}-1\right) (l(y_{t})-l(z_{t}))-\left( \frac{1}{\tau _{t}}-1\right) {\mathcal{B}}_{l}(y_{t},z_{t}))\\ =\,&a_{t}\left( \frac{1}{\tau _{t}}-1\right) (l(y_{t})-l(y_{t+1}))+\frac{a_{t}}{\tau _{t}}(l(y_{t+1})-l(z_{t}))-a_{1:t-1}{\mathcal{B}}_{l}(y_{t},z_{t}).\\ \end{aligned} \end{aligned}$$

(25)

Next, we have

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}a_{t}\left( \frac{1}{\tau _{t}}-1\right) (f(y_{t})-f(y_{t+1}))\\ =\,&\sum _{t=2}^{T}a_{1:t-1}(f(y_{t})-f(y_{t+1}))\\ =\,&\sum _{t=1}^{T-1}a_{t}f(y_{t+1})-a_{1:T-1}f(y_{T+1})\\ =\,&\sum _{t=1}^{T}a_{t}f(y_{t+1})-a_{1:T}f(y_{T+1})\\ =\,&\sum _{t=1}^{T}a_{t}(f(y_{t+1})-f(y_{T+1})) \end{aligned} \end{aligned}$$

(26)

Combining (24), (25) and (26), we have

$$\begin{aligned} \begin{aligned} a_{1:T}(f(y_{T+1})-f(x))=\,&\sum _{t=1}^{T}\frac{a_{t}}{\tau _{t}}(l(y_{t+1})-l(z_{t}))\\&+\, \sum _{t=1}^{T}\langle \triangledown l(z_{t}),x_{t}-x\rangle \\&-\, \sum _{t=1}^{T}(a_{1:t-1}{\mathcal{B}}_{l}(y_{t},z_{t})-a_{t}{\mathcal{B}}_{l}(z_{t},x)), \end{aligned} \end{aligned}$$

Simply setting $y_{t+1}:=z_{t}$ makes the first term above 0 and implies $z_{t}=\frac{\sum _{s=1}^{t}a_{s}x_{s}}{a_{1:t}}$. Furthermore it follows from the convexity of $r$

$$\begin{aligned} r(y_{T+1})=r\left( \frac{\sum _{s=1}^{T}a_{s}x_{s}}{a_{1:T}}\right) \le \sum _{t=1}^{T}\frac{a_{t}r(x_{t})}{a_{1:T}}. \end{aligned}$$

Combining the inequalities above and rearranging, we obtain

$$\begin{aligned} \begin{aligned} a_{1:T}(f(y_{T+1})-f(x))\le \,&\sum _{t=1}^{T}a_{t}(\langle \triangledown l(z_{t}),x_{t}-x\rangle +r(x_{t})-r(x))\\&-\sum _{t=1}^{T}(a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t})+a_{t}{\mathcal{B}}_{l}(z_{t},x))\\ \le \,&\sum _{t=1}^{T}a_{t}(\langle \triangledown l(z_{t}),x_{t}-x\rangle +r(x_{t})-r(x))\\&-\sum _{t=1}^{T}(a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t}))\\ \end{aligned} \end{aligned}$$

Furthermore, we have

$$\begin{aligned} &{\mathbb{E}}\left[\sum _{t=1}^{T}\langle a_{t}\triangledown l(z_{t}),x_{t}-x\rangle \right]\\ =\,&{\mathbb{E}}\left[\sum _{t=1}^{T}\langle a_{t}g_{t},x_{t}-x\rangle \right]+{\mathbb{E}}\left[\sum _{t=1}^{T}\langle a_{t}(\triangledown l_{t}-g_{t}),x_{t}-x\rangle \right]\\ =\,&{\mathbb{E}}\left[\sum _{t=1}^{T}\langle a_{t}g_{t},x_{t}-x\rangle \right]+\sum _{t=1}^{T}{\mathbb{E}}\left[\langle a_{t}(\triangledown l_{t}-g_{t}),x_{t}-x\rangle \right]\\ =\,&{\mathbb{E}}\left[\sum _{t=1}^{T}\langle a_{t}g_{t},x_{t}-x\rangle \right]+\sum _{t=1}^{T}{\mathbb{E}}\left[{\mathbb{E}}\left[\langle a_{t}(\triangledown l_{t}-g_{t}),x_{t}-x\rangle |z_{t}\right]\right]\\ =\,&{\mathbb{E}}\left[\sum _{t=1}^{T}\langle a_{t}g_{t},x_{t}-x\rangle \right].\\ \end{aligned}$$

Finally, we we obtain

$$\begin{aligned} a_{1:T}{\mathbb{E}}[f(y_{T+1})-f(x)]\le \,&{\mathbb{E}}\left[\sum _{t=1}^{T}a_{t}(\langle g_{t},x_{t}-x\rangle +r(x_{t})-r(x))\right]\\&-\sum _{t=1}^{T}(a_{1:t-1}{\mathcal{B}}_{l}(y_{t},y_{t+1})),\\ \end{aligned}$$

which is the claimed result. $\square$

Proof of Corollary 1

First of all, we have

$$\begin{aligned} {\mathbb{E}}[{\mathcal{R}}_{1:T}]\le \,&c_{1}+c_{2}{\mathbb{E}}\left[\sqrt{\sum _{t=1}^{T}\Vert a_{t}(g_{t}-g_{t-1})\Vert _{*} ^{2}}\right]\\ \le \,&c_{1}+c_{2}\sqrt{\sum _{t=1}^{T}{\mathbb{E}}[\Vert a_{t}(g_{t}-g_{t-1})\Vert _{*} ^{2}]}\\ \le \,&c_{1}+c_{2}\sqrt{\sum _{t=1}^{T}{\mathbb{E}}[\Vert a_{t}(g_{t}-g_{t-1})\Vert _{*} ^{2}|z_{t}]}.\\ \end{aligned}$$

(27)

For all t, we have

$$\begin{aligned} \begin{aligned} {\mathbb{E}}[\Vert a_{t}(g_{t}-g_{t-1})\Vert _{*} ^{2}\vert z_{t}]\le \,&2a_{t}^{2} ({\mathbb{E}}[\Vert g_{t}-\triangledown l(z_{t})-g_{t-1}+\triangledown l(z_{t-1})\Vert _{*} ^{2}|z_{t}])\\&+\, 2a_{t}^{2} (\Vert \triangledown l(z_{t})-\triangledown l(z_{t-1})\Vert _{*} ^{2}). \end{aligned} \end{aligned}$$

(28)

Since $z_{t-1}$ is fixed when $z_{t}$ is given, the first term above can be bounded by

$$\begin{aligned} \begin{aligned}&2a_{t}^{2} ({\mathbb{E}}[\Vert g_{t}-\triangledown l(z_{t})-g_{t-1}+\triangledown l(z_{t-1})\Vert _{*} ^{2}|z_{t}])\\ \le \,&4a_{t}^{2} ({\mathbb{E}}[\Vert g_{t}-\triangledown l(z_{t})\Vert _{*} ^{2}|z_{t}]+{\mathbb{E}}[\Vert g_{t-1}-\triangledown l(z_{t-1})\Vert _{*} ^{2}|z_{t}])\\ \le \,&4a_{t}^{2} ({\mathbb{E}}[\Vert g_{t}-\triangledown l(z_{t})\Vert _{*} ^{2}|z_{t}]+{\mathbb{E}}[\Vert g_{t-1}-\triangledown l(z_{t-1})\Vert _{*} ^{2}|z_{t-1}])\\ \le \,&4a_{t}^{2} \big (\nu _{t}^{2}+\nu _{t-1}^{2}\big ).\\ \end{aligned} \end{aligned}$$

Since ${\mathcal{K}}$ is compact, there is some $L>0$ such that $\Vert \triangledown l(z)\Vert _{*} \le L$ for all $z\in {\mathbb{X}}$. Thus the second term of (28) can be bounded by

$$\begin{aligned} \begin{aligned} 2a_{t}^{2} \Vert \triangledown l(z_{t})-\triangledown l(z_{t-1})\Vert _{*} ^{2}\le 8a_{t}^{2}L^{2} \end{aligned} \end{aligned}$$

(29)

Combining (27), (28) and (29), we have

$$\begin{aligned} {\mathbb{E}}[{\mathcal{R}}_{1:T}]\le c1+c2\sqrt{8\sum _{t=1}^{T}a_{t}^{2}\big (\nu _{t}^{2}+L^{2}\big )}, \end{aligned}$$

and combining with Proposition 3, we obtain

$$\begin{aligned} \begin{aligned} {\mathbb{E}}[f(z_{T})-f(x)]\le \,&\frac{c1+c2\sqrt{8\sum _{t=1}^{T}a_{t}^{2}\big (\nu _{t}^{2}+L^{2}\big )}}{a_{1:T}}.\\ \end{aligned} \end{aligned}$$

If l is M-smooth, then for $t\ge 2$, we have

$$\begin{aligned} \begin{aligned} 2a_{t}^{2} \Vert \triangledown l(z_{t})-\triangledown l(z_{t-1})\Vert _{*} ^{2}\le \,&\frac{4Ma_{t}^{2}}{a_{1:t-1}}a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t}).\\&8Ma_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t}).\\ \end{aligned} \end{aligned}$$

(30)

Using fact $2ab-a^{2}\le b^{2}$, we have

$$\begin{aligned} \begin{aligned}&2c_{2}\sqrt{2M\sum _{t=2}^{T}a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t})}-\sum _{t=2}^{T}a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t})\\ \le \,&2Mc_{2}^{2}. \end{aligned} \end{aligned}$$

(31)

Combining (27), (28) and (31), we have

$$\begin{aligned} \begin{aligned}&{\mathbb{E}}[{\mathcal{R}}_{1:T}]-\sum _{t=1}^{T}a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t})\\ \le \,&c_{1}+c_{2}\sqrt{\sum _{t=1}^{T}{\mathbb{E}}[\Vert a_{t}(g_{t}-g_{t-1})\Vert _{*} ^{2}|z_{t}]}-\sum _{t=1}^{T}a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t})\\ \le \,&c_{1}1+c_{2}\sqrt{8\sum _{t=1}^{T}a_{t}^{2}(\nu _{t}^{2})}\\&+ \, c_{2}\sqrt{\sum _{t=1}^{T}2a_{t}^{2} \Vert \triangledown l(z_{t})-\triangledown l(z_{t-1})\Vert _{*} ^{2}}-\sum _{t=1}^{T}a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t})\\ \le \,&c_{1}1+c_{2}\sqrt{8\sum _{t=1}^{T}a_{t}^{2}(\nu _{t}^{2})}+c_{2}\sqrt{2}\Vert \triangledown l(z_{1})\Vert _{*} \\&+ \, c_{2}\sqrt{\sum _{t=2}^{T}2a_{t}^{2} \Vert \triangledown l(z_{t})-\triangledown l(z_{t-1})\Vert _{*} ^{2}}-\sum _{t=2}^{T}a_{1:t-1}{\mathcal{B}}_{l}(z_{t-1},z_{t})\\ \le \,&c_{1}+c_{2}\sqrt{8\sum _{t=1}^{T}a_{t}^{2}(\nu _{t}^{2})}+\sqrt{2}c_{2}L+2Mc_{2}^{2},\\ \end{aligned} \end{aligned}$$

which implies

$$\begin{aligned} \begin{aligned} {\mathbb{E}}[f(z_{T})-f(x)]\le \,&\frac{ c_{1}+c_{2}\sqrt{8\sum _{t=1}^{T}a_{t}^{2}\nu _{t}^{2}}+\sqrt{2}c_{2}L+2Mc_{2}^{2}}{a_{1:T}}.\\ \end{aligned} \end{aligned}$$

$\square$

Appendix 4: Technical lemmata

Lemma 6

For positive values $a_{1},\ldots ,a_{n}$ the following holds:

1.
$$\begin{aligned} \sum _{i=1}^{n}\frac{a_{i}}{\sum _{k=1}^{i}a_{k}+1}\le \log \left( \sum _{i=1}^{n}a_{i}+1\right) \end{aligned}$$
2.
$$\begin{aligned} \sqrt{\sum _{i=1}^{n}a_{i}}\le \sum _{i=1}^{n}\frac{a_{i}}{\sqrt{\sum _{j=1}^ia_{j}^{2}}}\le 2\sqrt{\sum _{i=1}^{n}a_{i}}. \end{aligned}$$

Proof

The proof of (1) can be found in Lemma A.2 in Levy et al. (2018) For (2), we define $A_{0}=1$ and $A_{i}=\sum _{k=1}^{i}a_{i}+1$ for $i>0$. Then we have

$$\begin{aligned} \begin{aligned} \sum _{i=1}^{n}\frac{a_{i}}{\sum _{k=1}^{i}a_{k}+1}=\,&\sum _{i=1}^{n}\frac{A_{i}-A_{i-1}}{A_{i}}\\ =\,&\sum _{i=1}^{n}\left( 1-\frac{A_{i-1}}{A_{i}}\right) \\ \le \,&\sum _{i=1}^{n}\ln \frac{A_{i}}{A_{i-1}}\\ =\,&\ln A_{n}-\ln A_{0}\\ =\,&\ln \sum _{i=1}^{n}(a_{i}+1), \end{aligned} \end{aligned}$$

where the inequality follows from the concavity of $\log$. $\square$

Lemma 7

Let l be convex and M-smooth over ${\mathbb{X}}$, i.e.

$$\begin{aligned} l(x)\le l(y)+\langle \triangledown l(y),x-y\rangle +\frac{M}{2}\Vert x-y\Vert ^{2}. \end{aligned}$$

Then

$$\begin{aligned} \Vert \triangledown l(x)-\triangledown l(y)\Vert _{*} ^{2}\le 2M{\mathcal{B}}_{l}(x,y) \end{aligned}$$

holds for all $x,y\in {\mathbb{X}}$.

Proof

Let $x,y\in {\mathbb{X}}$ be arbitrary. Define $h:{\mathbb{X}}\rightarrow {\mathbb{R}}, z\mapsto l(z)-\langle \triangledown l(y),z\rangle$. Clearly, h is M-smooth and minimised at y. Thus we have

$$\begin{aligned} \begin{aligned} h(y)=\,&\min _{z\in {\mathbb{X}}}h(z)\\ \le \,&\min _{z\in {\mathbb{X}}}h(x)+\langle \triangledown h(x),z-x\rangle +\frac{M}{2}\Vert z-x\Vert ^{2}\\ \le \,&\min _{\gamma \ge 0}h(x)-\Vert \triangledown h(x)\Vert _{*} \gamma +\frac{M}{2}\gamma ^{2}\\ =\,&h(x)-\frac{1}{2M}\Vert \triangledown h(x)\Vert _{*} ^{2}, \end{aligned} \end{aligned}$$

where the first inequality uses the M-smoothness of h, and the second uses $\langle \triangledown h(x),z-x\rangle \ge -\Vert \triangledown h(x)\Vert _{*} \Vert z-x\Vert$, for which we choose z such that the equality holds. This implies

$$\begin{aligned} \begin{aligned} \frac{1}{2M}\Vert \triangledown l(x)-\triangledown l(y)\Vert _{*} ^{2}\le \,&l(x)-l(y)-\langle \triangledown l(y),x-y\rangle ={\mathcal{B}}_{l}(x,y), \end{aligned} \end{aligned}$$

and the desired result follows. $\square$

Lemma 8

Define $\psi :{\mathbb{R}}^{d}\rightarrow {\mathbb{R}}, x\mapsto \sum _{i=1}^{d}\phi (x_{i})$ for $\phi$ be as defined in (1). Assume $\Vert x\Vert _{1}\le D$ for all $x\in {\mathcal{K}}\subseteq {\mathbb{R}}^{d}$. Setting $\beta =\frac{1}{d}$, we obtain for all $x,y\in {\mathcal{K}}$

$$\begin{aligned} {\mathcal{B}}_{\psi }(x,y)\le 4D(\ln (D+1)+\ln d). \end{aligned}$$

Similarly, we define $\Psi :{\mathbb{R}}^{m,n}\rightarrow {\mathbb{R}}, x\mapsto \psi \circ \sigma (x)$. Assume $\Vert x\Vert _{1}\le D$ for all $x\in {\mathcal{K}}\subseteq {\mathbb{R}}^{m,n}$. Setting $\beta =\frac{1}{\min \{m,n\}}$, we obtain for all $x,y\in {\mathcal{K}}$

$$\begin{aligned} {\mathcal{B}}_{\Psi }(x,y)\le 4D(\ln (D+1)+\ln \min \{m,n\}). \end{aligned}$$

Proof

From the definition of the Bregman divergence it follows for all $x,y\in {\mathcal{K}}$

$$\begin{aligned} \begin{aligned} {\mathcal{B}}_{\psi }(x,y)=\,&\psi (x)-\psi (y)-\langle \triangledown \psi (y),x-y\rangle \\ \le \,&\langle \triangledown \psi (x)-\triangledown \psi (y),x-y\rangle \\ \le \,&\Vert \triangledown \psi (x)-\triangledown \psi (y)\Vert _{\infty }\Vert x-y\Vert _{1}\\ \le \,&2D(\Vert \triangledown \psi (x)\Vert _{\infty }+\Vert \triangledown \psi (y)\Vert _{\infty }). \end{aligned} \end{aligned}$$

Using the closed form of $\Vert \triangledown \psi (x)\Vert _{\infty }$, we have for $x\in {\mathcal{K}}$

$$\begin{aligned} \begin{aligned} \Vert \triangledown \psi (x)\Vert _{\infty } =\,&\max _{i}\vert \ln \left( \frac{\vert x_{i} |}{\beta }+1\right) |\\ \le \,&\vert \ln (D+\beta ) |+\vert \ln \left( \frac{1}{\beta }\right) |\\ \le \,&\ln (D+1)+\ln d. \end{aligned} \end{aligned}$$

Combining the inequalities above and choosing $\beta = \frac{1}{d}$, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{B}}_{\psi }(x,y)=\,&4D(\ln (D+1)+\ln d). \end{aligned} \end{aligned}$$

Using the same argument, we have for all $x,y\in {\mathcal{K}}\subseteq {\mathbb{R}}^{m,n}$

$$\begin{aligned} \begin{aligned} {\mathcal{B}}_{\Psi }(x,y)=\,&2D(\Vert \triangledown \Psi (x)\Vert _{\infty }+\Vert \triangledown \Psi (y)\Vert _{\infty }). \end{aligned} \end{aligned}$$

From the characterisation of subgradient, it follows for $x\in {\mathcal{K}}$

$$\begin{aligned} \begin{aligned} \Vert \triangledown \Psi (x)\Vert _{\infty } = \,&\Vert \triangledown \phi (\sigma (x))\Vert _{\infty }\\ \le \,&\ln (D+1)+\ln \frac{1}{\beta }. \end{aligned} \end{aligned}$$

Combine the inequalities above and choose $\beta = \frac{1}{\min \{m,n\}}$, we obtain

$$\begin{aligned} \begin{aligned} {\mathcal{B}}_{\Psi }(x,y)\le \,&4D(\ln (D+1)+\ln \min \{m,n\}). \end{aligned} \end{aligned}$$

$\square$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shao, W., Sivrikaya, F. & Albayrak, S. Optimistic optimisation of composite objective with exponentiated update. Mach Learn 111, 4719–4764 (2022). https://doi.org/10.1007/s10994-022-06229-1

Download citation

Received: 03 February 2022
Revised: 31 May 2022
Accepted: 22 July 2022
Published: 22 August 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10994-022-06229-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Optimistic optimisation of composite objective with exponentiated update

Abstract

Similar content being viewed by others

Adaptive Zeroth-Order Optimisation of Nonconvex Composite Objectives

A hybrid stochastic optimization framework for composite nonconvex optimization

Constrained composite optimization and augmented Lagrangian methods

1 Introduction

2 Related work

3 Preliminary

4 Algorithms and analysis

Lemma 1

4.1 Algorithms in the Euclidean space

Lemma 2

Theorem 1

Theorem 2

4.2 Spectral algorithms

Theorem 3

Lemma 3

Lemma 4

Theorem 4

Theorem 5

5 Derived algorithms

5.1 Elastic net regularisation

5.2 Nuclear and Frobenius regularisation

5.3 Projection onto the cross-polytope

Lemma 5

5.4 Stochastic acceleration

Corollary 1

6 Experiments

6.1 Online logistic regression

6.2 Online multitask learning

6.3 Optimisation for contrastive explanations

7 Conclusion

Availability of data and materials

Code availability

Change history

19 November 2022

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix 1: Missing proofs of section 3.1

1.1 Appendix 1.1: Proof of Lemma 1

Proof

1.2 Appendix 1.2: Proof of Lemma 2

Proof

1.3 Appendix 1.3: Proof of Theorem 1

Proposition 1

Proof

Proof of Theorem 1

1.4 Appendix 1.4: Proof of Theorem 2

Proposition 2

Proof of Proposition 2

Proof of Theorem 2

Appendix 2: Missing Proofs of section 3.2

1.1 Appendix 2.1: Proof of Theorem 3

Proof of Lemma 3

Proof of Lemma 4

Proof of Theorem 3

1.2 Appendix 2.2: Proof of Theorem 4

Proof

1.3 Appendix 2.3: Proof of Theorem 5

Proof

Appendix 3: Missing Proofs of section 3.4

1.1 Appendix 3.1: Proof of Lemma 5

Proof of Lemma 5

1.2 Appendix 3.2: Proof of Corollary 1

Proposition 3

Proof

Proof of Corollary 1