Advertisement

Machine Learning

, Volume 95, Issue 2, pp 183–223 | Cite as

Regret bounded by gradual variation for online convex optimization

  • Tianbao Yang
  • Mehrdad Mahdavi
  • Rong Jin
  • Shenghuo Zhu
Article

Abstract

Recently, it has been shown that the regret of the Follow the Regularized Leader (FTRL) algorithm for online linear optimization can be bounded by the total variation of the cost vectors rather than the number of rounds. In this paper, we extend this result to general online convex optimization. In particular, this resolves an open problem that has been posed in a number of recent papers. We first analyze the limitations of the FTRL algorithm as proposed by Hazan and Kale (in Machine Learning 80(2–3), 165–188, 2010) when applied to online convex optimization, and extend the definition of variation to a gradual variation which is shown to be a lower bound of the total variation. We then present two novel algorithms that bound the regret by the gradual variation of cost functions. Unlike previous approaches that maintain a single sequence of solutions, the proposed algorithms maintain two sequences of solutions that make it possible to achieve a variation-based regret bound for online convex optimization.

To establish the main results, we discuss a lower bound for FTRL that maintains only one sequence of solutions, and a necessary condition on smoothness of the cost functions for obtaining a gradual variation bound. We extend the main results three-fold: (i) we present a general method to obtain a gradual variation bound measured by general norm; (ii) we extend algorithms to a class of online non-smooth optimization with gradual variation bound; and (iii) we develop a deterministic algorithm for online bandit optimization in multipoint bandit setting.

Keywords

Online convex optimization Regret bound Gradual variation Bandit 

1 Introduction

We consider the general online convex optimization problem (Zinkevich 2003) which proceeds in trials. At each trial, the learner is asked to predict the decision vector x t that belongs to a bounded closed convex set \(\mathcal{P}\subseteq\mathbb{R}^{d}\); it then receives a cost function \(c_{t}(\cdot):\mathcal{P}\rightarrow \mathbb{R}\) and incurs a cost of c t (x t ) for the submitted solution. The goal of online convex optimization is to come up with a sequence of solutions x 1,…,x T that minimizes the regret, which is defined as the difference in the cost of the sequence of decisions accumulated up to the trial T made by the learner and the cost of the best fixed decision in hindsight, i.e.
$$ \mbox{Regret}_T = \sum_{t=1}^T c_t(\mathbf{x}_t) -\min_{\mathbf{x}\in \mathcal{P}}\sum _{t=1}^T c_t(\mathbf{x}). $$
In the special case of linear cost functions, i.e. \(c_{t}(\mathbf{x}) = \mathbf{f}_{t}^{\top }\mathbf{x}\), the problem becomes online linear optimization. The goal of online convex optimization is to design algorithms that predict, with a small regret, the solution x t at the tth trial given the (partial) knowledge about the past cost functions c τ (⋅),τ=1,…,t−1.

Over the past decade, many algorithms have been proposed for online convex optimization, especially for online linear optimization. As the first seminal paper in online convex optimization, Zinkevich (2003) proposed a gradient descent algorithm with a regret bound of \(O(\sqrt{T})\). When cost functions are strongly convex, the regret bound of the online gradient descent algorithm is reduced to O(logT) with appropriately chosen step size (Hazan et al. 2007). Another common methodology for online convex optimization, especially for online linear optimization, is based on the framework of Follow the Leader (FTL) (Kalai and Vempala 2005). FTL chooses x t by minimizing the cost incurred by x t in all previous trials. Since the naive FTL algorithm fails to achieve sublinear regret in the worst case, many variants have been developed to fix the problem, including Follow the Perturbed Leader (FTPL) (Kalai and Vempala 2005), Follow the Regularized Leader (FTRL) (Abernethy et al. 2008), and Follow the Approximate Leader (FTAL) (Hazan et al. 2007). Other methodologies for online convex optimization introduce a potential function (or link function) to map solutions between the space of primal variables and the space of dual variables, and carry out primal-dual update based on the potential function. The well-known Exponentiated Gradient (EG) algorithm (Kivinen and Warmuth 1995) or multiplicative weights algorithm (Freund and Schapire 1995) belong to this category. We note that these different algorithms are closely related. For example, in online linear optimization, the potential-based primal-dual algorithm is equivalent to FTRL algorithm (Hazan and Kale 2010).

Generally, most previous studies of online convex optimization bound the regret in terms of the number of trials T. However, it is expected that the regret should be low in an unchanging environment or when the cost functions are somehow correlated. Specifically, the tightest rate for the regret should depend on the variance of the sequence of cost functions rather than the number of rounds T. An algorithm with regret in terms of variation of the cost functions is expected to perform better when the cost sequence has low variation. Therefore, it is of great interest to derive a variation-based regret bound for online convex optimization in an adversarial setting.

As it has been established as a fact that the regret of a natural algorithm in a stochastic setting, i.e. the cost functions are generated by a stationary stochastic process, can be bounded by the total variation in the cost functions (Hazan and Kale 2010), devising algorithm in fully adversarial setting, i.e. the cost functions are chosen completely adversarially, is posed as an open problem in Bianchi et al. (2005). The concern is whether or not it is possible to derive a regret bound for an online optimization algorithm by the variation of the observed cost functions. Recently Hazan and Kale (2010) made a substantial progress in this route and proved a variation-based regret bound for online linear optimization by tight analysis of FTRL algorithm with an appropriately chosen step size. A similar regret bound is shown in the same paper for prediction from expert advice by slightly modifying the multiplicative weights algorithm. In this work, we aim to take one step further and contribute to this research direction by developing algorithms for general framework of online convex optimization with variation-based regret bounds.

A preliminary version of this paper appeared at the Conference on Learning Theory (COLT) (Chiang et al. 2012) as the result of a merge between two papers as discussed in a commentary written by Kale (2012). We highlight the differences between this paper and Chiang et al. (2012). We first motivate the definition of a gradual variation by showing that the total variation bound in Hazan and Kale (2010) is not necessary small when the cost functions change slowly, and the gradual variation is smaller than the total variation. We then extend FTRL to achieve a regret bounded by the gradual variation in Sect. 2.1. The second algorithm in Sect. 2.2 is similar to the one that appeared in Chiang et al. (2012). Sections 3 and 4 contain generalizations of the second algorithm to non-smooth functions and a bandit setting, respectively, which appear in this paper for the first time.

In the remaining of this section, we first present the results from Hazan and Kale (2010) for online linear optimization and discuss their potential limitations when applied to online convex optimization which motivates our work.

1.1 Online linear optimization

Many decision problems can be cast into online linear optimization problems, such as prediction from expert advice (Cesa-Bianchi and Lugosi 2006), online shortest path problem (Takimoto and Warmuth 2003). Hazan and Kale (2010) proved the first variation-based regret bound for online linear optimization problems in an adversarial setting. Hazan and Kale’s algorithm for online linear optimization is based on the framework of FTRL. For completeness, the algorithm is shown in Algorithm 1. At each trial, the decision vector x t is given by solving the following optimization problem:
$$ \mathbf{x}_t=\arg\min_{\mathbf{x}\in \mathcal{P}} \sum _{\tau=1}^{t-1}\mathbf{f}_\tau^{\top} \mathbf{x}+ \frac {1}{2\eta}\|\mathbf{x}\|_2^2, $$
where f t is the cost vector received at trial t after predicting the decision x t , and η is a step size. They bound the regret by the variation of cost vectors defined as
$$\begin{aligned} \mbox{VAR}_T=\sum_{t=1}^T\| \mathbf{f}_t-\mu\|_2^2, \end{aligned}$$
(1)
where \(\mu=1/T\sum_{t=1}^{T} \mathbf{f}_{t}\). By assuming ∥f t 2≤1,∀t and setting the value of η to \(\eta=\min(2/\sqrt{\mathrm{VAR}_{T}}, 1/6)\), they showed that the regret of Algorithm 1 can be bounded by
$$ \sum_{t=1}^T \mathbf{f}_t^{\top} \mathbf{x}_t -\min_{\mathbf{x}\in \mathcal{P}}\sum _{t=1}^T \mathbf{f}_t^{\top} \mathbf{x}\leq \left \{ \begin{array}{l@{\quad}l} 15\sqrt{\mathrm{VAR}_T}& \mathrm{if}\ \sqrt{\mathrm{VAR}}_T\geq12,\\ 150& \mathrm{if}\ \sqrt{\mathrm{VAR}}_T\leq12. \end{array} \right . $$
(2)
From (2), we can see that when the variation of the cost vectors is small (less than 12), the regret is a constant, otherwise it is bounded by the variation \(O(\sqrt{\mathrm{VAR}}_{T})\). This result indicates that online linear optimization in the adversarial setting is as efficient as in the stationary stochastic setting.
Algorithm 1

Follow The Regularized Leader for Online Linear Optimization

1.2 Online convex optimization

Online convex optimization generalizes online linear optimization by replacing linear cost functions with non-linear convex cost functions. It has found applications in several domains, including portfolio management (Agarwal et al. 2006) and online classification (Kivinen et al. 2004). For example, in online portfolio management problem, an investor wants to distribute his wealth over a set of stocks without knowing the market output in advance. If we let x t denote the distribution on the stocks and r t denote the price relative vector, i.e. r t [i] denote the ratio of the closing price of stock i on day t to the closing price on day t−1, then an interesting function is the logarithmic growth ratio, i.e. \(\sum_{t=1}^{T} \log(\mathbf{x}_{t}^{\top }\mathbf{r}_{t})\), which is a concave function to be maximized. Similar to Hazan and Kale (2010), we aim to develop algorithms for online convex optimization with regret bounded by the variation in the cost functions.

Before presenting our algorithms, we first show that directly applying the FTRL algorithm to general online convex optimization may not be able to achieve the desired result. To extend FTRL for online convex optimization, a straightforward approach is to use the first order approximation for convex cost function, i.e., c t (x)≃c t (x t )+∇c t (x t )(xx t ), and replace the cost vector f t in Algorithm 1 with the gradient of the cost function c t (⋅) at x t , i.e., f t =∇c t (x t ). The resulting algorithm is shown in Algorithm 2. Using the convexity of c t (⋅), we have
$$ \sum_{t=1}^T c_t(\mathbf{x}_t) -\min_{\mathbf{x}\in \mathcal{P}}\sum _{t=1}^T c_t(\mathbf{x})\leq\sum _{t=1}^T \mathbf{f}_t^{\top} \mathbf{x}_t -\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^T \mathbf{f}_t^{\top} \mathbf{x}. $$
(3)
If we assume \(\|\nabla c_{t}(\mathbf{x})\|_{2}\leq1, \forall t, \forall \mathbf{x}\in \mathcal{P}\), we can apply Hazan and Kale’s variation-based bound in (2) to bound the regret in (3) by the variation
$$ \mathrm{VAR}_{T}=\sum_{t=1}^T \|\mathbf{f}_t -\mu\|_2^2 = \sum _{t=1}^T\Biggl \Vert \nabla c_t( \mathbf{x}_t) - \frac{1}{T}\sum_{\tau=1}^T \nabla c_\tau(\mathbf{x}_\tau )\Biggr \Vert _2^2. $$
(4)
To better understand VAR T in (4), we rewrite VAR T as
$$\begin{aligned} \mbox{VAR}_T =&\sum_{t=1}^T \Biggl \Vert \nabla c_t(\mathbf{x}_t) - \frac{1}{T}\sum _{\tau =1}^T \nabla c_\tau( \mathbf{x}_\tau)\Biggr \Vert _2^2 = \frac{1}{2T} \sum_{t,\tau= 1}^T \bigl\|\nabla c_t( \mathbf{x}_t) - \nabla c_\tau(\mathbf{x}_\tau)\bigr\|^2 \\ \leq &\frac{1}{T} \sum _{t=1}^T \sum_{\tau=1}^T \bigl\|\nabla c_t(\mathbf{x}_t) - \nabla c_t( \mathbf{x}_\tau)\bigr\|_2^2 + \frac{1}{T} \sum _{t=1}^T \sum_{\tau=1}^T \bigl\|\nabla c_t(\mathbf{x}_\tau) - \nabla c_\tau( \mathbf{x}_\tau)\bigr\|_2^2 \\ = &\mbox{VAR}^1_T + \mbox{VAR}_T^2. \end{aligned}$$
(5)
We see that the variation VAR T is bounded by two parts: \(\mbox{VAR}^{1}_{T}\) essentially measures the smoothness of individual cost functions, while \(\mbox{VAR}^{2}_{T}\) measures the variation in the gradients of cost functions. Let us consider an easy setting when all cost functions are identical. In this case, \(\mbox{VAR}^{2}_{T}\) vanishes, and VAR T is equal to \(\mbox{VAR}^{1}_{T}/2\), i.e.,
$$\begin{aligned} \mbox{VAR}_T =& \frac{1}{2T} \sum_{t,\tau= 1}^T \bigl\|\nabla c_t(\mathbf{x}_t) - \nabla c_\tau( \mathbf{x}_\tau)\bigr\|^2= \frac{1}{2T} \sum _{t,\tau= 1}^T \bigl\|\nabla c_t(\mathbf{x}_t) - \nabla c_t(\mathbf{x}_\tau)\bigr\|^2 \\ = &\frac{\mbox{VAR}_T^1}{2}. \end{aligned}$$
As a result, the regret of the FTRL algorithm for online convex optimization may still be bounded by \(O(\sqrt{T})\) regardless of the smoothness of the cost function.
Algorithm 2

Follow The Regularized Leader (FTRL) for Online Convex Optimization

To address this challenge, we develop two novel algorithms for online convex optimization that bound the regret by the variation of cost functions. In particular, we would like to bound the regret of online convex optimization by the variation of cost functions defined as follows
$$\begin{aligned} \mathrm{GV}_{T} = \sum_{t=1}^{T-1} \max\limits _{\mathbf{x}\in \mathcal{P}} \bigl\|\nabla c_{t+1}(\mathbf{x}) - \nabla c_{t}( \mathbf{x})\bigr\|_2^2. \end{aligned}$$
(6)
Note that the variation in (6) is defined in terms of gradual difference between individual cost function to its previous one, while the variation in (1) (Hazan and Kale 2010) is defined in terms of total difference between individual cost vectors to their mean. Therefore we refer to the variation defined in (6) as gradual variation,1 and to the variation defined in (1) as total variation. It is straightforward to show that when \(c_{t}(\mathbf{x}) = \mathbf{f}_{t}^{\top} \mathbf{x}\), the gradual variation GV T defined in (6) is upper bounded by the total variation VAR T defined in (1) with a constant factor:
$$ \sum_{t=1}^{T-1}\|\mathbf{f}_{t+1} - \mathbf{f}_t\|_2^2 \leq\sum _{t=1}^{T-1} 2\|\mathbf{f}_{t+1} - \mu \|_2^2 + 2\|\mathbf{f}_t - \mu\|_2^2 \leq4 \sum_{t=1}^{T}\|\mathbf{f}_t-\mu \|_2^2. $$
On the other hand, we can not bound the total variation by the gradual variation up to a constant. This is verified by the following example: f 1=⋯=f T/2=f and f T/2+1=⋯=f T =gf. The total variation in (1) is given by
$$\mbox{VAR}_T= \sum_{t=1}^T\| \mathbf{f}_t -\mu\|_2^2 = \frac{T}{2}\biggl \Vert \mathbf{f}- \frac{\mathbf{f}+ \mathbf{g}}{2}\biggr \Vert _2^2 + \frac{T}{2}\biggl \Vert \mathbf{g} - \frac{\mathbf{f}+ \mathbf{g}}{2}\biggr \Vert _2^2 = \varOmega(T), $$
while the gradual variation defined in (6) is a constant given by
$$ \mathrm{GV}_T = \sum_{t=1}^{T-1} \|\mathbf{f}_{t+1} - \mathbf{f}_t\| _ 2^2 = \|\mathbf{f}- \mathbf{g}\|_2^2 = O(1). $$
Based on the above analysis, we claim that the regret bound by gradual variation is usually tighter than total variation. Before closing this section, we present the following lower bound for FTRL whose proof can be found in Chiang et al. (2012).

Theorem 1

The regret of FTRL is at least \(\varOmega(\min(\mathrm{GV}_{T}, \sqrt{T}))\).

The theorem motivates us to develop new algorithms for online convex optimization to achieve a gradual variation bound of \(O(\sqrt{\mathrm{GV}_{T}})\).

The remainder of the paper is organized as follows. We present in Sect. 2 the proposed algorithms and the main results. Section 3 generalizes the results to the special class of non-smooth functions that is composed of a smooth and a non-smooth component. Section 4 is devoted to extending the proposed algorithms to the setting where only partial feedback about the cost functions are available, i.e., online bandit convex optimization with a variation-based regret bound. Finally, in Sect. 5, we conclude this work and discuss a few open problems.

2 Algorithms and main results

Without loss of generality, we assume that the decision set \(\mathcal{P}\) is contained in a unit ball \(\mathcal{B}\), i.e., \(\mathcal{P}\subseteq\mathcal{B}\), and \(0\in \mathcal{P}\) (Hazan and Kale 2010). We propose two algorithms for online convex optimization. The first algorithm is an improved FTRL and the second one is based on the mirror prox method (Nemirovski 2005). One common feature shared by the two algorithms is that both of them maintain two sequences of solutions: decision vectors x 1:T =(x 1,…,x T ) and searching vectors z 1:T =(z 1,…,z T ) that facilitate the updates of decision vectors. Both algorithms share almost the same regret bound except for a constant factor. To facilitate the discussion, besides the variation of cost functions defined in (6), we define another variation, named extended gradual variation, as follows
$$ \mathrm{EGV}_{T,2}(\mathbf{y}_{1:T})=\sum _{t=0}^{T-1} \bigl\|\nabla c_{t+1}(\mathbf{y}_t)-\nabla c_{t}(\mathbf{y}_t)\bigr\|_2^2 \leq \bigl\|\nabla c_1(\mathbf{y}_0)\bigr\|_2^2 + \mathrm{GV}_T, $$
(7)
where c 0(x)=0, the sequence (y 0,…,y T ) is either (z 0,…,z T ) (as in the improved FTRL) or (x 0,…,x T ) (as in the prox method) and the subscript 2 means the variation is defined with respect to 2 norm. When all cost functions are identical, GV T becomes zero and the extended variation EGV T,2(y 1:T ) is reduced to \(\|\nabla c_{1}(\mathbf{y}_{0})\|_{2}^{2}\), a constant independent from the number of trials. In the sequel, we use the notation EGV T,2 for simplicity. In this study, we assume smooth cost functions with Lipschitz continuous gradients, i.e., there exists a constant L>0 such that
$$ \bigl\|\nabla c_t(\mathbf{x})-\nabla c_t(\mathbf{z}) \bigr\|_2\leq L\|\mathbf{x}-\mathbf{z}\|_2,\quad \forall \mathbf{x}, \mathbf{z}\in \mathcal{P}, \forall t. $$
(8)
Our results show that for online convex optimization with L-smooth cost functions, the regrets of the proposed algorithms can be bounded as follows
$$ \sum_{t=1}^Tc_t( \mathbf{x}_t) -\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^Tc_t( \mathbf{x})\leq O (\sqrt{\mathrm{EGV}_{T,2}} ) + \mathrm{constant}. $$
(9)

Remark 1

We would like to emphasize that our assumption about the smoothness of cost functions is necessary2 to achieve the variation-based bound stated in (9). To see this, consider the special case of c 1(x)=⋯=c T (x)=c(x). If the bound in (9) holds for any sequence of convex functions, then for the special case where all the cost functions are identical, we have
$$ \sum_{t=1}^T c(\mathbf{x}_t) \leq\min _{\mathbf{x}\in \mathcal{P}} \sum_{t=1}^T c(\mathbf{x}) + O(1), $$
implying that \(\widehat{\mathbf{x}}_{T} = (1/T)\sum_{t=1}^{T} \mathbf{x}_{t}\) approaches the optimal solution at the rate of O(1/T). This contradicts the lower complexity bound (i.e. \(O(1/\sqrt{T})\)) for any optimization method which only uses first order information about the cost functions (Nesterov 2004, Theorem 3.2.1). This analysis indicates that the smoothness assumption is necessary to attain variation based regret bound for general online convex optimization problem. We would like to emphasize the fact that this contradiction holds when only the gradient information about the cost functions is provided to the learner and the learner may be able to achieve a variation-based bound using second order information about the cost functions, which is not the focus of the present work.

2.1 An improved FTRL algorithm for online convex optimization

The improved FTRL algorithm for online convex optimization is presented in Algorithm 3. Note that in step 6, the searching vectors z t are updated according to the FTRL algorithm after receiving the cost function c t (⋅). To understand the updating procedure for the decision vector x t specified in step 4, we rewrite it as
$$ \mathbf{x}_t = \mathop{\arg\min}\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{ c_{t-1}( \mathbf{z}_{t-1})+ (\mathbf{x}- \mathbf{z}_{t-1})^{\top}\nabla c_{t-1}( \mathbf{z}_{t-1}) + \frac{L}{2\eta}\|\mathbf{x}- \mathbf{z}_{t-1} \|_2^2 \biggr\}. $$
(10)
Notice that
$$\begin{aligned} c_t(\mathbf{x}) \leq &c_t(\mathbf{z}_{t-1}) + ( \mathbf{x}-\mathbf{z}_{t-1})^{\top}\nabla c_{t}(\mathbf{z}_{t-1}) + \frac{L}{2}\|\mathbf{x}-\mathbf{z}_{t-1}\|_2^2 \\ \leq & c_t(\mathbf{z}_{t-1}) + (\mathbf{x}-\mathbf{z}_{t-1})^{\top} \nabla c_{t}(\mathbf{z}_{t-1}) + \frac {L}{2\eta}\|\mathbf{x}- \mathbf{z}_{t-1}\|_2^2, \end{aligned}$$
(11)
where the first inequality follows the smoothness condition in (8) and the second inequality follows from the fact η≤1. The inequality (11) provides an upper bound for c t (x) and therefore can be used as an approximation of c t (x) for predicting x t . However, since ∇c t (z t−1) is unknown before the prediction, we use ∇c t−1(z t−1) as a surrogate for ∇c t (z t−1), leading to the updating rule in (10). It is this approximation that leads to the variation bound. The following theorem states the regret bound of Algorithm 3.
Algorithm 3

Improved FTRL for Online Convex Optimization

Theorem 2

Let c t (⋅),t=1,…,T be a sequence of convex functions with L-Lipschitz continuous gradients. By setting \(\eta= \min \{1, L/\sqrt{\mathrm{EGV}_{T,2}} \}\), we have the following regret bound for Algorithm  3
$$\sum_{t=1}^Tc_t(\mathbf{x}_t) -\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^Tc_t( \mathbf{x}) \leq\max (L, \sqrt{\mathrm{EGV}_{T,2}} ). $$

Remark 2

Comparing with the variation bound in (5) for the FTRL algorithm, the smoothness parameter L plays the same role as \(\mbox{VAR}^{1}_{T}\) that accounts for the smoothness of cost functions, and term EGV T,2 plays the same role as \(\mbox{VAR}^{2}_{T}\) that accounts for the variation in the cost functions. Compared to the FTRL algorithm, the key advantage of the improved FTRL algorithm is that the regret bound is reduced to a constant when the cost functions change only by a constant number of times along the horizon. Of course, the extended variation EGV T,2 may not be known apriori for setting the optimal η, we can apply the standard doubling trick (Cesa-Bianchi and Lugosi 2006) to obtain a bound that holds uniformly over time and is a factor at most 8 from the bound obtained with the optimal choice of η. The details are provided in Appendix A.

To prove Theorem 2, we first present the following lemma.

Lemma 1

Let c t (⋅),t=1,…,T be a sequence of convex functions with L-Lipschitz continuous gradients. By running Algorithm  3 over T trials, we have
$$\begin{aligned} \sum_{t=1}^T c_t( \mathbf{x}_t) \leq &\min\limits _{\mathbf{x}\in \mathcal{P}} \Biggl[\frac{L}{2\eta}\|\mathbf{x}\|_2^2 + \sum_{t=1}^T c_t(\mathbf{z}_{t-1}) + (\mathbf{x}- \mathbf{z}_{t-1})^{\top}\nabla c_t(\mathbf{z}_{t-1}) \Biggr] \\ &{}+ \frac{\eta}{2L} \sum_{t=0}^{T-1} \bigl\|\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_{t}( \mathbf{z}_t)\bigr\|_2^2. \end{aligned}$$

With this lemma, we can easily prove Theorem 2 by exploring the convexity of c t (x).

Proof of Theorem 2

By using \(\|\mathbf{x}\|_{2}\leq1,\forall \mathbf{x}\in \mathcal{P}\subseteq\mathcal{B}\), and the convexity of c t (x), we have
$$\min\limits _{\mathbf{x}\in \mathcal{P}} \Biggl\{\frac{L}{2\eta}\|\mathbf{x}\|_2^2 + \sum_{t=1}^T c_t( \mathbf{z}_{t-1}) + (\mathbf{x}- \mathbf{z}_{t-1})^{\top}\nabla c_t( \mathbf{z}_{t-1}) \Biggr\}\leq\frac{L}{2\eta}+\min_{\mathbf{x}\in \mathcal{P}}\sum _{t=1}^Tc_t(\mathbf{x}). $$
Combining the above result with Lemma 1, we have
$$ \sum_{t=1}^Tc_t( \mathbf{x}_t)-\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^T c_t(\mathbf{x})\leq\frac {L}{2\eta} + \frac{\eta}{2L}{ \mathrm{EGV}}_{T,2}. $$
By choosing \(\eta=\min(1, L/\sqrt{\mathrm{EGV}_{T,2}})\), we have the regret bound claimed in Theorem 2. □
The Lemma 1 is proved by induction. The key to the proof is that z t is the optimal solution to the strongly convex minimization problem in Lemma 1, i.e.,
$$ \mathbf{z}_t = \arg\min\limits _{\mathbf{x}\in \mathcal{P}} \Biggl[\frac{L}{2\eta}\|\mathbf{x}\|_2^2 + \sum_{\tau=1}^t c_\tau(\mathbf{z}_{\tau-1}) + (\mathbf{x}- \mathbf{z}_{\tau-1})^{\top} \nabla c_\tau(\mathbf{z}_{\tau-1}) \Biggr] $$

Proof of Lemma 1

We prove the inequality by induction. When T=1, we have x 1=z 0=0 and
$$\begin{aligned} & \min\limits _{\mathbf{x}\in \mathcal{P}} \biggl[\frac{L}{2\eta}\|\mathbf{x}\|_2^2 +c_1(\mathbf{z}_0)+ (\mathbf{x}- \mathbf{z}_0)^{\top}\nabla c_1(\mathbf{z}_0) \biggr] + \frac{\eta}{2L} \bigl\|\nabla c_1(\mathbf{z}_0)\bigr\|_2^2 \\ &\quad {} \geq c_1(\mathbf{z}_0) + \frac{\eta}{2L} \bigl\|\nabla c_1(\mathbf{z}_0)\bigr\|_2^2 + \min\limits _{\mathbf{x}} \biggl\{\frac{L}{2\eta}\|\mathbf{x}\|_2^2 + (\mathbf{x}- \mathbf{z}_0)^{\top}\nabla c_1(\mathbf{z}_0) \biggr \} = c_1(\mathbf{z}_0) = c_1(\mathbf{x}_1), \end{aligned}$$
where the inequality follows that by relaxing the minimization domain \(\mathbf{x}\in \mathcal{P}\) to the whole space. We assume the inequality holds for t and aim to prove it for t+1. To this end, we define
$$\begin{aligned} \psi_t(\mathbf{x}) =& \Biggl[\frac{L}{2\eta}\|\mathbf{x}\|_2^2 + \sum_{\tau=1}^{t} c_\tau( \mathbf{z}_{\tau-1}) + (\mathbf{x}- \mathbf{z}_{\tau-1})^{\top}\nabla c_\tau(\mathbf{z}_{\tau-1}) \Biggr] \\ &{}+ \frac{\eta}{2L} \sum_{\tau=0}^{t - 1} \bigl\|\nabla c_{\tau+1}(\mathbf{z}_\tau) - \nabla c_{\tau}( \mathbf{z}_\tau)\bigr\|_2^2. \end{aligned}$$
According to the updating procedure for z t in step 6, we have \(\mathbf{z}_{t}=\arg\min_{\mathbf{x}\in \mathcal{P}}\psi_{t}(\mathbf{x})\). Define \(\phi_{t} = \psi_{t}(\mathbf{z}_{t}) = \min_{\mathbf{x}\in \mathcal{P}}\psi_{t}(\mathbf{x})\). Since ψ t (x) is a (L/η)-strongly convex function, we have
$$\begin{aligned} \psi_{t+1}(\mathbf{x})-\psi_{t+1}(\mathbf{z}_{t}) \geq & \frac{L}{2\eta}\|\mathbf{x}-\mathbf{z}_{t}\|_2^2 + (\mathbf{x}- \mathbf{z}_t)^{\top} \nabla\psi_{t+1}(\mathbf{z}_t) \\ = &\frac{L}{2\eta}\|\mathbf{x}-\mathbf{z}_{t}\|_2^2 + (\mathbf{x}- \mathbf{z}_t)^{\top} \bigl(\nabla\psi _{t}( \mathbf{z}_t)+\nabla c_{t+1}(\mathbf{z}_t) \bigr). \end{aligned}$$
Setting \(\mathbf{x}=\mathbf{z}_{t+1}=\arg\min_{\mathbf{x}\in \mathcal{P}}\psi_{t+1}(\mathbf{x})\) in the above inequality results in
$$\begin{aligned} \psi_{t+1}(\mathbf{z}_{t+1}) - \psi_{t+1}(\mathbf{z}_t) =& \phi_{t+1} - \biggl(\phi_t + c_{t+1}( \mathbf{z}_t) + \frac{\eta}{2L} \bigl\|\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_t(\mathbf{z}_t)\bigr\|_2^2\biggr) \\ \geq&\frac{L}{2\eta}\|\mathbf{z}_{t+1}-\mathbf{z}_{t} \|_2^2 + (\mathbf{z}_{t+1}-\mathbf{z}_t)^{\top} \bigl(\nabla\psi_{t}(\mathbf{z}_t)+\nabla c_{t+1}( \mathbf{z}_t) \bigr) \\ \geq&\frac{L}{2\eta}\|\mathbf{z}_{t+1}-\mathbf{z}_{t} \|_2^2 + (\mathbf{z}_{t+1}-\mathbf{z}_t)^{\top } \nabla c_{t+1}(\mathbf{z}_t), \end{aligned}$$
where the second inequality follows from the fact \(\mathbf{z}_{t}=\arg\min_{\mathbf{x}\in \mathcal{P}}\psi_{t}(\mathbf{x})\), and therefore \((\mathbf{x}-\mathbf{z}_{t})^{\top}\nabla\psi_{t}(\mathbf{z}_{t})\geq 0, \forall \mathbf{x}\in \mathcal{P}\). Moving c t+1(z t ) in the above inequality to the right hand side, we have
$$\begin{aligned} &\phi_{t+1} - \phi_t - \frac{\eta}{2L} \bigl\|\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_t(\mathbf{z}_t) \bigr\|_2^2 \\ &\quad {}\geq\frac{L}{2\eta}\|\mathbf{z}_{t+1} -\mathbf{z}_t \|_2^2 + (\mathbf{z}_{t+1} - \mathbf{z}_{t})^{\top } \nabla c_{t+1}(\mathbf{z}_t)+ c_{t+1}(\mathbf{z}_t) \\ &\quad {} \geq \min\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{ \frac{L}{2\eta}\|\mathbf{x}-\mathbf{z}_t \|_2^2 + (\mathbf{x}- \mathbf{z}_{t})^{\top}\nabla c_{t+1}(\mathbf{z}_t)+ c_{t+1}(\mathbf{z}_t) \biggr\} \\ &\quad {} = \min\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{ \underbrace{\frac{L}{2\eta}\|\mathbf{x}\,{-}\, \mathbf{z}_t\|_2^2 + (\mathbf{x}\,{-}\, \mathbf{z}_{t})^{\top} \nabla c_{t}(\mathbf{z}_t)}_{\rho(\mathbf{x})} \,{+}\ c_{t+1}( \mathbf{z}_t)+ \underbrace{(\mathbf{x}\,{-}\, \mathbf{z}_{t})^{\top}\bigl(\nabla c_{t+1}(\mathbf{z}_t) \,{-}\, \nabla c_{t}(\mathbf{z}_t) \bigr)} _{r(\mathbf{x})} \biggr\}. \end{aligned}$$
(12)
To bound the right hand side, we note that x t+1 is the minimizer of ρ(x) by step 4 in Algorithm 3, and ρ(x) is a L/η-strongly convex function, so we have
$$ \rho(\mathbf{x})\geq\rho(\mathbf{x}_{t+1}) + \underbrace{(\mathbf{x}-\mathbf{x}_{t+1})^{\top} \nabla\rho (\mathbf{x}_{t+1})} _{\geq0}+ \frac{L}{2\eta}\|\mathbf{x}- \mathbf{x}_{t+1}\|_2^2\geq\rho (\mathbf{x}_{t+1}) + \frac{L}{2\eta}\|\mathbf{x}-\mathbf{x}_{t+1}\|_2^2. $$
Then we have
$$\begin{aligned} &\rho(\mathbf{x}) + c_{t+1}(\mathbf{z}_t) + r(\mathbf{x}) \\ &\quad {}\geq \rho( \mathbf{x}_{t+1}) + c_{t+1}(\mathbf{z}_t) + \frac{L}{2\eta}\|\mathbf{x}- \mathbf{x}_{t+1}\|_2^2+ r(\mathbf{x}) \\ &\quad {}= \underbrace{ \frac{L}{2\eta}\|\mathbf{x}_{t+1} - \mathbf{z}_t \|^2_2 + (\mathbf{x}_{t+1} - \mathbf{z}_t)^{\top} \nabla c_t(\mathbf{z}_t)} _{\rho(\mathbf{x}_{t+1})}\,+\ c_{t+1}( \mathbf{z}_t) +\frac{L}{2\eta}\|\mathbf{x}-\mathbf{x}_{t+1}\|_2^2+ r(\mathbf{x}) \end{aligned}$$
Plugging above inequality into the inequality in (12), we have
$$\begin{aligned} &\phi_{t+1} - \phi_t - \frac{\eta}{2L} \bigl\|\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_t(\mathbf{z}_t) \bigr\|_2^2 \\ &\quad {}\geq \frac{L}{2\eta}\|\mathbf{x}_{t+1} - \mathbf{z}_t \|^2_2 + (\mathbf{x}_{t+1} - \mathbf{z}_t)^{\top} \nabla c_t(\mathbf{z}_t)+ c_{t+1}(\mathbf{z}_t) \\ &\qquad {}+ \min\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{ \frac{L}{2\eta}\|\mathbf{x}- \mathbf{x}_{t+1} \|_2^2 + (\mathbf{x}- \mathbf{z}_t)^{\top}\bigl(\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_{t}(\mathbf{z}_t) \bigr) \biggr\} \end{aligned}$$
To continue the bounding, we proceed as follows
$$\begin{aligned} & \phi_{t+1} - \phi_t - \frac{\eta}{2L} \bigl\|\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_t(\mathbf{z}_t) \bigr\|_2^2 \\ &\quad {} \geq \frac{L}{2\eta}\|\mathbf{x}_{t+1} - \mathbf{z}_t \|^2_2 + (\mathbf{x}_{t+1} - \mathbf{z}_t)^{\top} \nabla c_t(\mathbf{z}_t)+ c_{t+1}(\mathbf{z}_t) \\ &\qquad {} + \min\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{ \frac{L}{2\eta}\|\mathbf{x}- \mathbf{x}_{t+1} \|_2^2 + (\mathbf{x}- \mathbf{z}_t)^{\top}\bigl(\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_{t}(\mathbf{z}_t) \bigr) \biggr\} \\ &\quad {} = \frac{L}{2\eta}\|\mathbf{x}_{t+1} - \mathbf{z}_t\|^2_2 +(\mathbf{x}_{t+1} - \mathbf{z}_t)^{\top }\nabla c_{t+1}( \mathbf{z}_t) + c_{t+1}(\mathbf{z}_t) \\ &\qquad {} + \min\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{ \frac{L}{2\eta}\|\mathbf{x}-\mathbf{x}_{t+1} \|_2^2 + (\mathbf{x}- \mathbf{x}_{t+1})^{\top}\bigl( \nabla c_{t+1}(\mathbf{z}_t) - \nabla c_{t}( \mathbf{z}_t)\bigr) \biggr\} \\ &\quad {} \geq \frac{L}{2\eta}\|\mathbf{x}_{t+1} - \mathbf{z}_t \|^2_2 +(\mathbf{x}_{t+1} - \mathbf{z}_t)^{\top } \nabla c_{t+1}(\mathbf{z}_t)+ c_{t+1}(\mathbf{z}_t) \\ &\qquad {} + \min\limits _{\mathbf{x}} \biggl\{ \frac{L}{2\eta}\|\mathbf{x}-\mathbf{x}_{t+1} \|_2^2 + (\mathbf{x}- \mathbf{x}_{t+1})^{\top}\bigl( \nabla c_{t+1}(\mathbf{z}_t) - \nabla c_{t}( \mathbf{z}_t)\bigr) \biggr\} \\ &\quad {} = \frac{L}{2\eta}\|\mathbf{x}_{t+1} - \mathbf{z}_t\|^2_2 + (\mathbf{x}_{t+1} - \mathbf{z}_t)^{\top}\nabla c_{t+1}( \mathbf{z}_t) + c_{t+1}(\mathbf{z}_t) - \frac{\eta}{2L} \bigl\|\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_{t}( \mathbf{z}_t)\bigr\|_2^2 \\ &\quad {} \geq c_{t+1}(\mathbf{x}_{t+1}) - \frac{\eta}{2L} \bigl\|\nabla c_{t+1}(\mathbf{z}_t) - \nabla c_{t}(\mathbf{z}_t) \bigr\|_2^2, \end{aligned}$$
where the first equality follows by writing (x t+1z t )c t (z t )=(x t+1z t )c t+1(z t )−(x t+1z t )(∇c t+1(z t )−∇c t (z t )) and combining with (xz t )(∇c t+1(z t )−∇c t (z t )), and the last inequality follows from the smoothness condition of c t+1(x). Since by induction \(\phi_{t} \geq\sum_{\tau=1}^{t} c_{\tau}(\mathbf{x}_{\tau})\), we have \(\phi_{t+1} \geq\sum_{\tau=1}^{t+1} c_{\tau}(\mathbf{x}_{\tau})\). □

2.2 A prox method for online convex optimization

In this subsection, we present a prox method for online convex optimization that shares the same order of regret bound as the improved FTRL algorithm. It is closely related to the prox method in Nemirovski (2005) by maintaining two sets of vectors x 1:T and z 1:T , where x t and z t are computed by gradient mappings using ∇c t−1(x t−1), and ∇c t (x t ), respectively, as
$$\begin{aligned} \mathbf{x}_t =&\arg\min_{\mathbf{x}\in \mathcal{P}}\frac{1}{2}\biggl \Vert \mathbf{x}- \biggl(\mathbf{z}_{t-1}-\frac{\eta }{L}\nabla c_{t-1}( \mathbf{x}_{t-1}) \biggr)\biggr \Vert _2^2 \\ \mathbf{z}_t =&\arg\min_{\mathbf{x}\in \mathcal{P}}\frac{1}{2}\biggl \Vert \mathbf{x}- \biggl(\mathbf{z}_{t-1}-\frac{\eta }{L}\nabla c_{t}( \mathbf{x}_t) \biggr)\biggr \Vert _2^2 \end{aligned}$$
The detailed steps are shown in Algorithm 4, where we use an equivalent form of updates for x t and z t in order to compare to Algorithm 3. Algorithm 4 differs from Algorithm 3: (i) in updating the searching points z t , Algorithm 3 updates z t by the FTRL scheme using all the gradients of the cost functions at \(\{\mathbf{z}_{\tau}\}_{\tau= 1}^{t - 1}\), while Algorithm 4 updates z t by a prox method using a single gradient ∇c t (x t ), and (ii) in updating the decision vector x t , Algorithm 4 uses the gradient ∇c t−1(x t−1) instead of ∇c t−1(z t−1). The advantage of Algorithm 4 compared to Algorithm 3 is that it only requires to compute one gradient ∇c t (x t ) for each loss function; in contrast, the improved FTRL algorithm in Algorithm 3 needs to compute the gradients of c t (x) at two searching points z t and z t−1. It is these differences mentioned above that make it easier to extend the prox method to a bandit setting, which will be discussed in Sect. 5.
Algorithm 4

A Prox Method for Online Convex Optimization

The following theorem states the regret bound of the prox method for online convex optimization.

Theorem 3

Let c t (⋅),t=1,…,T be a sequence of convex functions with L-Lipschitz continuous gradients. By setting \(\eta= (1/2)\min \{1/\sqrt{2}, L/\sqrt{\mathrm{EGV}_{T,2}} \} \), we have the following regret bound for Algorithm  4
$$\sum_{t=1}^Tc_t( \mathbf{x}_t)-\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^Tc_t( \mathbf{x}) \leq2\max (\sqrt{2}L, \sqrt{\mathrm{EGV}_{T,2}}\, ). $$

We note that compared to Theorem 2, the regret bound in Theorem 3 is slightly worse by a factor of 2.

To prove Theorem 3, we need the following lemma, which is the Lemma 3.1 in Nemirovski (2005) stated in our notations.

Lemma 2

(Nemirovski 2005, Lemma 3.1)

Let ω(z) be a α-strongly convex function with respect to the norm ∥⋅∥, whose dual norm is denoted by ∥⋅∥, and D(x,z)=ω(x)−(ω(z)+(xz) ω′(z)) be the Bregman distance induced by function ω(x). Let Z be a convex compact set, and UZ be convex and closed. Let zZ, γ>0, Consider the points,
$$\begin{aligned} \mathbf{x} =& \arg\min_{\mathbf{u}\in U} \gamma \mathbf{u}^{\top}\xi+ D(\mathbf{u}, \mathbf{z}) , \end{aligned}$$
(13)
$$\begin{aligned} \mathbf{z}_+ =&\arg\min_{\mathbf{u}\in U} \gamma \mathbf{u}^{\top}\zeta+ D(\mathbf{u},\mathbf{z}), \end{aligned}$$
(14)
then for any uU, we have
$$ \gamma\zeta^{\top}(\mathbf{x}-\mathbf{u})\leq D(\mathbf{u},\mathbf{z}) - D(\mathbf{u}, \mathbf{z}_+) + \frac{\gamma ^2}{\alpha}\|\xi-\zeta\|_*^2 - \frac{\alpha}{2} \bigl[\|\mathbf{x}-\mathbf{z}\|^2 + \|\mathbf{x}-\mathbf{z}_+\|^2\bigr]. $$
(15)

In order not to have readers struggle with complex notations in Nemirovski (2005) for the proof of Lemma 2, we present a detailed proof in Appendix A which is an adaption of the original proof to our notations.

Theorem 3 can be proved by using the above Lemma, because the updates of x t ,z t can be written equivalently as (13) and (14), respectively. The proof below starts from (15) and bounds the summation of each term over t=1,…,T, respectively.

Proof of Theorem 3

First, we note that the two updates in step 4 and step 6 of Algorithm 4 fit in the Lemma 2 if we let \(U=Z=\mathcal{P}\), z=z t−1, x=x t , z +=z t , and \(\omega(\mathbf{x})=\frac{1}{2}\|\mathbf{x}\| _{2}^{2}\), which is 1-strongly convex function with respect to ∥⋅∥2. Then \(D(\mathbf{u},\mathbf{z})=\frac{1}{2}\|\mathbf{u}-\mathbf{z}\|_{2}^{2}\). As a result, the two updates for x t ,z t in Algorithm 4 are exactly the updates in (13) and (14) with z=z t−1,γ=η/L, ξ=∇c t−1(z t−1), and ζ=∇c t (x t ). Replacing these into (15), we have the following inequality for any \(\mathbf{u}\in \mathcal{P}\),
$$\begin{aligned} \frac{\eta}{L} (\mathbf{x}_t - \mathbf{u})^{\top}\nabla c_t( \mathbf{x}_t) \le& \frac{1}{2} \bigl(\|\mathbf{u}-\mathbf{z}_{t-1} \|_2^2 - \|\mathbf{u}-\mathbf{z}_{t} \|_2^2 \bigr) \\ &{}+ \frac{\eta^2}{L^2} \bigl\|\nabla c_{t}( \mathbf{x}_t) - \nabla c_{t-1}(\mathbf{x}_{t-1}) \bigr\|_2^2 - \frac{1}{2} \bigl(\|\mathbf{x}_t - \mathbf{z}_{t-1}\|_2^2 + \|\mathbf{x}_t- \mathbf{z}_t\|_2^2 \bigr) \end{aligned}$$
Then we have
$$\begin{aligned} \frac{\eta}{L}\bigl(c_t(\mathbf{x}_t) - c_t(\mathbf{u}) \bigr) \leq& \frac{\eta}{L} (\mathbf{x}_t - \mathbf{u})^{\top}\nabla c_t(\mathbf{x}_t) \leq \frac{1}{2} \bigl(\|\mathbf{u}- \mathbf{z}_{t-1} \|_2^2 - \|\mathbf{u}-\mathbf{z}_{t} \|_2^2 \bigr) \\ &{}+ \frac{2\eta^2}{L^2} \bigl\|\nabla c_{t}(\mathbf{x}_{t-1}) - \nabla c_{t-1}(\mathbf{x}_{t-1})\bigr\|_2^2+ \frac{2\eta^2}{L^2} \bigl\| \nabla c_{t}(\mathbf{x}_t) - \nabla c_{t}(\mathbf{x}_{t-1})\bigr\|_2^2 \\ &{}- \frac{1}{2} \bigl(\|\mathbf{x}_t - \mathbf{z}_{t-1} \|_2^2+\|\mathbf{x}_t-\mathbf{z}_t \|_2^2 \bigr) \\ \leq & \frac{1}{2} \bigl(\|\mathbf{u}-\mathbf{z}_{t-1}\|_2^2 - \|\mathbf{u}-\mathbf{z}_{t}\|_2^2 \bigr)+ \frac{2\eta^2}{L^2} \bigl\|\nabla c_{t}(\mathbf{x}_{t-1}) - \nabla c_{t-1}(\mathbf{x}_{t-1})\bigr\|_2^2 \\ &{}+ 2\eta^2\|\mathbf{x}_t - \mathbf{x}_{t-1} \|_2^2 - \frac{1}{2} \bigl(\|\mathbf{x}_t - \mathbf{z}_{t-1}\|_2^2+\|\mathbf{x}_t- \mathbf{z}_t\|_2^2 \bigr), \end{aligned}$$
where the first inequality follows the convexity of c t (x), and the third inequality follows the smoothness of c t (x). By taking the summation over t=1,…,T with \(\mathbf{z}^{*}=\arg\min _{\mathbf{u}\in \mathcal{P}}\sum_{t=1}^{T}c_{t}(\mathbf{u})\), and dividing both sides by η/L, we have
$$\begin{aligned} \sum_{t=1}^Tc_t( \mathbf{x}_t) - \min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}c_t( \mathbf{x}) \leq & \frac {L}{2\eta}+ \frac{2\eta}{L}\sum_{t=0}^{T-1} \bigl\|\nabla c_{t+1}(\mathbf{x}_{t}) - \nabla c_{t}( \mathbf{x}_{t})\bigr\|_2^2 \\ &{} +\sum_{t=1}^T2 \eta^2\|\mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2 - \underbrace{\sum_{t=1}^T \frac{1}{2} \bigl(\|\mathbf{x}_t - \mathbf{z}_{t-1} \|_2^2+\|\mathbf{x}_t-\mathbf{z}_t \|_2^2 \bigr)} _{B_T} \end{aligned}$$
(16)
We can bound B T as follows:
$$\begin{aligned} B_T =& \frac{1}{2}\sum_{t=1}^T \|\mathbf{x}_t-\mathbf{z}_{t-1}\|_2^2 + \frac{1}{2}\sum_{t=2}^{T+1}\| \mathbf{x}_{t-1}-\mathbf{z}_{t-1}\|_2^2 \\ \geq& \frac{1}{2}\sum_{t=2}^T \bigl(\|\mathbf{x}_t-\mathbf{z}_{t-1}\|_2^2 + \|\mathbf{x}_{t-1}-\mathbf{z}_{t-1}\|_2^2 \bigr) \\ \geq&\frac{1}{4}\sum_{t=2}^T\| \mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2= \frac{1}{4}\sum_{t=1}^T\| \mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2 \end{aligned}$$
(17)
where the last equality follows that x 1=x 0. Plugging the above bound into (18), we have
$$\begin{aligned} \sum_{t=1}^Tc_t( \mathbf{x}_t) - \min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}c_t( \mathbf{x}) \leq& \frac {L}{2\eta}+ \frac{2\eta}{L}\sum_{t=0}^{T-1} \bigl\|\nabla c_{t+1}(\mathbf{x}_{t}) - \nabla c_{t}( \mathbf{x}_{t})\bigr\|_2^2 \\ &{} +\sum_{t=1}^T \biggl(2 \eta^2-\frac{1}{4} \biggr) \|\mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2 \end{aligned}$$
(18)
We complete the proof by plugging the value of η. □

2.3 A general prox method and some special cases

In this subsection, we first present a general prox method to obtain a variation bound defined in a general norm. Then we discuss four special cases: online linear optimization, prediction with expert advice, and online strictly convex optimization. The omitted proofs in this subsection can be easily duplicated by mimicking the proof of Theorem 3, if necessary with the help of previous analysis as mentioned in the appropriate text.

2.3.1 A general prox method

The prox method, together with Lemma 2 provides an easy way to generalize the framework based on Euclidean norm to a general norm. To be precise, let ∥⋅∥ denote a general norm, ∥⋅∥ denote its dual norm, ω(z) be a α-strongly convex function with respect to the norm ∥⋅∥, and D(x,z)=ω(x)−(ω(z)+(xz) ω′(z)) be the Bregman distance induced by function ω(x). Let c t (⋅),t=1,…,T be L-smooth functions with respect to norm ∥⋅∥, i.e.,
$$ \bigl\|\nabla c_t(\mathbf{x}) - \nabla c_t(\mathbf{z})\bigr\|_* \leq L \|\mathbf{x}- \mathbf{z}\|. $$
Correspondingly, we define the extended gradual variation based on the general norm as follows:
$$ \mathrm{EGV}_T = \sum_{t=0}^{T-1} \bigl\|\nabla c_{t+1}(\mathbf{x}_t) - \nabla c_t(\mathbf{x}_t)\bigr\|_*^2. $$
(19)
Algorithm 5 gives the detailed steps for the general framework. We note that the key differences from Algorithm 4 are: z 0 is set to \(\min_{\mathbf{z}\in \mathcal{P}}\omega(\mathbf{z})\), and the Euclidean distances in steps 4 and 6 are replaced by Bregman distances, i.e.,
$$\begin{aligned} &\displaystyle \mathbf{x}_t = \mathop{\arg\min}\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{\mathbf{x}^{\top}\nabla c_{t-1}(\mathbf{x}_{t-1})+ \frac{L}{\eta}D( \mathbf{x}, \mathbf{z}_{t-1}) \biggr\}, \\ &\displaystyle \mathbf{z}_{t} =\mathop{\arg\min}\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{\mathbf{x}^{\top}\nabla c_{t}(\mathbf{x}_{t})+ \frac{L}{\eta}D( \mathbf{x}, \mathbf{z}_{t-1}) \biggr\}. \end{aligned}$$
The following theorem states the variation-based regret bound for the general norm framework, where R measure the size of \(\mathcal{P}\) defined as
$$R = \sqrt{2\Bigl(\max_{\mathbf{x}\in \mathcal{P}} \omega(\mathbf{x}) - \min _{\mathbf{x}\in \mathcal{P}}\omega(\mathbf{x})\Bigr)}. $$
Algorithm 5

A General Prox Method for Online Convex Optimization

Theorem 4

Let c t (⋅),t=1,…,T be a sequence of convex functions whose gradients are L-smooth continuous, ω(z) be a α-strongly convex function, both with respect to norm ∥⋅∥, and EGV T be defined in (19). By setting \(\eta= (1/2)\min \{\sqrt{\alpha}/\sqrt{2}, LR/\sqrt{\mathrm{EGV}_{T}} \}\), we have the following regret bound
$$\sum_{t=1}^Tc_t( \mathbf{x}_t)-\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^Tc_t( \mathbf{x}) \leq2R\max (\sqrt{2}LR/\sqrt{\alpha}, \sqrt{\mathrm{EGV}_T} ). $$

2.3.2 Online linear optimization

Here we consider online linear optimization and present the algorithm and the gradual variation bound for this setting as a special case of proposed algorithm. In particular, we are interested in bounding the regret by the gradual variation
$$\mathrm{EGV}^{f}_{T,2}=\sum_{t=0}^{T-1} \|\mathbf{f}_{t+1} - \mathbf{f}_t\|_2^2, $$
where f t ,t=1,…,T are the linear cost vectors and f 0=0. Since linear functions are smooth functions that satisfy the inequality in (8) for any positive L>0, therefore we can apply Algorithm 4 to online linear optimization with any positive value for L.3 The regret bound of Algorithm 4 for online linear optimization is presented in the following corollary.

Corollary 1

Let \(c_{t}(\mathbf{x}) = \mathbf{f}_{t}^{\top} \mathbf{x}, t=1,\ldots, T\) be a sequence of linear functions. By setting \(\eta= \displaystyle\sqrt{1/ (2{\mathrm{EGV}}^{f}_{T,2} )}\) and L=1 in Algorithm  4, then we have
$$\sum_{t=1}^T\mathbf{f}_t^{\top} \mathbf{x}_t - \min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^T \mathbf{f}_t^{\top} \mathbf{x}\leq\sqrt{2\mathrm{EGV}^{f}_{T,2}}. $$

Remark 3

Note that the regret bound in Corollary 1 is stronger than the regret bound obtained in Hazan and Kale (2010) for online linear optimization due to the fact that the gradual variation is smaller than the total variation.

2.3.3 Prediction with expert advice

In the problem of prediction with expert advice, the decision vector x is a distribution over m experts, i.e., \(\mathbf{x}\in \mathcal{P}=\{\mathbf{x}\in\mathbb{R}^{m}_{+}: \sum_{i=1}^{m}x_{i}=1\}\). Let \(\mathbf{f}_{t}\in\mathbb{R}^{m}\) denote the costs for m experts in trial t. Similar to Hazan and Kale (2010), we would like to bound the regret of prediction from expert advice by the gradual variation defined in infinite norm, i.e.,
$$\mathrm{EGV}^{f}_{T, \infty}= \sum_{t=0}^{T-1} \|\mathbf{f}_{t+1} -\mathbf{f}_t\|_\infty^2. $$
Since it is a special online linear optimization problem, we can apply Algorithm 4 to obtain a regret bound as in Corollary 1, i.e.,
$$\sum_{t=1}^T\mathbf{f}_t^{\top} \mathbf{x}_t - \min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^T \mathbf{f}_t^{\top} \mathbf{x}\leq\sqrt{2\mathrm{EGV}_{T,2}^{f}} \leq\sqrt{2m\mathrm{EGV}^{f}_{T,\infty}}. $$
However, the above regret bound scales badly with the number of experts. We can obtain a better regret bound in \(O(\sqrt{{\mathrm{EGV}}^{f}_{T, \infty}}\ln m)\) by applying the general prox method in Algorithm 5 with \(\omega(\mathbf{x})= \sum_{i=1}^{m}x_{i}\ln x_{i}\) and \(D(\mathbf{x}, \mathbf{z})= \sum_{i=1}^{m}x_{i}\ln(z_{i}/x_{i})\). The two updates in Algorithm 5 become
$$\begin{aligned} x_t^i =& \frac{z^i_{t-1}\exp([\eta/L]f_{t-1}^i)}{\sum_{j=1}^mz^j_{t-1}\exp([\eta/L]f_{t-1}^j)}, i=1,\ldots, m \\ z_t^i = &\frac{z^i_{t-1}\exp([\eta/L]f_{t}^i)}{\sum_{j=1}^mz^j_{t-1}\exp([\eta/L]f_{t}^j)}, i=1,\ldots, m. \end{aligned}$$
The resulting regret bound is formally stated in the following corollary.

Corollary 2

Let \(c_{t}(\mathbf{x}) = \mathbf{f}_{t}^{\top} \mathbf{x}, t=1,\ldots, T\) be a sequence of linear functions in prediction with expert advice. By setting \(\eta= \displaystyle\sqrt{(\ln m)/\mathrm{EGV}^{f}_{T,\infty}}\), L=1, \(\omega (\mathbf{x})= \sum_{i=1}^{m}x_{i}\ln x_{i}\) and \(D(\mathbf{x}, \mathbf{z})= \sum_{i=1}^{m}x_{i}\ln (x_{i}/z_{i})\) in Algorithm  5, we have
$$\sum_{t=1}^T\mathbf{f}_t^{\top} \mathbf{x}_t - \min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^T \mathbf{f}_t^{\top} \mathbf{x}\leq\sqrt{2\mathrm{EGV}^{f}_{T,\infty} \ln m}. $$

Remark 4

By noting the definition of \(\mathrm{EGV}^{f}_{T,\infty}\), the regret bound in Corollary 2 is \(O (\sqrt{\sum_{t=0}^{T-1}\max_{i} |f_{t+1}^{i} - f_{t}^{i}|\ln m} )\), which is similar to the regret bound obtained in Hazan and Kale (2010) for prediction with expert advice. However, the definitions of the variation are not exactly the same. In Hazan and Kale (2010), the authors bound the regret of prediction with expert advice by \(O (\sqrt{\ln m\max_{i} \sum_{t=1}^{T} |f_{t}^{i} - \mu_{t}^{i}|^{2}}+ \ln m )\), where the variation is the maximum total variation over all experts. To compare the two regret bounds, we first consider two extreme cases. When the costs of all experts are the same, then the variation in Corollary 2 is a normal gradual variation, while the variation in Hazan and Kale (2010) is a normal total variation. According to the previous analysis, a gradual variation is smaller than a total variation, therefore the regret bound in Corollary 2 is better than that in Hazan and Kale (2010). In another extreme case when the costs at all iterations of each expert are the same, both regret bounds are constants. More generally, if we assume the maximum total variation is small (say a constant), then \(\sum_{t=0}^{T-1}|f_{t+1}^{i} - f^{i}_{t}|\) is also a constant for any i∈[m]. By a trivial analysis \(\sum_{t=0}^{T-1}\max_{i} |f^{i}_{t+1} - f^{i}_{t}|\leq m \max_{i} \sum_{t=0}^{T-1}|f_{t+1}^{i} - f_{t}^{i}|\), the regret bound in Corollary 2 might be worse up to a factor \(\sqrt{m}\) than that in Hazan and Kale (2010).

Remark 5

It was shown in Chiang et al. (2012), both the regret bounds in Corollary 1 and Corollary 2 are optimal because they match the lower bounds for a special sequence of loss functions. In particular, for online linear optimization if all loss functions but the first \(T_{k}=\sqrt{\mathrm{EGV}^{f}_{T,2}}\) are all-0 functions, then the known lower bound \(\varOmega(\sqrt{T_{k}})\) (Nesterov 2004) matches the upper bound in Corollary 1. Similarly, for prediction from expert advice if all loss functions but the first \(T'_{k}=\sqrt{\mathrm{EGV}^{f}_{T,\infty}}\) are all-0 functions, then the known lower bound \(\varOmega(\sqrt{T'_{k}\ln m})\) (Cesa-Bianchi and Lugosi 2006) matches the upper bound in Corollary 2.

2.3.4 Online strictly convex optimization

In this subsection, we present an algorithm to achieve a logarithmic variation bound for online strictly convex optimization (Hazan et al. 2007). In particular, we assume the cost functions c t (x) are not only smooth but also strictly convex defined formally in the following.

Definition 1

For β>0, a function \(c(\mathbf{x}):\mathcal{P}\rightarrow \mathbb{R}\) is β-strictly convex if for any \(\mathbf{x}, \mathbf{z}\in \mathcal{P}\)
$$ c(\mathbf{x})\geq c(\mathbf{z}) + \nabla c(\mathbf{z})^{\top}(\mathbf{x}-\mathbf{z}) + \beta( \mathbf{x}-\mathbf{z})^{\top }\nabla c(\mathbf{z})\nabla c(\mathbf{z})^{\top} (\mathbf{x}-\mathbf{z}) $$
(20)
It is known that such a defined strictly convex function include strongly convex function and exponential concave function as special cases as long as the gradient of the function is bounded. redTo see this, if c(x) is a β′-strongly convex function with a bounded gradient ∥∇c(x)∥2G, then
$$\begin{aligned} c(\mathbf{x}) \geq &c(\mathbf{z}) + \nabla c(\mathbf{z})^{\top}(\mathbf{x}-\mathbf{z}) + \beta' ( \mathbf{x}-\mathbf{z})^{\top}(\mathbf{x}-\mathbf{z}) \\ \geq & c(\mathbf{z}) + \nabla c(\mathbf{z})^{\top}(\mathbf{x}-\mathbf{z}) + \frac{\beta'}{G^2}(\mathbf{x}-\mathbf{z})^{\top}\nabla c(\mathbf{z})\nabla c(\mathbf{z})^{\top} (\mathbf{x}-\mathbf{z}), \end{aligned}$$
thus c(x) is a (β′/G 2) strictly convex. Similarly if c(x) is exp-concave, i.e., there exists α>0 such that h(x)=exp(−αc(x)) is concave, then c(x) is a β=1/2min(1/(4GD),α) strictly convex (cf. Lemma 2 in Hazan et al. 2007), where D is defined as the diameter of the domain.

Therefore, in addition to smoothness and strict convexity we also assume all the cost functions have bounded gradients, i.e., ∥∇c t (x)∥2G. In Chiang et al. (2012), a logarithmic gradual variation regret bound has been proved. For completeness, we present the algorithm in the framework of general prox method and summarize the results and analysis.

To derive a logarithmic gradual variation bound for online strictly convex optimization, we need to change the Euclidean distance function in Algorithm 4 to a generalized Euclidean distance function. Specifically, at trial t, we let \(H_{t}= I + \beta G^{2} I + \beta\sum_{\tau=0}^{t-1}\nabla c_{\tau}(\mathbf{x}_{\tau})\nabla c_{\tau}(\mathbf{x}_{\tau})^{\top}\) and use the generalized Euclidean distance \(D_{t}(\mathbf{x},\mathbf{z}) = \frac{1}{2}\|\mathbf{x}-\mathbf{z}\|^{2}_{H_{t}}=\frac{1}{2}(\mathbf{x}-\mathbf{z})^{\top}H_{t}(\mathbf{x}-\mathbf{z})\) in updating x t and z t , i.e.,
$$ \begin{aligned} \mathbf{x}_t &= \mathop{\arg \min}\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{\mathbf{x}^{\top}\nabla c_{t-1}( \mathbf{x}_{t-1})+ \frac{1}{2}\|\mathbf{x}-\mathbf{z}_{t-1}\|^2_{H_t} \biggr\} \\ \mathbf{z}_t &= \mathop{\arg\min}\limits _{\mathbf{x}\in \mathcal{P}} \biggl\{\mathbf{x}^{\top}\nabla c_{t}(\mathbf{x}_{t})+ \frac{1}{2}\|\mathbf{x}-\mathbf{z}_{t-1} \|^2_{H_t} \biggr\}, \end{aligned} $$
(21)
To prove the regret bound, we can prove a similar inequality as in Lemma 2 by applying \(\omega(\mathbf{x})= 1/2\|\mathbf{x}\|^{2}_{H_{t}}\), which is stated as follows
$$\begin{aligned} &\nabla c_t(\mathbf{x}_t)^{\top}(\mathbf{x}_t-\mathbf{z}) \\ &\quad {}\leq D_t(\mathbf{z},\mathbf{z}_{t-1}) - D_t(\mathbf{z}, \mathbf{z}_t) \\ &\qquad {} + \bigl\|\nabla c_t(\mathbf{x}_t)-\nabla c_{t-1}(\mathbf{x}_{t-1})\bigr\| _{H_t^{-1}}^2- \frac{1}{2} \bigl[\|\mathbf{x}_t-\mathbf{z}_{t-1}\|_{H_t}^2 + \|\mathbf{x}_t-\mathbf{z}_t\|_{H_t}^2 \bigr]. \end{aligned}$$
Then by applying inequality in (20) for strictly convex function, we obtain the following
$$\begin{aligned} c_t(\mathbf{x}_t)-c_t(\mathbf{z}) \leq& D_t(\mathbf{x},\mathbf{z}_{t-1}) - D_t(\mathbf{x}, \mathbf{z}_t) - \beta\|\mathbf{x}_t-\mathbf{z}\|^2_{h_t} \\ &{} + \bigl\|\nabla c_t(\mathbf{x}_t)-\nabla c_{t-1}(\mathbf{x}_{t-1})\bigr\| _{H_t^{-1}}^2- \frac{1}{2} \bigl[\|\mathbf{x}_t-\mathbf{z}_{t-1}\|_{H_t}^2 + \|\mathbf{x}_t-\mathbf{z}_t\|_{H_t}^2 \bigr], \end{aligned}$$
(22)
where h t =∇c t (x t )∇c t (x t ). It remains to apply the analysis in Chiang et al. (2012) to obtain a logarithmic gradual variation bound, which is stated in the following corollary and its proof is deferred to Appendix C.

Corollary 3

Let c t (x),t=1,…,T be a sequence of β-strictly convex and L-smooth functions with gradients bounded by G. We assume 8dL 2≥1, otherwise we can set \(L=\sqrt{1/(8d)}\). An algorithm that adopts the updates in (21) has a regret bound by
$$\sum_{t=1}^Tc_t( \mathbf{x}_t) - \min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^Tc_t( \mathbf{x})\leq\frac{1 + \beta G^2}{2} +\displaystyle\frac{8d}{\beta} \ln\max \bigl(16dL^2, \beta{\mathrm{EGV}}_{T,2}\bigr), $$
where \(\mathrm{EGV}_{T,2}=\sum_{t=0}^{T-1}\|\nabla c_{t+1}(\mathbf{x}_{t}) - \nabla c_{t}(\mathbf{x}_{t})\|_{2}^{2}\) and d is the dimension of \(\mathbf{x}\in \mathcal{P}\).

3 Online non-smooth optimization with gradual variation bound

All the results we have obtained so far rely on the assumption that the cost functions are smooth. Additionally, at the beginning of Sect. 2, we showed that for general non-smooth functions when the only information presented to the learner is the first order information about the cost functions, it is impossible to obtain a regret bound by gradual variation. However, in this section, we show that a gradual variation bound is achievable for a special class of non-smooth functions that is composed of a smooth component and a non-smooth component.

We consider two categories for the non-smooth component. In the first category, we assume that the non-smooth component is a fixed function and is relatively easy such that the composite gradient mapping can be solved without too much computational overhead compared to gradient mapping. A common example that falls into this category is to consider a non-smooth regularizer. For example, in addition to the basic domain \(\mathcal{P}\), one would enforce the sparsity constraint on the decision vector x, i.e., ∥x0k<d, which is important in feature selection. However, the sparsity constraint ∥x0k is a non-convex function, and is usually implemented by adding a 1 regularizer λx1 to the objective function, where λ>0 is a regularization parameter. Therefore, at each iteration the cost function is given by c t (x)+λx1. To prove a regret bound by gradual variation for this type of non-smooth optimization, we first present a simplified version of the general prox method and show that it has the exactly same regret bound as stated in Sect. 2.3.1, and then extend the algorithm to the non-smooth optimization with a fixed non-smooth component.

In the second category, we assume that the non-smooth component can be written as an explicit maximization structure. In general, we consider a time-varying non-smooth component, present a primal-dual prox method, and prove a min-max regret bound by gradual variation. When the non-smooth components are equal across all trials, the usual regret is bounded by the min-max bound plus a variation in the non-smooth component. To see an application of min-max regret bound, we consider the problem of online classification with hinge loss and show that the number of mistakes can be bounded by a variation in sequential examples.

Before moving to the detailed analysis, it is worth mentioning that several pieces of works have proposed algorithms for optimizing the two types of non-smooth functions as described above to obtain an optimal convergence rate of O(1/T) (Nesterov 2005a, 2005b). Therefore, the existence of a regret bound by gradual variation for these two types of non-smooth optimization does not violate the contradictory argument in Sect. 2.

3.1 Online non-smooth optimization with a fixed non-smooth component

3.1.1 A simplified version of Algorithm 5

In this subsection, we present a simplified version of Algorithm 5, which is the foundation for us to develop the algorithm for non-smooth optimization.

The key trick is to replace the domain constraint \(\mathbf{x}\in \mathcal{P}\) with a non-smooth function in the objective. Let \(\delta_{\mathcal{P}}(\mathbf{x})\) denote the indicator function of the domain \(\mathcal{P}\), i.e.,
$$ \delta_\mathcal{P}(\mathbf{x})=\left \{ \begin{array}{l@{\quad }l}0,&\mathbf{x}\in \mathcal{P}\\ \infty,&\mathrm{otherwise} \end{array} \right . $$
Then the proximal gradient mapping for updating x (step 4) in Algorithm 5 is equivalent to
$$ \mathbf{x}_t = \arg\min_{\mathbf{x}} \mathbf{x}^{\top}\nabla c_{t-1}(\mathbf{x}_{t-1}) + \frac{L}{\eta }D(\mathbf{x}, \mathbf{z}_{t-1}) + \delta_\mathcal{P}(\mathbf{x}). $$
By the first order optimality condition, there exists a sub-gradient \(\mathbf{v}_{t}\in\partial\delta_{\mathcal{P}}(\mathbf{x}_{t})\) such that
$$ \nabla c_{t-1}(\mathbf{x}_{t-1}) + \frac{L}{\eta} \bigl(\nabla\omega(\mathbf{x}_t) - \nabla \omega(\mathbf{z}_{t-1})\bigr) + \mathbf{v}_t = 0. $$
(23)
Thus, x t is equal to
$$ \mathbf{x}_t = \arg\min_{\mathbf{x}} \mathbf{x}^{\top}\bigl(\nabla c_{t-1}(\mathbf{x}_{t-1}) + \mathbf{v}_t\bigr)+ \frac {L}{\eta}D(\mathbf{x}, \mathbf{z}_{t-1}). $$
(24)
Then we can change the update for z t to
$$ \mathbf{z}_t = \arg\min _{\mathbf{x}} \mathbf{x}^{\top}\bigl(\nabla c_{t}( \mathbf{x}_{t}) + \mathbf{v}_t\bigr)+ \frac{L}{\eta}D(\mathbf{x}, \mathbf{z}_{t-1}). $$
(25)
The key ingredient of above update compared to step 6 in Algorithm 5 is that we explicitly use the sub-gradient v t that satisfies the optimality condition for x t instead of solving a domain constrained optimization problem. The advantage of updating z t by (25) is that we can easily compute z t by the first order optimality condition, i.e.,
$$ \nabla c_{t}(\mathbf{x}_{t}) + \mathbf{v}_t + \frac{L}{\eta}\bigl(\nabla\omega(\mathbf{z}_t) - \nabla\omega( \mathbf{z}_{t-1})\bigr) = 0. $$
(26)
Note that Eq. (23) indicates v t =−∇c t−1(x t−1)−∇ω(x t )+∇ω(z t−1). By plugging this into (26), we reach to the following simplified update for z t ,
$$ \nabla\omega(\mathbf{z}_t) = \nabla\omega(\mathbf{x}_{t}) + \frac{\eta}{L}\bigl(\nabla c_{t-1}(\mathbf{x}_{t-1}) - \nabla c_t(\mathbf{x}_t)\bigr). $$
The simplified version of Algorithm 5 is presented in Algorithm 6.
Algorithm 6

A Simplified General Prox Method for Online Convex Optimization

Remark 6

We make three remarks for Algorithm 6. First, the searching point z t does not necessarily belong to the domain \(\mathcal{P}\), which is usually not a problem given that the decision point x t is always in \(\mathcal{P}\). Nevertheless, the update can be followed by a projection step \(\mathbf{z}_{t}=\min_{\mathbf{x}\in \mathcal{P}}D(\mathbf{x},\mathbf{z}'_{t})\) to ensure the searching point also stay in the domain \(\mathcal{P}\), where we slightly abuse the notation \(\mathbf{z}_{t}'\) in \(\nabla\omega(\mathbf{z}_{t}') = \nabla\omega(\mathbf{x}_{t}) + \frac{\eta}{L}(\nabla c_{t-1}(\mathbf{x}_{t-1}) - \nabla c_{t}(\mathbf{x}_{t}))\).

Second, the update in step 6 can be implemented by Cesa-Bianchi and Lugosi (2006), Chap. 11:
$$\mathbf{z}_t = \nabla\omega^*\biggl( \nabla\omega(\mathbf{x}_{t}) + \frac{\eta}{L}\bigl(\nabla c_{t-1}(\mathbf{x}_{t-1}) - \nabla c_t(\mathbf{x}_t)\bigr)\biggr), $$
where ω (⋅) is the Legendre-Fenchel conjugate of ω(⋅). For example, when \(\omega(\mathbf{x}) = 1/2\|\mathbf{x}\|_{2}^{2}\), \(\omega^{*}(\mathbf{x})= 1/2\|\mathbf{x}\|_{2}^{2}\) and the update for the searching point is given by
$$\mathbf{z}_t = \mathbf{x}_t + (\eta/L) \bigl(\nabla c_{t-1}( \mathbf{x}_{t-1})-\nabla c_t(\mathbf{x}_t)\bigr); $$
when ω(x)=∑ i x i lnx i , ω (x)=log[∑ i exp(x i )] and the update for the searching point can be computed by
$$[\mathbf{z}_t]_i \propto \mathbf{x}_i\exp \bigl(\eta/L\bigl[ \nabla c_{t-1}(\mathbf{x}_{t-1})-\nabla c_t( \mathbf{x}_t)\bigr] \bigr), \quad \mathrm{s.t.}\quad\sum_i[ \mathbf{z}_t]_i=1. $$
Third, the key inequality in (15) for proving the regret bound still hold for ζ=∇c t (x t )+v t , ξ=∇c t−1(x t−1)+v t by noting the equivalence between the pairs (24, 25) and (13, 14), which is given below:
$$\begin{aligned} & \frac{\eta}{L}\bigl(\nabla c_t(\mathbf{x}_t) + \mathbf{v}_t\bigr)^{\top}(\mathbf{x}_t-\mathbf{x}) \\ &\quad {}\leq D(\mathbf{x},\mathbf{z}_{t-1}) - D(\mathbf{x}, \mathbf{z}_t) \\ &\qquad {}+ \frac{\gamma^2}{\alpha} \bigl\|\nabla c_t(\mathbf{x}_t)-\nabla c_{t-1}(\mathbf{x}_{t-1})\bigr\|_*^2 - \frac{\alpha}{2}\bigl[ \|\mathbf{x}_t-\mathbf{z}_{t-1}\|^2 + \|\mathbf{x}-\mathbf{z}_{t} \|^2\bigr],\quad \forall \mathbf{x}\in \mathcal{P}, \end{aligned}$$
where \(\mathbf{v}_{t}\in\partial\delta_{\mathcal{P}}(\mathbf{x}_{t})\). As a result, we can apply the same analysis as in the proof of Theorem 3 to obtain the same regret bound in Theorem 4 for Algorithm 6. Note that the above inequality remains valid even if we take a projection step after the update for \(\mathbf{z}'_{t}\) due to the generalized Pythagorean inequality \(D(\mathbf{x}, \mathbf{z}_{t})\leq D(\mathbf{x},\mathbf{z}'_{t}), \forall \mathbf{x}\in \mathcal{P}\) (Cesa-Bianchi and Lugosi 2006).

3.1.2 A gradual variation bound for online non-smooth optimization

In spirit of Algorithm 6, we present an algorithm for online non-smooth optimization of functions \(c_{t}(\mathbf{x})=\widehat{c}_{t}(\mathbf{x})+ g(\mathbf{x})\) with a regret bound by gradual variation \(\mathrm{EGV}_{T}= \sum_{t=0}^{T-1}\|\nabla c_{t+1}(\mathbf{x}_{t}) - \nabla c_{t}(\mathbf{x}_{t})\|_{*}^{2}\). The trick is to solve the composite gradient mapping:
$$ \mathbf{x}_t= \arg\min_{\mathbf{x}\in \mathcal{P}} \mathbf{x}^{\top}\nabla c_{t-1}(\mathbf{x}_{t-1}) + \frac {L}{\eta}D(\mathbf{x}, \mathbf{z}_{t-1}) + g(\mathbf{x}) $$
and update z t by
$$ \nabla\omega(\mathbf{z}_t) = \nabla\omega(\mathbf{x}_{t}) + \frac{\eta}{L}\bigl(\nabla c_{t-1}(\mathbf{x}_{t-1}) - \nabla c_t(\mathbf{x}_t)\bigr). $$
Algorithm 7 shows the detailed steps and Corollary 4 states the regret bound, which can be proved similarly.
Algorithm 7

A General Prox Method for Online Non-Smooth Optimization with a Fixed Non-Smooth Component

Corollary 4

Let \(c_{t}(\cdot)=\widehat{c}_{t}(\cdot) + g(\cdot), t=1, \ldots, T\) be a sequence of convex functions where \(\widehat{c}_{t}(\cdot)\) are L-smooth continuous w.r.t ∥⋅∥ and g(⋅) is a non-smooth function, ω(z) be a α-strongly convex function w.r.t ∥⋅∥, and EGV T be defined in (19). By setting \(\eta= (1/2)\min \{\sqrt{\alpha}/\sqrt{2}, LR/\sqrt{\mathrm{EGV}_{T}} \}\), we have the following regret bound
$$\sum_{t=1}^Tc_t( \mathbf{x}_t)-\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^Tc_t( \mathbf{x}) \leq2R\max (\sqrt{2}LR/\sqrt{\alpha}, \sqrt{\mathrm{EGV}_T} ). $$

3.2 Online non-smooth optimization with an explicit max structure

In previous subsection, we assume the composite gradient mapping with the non-smooth component can be efficiently solved. Here, we replace this assumption with an explicit max structure of the non-smooth component.

In what follows, we present a primal-dual prox method for such non-smooth cost functions and prove its regret bound. We consider a general setting, where the non-smooth functions c t (x) has the following structure:
$$ c_t(\mathbf{x}) = \widehat{c}_t(\mathbf{x}) + \max _{\mathbf{u}\in\mathcal{Q}}\langle A_t\mathbf{x}, \mathbf{u}\rangle- \widehat{\phi}_t(\mathbf{u}), $$
(27)
where \(\widehat{c}_{t}(\mathbf{x})\) and \(\widehat{\phi}_{t}(\mathbf{u})\) are L 1-smooth and L 2-smooth functions, respectively, and \(A_{t}\in\mathbb{R}^{m\times d}\) is a matrix used to characterize the non-smooth component of c t (x) with \(-\widehat{\phi}_{t}(\mathbf{u})\) by maximization. Similarly, we define a dual cost function ϕ t (u) as
$$ \phi_t(\mathbf{u}) =- \widehat{\phi}_t(\mathbf{u}) + \min _{\mathbf{x}\in \mathcal{P}}\langle A_t\mathbf{x}, \mathbf{u}\rangle+ \widehat{c}_t(\mathbf{x}). $$
(28)
We refer to x as the primal variable and to u as the dual variable. To motivate the setup, let us consider online classification with hinge loss t (w)=max(0,1−y t w x t ), where we slightly abuse the notation (x t ,y t ) to denote the attribute and label pair received at trial t. It is straightforward to see that t (w) is a non-smooth function and can be cast into the form in (27) by
$$ \ell_t(\mathbf{w}) = \max_{\alpha\in[0, 1]}\alpha \bigl(1-y_t\mathbf{w}^{\top} \mathbf{x}_t\bigr) = \max _{\alpha\in[0,1]} -\alpha y_t\mathbf{x}_t^{\top} \mathbf{w} + \alpha. $$
To present the algorithm and analyze its regret bound, we introduce some notations. Let \(F_{t}(\mathbf{x},\mathbf{u}) = \widehat{c}_{t}(\mathbf{x}) + \langle A_{t}\mathbf{x}, \mathbf{u}\rangle- \widehat{\phi}_{t}(\mathbf{u})\), ω 1(x) be a α 1-strongly convex function defined on the primal variable x w.r.t a norm ∥⋅∥ p and ω 2(u) be a α 2-strongly convex function defined on the dual variable u w.r.t a norm ∥⋅∥ q . Correspondingly, let D 1(x,z) and D 2(u,v) denote the induced Bregman distance, respectively. We assume the domains \(\mathcal{P}\), \(\mathcal{Q}\) are bounded and matrices A t have a bounded norm, i.e.,
$$ \begin{aligned} &\max_{\mathbf{x}\in \mathcal{P}}\|\mathbf{x}\|_p\leq R_1,\qquad\max_{\mathbf{u}\in \mathcal{Q}}\|\mathbf{u}\|_q\leq R_2 \\ &\max_{\mathbf{x}\in \mathcal{P}}\omega_1(\mathbf{x}) -\min_{\mathbf{x}\in \mathcal{P}} \omega_1(\mathbf{x})\leq M_1 \\ &\max_{\mathbf{u}\in \mathcal{Q}}\omega_2(\mathbf{u}) -\min_{\mathbf{u}\in \mathcal{Q}} \omega_2(\mathbf{u})\leq M_2 \\ &\|A_t\|_{p, q}=\max_{\|\mathbf{x}\|_p\leq1, \|\mathbf{u}\|_q\leq1} \mathbf{u}^{\top}A_t\mathbf{x}\leq \sigma. \end{aligned} $$
(29)
Let ∥⋅∥ p,∗ and ∥⋅∥ q,∗ denote the dual norms to ∥⋅∥ p and ∥⋅∥ q , respectively. To prove a variational regret bound, we define a gradual variation as follows:
$$\begin{aligned} \mathrm{EGV}_{T, p, q} =& \sum_{t=0}^{T-1} \bigl\|\nabla\widehat{c}_{t+1}(\mathbf{x}_t) - \nabla\widehat{c}_{t}(\mathbf{x}_t)\bigr\|^2_{p,*} + \bigl(R_1^2+R_2^2\bigr)\sum _{t=0}^{T-1}\|A_t-A_{t-1} \|_{p, q}^2 \\ &{}+ \sum_{t=0}^{T-1} \bigl\|\nabla\widehat{\phi}_{t+1}(\mathbf{u}_ t) - \nabla\widehat{\phi}_{t}( \mathbf{u}_ t)\bigr\|^2_{q,*}. \end{aligned}$$
(30)
Given above notations, Algorithm 8 shows the detailed steps and Theorem 5 states a min-max bound.
Algorithm 8

General Prox Method for Online Non-smooth Optimization with an Explicit Max Structure

Theorem 5

Let \(c_{t}(\mathbf{x}) = \widehat{c}_{t}(\mathbf{x}) + \max_{\mathbf{u}\in\mathcal{Q}}\langle A_{t}\mathbf{x}, \mathbf{u}\rangle- \widehat{\phi}_{t}(\mathbf{u}), t=1,\ldots, T\) be a sequence of non-smooth functions. Assume \(\widehat{c}_{t}(\mathbf{x}), \widehat{\phi}(\mathbf{u})\) are L=max(L 1,L 2)-smooth functions and the domain \(\mathcal{P}, \mathcal{Q}\) and A t satisfy the boundness condition as in (29). Let ω(x) be a α 1-strongly convex function w.r.t the norm ∥⋅∥ p , ω(u) be a α 2-strongly convex function w.r.t. the norm ∥⋅∥ q , and α=min(α 1,α 2). By setting \(\eta= \min (\sqrt{M_{1}+M_{2}}/(2\sqrt{\mathrm{EGV}_{T,p,q}}), \sqrt{\alpha}/(4\sqrt{\sigma^{2}+L^{2}}) )\) in Algorithm  8, then we have
$$\begin{aligned} &\max_{\mathbf{u}\in \mathcal{Q}}\sum_{t=1}^TF_t( \mathbf{x}_t, \mathbf{u}) - \min_{\mathbf{x}\in \mathcal{P}}\sum _{t=1}^TF_t(\mathbf{x}, \mathbf{u}_ t) \\ &\quad {} \leq4\sqrt{M_1+M_2}\max \biggl(2\sqrt{ \frac{(M_1+M_2)(\sigma ^2+L^2)}{\alpha}}, \sqrt{\mathrm{EGV}_{T,p, q}} \biggr). \end{aligned}$$

To facilitate understanding, we break the proof into several Lemmas. The following Lemma is by analogy with Lemma 2.

Lemma 3

  Let Open image in new window denote a single vector with a norm \(\|\theta\|=\sqrt{\|\mathbf{x}\| ^{2}_{p}+\|\mathbf{u}\|_{q}^{2}}\) and a dual norm \(\|\theta\|_{*}=\sqrt{\|\mathbf{x}\|^{2}_{p*}+\|\mathbf{u}\|^{2}_{q*}}\). Let ω(θ)=ω 1(x)+ω 2(u), D(θ,ζ)=D 1(x,u)+D 2(z,v). Then
$$\begin{aligned} & \left(\begin{array}{c} \eta \nabla_\mathbf{x}F_t(\mathbf{x}_t, \mathbf{u}_ t)\\ -\nabla_\mathbf{u}F_t(\mathbf{x}_t,\mathbf{u}_t) \end{array}\right)^{\top} \left(\begin{array}{c} \mathbf{x}_t-\mathbf{x}\\ \mathbf{u}_ t- \mathbf{u}\end{array}\right) \\ &\quad {}\leq D(\theta, \zeta _{t-1}) - D(\theta, \zeta_t) \\ &\qquad {}+ \eta^2 \bigl(\bigl\|\nabla_\mathbf{x}F_t( \mathbf{x}_t,\mathbf{u}_ t) - \nabla_\mathbf{x}F_{t-1}(\mathbf{x}_{t-1}, \mathbf{u}_ {t-1})\bigr\|^2_{p,*} \bigr) \\ &\qquad {}+\eta^2 \bigl(\bigl\|\nabla_\mathbf{u}F_t( \mathbf{x}_t,\mathbf{u}_ t) - \nabla_\mathbf{u}F_{t-1}( \mathbf{x}_{t-1}, \mathbf{u}_ {t-1})\bigr\|^2_{q,*} \bigr) \\ &\qquad {}- \frac{\alpha}{2} \bigl(\|\mathbf{x}_t-\mathbf{z}_t \|^2_{p} +\|\mathbf{u}_ t-\mathbf{v}_t \|_q^2 + \|\mathbf{x}_t-\mathbf{z}_{t-1} \|_p^2 + \|\mathbf{u}_ t-\mathbf{v}_{t-1} \|_q^2 \bigr). \end{aligned}$$

Proof

The updates of (x t ,u t ) in Algorithm 8 can be seen as applying the updates in Lemma 2 with Open image in new window in place of x, Open image in new window in place of z +, Open image in new window in place of z. Note that ω(θ) is a α=min(α 1,α 2)-strongly convex function with respect to the norm ∥θ∥. Then applying the results in Lemma 2 we can complete the proof. □

Applying the convexity of F t (x,u) in terms of x and the concavity of F t (x,u) in terms of u to the result in Lemma 3, we have
$$ \begin{aligned} &\eta \bigl( F_t( \mathbf{x}_t, \mathbf{u}) - F_t(\mathbf{x}, \mathbf{u}_ t) \bigr) \\ &\quad {}\leq D_1(\mathbf{x}, \mathbf{z}_{t-1}) -D_1(\mathbf{x},\mathbf{z}_t) + D_2(\mathbf{u}, \mathbf{v}_{t-1}) -D_2(\mathbf{u},\mathbf{v}_t) \\ &\qquad {}+ {\eta^2} \bigl\|\nabla\widehat{c}_t( \mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t-1}) + A_t^{\top} \mathbf{u}_ t - A_{t-1}^{\top} \mathbf{u}_ {t-1}\bigr\|_{p,*}^2 \\ &\qquad {}+ {\eta^2} \bigl\|\nabla\widehat{\phi}_t( \mathbf{u}_ t) - \nabla \widehat{\phi}_{t-1}(\mathbf{u}_ {t-1}) + A_t\mathbf{x}_t - A_{t-1}\mathbf{x}_{t-1} \bigr\|_{q,*}^2 \\ &\qquad {}- \frac{\alpha}{2} \bigl(\|\mathbf{x}_t-\mathbf{z}_{t-1} \|_p^2+\|\mathbf{x}_t-\mathbf{z}_{t} \|_p^2 \bigr)- \frac{\alpha}{2} \bigl(\| \mathbf{u}_ t-\mathbf{v}_{t-1}\|_q^2 + \|\mathbf{u}_t-\mathbf{v}_{t}\|_q^2 \bigr) \\ &\quad {}\leq D_1(\mathbf{x}, \mathbf{z}_{t-1}) -D_1(\mathbf{x}, \mathbf{z}_t) + D_2(\mathbf{u}, \mathbf{v}_{t-1}) -D_2(\mathbf{u},\mathbf{v}_t) \\ &\qquad {}+ 2{\eta^2} \bigl\|\nabla\widehat{c}_t( \mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t-1}) \bigr\|_{p,*} ^2+ 2\eta^2 \bigl\|A_t^{\top} \mathbf{u}_ t - A_{t-1}^{\top} \mathbf{u}_{t-1} \bigr\|_{p,*}^2 \\ &\qquad {}+2 {\eta^2} \bigl\|\nabla\widehat{\phi}_t( \mathbf{u}_ t) - \nabla \widehat{\phi}_{t-1}(\mathbf{u}_ {t-1}) \bigr\|_{q,*}^2 + 2\eta^2 \|A_t \mathbf{x}_t - A_{t-1}\mathbf{x}_{t-1} \|_{q,*}^2 \\ &\qquad {}- \frac{\alpha}{2} \bigl(\|\mathbf{x}_t-\mathbf{z}_{t-1} \|_p^2+\|\mathbf{x}_t-\mathbf{z}_{t} \|_p^2 \bigr)- \frac{\alpha}{2} \bigl(\| \mathbf{u}_ t-\mathbf{v}_{t-1}\|_q^2 + \|\mathbf{u}_t-\mathbf{v}_{t}\|_q^2 \bigr) \end{aligned} $$
(31)
The following lemma provides tools for proceeding the bound.

Lemma 4

$$\begin{aligned} \bigl\|\nabla\widehat{c}_t(\mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t-1})\bigr\|_{p,*} ^2 \leq& 2 \bigl\|\nabla \widehat{c}_t(\mathbf{x}_t) - \nabla\widehat{c}_{t-1}( \mathbf{x}_{t})\bigr\| _{p,*} ^2 + 2L^2\| \mathbf{x}_t -\mathbf{x}_{t-1}\|_{p}^2 \\ \bigl\|\nabla\widehat{\phi}_t(\mathbf{u}_ t) - \nabla\widehat{\phi}_{t-1}(\mathbf{u}_ {t-1})\bigr\| _{q,*} ^2 \leq & 2 \bigl\|\nabla\widehat{\phi}_t(\mathbf{u}_ t) - \nabla\widehat{\phi} _{t-1}(\mathbf{u}_ {t})\bigr\|_{q,*} ^2 + 2L^2\|\mathbf{u}_ t -\mathbf{u}_ {t-1}\|_{q}^2 \\ \|A_t\mathbf{x}_t - A_{t-1}\mathbf{x}_{t-1} \|_{q,*}^2 \leq & 2R_1^2\|A_t - A_{t-1}\| _{p,q}^2 + 2\sigma^2 \| \mathbf{x}_t - \mathbf{x}_{t-1}\|_p^2 \\ \bigl\|A_t^{\top} \mathbf{u}_ t - A^{\top}_{t-1} \mathbf{u}_ {t-1}\bigr\|_{p,*}^2 \leq & 2R_2^2 \|A_t - A_{t-1}\|_{p,q}^2 + 2 \sigma^2 \|\mathbf{u}_ t - \mathbf{u}_ {t-1}\|_q^2 \end{aligned}$$

Proof

We prove the first and the third inequalities. Another two inequalities can be proved similarly.
$$\begin{aligned} & \bigl\|\nabla\widehat{c}_t(\mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t-1})\bigr\|_{p,*} ^2 \\ &\quad {}\leq2 \bigl\|\nabla\widehat{c}_t(\mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t})\bigr\|_{p,*} ^2 + 2 \bigl\|\nabla \widehat{c}_{t-1}(\mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t-1})\bigr\|_{p,*} ^2 \\ &\quad {}\leq2\bigl\|\nabla\widehat{c}_t(\mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t})\bigr\| _{p,*} ^2 + 2L^2\|\mathbf{x}_t-\mathbf{x}_{t-1}\|_{p} ^2 \end{aligned}$$
where we use the smoothness of \(\widehat{c}_{t}(\mathbf{x})\).
$$\begin{aligned} \|A_t\mathbf{x}_t - A_{t-1}\mathbf{x}_{t-1} \|_{q,*}^2 \leq& 2 \bigl\|(A_t - A_{t-1}) \mathbf{x}_t \bigr\|_{q,*}^2 + 2 \bigl\|A_{t-1}(\mathbf{x}_t - \mathbf{x}_{t-1})\bigr\|_p^2 \\ \leq & 2R_1^2\|A_t - A_{t-1} \|^2_{p,q} + 2\sigma^2 \|\mathbf{x}_t - \mathbf{x}_{t-1}\|_p^2 \end{aligned}$$
 □

Lemma 5

For any \(\mathbf{x}\in \mathcal{P}\) and \(\mathbf{u}\in \mathcal{Q}\), we have
$$\begin{aligned} &\eta \bigl( F_t(\mathbf{x}_t, \mathbf{u}) - F_t(\mathbf{x}, \mathbf{u}_ t) \bigr) \\ &\quad {}\leq D_1(\mathbf{x}, \mathbf{z}_{t-1}) -D_1(\mathbf{x}, \mathbf{z}_t) + D_2(\mathbf{u}, \mathbf{v}_{t-1}) -D_2(\mathbf{u},\mathbf{v}_t) \\ &\qquad {}+4\eta^2 \bigl(\bigl\|\nabla\widehat{c}_t( \mathbf{x}_t) - \nabla \widehat{c}_{t-1}(\mathbf{x}_{t}) \bigr\|_{p,*} ^2 + \bigl\|\nabla\widehat{\phi}_t( \mathbf{u}_ t) - \nabla\widehat{\phi}_{t-1}(\mathbf{u}_ {t}) \bigr\|_{q,*}^2 \\ &\qquad {}+ \bigl(R_1^2+R_2^2 \bigr) \|A_t-A_{t-1}\|_{p,q}^2 \bigr) \\ &\qquad {}+4\eta^2\sigma^2 \|\mathbf{x}_t- \mathbf{x}_{t-1}\|_p^2 + 4\eta^2L^2 \|\mathbf{x}_t-\mathbf{x}_{t-1}\|_p^2 - \frac{\alpha}{2}\bigl(\|\mathbf{x}_t-\mathbf{z}_t\|_p^2+ \|\mathbf{x}_t-\mathbf{z}_{t-1}\|_p^2\bigr) \\ &\qquad {}+4\eta^2\sigma^2\|\mathbf{u}_ t- \mathbf{u}_ {t-1}\|_p^2 + 4\eta^2L^2 \|\mathbf{u}_t-\mathbf{u}_ {t-1}\|_q^2 - \frac{\alpha}{2}\bigl(\|\mathbf{u}_ t-\mathbf{v}_t\|_q^2+ \|\mathbf{u}_ t-\mathbf{v}_{t-1}\|_q^2\bigr). \end{aligned}$$

Proof

The lemma can be proved by combining the results in Lemma 4 and the inequality in (31).
$$\begin{aligned} &\eta \bigl( F_t(\mathbf{x}_t, \mathbf{u}) - F_t(\mathbf{x}, \mathbf{u}_ t) \bigr) \\ &\quad{}\leq D_1(\mathbf{x}, \mathbf{z}_{t-1}) -D_1(\mathbf{x},\mathbf{z}_t) + D_2(\mathbf{u}, \mathbf{v}_{t-1}) -D_2(\mathbf{u},\mathbf{v}_t) \\ &\qquad {}+ 2{\eta^2} \bigl\|\nabla\widehat{c}_t( \mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t-1}) \bigr\|_{p,*} ^2+ 2\eta^2 \bigl\|A_t^{\top} \mathbf{u}_ t - A_{t-1}^{\top} \mathbf{u}_{t-1} \bigr\|_{p,*}^2 \\ &\qquad {}+2 {\eta^2}\bigl\|\nabla\widehat{\phi}_t( \mathbf{u}_ t) - \nabla \widehat{\phi}_{t-1}(\mathbf{u}_ {t-1}) \bigr\|_{q,*}^2 + 2\eta^2\|A_t \mathbf{x}_t - A_{t-1}\mathbf{x}_{t-1}\|_{q,*}^2 \\ &\qquad {}- \frac{\alpha}{2} \bigl(\|\mathbf{x}_t-\mathbf{z}_{t-1} \|_p^2+\|\mathbf{x}_t-\mathbf{z}_{t} \|_p^2 \bigr)- \frac{\alpha}{2} \bigl(\| \mathbf{u}_ t-\mathbf{v}_{t-1}\|_q^2 + \|\mathbf{u}_t-\mathbf{v}_{t}\|_q^2 \bigr) \\ &\quad{}\leq D_1(\mathbf{x}, \mathbf{z}_{t-1}) -D_1(\mathbf{x}, \mathbf{z}_t) + D_2(\mathbf{u}, \mathbf{v}_{t-1}) -D_2(\mathbf{u}, \mathbf{v}_t) \\ &\qquad {}+ 4{\eta^2}\bigl\|\nabla\widehat{c}_t( \mathbf{x}_t) - \nabla\widehat{c}_{t-1}(\mathbf{x}_{t}) \bigr\|_{p,*} ^2+4{\eta^2}L^2\| \mathbf{x}_t-\mathbf{x}_{t-1}\|_p^2 + 4\eta ^2\sigma^2\|\mathbf{x}_t-\mathbf{x}_{t-1} \|_p^2 \\ &\qquad {}+ 4\eta^2R_1^2 \|A_t - A_{t-1}\|_{p,q}^2 + 4 \eta^2R_2^2\| A_t-A_{t-1} \|_{p,q}^2 \\ &\qquad {} +4 {\eta^2}\bigl\|\nabla\widehat{\phi}_t( \mathbf{u}_ t) - \nabla \widehat{\phi}_{t-1}(\mathbf{u}_ {t}) \bigr\|_{q,*}^2 + 4\eta^2L^2\| \mathbf{u}_ t - \mathbf{u}_ {t-1}\| _{q}^2+ 4 \eta^2\sigma^2\|\mathbf{u}_ t-\mathbf{u}_ {t-1} \|_q^2 \\ &\qquad {}- \frac{\alpha}{2} \bigl(\|\mathbf{x}_t-\mathbf{z}_{t-1} \|_p^2+\|\mathbf{x}_t-\mathbf{z}_{t} \|_p^2 \bigr)- \frac{\alpha}{2} \bigl(\| \mathbf{u}_ t-\mathbf{v}_{t-1}\|_q^2 + \|\mathbf{u}_t-\mathbf{v}_{t}\|_q^2 \bigr) \\ &\quad{}\leq D_1(\mathbf{x}, \mathbf{z}_{t-1}) -D_1(\mathbf{x}, \mathbf{z}_t) + D_2(\mathbf{u}, \mathbf{v}_{t-1}) -D_2(\mathbf{u}, \mathbf{v}_t) \\ &\qquad {}+4\eta^2 \bigl(\bigl\|\nabla\widehat{c}_t( \mathbf{x}_t) - \nabla \widehat{c}_{t-1}(\mathbf{x}_{t}) \bigr\|_{p,*} ^2 + \bigl\|\nabla\widehat{\phi}_t( \mathbf{u}_ t) - \nabla\widehat{\phi}_{t-1}(\mathbf{u}_ {t}) \bigr\|_{q,*}^2 \\ &\qquad {}+ \bigl(R_1^2+R_2^2 \bigr)\|A_t-A_{t-1}\|_{p,q}^2 \bigr) \\ &\qquad {}+4\eta^2\sigma^2 \|\mathbf{x}_t-\mathbf{x}_{t-1}\|_p^2 + 4\eta^2L^2 \|\mathbf{x}_t-\mathbf{x}_{t-1}\|_p^2 - \frac{\alpha}{2}\bigl(\|\mathbf{x}_t-\mathbf{z}_t\|_p^2+ \|\mathbf{x}_t-\mathbf{z}_{t-1}\|_p^2\bigr) \\ &\qquad{}+4\eta^2\sigma^2\|\mathbf{u}_ t- \mathbf{u}_ {t-1}\|_p^2 + 4\eta^2L^2 \|\mathbf{u}_t-\mathbf{u}_ {t-1}\|_q^2 - \frac{\alpha}{2}\bigl(\|\mathbf{u}_ t-\mathbf{v}_t\|_q^2+ \|\mathbf{u}_ t-\mathbf{v}_{t-1}\|_q^2\bigr). \end{aligned}$$
 □

Proof of Theorem 5

Taking summation the inequalities in Lemma 5 over t=1,…,T, applying the inequality in (17) twice and using \(\eta \leq\sqrt{\alpha}/(4\sqrt{\sigma^{2}+L^{2}})\), we have
$$\begin{aligned} &\sum_{t=1}^TF_t( \mathbf{x}_t, \mathbf{u})-\sum_{t=1}^TF_t( \mathbf{x},\mathbf{u}_ t) \\ &\quad {} \leq4\eta{\mathrm{EGV}}_{T,p,q} + \frac{M_1+M_2}{\eta} \\ &\quad {}= 4\sqrt{M_1+M_2}\max \biggl(2\sqrt{ \frac{(M_1+M_2)(\sigma^2+L^2)}{\alpha }}, \sqrt{\mathrm{EGV}_{T,p, q}} \biggr). \end{aligned}$$
(32)
We complete the proof by using \(\mathbf{x}^{*}=\arg\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^{T}F_{t}(\mathbf{x}_{t},\mathbf{u})\) and \(\mathbf{u}^{*}= \arg\max_{\mathbf{u}\in \mathcal{Q}}\sum_{t=1}^{T}F_{t}(\mathbf{x},\mathbf{u}_{t})\). □

As an immediate result of Theorem 5, the following Corollary states a regret bound for non-smooth optimization with a fixed non-smooth component that can be written as a max structure, i.e., \(c_{t}(\mathbf{x}) =\widehat{c}_{t}(\mathbf{x}) + [g(\mathbf{x})=\max_{\mathbf{u}\in \mathcal{Q}}\langle A\mathbf{x}, \mathbf{u}\rangle- \widehat{\phi}(\mathbf{u})]\).

Corollary 5

Let \(c_{t}(\mathbf{x}) = \widehat{c}_{t}(\mathbf{x}) + g(\mathbf{x}), t=1,\ldots, T\) be a sequence of non-smooth functions, where \(g(\mathbf{x})=\max_{\mathbf{u}\in\mathcal{Q}}\langle A\mathbf{x}, \mathbf{u}\rangle- \widehat{\phi}(\mathbf{u})\), and the gradual variation EGV T be defined in (19) w.r.t the dual norm ∥⋅∥ p,∗. Assume \(\widehat{c}_{t}(\mathbf{x})\) are L-smooth functions w.r.t ∥⋅∥, the domain \(\mathcal{P}, \mathcal{Q}\) and A satisfy the boundness condition as in (29). If we set \(\eta=\min (\sqrt{M_{1}+M_{2}}/(2\sqrt{\mathrm{EGV}_{T}}), \sqrt{\alpha}/(4\sqrt{\sigma ^{2}+L^{2}}) )\) in Algorithm  8, then we have the following regret bound
$$\begin{aligned} &\sum_{t=1}^Tc_t( \mathbf{x}_t)- \min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^Tc_t( \mathbf{x}) \\ &\quad{}\leq4\sqrt{M_1+M_2}\max \biggl(2\sqrt{ \frac{(M_1+M_2)(\sigma ^2+L^2)}{\alpha}}, \sqrt{\mathrm{EGV}_{T}} \biggr) + V (g, \mathbf{x}_{1:T}), \end{aligned}$$
where \(\widehat{\mathbf{x}}_{T}= \sum_{t=1}^{T}\mathbf{x}_{t}/T\) and \(V(g, \mathbf{x}_{1:T})=\sum_{t=1}^{T}|g(\mathbf{x}_{t})-g(\widehat{\mathbf{x}}_{T})|\) measures the variation in the non-smooth component.

Proof

In the case of fixed non-smooth component, the gradual variation defined in (30) reduces the one defined in (19) w.r.t the dual norm ∥⋅∥ p,∗. By using the bound in (32) and noting that \(c_{t}(\mathbf{x})=\max_{\mathbf{u}\in \mathcal{Q}}F_{t}(\mathbf{x},\mathbf{u})\geq F_{t}(\mathbf{x}, \mathbf{u}_{t})\), we have
$$\begin{aligned} & \sum_{t=1}^T \bigl( \widehat{c}_t(\mathbf{x}_t) + \langle A\mathbf{x}_t, \mathbf{u}\rangle - \widehat{\phi}(\mathbf{u}) \bigr) \\ &\quad {} \leq\sum_{t=1}^T c_t(\mathbf{x}) + 4\sqrt{M_1+M_2}\max \biggl(2\sqrt{ \frac{(M_1+M_2)(\sigma ^2+L^2)}{\alpha}}, \sqrt{\mathrm{EGV}_{T}} \biggr). \end{aligned}$$
Therefore
$$\begin{aligned} &\sum_{t=1}^T \bigl(\widehat{c}_t(\mathbf{x}_t) + g(\,\widehat{\mathbf{x}}_T) \bigr) \\ &\quad {} \leq \sum _{t=1}^T c_t(\mathbf{x}) + 4\sqrt{M_1+M_2}\max \biggl(2\sqrt{ \frac{(M_1+M_2)(\sigma ^2+L^2)}{\alpha}}, \sqrt{\mathrm{EGV}_{T}} \biggr). \end{aligned}$$
We complete the proof by complementing \(\widehat{c}_{t}(\mathbf{x}_{t})\) with g(x t ) to obtain c t (x t ) and moving the additional term \(\sum_{t=1}^{T}(g(\,\widehat{\mathbf{x}}_{T})-g(\mathbf{x}_{t}))\) to the right hand side. □

Remark 7

Note that the regret bound in Corollary 5 has an additional term V(g,x 1:T ) compared to the regret bound in Corollary 4, which constitutes a tradeoff between the reduced computational cost in solving a composite gradient mapping.

To see an application of Theorem 5 to an online non-smooth optimization with time-varying non-smooth components, let us consider the example of online classification with hinge loss. At each trial, upon receiving an example x t , we need to make a prediction based on the current model w t , i.e., \(\widehat{y}_{t}=\mathbf{w}_{t}^{\top} \mathbf{x}_{t}\), then we receive the true label of x t denoted by y t ∈{+1,−1}. The goal is to minimize the total number of mistakes across the time line \(M_{T}=\sum_{t=1}^{T}{I}(\widehat{y}_{t}y_{t}\leq0)\). Here we are interested in a scenario that the data sequence (x t ,y t ),t=1,…,T has a small gradual variation in terms of y t x t . To obtain such a gradual variation based mistake bound, we can apply Algorithm 8. For the purpose of deriving the mistake bound, we need to make a small change to Algorithm 8. At the beginning of each trial, we first make a prediction \(\widehat{y}_{t}=\mathbf{w}_{t}^{\top} \mathbf{x}_{t}\), and if we make a mistake \(I(\widehat{y}_{t}y_{t}\leq0)\) the we proceed to update the auxiliary primal-dual pair \((\mathbf{w}'_{t}, \beta_{t})\) similar to (z t ,v t ) in Algorithm 8 and the primal-dual pair (w t+1,α t+1) similar to (x t+1,u t+1) in Algorithm 8, which are given explicitly as follows:
$$\begin{aligned} \beta_{t} =& \prod_{[0, 1]} \bigl( \beta_{t-1} +\eta(1-\mathbf{w}_{t}y_{t}\mathbf{x}_{t}) \bigr),\qquad \mathbf{w}'_{t} = \prod_{\|\mathbf{w}\|_2\leq R} \bigl(\mathbf{w}'_{t-1} + \eta\alpha _{t}y_{t} \mathbf{x}_{t}\bigr) \\ \alpha_{t+1} =& \prod_{[0, 1]} \bigl( \beta_t +\eta(1-\mathbf{w}_{t}y_t\mathbf{x}_t) \bigr),\qquad \mathbf{w}_{t+1} = \prod_{\|\mathbf{w}\|_2\leq R}\bigl( \mathbf{w}'_t + \eta\alpha _ty_t \mathbf{x}_t\bigr). \end{aligned}$$
Without loss of generality, we let (x t ,y t ),t=1,…,M T denote the examples that are predicted incorrectly. The F t function is written as \(F_{t}(\mathbf{w}, \alpha) = \alpha(1-y_{t}\mathbf{w}^{\top}_{t}\mathbf{x}_{t})\). Then for a total sequence of T examples, we have the following bound by assuming ∥x t 2≤1 and \(\eta\leq{1}/{2\sqrt{2}}\)
$$ \sum_{t=1}^{M_T}F_t( \mathbf{w}_{t}, \alpha) \leq\sum_{t=1}^{M_T} \ell\bigl(y_t\mathbf{w}^{\top} \mathbf{x}_t\bigr) + \eta\sum _{t=0}^{M_T-1}\bigl(R^2+1\bigr) \|y_{t+1}\mathbf{x}_{t+1} - y_t\mathbf{x}_t \|_2^2 + \frac{R^2 + \alpha^2}{2\eta}. $$
Since \(y_{t}\mathbf{w}_{t}^{\top} \mathbf{x}_{t}\) is less than 0 for the incorrectly predicted examples, if we set α=1 in the above inequality, we have
$$\begin{aligned} M_T \leq&\sum_{t=1}^{M_T}\ell \bigl(y_t\mathbf{w}^{\top} \mathbf{x}_t\bigr) + \eta\sum _{t=0}^{M_T-1}\bigl(R^2+1\bigr) \|y_{t+1}\mathbf{x}_{t+1} - y_t\mathbf{x}_t \|_2^2 + \frac{R^2 + 1}{2\eta} \\ \leq&\sum_{t=1}^{M_T}\ell \bigl(y_t\mathbf{w}^{\top} \mathbf{x}_t\bigr) + \sqrt{2} \bigl(R^2+1\bigr)\max(2, \sqrt{\mathrm{EGV}_{T,2}}). \end{aligned}$$
which results in a gradual variational mistake bound, where \(\sqrt{\mathrm{EVG}_{T,2}}\) measures the gradual variation in the incorrectly predicted examples. To end the discussion, we note that one may find applications of a small gradual variation of y t x t in time series classification. For instance, if x t represent some medical measurements of a person and y t indicates whether the person observes a disease, since the health conditions usually change slowly then it is expected that the gradual variation of y t x t is small. Similarly, if x t are some sensor measurements of an equipment and y t indicates whether the equipment fails or not, we would also observe a small gradual variation of the sequence y t x t during a time period.

4 Variation bound for online bandit convex optimization

Online convex optimization becomes more challenging when the learner only receives partial feedback about the cost functions. One common scenario of partial feedback is that the learner only receives the cost c(x t ) at the predicted point x t but without observing the entire cost function c t (x). This setup is usually referred to as bandit setting, and the related online learning problem is called online bandit convex optimization.

Before describing our result for bandit setting, we give a quick review of the literature on online bandit convex optimization. Flaxman et al. (2005) presented a modified gradient descent approach for online bandit convex optimization that attains O(T 3/4) regret bound. The key idea of their algorithm is to compute the stochastic approximation of the gradient of cost functions by single point evaluation of the cost functions. This regret bound is later improved to O(T 2/3) (Awerbuch and Kleinberg 2004; Dani and Hayes 2006) for online bandit linear optimization. More recently, Dani et al. (2007) proposed an inefficient algorithm for online bandit linear optimization with the optimal regret bound \(O(\mathit{poly}(d)\sqrt {T})\) based on multi-armed bandit algorithm. The key disadvantage of Dani et al. (2007) is that it is not computationally efficient. Abernethy et al. (2008) presented an efficient randomized algorithm with an optimal regret bound \(O(\mathit{poly}(d)\sqrt{T})\) that exploits the properties of self-concordant barrier regularization. For online bandit convex optimization, Agarwal et al. (2010) proposed optimal algorithms in a multi-point bandit setting, in which multiple points can be queried for the cost values. With multiple queries, they show that the modified online gradient descent algorithm (Agarwal et al. 2010) can give an \(O(\sqrt{T})\) expected regret bound.

Recently Hazan and Kale (2009) extended the FTRL algorithm to online bandit linear optimization and obtained a variation-based regret bound in the form of \(O(\mathit{poly}(d)\sqrt{\mathrm{VAR}_{T}\log(T)}+\mathit{poly}(d\log(T)))\), where VAR T is the total variation of the cost vectors. We continue this line of work by proposing algorithms for general online bandit convex optimization with a variation-based regret bound. We present a deterministic algorithm for online bandit convex optimization by extending Algorithm 4 to a multi-point bandit setting, and prove the variation-based regret bound, which is optimal when the variation is independent of the number of trials. In our bandit setting , we assume we are allowed to query d+1 points around the decision point x t . We pose the problem of further reducing the number of point evaluations to a constant number that is independent of the dimension as an open problem.

4.1 A deterministic algorithm for online bandit convex optimization

To develop a variation bound for online bandit convex optimization, we follow Agarwal et al. (2010) by considering the multi-point bandit setting, where at each trial the player is allowed to query the cost functions at multiple points. We propose a deterministic algorithm to compete against the completely adaptive adversary that can choose the cost function c t (x) with the knowledge of x 1,…,x t . To approximate the gradient ∇c t (x t ), we query the cost function to obtain the cost values at c t (x t ), and c t (x t +δ e i ),i=1,…,d, where e i is the ith standard base in \(\mathbb{R}^{d}\). Then we compute the estimate of the gradient ∇c t (x t ) by
$$ g_t(\mathbf{x}_t)= \frac{1}{\delta}\sum _{i=1}^d \bigl(c_t(\mathbf{x}_t+ \delta \mathbf{e}_i)- c_t(\mathbf{x}_t) \bigr)\mathbf{e}_i. $$
It can be shown that (Agarwal et al. 2010), under the smoothness assumption in (8),
$$ \bigl\|g_t(\mathbf{x}_t)-\nabla c_t( \mathbf{x}_t)\bigr\|_2\leq\frac{\sqrt{d}L\delta}{2}. $$
(33)
To prove the regret bound, besides the smoothness assumption of the cost functions, and the boundness assumption about the domain \(\mathcal{P}\subseteq\mathcal{B}\), we further assume that (i) there exists r≤1 such that \(r\mathcal{B}\subseteq \mathcal{P}\subseteq\mathcal{B}\), and (ii) the cost function themselves are Lipschitz continuous, i.e., there exists a constant G such that
$$ \bigl|c_t(\mathbf{x})-c_t(\mathbf{z})\bigr|\leq G\|\mathbf{x}-\mathbf{z}\|_2,\quad \forall \mathbf{x}, \mathbf{z}\in \mathcal{P}, \forall t. $$
For our purpose, we define another gradual variation of cost functions by
$$ \mathrm{EGV}^{c}_T=\sum_{t=0}^{T-1} \max_{\mathbf{x}\in \mathcal{P}} \bigl|c_{t+1}(\mathbf{x})-c_t(\mathbf{x})\bigr|. $$
(34)
Unlike the gradual variation defined in (7) that uses the gradient of the cost functions, the gradual variation in (34) is defined according to the values of cost functions. The reason why we bound the regret of Algorithm 8 by the gradual variation defined in (34) by the values of the cost functions rather than the one defined in (7) by the gradient of the cost functions is that in the bandit setting, we only have point evaluations of the cost functions. The following theorem states the regret bound for Algorithm 9.
Algorithm 9

Deterministic Online Bandit Convex Optimization

Theorem 6

Let c t (⋅),t=1,…,T be a sequence of G-Lipschitz continuous convex functions, and their gradients are L-Lipschitz continuous. By setting
$$\displaystyle\delta= \sqrt{\frac{4d\max(\sqrt{2}G,\sqrt{{\mathrm{EGV}}^{c}_T})}{(\sqrt{d}L+G(1+1/r))T}}, $$
$$\displaystyle\eta= \frac{\delta}{4d}\min \biggl\{\frac{1}{\sqrt{2}}, \frac{G}{\sqrt{\mathrm{EGV}^{c}_T}} \biggr\} ,$$
and α=δ/r, we have the following regret bound for Algorithm  9
$$ \sum_{t=1}^Tc_t( \mathbf{x}_t)- \min\limits _{\mathbf{x}\in \mathcal{P}} \sum_{t=1}^T c_t(\mathbf{x}) \leq4\sqrt{\max \bigl(\sqrt{2}G, \sqrt{{\mathrm{EGV}}^{c}_T} \bigr)d (dL+G/r )T.} $$

Remark 8

Similar to the regret bound in Agarwal et al. (2010, Theorem 9), Algorithm 9 also gives the optimal regret bound \(O(\sqrt{T})\) when the variation is independent of the number of trials. Our regret bound has a better dependence on d (i.e., d) compared with the regret bound in Agarwal et al. (2010) (i.e., d 2).

Proof

Let h t (x)=c t (x)+(g t (x t )−∇c t (x t )) x. It is easy seen that ∇h t (x t )=g t (x t ). Followed by Lemma 2, we have for any \(\mathbf{z}\in(1-\alpha)\mathcal{P}\),
$$\begin{aligned} \frac{\eta}{G}\nabla h_t(\mathbf{x}_t)^{\top}( \mathbf{x}_t - \mathbf{z}) \leq& \frac{1}{2} \bigl(\|\mathbf{z}-\mathbf{z}_{t-1} \|_2^2 - \|\mathbf{z}-\mathbf{z}_{t}\|_2^2 \bigr) + \frac{\eta^2}{G^2} \bigl\|g_t(\mathbf{x}_t) - g_{t-1}(\mathbf{x}_{t-1})\bigr\|_2^2 \\ &{}- \frac{1}{2} \bigl(\|\mathbf{x}_t - \mathbf{z}_{t-1} \|_2^2+\|\mathbf{x}_t - \mathbf{z}_t \|_2^2 \bigr) \end{aligned}$$
Taking summation over t=1,…,T, we have,
$$\begin{aligned} &\sum_{t=1}^T\frac{\eta}{G}\nabla h_t(\mathbf{x}_t)^{\top}(\mathbf{x}_t - \mathbf{z}) \\ &\quad {}\leq \frac {\|\mathbf{z}-\mathbf{z}_0\|_2^2}{2} + \sum_{t=1}^T \frac{\eta^2}{G^2} \bigl\|g_t(\mathbf{x}_t) - g_{t-1}( \mathbf{x}_{t-1})\bigr\|_2^2 \\ &\qquad {}- \sum_{t=1}^T \frac{1}{2} \bigl( \|\mathbf{x}_t - \mathbf{z}_{t-1}\|_2^2+\| \mathbf{x}_t - \mathbf{z}_t\| _2^2 \bigr) \\ &\quad {}\leq \frac{\|\mathbf{z}-\mathbf{z}_0\|_2^2}{2} + \sum_{t=1}^T \frac{\eta^2}{G^2} \bigl\| g_t(\mathbf{x}_t) - g_{t-1}( \mathbf{x}_{t-1})\bigr\|_2^2 - \sum _{t=1}^T\frac{1}{4}\|\mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2 \\ &\quad {}\leq\frac{1}{2}+ \sum_{t=1}^T \frac{\eta^2}{G^2} \bigl\|g_t(\mathbf{x}_t) - g_{t-1}( \mathbf{x}_{t-1})\bigr\|_2^2 - \sum _{t=1}^T\frac{1}{4}\|\mathbf{x}_t- \mathbf{x}_{t-1}\|_2^2 \\ &\quad {}\leq\frac{1}{2}+ \sum_{t=1}^T \frac{2\eta^2}{G^2} \bigl\|g_t(\mathbf{x}_t) - g_{t}(\mathbf{x}_{t-1})\bigr\|_2^2 + \sum_{t=1}^T\frac{2\eta^2}{G^2} \bigl\|g_t(\mathbf{x}_{t-1}) - g_{t-1}(\mathbf{x}_{t-1})\bigr\|_2^2 \\ &\qquad {}- \sum_{t=1}^T\frac{1}{4}\| \mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2 \\ &\quad {} \leq\frac{1}{2} + \frac{2\eta^2}{\delta^2G^2}\sum_{t=1}^T \Biggl \Vert \sum_{i=1}^d \bigl(c_t(\mathbf{x}_t+\delta \mathbf{e}_i)-c_t( \mathbf{x}_{t-1}+\delta \mathbf{e}_i)\bigr)\mathbf{e}_i- \bigl(c_t(\mathbf{x}_t)-c_t(\mathbf{x}_{t-1})\bigr) \mathbf{e}_i\Biggr \Vert _2^2 \\ &\quad {}+\frac{2\eta^2}{\delta^2G^2}\sum_{t=1}^T\Biggl \Vert \sum_{i=1}^d \bigl(c_t( \mathbf{x}_{t-1}+\delta \mathbf{e}_i)-c_{t-1}(\mathbf{x}_{t-1}+ \delta \mathbf{e}_i)\bigr)\mathbf{e}_i -\bigl(c_t(\mathbf{x}_{t-1})-c_{t-1}(\mathbf{x}_{t-1})\bigr)\mathbf{e}_i\Biggr \Vert _2^2 \\ &\quad {}- \sum_{t=1}^T\frac{1}{4}\| \mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2, \end{aligned}$$
where the second inequality follows (17). Next, we bound the middle two terms in R.H.S of the above inequality.
$$\begin{aligned} &\sum_{t=1}^T\Biggl \Vert \sum _{i=1}^d \bigl(c_t(\mathbf{x}_t+ \delta \mathbf{e}_i)-c_t(\mathbf{x}_{t-1}+\delta \mathbf{e}_i) \bigr)\mathbf{e}_i- \bigl(c_t(\mathbf{x}_t)-c_t( \mathbf{x}_{t-1})\bigr)\mathbf{e}_i\Biggr \Vert _2^2 \\ &\quad {}\leq\sum_{t=1}^T2d\sum _{i=1}^d \bigl(\bigl|c_t(\mathbf{x}_t+ \delta \mathbf{e}_i)-c_t(\mathbf{x}_{t-1}+\delta \mathbf{e}_i)\bigr|^2 + \bigl|c_t(\mathbf{x}_t)-c_t( \mathbf{x}_{t-1})\bigr|^2 \bigr) \\ &\quad {}\leq\sum_{t=1}^T4d^2 G^2\|\mathbf{x}_t - \mathbf{x}_{t-1}\|_2^2, \end{aligned}$$
and
$$\begin{aligned} &\sum_{t=1}^T\Biggl \Vert \sum _{i=1}^d \bigl(c_t(\mathbf{x}_{t-1}+ \delta \mathbf{e}_i)-c_{t-1}(\mathbf{x}_{t-1}+\delta \mathbf{e}_i)\bigr)\mathbf{e}_i -\bigl(c_t( \mathbf{x}_{t-1})-c_{t-1}(\mathbf{x}_{t-1})\bigr)\mathbf{e}_i\Biggr \Vert _2^2 \\ &\quad {}\leq\sum_{t=1}^T2d\sum _{i=1}^d \bigl(\bigl|c_t(\mathbf{x}_{t-1}+ \delta \mathbf{e}_i)-c_{t-1}(\mathbf{x}_{t-1}+\delta \mathbf{e}_i)\bigr|^2 + \bigl|c_t(\mathbf{x}_{t-1})-c_{t-1}( \mathbf{x}_{t-1})\bigr|^2 \bigr) \\ &\quad {}\leq\sum_{t=1}^T4d^2 \max _{\mathbf{x}\in \mathcal{P}} \bigl|c_t(\mathbf{x}) - c_{t-1}( \mathbf{x})\bigr|^2. \end{aligned}$$
Then we have
$$\begin{aligned} &\sum_{t=1}^T\frac{\eta}{G}\nabla h_t(\mathbf{x}_t)^{\top}(\mathbf{x}_t - \mathbf{z}) \\ &\quad {} \leq \frac {1}{2} + \frac{8d^2\eta^2}{\delta^2}\sum_{t=1}^T \|\mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2 + \frac{8d^2\eta^2}{\delta^2G^2}\sum_{t=1}^T\max _{\mathbf{x}\in \mathcal{P}} \bigl|c_t(\mathbf{x}) - c_{t-1}(\mathbf{x})\bigr|^2 \\ &\qquad {}- \sum_{t=1}^T\frac{1}{4}\| \mathbf{x}_t-\mathbf{x}_{t-1}\|_2^2 \\ &\quad {}\leq \frac{1}{2} +\frac{8d^2\eta^2}{\delta^2G^2}\sum_{t=1}^T \max_{\mathbf{x}\in \mathcal{P}}\bigl|c_t(\mathbf{x}) - c_{t-1}( \mathbf{x})\bigr|^2 \end{aligned}$$
where the last inequality follows that \(\eta\leq\delta/(4\sqrt{2}d)\). Then by using the convexity of h t (x) and dividing both sides by η/G, we have
$$ \sum_{t=1}^Th_t( \mathbf{x}_t) -\min_{\mathbf{x}\in \mathcal{P}}h_t\bigl((1-\alpha)\mathbf{x}\bigr) \leq\frac {G}{2\eta} + \frac{8\eta d^2}{G\delta^2}\mathrm{EVAR}^{c}_T \leq\frac {4d}{\delta}\max \Bigl(\sqrt{2}G, \sqrt{\mathrm{EVAR}^{c}_T}\, \Bigr) $$
Following the proof of Theorem 8 in Agarwal et al. (2010), we have
$$\begin{aligned} &\sum_{t=1}c_t(\mathbf{x}_t) -\sum _{t=1}^T c_t(\mathbf{x}) \\ &\quad {} \leq\sum _{t=1}^T h_t(\mathbf{x}_t) -\sum_{t=1}^T h_t(\mathbf{x}) + \sum _{t=1}^T c_t(\mathbf{x}_t) - h_t(\mathbf{x}_t) - c_t(\mathbf{x}) + h_t(\mathbf{x}) \\ &\quad {}\leq\sum_{t=1}^T h_t( \mathbf{x}_t) -\sum_{t=1}^T h_t(\mathbf{x}) + \sum_{t=1}^T \bigl(g_t(\mathbf{x}_t)- \nabla c_t(\mathbf{x}_t) \bigr)^{\top}(\mathbf{x}-\mathbf{x}_t) \\ &\quad {}\leq\sum_{t=1}^T h_t( \mathbf{x}_t) -\sum_{t=1}^T h_t(\mathbf{x}) + \sqrt{d}L\delta T \end{aligned}$$
where the last inequality follows from the following facts:
$$\begin{aligned} &\bigl\|g_t(\mathbf{x}_t) - \nabla c_t(\mathbf{x}_t) \bigr\|_2\leq\frac{\sqrt{d}L\delta}{2} \\ & \|\mathbf{x}-\mathbf{x}_t\|\leq2 \end{aligned}$$
Then we have
$$ \sum_{t=1}^Tc_t( \mathbf{x}_t) -\min_{\mathbf{x}\in \mathcal{P}}\sum_{t=1}^T c_t\bigl((1-\alpha)\mathbf{x}\bigr)\leq \frac{4d}{\delta}\max \Bigl( \sqrt{2}G, \sqrt{\mathrm{EVAR}^{c}_T} \,\Bigr) + \sqrt{d}L\delta T $$
By the Lipschitz continuity of c t (x), we have
$$ \sum_{t=1}^Tc_t \bigl((1-\alpha)\mathbf{x}\bigr)\leq\sum_{t=1}^T c_t(\mathbf{x}) + G \alpha T $$
Then we get
$$ \sum_{t=1}^Tc_t( \mathbf{x}_t)-\min\limits _{\mathbf{x}\in \mathcal{P}} \sum_{t=1}^T c_t(\mathbf{x}) \leq \frac{4d}{\delta}\max \Bigl(\sqrt{2}G, \sqrt {\mathrm{EVAR}^{c}_T} \,\Bigr)+ \delta\sqrt{d}LT + \alpha GT $$
Plugging the stated values of δ and α completes the proof. □

5 Conclusions and discussions

In this paper, we proposed two novel algorithms for online convex optimization that bound the regret by the gradual variation of cost functions. The first algorithm is an improvement of the FTRL algorithm, and the second algorithm is based on the prox method. Both algorithms maintain two sequence of solution points, a sequence of decision points and a sequence of searching points, and share the same order of regret bound up to a constant. The prox method only requires to keep tracking of a single gradient of each cost function, while the improved FTRL algorithm needs to evaluate the gradient of each cost function at two points and maintain a sum of up-to-date gradients of the cost functions. We also extended the prox method to a general framework that yields a gradual variation bound with the variation defined by a general norm and also to a multi-point bandit setting. We discuss several special cases, including online linear optimization, predict with expert advice, online strictly convex optimization. We also developed a simplified prox method using a composite gradient mapping for non-smooth optimization with a fixed non-smooth component and a primal-dual prox method for non-smooth optimization with the non-smooth component written as a max structure.

Finally, it has been brought to our attention that as we prepare the final version of the present work, Chiang et al. (2013) published a paper at the Conference on Learning Theory in 2013, which extends the prox method into a two-point bandit setting and achieves a similar regret bound in expectation as that in the full setting, i.e., \(O (d^{2}\sqrt{\mathrm{EGV}_{T,2}\ln T} )\) for smooth convex cost functions and O(d 2ln(EGV T,2+lnT)) for smooth and strongly convex cost functions, where EGV T,2 is the gradual variation defined on the gradients of the cost functions. We would like to make a thought-provoking comparison between our regret bound and their regret bound for online bandit convex optimization with smooth cost functions. First, the gradual variation in our bandit setting is defined on the values of the cost functions in contrast to that defined on the gradients of the cost functions. Second, we query the cost function d times in contrast to 2 times in their algorithms, and as a tradeoff our regret bound has a better dependence on the number of dimensions (i.e., O(d)) than that (i.e., O(d 2)) of their regret bound. Third, our regret bound has an annoying factor of \(\sqrt{T}\) in comparison with \(\sqrt{\ln T}\) in theirs. Therefore, some open problems are how to achieve a lower order of dependence on d than d 2 in the two-point bandit setting, and how to remove the factor of \(\sqrt{T}\) while keeping a small order of dependence on d in our settup; and studying the two different types of gradual variations for bandit settings is a future work as well.

Footnotes

  1. 1.

    This is also termed as deviation in Chiang et al. (2012).

  2. 2.

    Gradual variation bounds for a special class of non-smooth cost functions that are composed of a smooth component and a non-smooth component are discussed in Sect. 3.

  3. 3.

    We simply set L=1 for online linear optimization and prediction with expert advice.

Notes

Acknowledgements

We thank the reviewers for their immensely helpful and thorough comments.

References

  1. Abernethy, J., Hazan, E., & Rakhlin, A. (2008). Competing in the dark: an efficient algorithm for bandit linear optimization. In Proceedings of the 21st annual conference on learning theory (pp. 263–274). Google Scholar
  2. Agarwal, A., Hazan, E., Kale, S., & Schapire, R. E. (2006). Algorithms for portfolio management based on the Newton method. In Proceedings of the 23rd international conference on machine learning (pp. 9–16). Google Scholar
  3. Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of the 23rd annual conference on learning theory (pp. 28–40). Google Scholar
  4. Awerbuch, B., & Kleinberg, R. D. (2004). Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In Proceedings of the 36th ACM symposium on theory of computing (pp. 45–53). Google Scholar
  5. Bianchi, N. C., Mansour, Y., & Stoltz, G. (2005). Improved second-order bounds for prediction with expert advice. In Proceedings of the 18th annual conference on learning theory (Vol. 3559, pp. 217–232). Google Scholar
  6. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. New York: Cambridge University Press. CrossRefzbMATHGoogle Scholar
  7. Chiang, C. K., Yang, T., Lee, C. J., Mahdavi, M., Lu, C. J., Jin, R., & Zhu, S. (2012). Online optimization with gradual variations. In COLT (pp. 6.1–6.20). Google Scholar
  8. Chiang, C. K., Lee, C. J., & Lu, C. J. (2013). Beating bandits in gradually evolving worlds. In COLT (pp. 210–227). Google Scholar
  9. Dani, V., & Hayes, T. P. (2006). Robbing the bandit: less regret in online geometric optimization against an adaptive adversary. In Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms (pp. 937–943). Google Scholar
  10. Dani, V., Hayes, T. P., & Kakade, S. (2007). The price of bandit information for online optimization. In Proceedings of the twenty-first annual conference on neural information processing systems. Google Scholar
  11. Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms (pp. 385–394). Google Scholar
  12. Freund, Y., & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the 2nd European conference on computational learning theory (pp. 23–37). London: Springer. CrossRefGoogle Scholar
  13. Hazan, E., & Kale, S. (2009). Better algorithms for benign bandits. In Proceedings of the 20th annual ACM-SIAM symposium on discrete algorithms (pp. 38–47). CrossRefGoogle Scholar
  14. Hazan, E., & Kale, S. (2010). Extracting certainty from uncertainty: regret bounded by variation in costs. Machine Learning, 80(2–3), 165–188. MathSciNetCrossRefGoogle Scholar
  15. Hazan, E., Agarwal, A., & Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69, 169–192. CrossRefGoogle Scholar
  16. Kalai, A., & Vempala, S. (2005). Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71, 291–307. MathSciNetCrossRefzbMATHGoogle Scholar
  17. Kale, S. (2012). Commentary on “online optimization with gradual variations”. J Mach Learn Res, 23, 6.21–6.24. MathSciNetGoogle Scholar
  18. Kivinen, J., & Warmuth, M. K. (1995). Additive versus exponentiated gradient updates for linear prediction. In Proceedings of the twenty-seventh annual ACM symposium on theory of computing (pp. 209–218). New York: ACM. Google Scholar
  19. Kivinen, J., Smola, A. J., & Williamson, R. C. (2004). Online learning with kernels. IEEE Transactions on Signal Processing, 52, 2165–2176. MathSciNetCrossRefGoogle Scholar
  20. Nemirovski, A. (2005). Prox-method with rate of convergence o(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15, 229–251. MathSciNetCrossRefGoogle Scholar
  21. Nesterov, Y. (2004). Introductory lectures on convex optimization: a basic course (applied optimization) (1st ed.). Dordrecht: Springer. CrossRefGoogle Scholar
  22. Nesterov, Y. (2005a). Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16, 235–249. MathSciNetCrossRefzbMATHGoogle Scholar
  23. Nesterov, Y. (2005b). Smooth minimization of non-smooth functions. Mathematical Programming, 103, 127–152. MathSciNetCrossRefzbMATHGoogle Scholar
  24. Takimoto, E., & Warmuth, M. K. (2003). Path kernels and multiplicative updates. J Mach Learn Res, 4, 773–818. MathSciNetGoogle Scholar
  25. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (pp. 928–936). Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • Tianbao Yang
    • 1
  • Mehrdad Mahdavi
    • 2
  • Rong Jin
    • 2
  • Shenghuo Zhu
    • 1
  1. 1.NEC Laboratories AmericaCupertinoUSA
  2. 2.Department of Computer Science and EngineeringMichigan State UniversityEast LansingUSA

Personalised recommendations