Optimal learning with Bernstein online aggregation
 368 Downloads
Abstract
We introduce a new recursive aggregation procedure called Bernstein Online Aggregation (BOA). Its exponential weights include a second order refinement. The procedure is optimal for the model selection aggregation problem in the bounded iid setting for the square loss: the excess of risk of its batch version achieves the fast rate of convergence \(\log (M)/n\) in deviation. The BOA procedure is the first online algorithm that satisfies this optimal fast rate. The second order refinement is required to achieve the optimality in deviation as the classical exponential weights cannot be optimal, see Audibert (Advances in neural information processing systems. MIT Press, Cambridge, MA, 2007). This refinement is settled thanks to a new stochastic conversion that estimates the cumulative predictive risk in any stochastic environment with observable second order terms. The observable second order term is shown to be sufficiently small to assert the fast rate in the iid setting when the loss is Lipschitz and strongly convex. We also introduce a multiple learning rates version of BOA. This fully adaptive BOA procedure is also optimal, up to a \(\log \log (n)\) factor.
Keywords
Exponential weighted averages Learning theory Individual sequences1 Introduction and main results
Definition 1.1
Very few known procedures achieve the fast rate \(\log (M)/n\) in deviation and none of them are issued from an online procedure. In this article, we provide the Bernstein Online Aggregation (BOA) that is proved to be the first online aggregation procedure such that its batch version, defined as \(\bar{f}=(n+1)^{1}\sum _{t=0}^{n}\hat{f}_t\), is optimal. Before defining it properly, let us review the existing optimal procedures for Problem (MS).
The batch procedures in Audibert (2007), Lecué and Mendelson (2009), Lecué and Rigollet (2014) achieve the optimal rate in deviation. A priori, they face practical issues as they require a computational optimization technique to approximate the weights that are defined as an optimum. A step further has been done in the context of quadratic loss with gaussian noise in Dai et al. (2012) where an explicit iterative scheme is provided. We will now explain why the question of the existence of an online algorithm whose batch version achieves fast rate of convergence in deviations remained open (see the conclusion of Audibert 2007) before our work. Optimal (for the regret) online aggregation procedures are exponential weights algorithms (EWAs), see Vovk (1990), Haussler et al. (1998). The batch versions of EWAs coincides with the Progressive Mixture Rules (PMRs). In the iid setting, the properties of the excess of risk of such procedures have been extensively studied in Catoni (2004). PMRs achieve the fast optimal rate \(\log (M)/n\) in expectation (that follows from the expectation of the optimal regret bound by an application of Jensen’s inequality, see Catoni 2004, Juditsky et al. 2008). However, PMRs are suboptimal in deviation, i.e. the optimal rate cannot hold with high probability, see Audibert (2007), Dai et al. (2012). It is because the optimality for the regret defined as in Haussler et al. (1998) does not coincides with the notion of optimality for the risk in deviation used in Definition 1.1.
Theorem 1.1
 1.
\(\Delta M_{J,t}= \eta \ell _{J,t}\) as a function of J distributed conditionally as \(\pi _{t1}\) on \(\{1,\ldots ,M\}\),
 2.
\( M_{j,t}=\eta (R_t(\hat{f}) R_t(f_j)\text {Err}_t(\hat{f})+\text {Err}_t(f_j))\) such that \(\Delta M_{j,t}=\eta (\mathbb {E}_{t1}[\ell _{j,t}]\ell _{j,t})\) where \(\mathbb {E}_{t1}\) denotes the expectation of \((X_t,Y_t)\) conditionally on \(\mathcal D_{t1}\), \(1\le j\le M\).

the “gradient trick” to bound the excess of the cumulative predictive risk in Problem (C),

the multiple learning rates for adapting the procedure and

the batch version of BOA to achieve the fast rate of convergence in Problem (MS).
The paper is organized as follows: We present the second order regret bounds for different versions of BOA in Sect. 3. The new stochastic conversion and the excess of cumulative predictive risk bounds in a stochastic environment are provided in Sect. 4. In the next section, we introduce some useful probabilistic preliminaries.
2 Preliminaries
As in Audibert (2009), the recursive argument for supermartingales will be at the core of the proofs developed in this paper. It will be used jointly with the variational form of the entropy to provide second order regret bounds.
2.1 The proof of the martingale inequality in Theorem 1.1
2.2 The variational form of the entropy
The relative entropy (or KullbackLeibler divergence) \(\mathcal K(Q,P)=\mathbb {E}_Q[\log (dQ/dP)]\) is a pseudodistance between any probability measures P and Q. Let us remind the basic property of the entropy: the variational formula of the entropy originally proved in full generality in Donsker and Varadhan (1975). We consider here a version well adapted for obtaining second order regret bounds:
Lemma 2.1
That the Gibbs measure realizes the dual identity is at the core of the PACbayesian approach. Exponential weights aggregation procedures arise naturally as they can be considered as Gibbs measures, see Catoni (2007).
3 Second order regret bounds for the BOA procedure
3.1 First regret bounds and link with the individual sequences framework
We work conditionally on \(\mathcal D_{n+1}\); it is the deterministic setting, similar than in Gerchinovitz (2013), where \((X_t,Y_t)=(x_t,y_t)\) are provided recursively for \(1\le t\le n\). In that case, the cumulative loss \(\text {Err}_{n+1}(f)\) quantify the prediction of \(f=(f_0,f_1,f_2,\ldots )\). We state first a regret bound for non convex losses, and then move to the case of convex losses combined with the “gradient trick” as in the Appendix of Gaillard et al. (2014). Recall that \(\mathbb {E}_{\hat{\pi }}[\text {Err}_{n+1}( f_J)]=\sum _{t=1}^{n+1}\mathbb {E}_{\hat{\pi }_{t1}}[\ell (Y_{t} ,f_{J,t}(X_{t}))]\).
Theorem 3.1
Proof
Theorem 3.2
Proof
Second order regret bounds similar to the one of Theorem 3.2 have been obtained in Gaillard et al. (2014), Luo and Schapire (2015), Koolen and Erven (2015) in the context of individual sequences. In this context, we consider that \(Y_t=y_t\) for a deterministic sequence \(y_0,\ldots ,y_n\) (\((X_t)\) is useless in this context), see CesaBianchi and Lugosi (2006) for an extensive treatment of that setting. We have \(\mathcal D_t=\{y_0,\ldots ,y_t\}\), \(0\le t\le n\), and the online learners \(f_j=(y_{j,1},y_{j,2},y_{j,3},\ldots )\) of the dictionary are called the experts. The cumulative loss is \(\text {Err}_{n+1}(\hat{f})=\sum _{t=1}^{n+1}\ell (y_{t},\hat{y}_t)\) for any aggregative strategy \(\hat{y}_t=\hat{f}_{t1}=\sum _{j=1}^M\pi _{j,t1}y_{j,t}\) where \(\pi _{j,t1}\) are measurable functions of the past \(\{y_0,\ldots ,y_{t1}\}\). We will compare our second order regret bounds to the ones of other adaptive procedures from the individual sequences setting at the end of the next section.
3.2 A new adaptive method for exponential weights
Theorem 3.3
Proof
The advantage of the adaptive BOA procedure compared with the procedures studied in Gaillard et al. (2014), Luo and Schapire (2015), Koolen and Erven (2015) is to be adaptive to unknown ranges. The price to pay is an additional logarithmic term \(\log (E)+c\log (2)\) depending on the variability of the adaptive learning rates \(\eta _{j,t}\) through time. Such losses are avoidable in the case of one single adaptive learning rate \(\eta _{j,t}=\eta _{t}\), for all \(1\le j\le M\). Notice also that the relative entropy bound is only achieved in the case of one single adaptive learning rate as then \(\mathcal K(\pi ^{\prime },\pi _0)=\mathcal K(\pi ,\pi _0)\). It is a drawback of the multiple learning rates procedures compared with the single ones of Luo and Schapire (2015), Koolen and Erven (2015) achieving such relative entropy bounds. Whether those drawbacks of multiple learning rates procedures can be avoided is an open question.
4 Optimality of the BOA procedure in a stochastic environment
4.1 An empirical stochastic conversion
Theorem 4.1
Proof
We first note that for each \(1\le j\le M\) the sequence \((M_{j,t})_t\) with \(M_{j,t}=\eta (\mathbb {E}_{\hat{\pi }}[R_t(f_J)]R_{t}(f_j)(\mathbb {E}_{\hat{\pi }}[\text {Err}_{t}(f_j )] \text {Err}_{t}(f_j)))\) is a martingale adapted to the filtration \((\mathcal D_t)\). Its difference is equal to \(\Delta M_{j,t}=\eta (\mathbb {E}_{t1}[\ell _{j,t}]\ell _{j,t})\). Then the proof will follow from the classical recursive argument for supermartingales applied to the exponential inequality of Theorem 1.1. However, as the learning rates \(\eta _{j,t}\) are not necessarily constant, we adapt the recursive argument as in the proof of Theorem 3.3.
4.2 Second order bounds on the excess of the cumulative predictive risk
Using the stochastic conversion of Theorem 4.1, we derive from the regret bounds of Sect. 3 second order bounds on the cumulative predictive risk of the BOA procedure. As an example, using the second order regret bound of Theorem 3.1 and the stochastic conversion of Theorem 4.1 we obtain
Theorem 4.2
Proof
We prove the result by integrating the result of Theorem 4.1 with respect to any deterministic \(\pi \) and noticing that, as the learning rates \(\eta _{j,t}=\eta \) are constant in the BOA procedure described in Fig. 1, \( \log (\eta _{j,1}/\eta _{j,n})=0.\)
The main advantage of the new stochastic conversion compared with the one of Gaillard et al. (2014) is that the empirical second order bound of of the stochastic conversion is similar to the one of the regret bound. We can extend Theorem 3.3; as the boundedness condition (13) is no longer satisfied for any \(1\le t\le n+1\), Theorem 4.1 does not apply directly. We still have
Theorem 4.3
Proof
4.3 Optimal learning for problem (MS)
 (LIST)

the loss function \(\ell \) is \(C_\ell \)strongly convex and \(C_b\)Lipschitz continuous in its second coordinate on a convex set \(\mathcal C\subset \mathbb {R}\).
We obtain the optimality of the BOA procedure for Problem (MS). The result extends easily (with different constants) to any online procedures achieving second order regret bounds on the linearized loss similar to BOA such as the procedures described in Gaillard et al. (2014), Luo and Schapire (2015), Koolen and Erven (2015).
Theorem 4.4
Proof
The tuning parameter \(\eta \) can be considered as the inverse of the temperature \(\beta \) of the Qaggregation procedure studied in Lecué and Rigollet (2014). In the Qaggregation, the tuning parameter \(\beta \) is required to be larger than \(60 C_b^2/C_\ell \). It is a condition similar than our restriction (15) on \(\eta \). The larger is \(\eta \) satisfying the condition (15) and the best is the rate of convergence. The choice \(\eta ^*=(16(e1)C_b^2/ C_\ell )^{1} \) is optimal. The resulting BOA procedure is non adaptive in the sense that it depends on the range \(C_b\) of the gradients that can be unknown. On the contrary, the multiple learning rates BOA procedure achieves to tune automatically the learning rates. At the price of larger “constants” that grow as \(\log \log (n)\), we extend the preceding optimal rate of convergence to the adaptive BOA procedure:
Theorem 4.5
Proof
The BOA procedure is explicitly computed with complexity O(Mn). It is a practical advantage compared with the batch procedures studied in Audibert (2007), Lecué and Mendelson (2009), Lecué and Rigollet (2014) that require a computational optimization technique. This issue has been solved in Dai et al. (2012) for the square loss using greedy iterative algorithms that approximate the Qaggregation procedure.
Notes
Acknowledgments
I am grateful to two anonymous referees for their helpful comments. I would also like to thank Pierre Gaillard and Gilles Stoltz for valuable comments on a preliminary version.
References
 Abernethy, J., Agarwal, A., Bartlett, P. L., & Rakhlin, A. (2009). A stochastic view of optimal regret through minimax duality. In COLT.Google Scholar
 Agarwal, A., & Duchi, J. C. (2013). The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59, 573–587.MathSciNetCrossRefGoogle Scholar
 Alquier, P., Li, X., & Wintenberger, O. (2013). Prediction of time series by statistical learning: General losses and fast rates. Dependence Modeling, 1, 65–93.CrossRefMATHGoogle Scholar
 Audibert, J. Y., Munos, R., & Szepesvari, C. (2006). Use of variance estimation in the multiarmed bandit problem. In NIPS.Google Scholar
 Audibert, J.Y. (2007). Progressive mixture rules are deviation suboptimal. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (Vol. 20, pp. 41–48). Cambridge, MA: MIT Press.Google Scholar
 Audibert, J.Y. (2009). Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37, 1591–1646.MathSciNetCrossRefMATHGoogle Scholar
 Blum, A., & Mansour, Y. (2005). From external to internal regret. In Proceedings of the 18th annual conference on learning theory (New York) (pp. 621–636). Springer.Google Scholar
 Catoni, O. (2004). Statistical learning theory and stochastic optimization. Lecture notes in mathematics (Vol. 1851), SpringerVerlag, Berlin. Lecture notes from the 31st summer school on probability theory held in SaintFlour, July 8–25, 2001. MR 2163920.Google Scholar
 Catoni, O. (2007). Pacbayesian supervised classification: The thermodynamics of statistical learning. Beachwood, OH: Institute of Mathematical Statistics.MATHGoogle Scholar
 CesaBianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge, NY: Cambridge University Press.CrossRefMATHGoogle Scholar
 CesaBianchi, N., Mansour, Y., & Stoltz, G. (2007). Improved secondorder bounds for prediction with expert advice. Machine Learning, 66, 321–352.CrossRefMATHGoogle Scholar
 Dai, D., Rigollet, P., Xia, L., & Zhang, T. (2012). Deviation optimal learning using greedy Qaggregation. The Annals of Statistics, 40, 1878–1905.MathSciNetCrossRefMATHGoogle Scholar
 Donsker, M. D., & Varadhan, S. S. (1975). Asymptotic evaluation of certain markov process expectations for large time, I. Communications on Pure and Applied Mathematics, 28, 1–47.MathSciNetCrossRefMATHGoogle Scholar
 Freedman, D. A. (1975). On tail probabilities for martingales. The Annals of Probability, 3, 100–118.MathSciNetCrossRefMATHGoogle Scholar
 Gaillard, P., Stoltz, G., & Van Erven, T. (2014). A secondorder bound with excess losses. In COLT. arXiv:1402.2044.
 Gerchinovitz, S. (2013). Sparsity regret bounds for individual sequences in online linear regression. JMLR, 14, 729–769.MathSciNetMATHGoogle Scholar
 Haussler, D., Kivinen, J., & Warmuth, M. K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44, 1906–1925.MathSciNetCrossRefMATHGoogle Scholar
 Hazan, E., & Kale, S. (2010). Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine Learning, 80, 165–188.MathSciNetCrossRefGoogle Scholar
 Juditsky, A., Rigollet, P., & Tsybakov, A. B. (2008). Learning by mirror averaging. The Annals of Statistics, 36, 2183–2206.MathSciNetCrossRefMATHGoogle Scholar
 Kakade, S. M., & Tewari, A. (2008). On the generalization ability of online strongly convex programming algorithms. In NIPS.Google Scholar
 Koolen, W., & Van Erven, T. (2015). Secondorder quantile methods for experts and combinatorial games. In COLT (pp. 1155–1175).Google Scholar
 Lecué, G., & Mendelson, S. (2009). Aggregation via empirical risk minimization. PTRF, 145, 591–613.MathSciNetMATHGoogle Scholar
 Lecué, G., & Rigollet, P. (2014). Optimal learning with Qaggregation. The Annals of Statistics, 42(1), 211–224.MathSciNetCrossRefMATHGoogle Scholar
 Luo, H., & Schapire, R. E. (2015). Achieving all with no parameters: Adanormalhedge. In Proceedings of the 28th conference on learning theory (pp. 1286–1304).Google Scholar
 Maurer, A., & Pontil, M. (2009). Empirical bernstein bounds and sample variance penalization. In COLT.Google Scholar
 Mohri, M., & Rostamizadeh, A. (2010). Stability bounds and for \(\phi \)mixing and \(\beta \)mixing processes. JMLR, 4, 1–26.MathSciNetMATHGoogle Scholar
 Nemirovski, A. (2000). Topics in nonparametric statistics. Lectures on probability theory and statistics (SaintFlour, 1998), Lecture Notes in Math. (Vol. 1738, pp. 85–277). Berlin: Springer. MR 1775640.Google Scholar
 Rigollet, P. (2012). KullbackLeibler aggregation and misspecified generalized linear models. The Annals of Statistics, 40(2), 639–665.MathSciNetCrossRefMATHGoogle Scholar
 Tsybakov, A. B. (2003). Optimal rates of aggregation. In COLT. Berlin, Heidelberg: Springer.Google Scholar
 Vovk, V. G. (1990). Aggregating strategies. In COLT.Google Scholar
 Zhang, T. (2005). Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the 18th annual conference on learning theory. Berlin, Heidelberg: Springer.Google Scholar