Skip to main content
Log in

DCA for online prediction with expert advice

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

We investigate DC (Difference of Convex functions) programming and DCA (DC Algorithm) for a class of online learning techniques, namely prediction with expert advice, where the learner’s prediction is made based on the weighted average of experts’ predictions. The problem of predicting the experts’ weights is formulated as a DC program for which an online version of DCA is investigated. The two so-called approximate/complete variants of online DCA based schemes are designed, and their regrets are proved to be logarithmic/sublinear. The four proposed algorithms for online prediction with expert advice are furthermore applied to online binary classification. Experimental results tested on various benchmark datasets showed their performance and their superiority over three standard online prediction with expert advice algorithms—the well-known weighted majority algorithm and two online convex optimization algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

  2. http://www.ics.uci.edu/~mlearn/MLRepository.html.

References

  1. Alexander L, Das SR, Ives Z, Jagadish H, Monteleoni C (2017) Research challenges in financial data modeling and analysis. Big Data 5(3):177–188

    Article  Google Scholar 

  2. Angluin D (1988) Queries and concept learning. Mach Learn 2(4):319–342

    Article  MathSciNet  Google Scholar 

  3. Azoury K, Warmuth MK (2001) Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach Learn 43(3):211–246

    Article  MATH  Google Scholar 

  4. Barzdin JM, Freivald RV (1972) On the prediction of general recursive functions. Sov Math Doklady 13:1224–1228

    MATH  Google Scholar 

  5. Cesa-Bianchi N (1999) Analysis of two gradient-based algorithms for on-line regression. J Comput Syst Sci 59(3):392–411

    Article  MathSciNet  MATH  Google Scholar 

  6. Cesa-Bianchi N, Freund Y, Haussler D, Helmbold DP, Schapire RE, Warmuth MK (1997) How to use expert advice. J ACM 44(3):427–485

    Article  MathSciNet  MATH  Google Scholar 

  7. Cesa-Bianchi N, Lugosi G (2003) Potential-based algorithms in on-line prediction and game theory. Mach Learn 51(3):239–261

    Article  MATH  Google Scholar 

  8. Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  9. Cesa-Bianchi N, Mansour Y, Stoltz G (2007) Improved second-order bounds for prediction with expert advice. Mach Learn 66(2):321–352

    Article  MATH  Google Scholar 

  10. Chung TH (1994) Approximate methods for sequential decision making using expert advice. In: Proceedings of the seventh annual conference on computational learning theory, COLT ’94, pp 183–189. ACM, New York, NY, USA

  11. Collobert R, Sinz F, Weston J, Bottou L (2006) Large scale transductive SVMs. J Mach Learn Res 7:1687–1712

    MathSciNet  MATH  Google Scholar 

  12. Collobert R, Sinz F, Weston J, Bottou L (2006) Trading convexity for scalability. In: Proceedings of the 23rd international conference on machine learning, ICML ’06, pp 201–208. New York, NY, USA

  13. Conover WJ (1999) Pratical nonparametric statistics, 3rd edn. Wiley, Hoboken

    Google Scholar 

  14. Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. J Mach Learn Res 7:551–585

    MathSciNet  MATH  Google Scholar 

  15. Dadkhahi H, Shanmugam K, Rios J, Das P, Hoffman SC, Loeffler TD, Sankaranarayanan S (2020) Combinatorial black-box optimization with expert advice. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1918–1927. Association for Computing Machinery, New York, NY, USA

  16. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  17. DeSantis A, Markowsky G, Wegman MN (1988) Learning probabilistic prediction functions. In: Proceedings of the first annual workshop on computational learning theory, COLT’88, pp. 312–328. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

  18. Devaine M, Gaillard P, Goude Y, Stoltz G (2013) Forecasting electricity consumption by aggregating specialized experts. Mach Learn 90(2):231–260

    Article  MathSciNet  MATH  Google Scholar 

  19. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  MATH  Google Scholar 

  20. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  MATH  Google Scholar 

  21. Friedman M (1940) A comparison of alternative tests of significance for the problem of \(m\) rankings. Ann Math Stat 11(1):86–92

    Article  MathSciNet  MATH  Google Scholar 

  22. García S, Herrera F (2009) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  23. Gentile C (2002) A new approximate maximal margin classification algorithm. J Mach Learn Res 2:213–242

    MathSciNet  MATH  Google Scholar 

  24. Gentile C (2003) The robustness of the \(p\)-norm algorithms. Mach Learn 53(3):265–299

    Article  MATH  Google Scholar 

  25. Gollapudi S, Panigrahi D (2019) Online algorithms for rent-or-buy with expert advice. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp 2319–2327. PMLR, Long Beach, California, USA

  26. Gramacy RB, Warmuth MKK, Brandt SA, Ari I (2003) Adaptive caching by refetching. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT Press, Cambridge, pp 1489–1496

    Google Scholar 

  27. Grove AJ, Littlestone N, Schuurmans D (2001) General convergence results for linear discriminant updates. Mach Learn 43(3):173–210

    Article  MATH  Google Scholar 

  28. Hao S, Hu P, Zhao P, Hoi SCH, Miao C (2018) Online active learning with expert advice. ACM Trans Knowl Discov Data 12(5):1–22

    Article  Google Scholar 

  29. Haussler D, Kivinen J, Warmuth MK (1995) Tight worst-case loss bounds for predicting with expert advice. In: Vitányi P (ed) Computational learning theory, lecture notes in computer Science, vol 904. Springer, Berlin, pp 69–83

    Google Scholar 

  30. Hazan E (2016) Introduction to online convex optimization. Found Trends Optim 2(3–4):157–325

    Article  Google Scholar 

  31. Ho VT, Le Thi HA, Bui DC (2016) Online DC optimization for online binary linear classification. In: Nguyen TN, Trawiński B, Fujita H, Hong TP (eds) Intelligent information and database systems: 8th Asian conference, ACIIDS 2016, proceedings, Part II. Springer, Berlin, pp 661–670

  32. Hoi SCH, Wang J, Zhao P (2014) LIBOL: a library for online learning algorithms. J Mach Learn Res 15(1):495–499

    MATH  Google Scholar 

  33. Jamil W, Bouchachia A (2019) Model selection in online learning for times series forecasting. In: Lotfi A, Bouchachia H, Gegov A, Langensiepen C, McGinnity M (eds) Advances in computational intelligence systems. Springer, Cham, pp 83–95

    Chapter  Google Scholar 

  34. Kivinen J, Warmuth MK (1997) Exponentiated gradient versus gradient descent for linear predictors. Inf Comput 132(1):1–63

    Article  MathSciNet  MATH  Google Scholar 

  35. Kivinen J, Warmuth MK (2001) Relative loss bounds for multidimensional regression problems. Mach Learn 45(3):301–329

    Article  MATH  Google Scholar 

  36. Kveton B, Yu JY, Theocharous G, Mannor S (2008) Online learning with expert advice and finite-horizon constraints. In: Proceedings of the twenty-third AAAI conference on artificial intelligence, AAAI 2008, pp 331–336. AAAI Press

  37. Le Thi HA (1994) Analyse numérique des algorithmes de l’optimisation d. C. Approches locale et globale. Codes et simulations numériques en grande dimension. Applications. Ph.D. thesis, University of Rouen, France

  38. Le Thi HA (2020) DC programming and DCA for supply chain and production management: state-of-the-art models and methods. Int J Prod Res 58(20):6078–6114

    Article  Google Scholar 

  39. Le Thi HA, Ho VT, Pham Dinh T (2019) A unified DC programming framework and efficient DCA based approaches for large scale batch reinforcement learning. J Glob Optim 73(2):279–310

    Article  MathSciNet  MATH  Google Scholar 

  40. Le Thi HA, Le HM, Phan DN, Tran B (2020) Stochastic DCA for minimizing a large sum of DC functions with application to multi-class logistic regression. Neural Netw 132:220–231

    Article  Google Scholar 

  41. Le Thi HA, Moeini M, Pham Dinh T (2009) Portfolio selection under downside risk measures and cardinality constraints based on DC programming and DCA. Comput Manag Sci 6(4):459–475

    Article  MathSciNet  MATH  Google Scholar 

  42. Le Thi HA, Pham Dinh T (2001) DC programming approach to the multidimensional scaling problem. In: Migdalas A, Pardalos PM, Värbrand P (eds) From local to global optimization. Springer, Boston, pp 231–276

    MATH  Google Scholar 

  43. Le Thi HA, Pham Dinh T (2003) Large-scale molecular optimization from distance matrices by a D.C. optimization approach. SIAM J Optim 14(1):77–114

    Article  MathSciNet  MATH  Google Scholar 

  44. Le Thi HA, Pham Dinh T (2005) The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann Oper Res 133(1–4):23–48

    MathSciNet  MATH  Google Scholar 

  45. Le Thi HA, Pham Dinh T (2014) DC programming in communication systems: challenging problems and methods. Vietnam J Comput Sci 1(1):15–28

    Article  Google Scholar 

  46. Le Thi HA, Pham Dinh T (2018) DC programming and DCA: thirty years of developments. Math Program Spec Issue DC Program Theory Algorithms Appl 169(1):5–68

    MathSciNet  MATH  Google Scholar 

  47. Li Y, Long P (2002) The relaxed online maximum margin algorithm. Mach Learn 46(1–3):361–387

    Article  MATH  Google Scholar 

  48. Littlestone N, Warmuth MK (1994) The weighted majority algorithm. Inf Comput 108(2):212–261

    Article  MathSciNet  MATH  Google Scholar 

  49. Nayman N, Noy A, Ridnik T, Friedman I, Jin R, Zelnik-Manor L (2019) XNAS: neural architecture search with expert advice. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, pp 1975–1985

  50. Novikoff AB (1963) On convergence proofs for perceptrons. In: Proceedings of the symposium on the mathematical theory of automata 12:615–622

  51. Ong CS, Le Thi HA (2013) Learning sparse classifiers with difference of convex functions algorithms. Optim Methods Softw 28(4):830–854

    Article  MathSciNet  MATH  Google Scholar 

  52. Pereira DG, Afonso A, Medeiros FM (2014) Overview of Friedman’s test and post-hoc analysis. Commun Stat Simul Comput 44(10):2636–2653

    Article  MathSciNet  Google Scholar 

  53. Pham Dinh T, Le HM, Le Thi HA, Lauer F (2014) A difference of convex functions algorithm for switched linear regression. IEEE Trans Autom Control 59(8):2277–2282

    Article  MathSciNet  MATH  Google Scholar 

  54. Pham Dinh T, Le Thi HA (1997) Convex analysis approach to D.C. programming: theory, algorithm and applications. Acta Math Vietnam 22(1):289–355

    MathSciNet  MATH  Google Scholar 

  55. Pham Dinh T, Le Thi HA (1998) DC optimization algorithms for solving the trust region subproblem. SIAM J Optim 8(2):476–505

    Article  MathSciNet  MATH  Google Scholar 

  56. Pham Dinh T, Le Thi HA (2014) Recent advances in DC programming and DCA. In: Nguyen NT, Le Thi HA (eds) Transactions on computational intelligence XIII, vol 8342. Springer, Berlin, pp 1–37

    Chapter  Google Scholar 

  57. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408

    Article  Google Scholar 

  58. Shalev-Shwartz S (2007) Online learning: theory, algorithms, and applications. Ph.D. thesis, The Hebrew University of Jerusalem

  59. Shalev-Shwartz S (2012) Online learning and online convex optimization. Found Trends Mach Learn 4(2):107–194

    Article  MATH  Google Scholar 

  60. Shor NZ (1985) Minimization methods for non-differentiable functions, 1 edn. Springer Series in Computational Mathematics 3. Springer, Berlin

  61. Valadier M (1969) Sous-différentiels d’une borne supérieure et d’une somme continue de fonctions convexes. CR Acad. Sci. Paris Sér. AB 268:A39–A42

    MATH  Google Scholar 

  62. Van Der Malsburg C (1986) Frank rosenblatt: principles of neurodynamics: perceptrons and the theory of brain mechanisms. In: Palm G, Aertsen A (eds) Brain theory. Springer, Berlin, pp 245–248

    Chapter  Google Scholar 

  63. Vovk V (1990) Aggregating strategies. In: Proceedings of the third annual workshop on computational learning theory. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 371–386

  64. Vovk V (1998) A game of prediction with expert advice. J Comput Syst Sci 56(2):153–173

    Article  MathSciNet  MATH  Google Scholar 

  65. Wang W, Carreira-Perpiñán MÁ (2013) Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arxiv: 1309.1541

  66. Wilcoxon F (1945) Individual comparisons by ranking methods. Biometr Bull 1(6):80–83

    Article  Google Scholar 

  67. Wu P, Hoi SCH, Zhao P, Miao C, Liu Z (2016) Online multi-modal distance metric learning with application to image retrieval. IEEE Trans Knowl Data Eng 28(2):454–467

    Article  Google Scholar 

  68. Yaroshinsky R, El-Yaniv R, Seiden SS (2004) How to better use expert advice. Mach Learn 55(3):271–309

    Article  MATH  Google Scholar 

  69. Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Fawcett T, Mishra N (eds) Proceedings of the 20th international conference on machine learning (ICML-03), pp 928–936

Download references

Acknowledgements

This research is funded by Foundation for Science and Technology Development of Ton Duc Thang University (FOSTECT), website: http://fostect.tdtu.edu.vn, under Grant FOSTECT.2017.BR.10.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoai An Le Thi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: DC decomposition of \({f}^{(i)}_t\)

We present the following proposition to get a DC decomposition of \({f}^{(i)}_t\).

Proposition 3

Let \(a > 0\), b, c and x be three constants and a given vector, respectively. The function

$$\begin{aligned} f(w) = \max \{0, \min \{a, b\langle w, x \rangle + c\}\}, \end{aligned}$$
(22)

is a DC function with DC components

$$\begin{aligned}g(w) = \max \{0, b\langle w, x \rangle + c\}~\text { and } ~ h(w) = \max \{0, b\langle w, x \rangle + c - a \}.\end{aligned}$$

Proof

Knowing from [54] that if \(f = g - h\) is a DC function, then the function

$$\begin{aligned} \max \{0, g - h\} = \max \{g,h\} - h \end{aligned}$$
(23)

is DC too. We see that the function

$$\begin{aligned}\min \{a, b\langle w, x \rangle + c\} = (b\langle w, x \rangle + c) - \max \{0,b\langle w, x \rangle + c - a \}\end{aligned}$$

is DC. Therefore, applying (23) for \(g = b\langle w, x \rangle + c\) and \(h = \max \{0,b\langle w, x \rangle + c - a \}\), we get immediately a DC decomposition of f, that is,

$$\begin{aligned}&f(w) = \max \{{b\langle w, x \rangle + c,\max \{0,b\langle w, x \rangle + c - a \}}\} \\&\quad - \max \{0,b\langle w, x \rangle + c - a \}. \end{aligned}$$

However, since \(a > 0\), we have

$$\begin{aligned}\max \{{b\langle w, x \rangle + c,\max \{0,b\langle w, x \rangle + c - a \}}\} = \max \{0,b\langle w, x \rangle + c\}.\end{aligned}$$

Hence, we complete the proof. \(\square\)

Appendix 2: Proof of Lemma 1

Proof

First, we observe that when \(\tilde{\mathtt {p}}_{t-1} = 0\) or \(\tau ^{(1)}_{t-1} > \langle w^{t-1}, \tilde{\mathtt {p}}_{t-1} \rangle\) (resp. \(\tau ^{(2)}_{t-1} < \langle w^{t-1}, \tilde{\mathtt {p}}_{t-1} \rangle\)), our algorithms do not update the weight, and thus, the result in Lemma 1 is straightforward. Hence, we need to consider only the cases where, at each step \(t \in \mathcal {M}\),

$$\begin{aligned} \tau ^{(1)}_{t-1} = \langle w^{t-1}, \tilde{\mathtt {p}}_{t-1} \rangle ~(\mathrm {resp. }~ \tau ^{(2)}_{t-1} = \langle w^{t-1}, \tilde{\mathtt {p}}_{t-1} \rangle ) ~\mathrm {and}\,~ \tilde{\mathtt {p}}_{t-1} \ne 0. \end{aligned}$$
(24)

Since \(u^*\) satisfies (18) and \(\tilde{\mathtt {p}}_t \ne 0\), Assumption 1 (i) is satisfied for DC functions (8). We see that \(\Vert u^* - w^t \Vert _2 \ne 0\) for all t because if there is some step t such that \(u^* = w^t\), then \(\langle u^*,\tilde{\mathtt {p}}_t \rangle = \langle w^t,\tilde{\mathtt {p}}_t \rangle\), which contradicts (18).

Next, we verify Assumptions 1 (ii)–(iv) for the DC functions \(f^{(1)}_t\).

\(\bullet\) Let us define the function \(\overline{g}^{(1)}_t := {g}^{(1)}_t - \langle z^t, \cdot \rangle .\) From (24), we derive

$$\begin{aligned} \overline{g}^{(1)}_t(w^t)-\overline{g}^{(1)}_t(u^*) = \frac{\rho - \langle w^t, \tilde{\mathtt {p}}_t \rangle }{\rho - \tau ^{(1)}_{t}} = 1 > 0. \end{aligned}$$

Thus, there exists a positive number \(\alpha \le \min \limits _{t \in \mathcal {M}}{\frac{2 }{ \Vert u^*-w^t\Vert _2^2}}\) such that Assumption 1 (ii) is satisfied.

\(\bullet\) For any \(\beta \ge 0\), we have

$$\begin{aligned} h^{(1)}_t(u^*)-h^{(1)}_t(w^t) - \langle z^t,u^*-w^t \rangle = 0 \le \frac{\beta }{2} \Vert u^*-w^t \Vert _2^2. \end{aligned}$$

Thus, Assumption 1 (iii) is satisfied.

\(\bullet\) Assumption 1 (iv) is also verified with \(\gamma \le \min \limits _{t \in \mathcal {M}}{\frac{2(\langle u^*, \tilde{\mathtt {p}}_t \rangle -\rho )}{(\rho - \tau ^{(1)}_{t})\Vert u^*-w^t\Vert _2^2}}\) since

$$\begin{aligned}&{g}^{(1)}_t(u^*)-{g}^{(1)}_t(w^t) - \langle r^t, u^*-w^t \rangle = \frac{\langle u^*, \tilde{\mathtt {p}}_t \rangle -\rho }{\rho - \tau ^{(1)}_{t}}\\&\quad \ge \frac{\gamma }{2} \Vert u^*-w^t \Vert _2^2. \end{aligned}$$

Similarly, as for the DC functions \(f_t^{(i)}\) (\(i=2,3\)), Assumptions 1 (i)–(iv) are satisfied if

$$\begin{aligned}\alpha\le & {} \min \limits _{t \in \mathcal {M}}{\frac{2}{\Vert u^*-w^t\Vert _2^2}}, \beta \ge 0, \\ \gamma\le & {} \min \limits _{t \in \mathcal {M}}{\frac{2}{\Vert u^*-w^t\Vert _2^2}\left( \dfrac{\langle u^*,\tilde{\mathtt {p}}_t \rangle -\rho }{\rho -\tau ^{(i)}_t} - \delta ^{(i)}\right) }.\end{aligned}$$

The proof of Lemma 1 is established. \(\square\)

Appendix 3: Proof of Theorem 1

Proof

First of all, we analyze the regret bound of ODCA-SG.

From the definition (17), we have

$$\begin{aligned}&\mathsf {Regret}^T_{\text {ODCA-SG}} = \sum \limits _{t=1}^T f_t(w^t) - \min \limits _{w \in \mathcal {S}} \sum \limits _{t=1}^T f_t(w)\nonumber \\&\quad \quad \le \sum \limits _{t=1}^T \left[ f_t(w^t) - \min \limits _{w \in \mathcal {S}} f_t(w) \right] . \end{aligned}$$
(25)

It derives from Assumption 1 (i) that

$$\begin{aligned}&f_t(w^t) - \min \limits _{w \in \mathcal {S}} f_t(w) = f_t(w^t) - f_t(u^*) \nonumber \\&\quad = [\overline{g}_t(w^t) - \overline{g}_t(u^*)] + [h_t(u^*)-h_t(w^t) - \langle z^t,u^*-w^t \rangle ], \end{aligned}$$
(26)

where the convex function \(\overline{g}_t := g_t - \langle z^t, \cdot \rangle\) for \(t=1,\ldots ,T\).

From (25), (26) and Assumptions 1 (ii)–(iii) with the choice \(\beta = \alpha\), we obtain

$$\begin{aligned}&\mathsf {Regret}^T_{\text {ODCA-SG}} {\le } 2\sum \limits _{t=1}^T [\overline{g}_t(w^t) - \overline{g}_t(u^*)] \nonumber \\&\quad = 2\sum \limits _{t=1}^T [\overline{g}_t(w^{t,0}) - \overline{g}_t(u^*)] \end{aligned}$$
(27)
$$\begin{aligned}&\quad \le 2\sum \limits _{t=1}^T \left\{ \sum \limits _{k=0}^{K_t-2} [\overline{g}_t(w^{t,k}) - \overline{g}_t(w^{t,k+1})] + [ \overline{g}_t(w^{t,K_t-1})-\overline{g}_t(u^*)]\right\} \nonumber \\&\quad \le 2\sum \limits _{t=1}^T \left\{ \sum \limits _{k=0}^{K_t-2} \langle s^{t,k}, w^{t,k}-w^{t,k+1} \rangle + \langle s^{t,K_t-1}, w^{t,K_t-1} - u^* \rangle \right\} . \end{aligned}$$
(28)

The last inequality holds as \(s^{t,k} \in \partial \overline{g}_t(w^{t,k})\), \(k=0,1,\ldots ,K_t-1\).

Similarly to Theorem 3.1 in [30], we can derive from (13) an upper bound of \(\langle s^{t,K_t-1}, w^{t,K_t-1} - u^* \rangle\) as follows:

$$\begin{aligned}&\langle s^{t,K_t-1}, w^{t,K_t-1} - u^* \rangle \nonumber \\&\quad \le \frac{ \Vert w^{t,K_t-1} - u^* \Vert _2^2 - \Vert w^{t,K_t} - u^* \Vert _2^2 }{2\eta _t} + \frac{\eta _t}{2} \Vert s^{t,K_t-1} \Vert _2^2. \end{aligned}$$
(29)

Combining (19) and the fact that

$$\begin{aligned}&\Vert w^{t,K_t-1} - u^* \Vert _2 \le \Vert w^{t,K_t-1} - w^{t,0} \Vert _2 + \Vert w^{t,0} - u^* \Vert _2 \\&\quad \le \sum _{k=0}^{K_t-2} \Vert w^{t,k+1} - w^{t,k} \Vert _2 + \Vert w^{t} - u^* \Vert _2\\&\quad \le \sum _{k=0}^{K_t-2} \eta _t \Vert s^{t,k}\Vert _2 + \Vert w^{t} - u^* \Vert _2 \\&\quad \le \eta _t (K-1) L + \Vert w^{t} - u^* \Vert _2, \end{aligned}$$

we yield

$$\begin{aligned} \Vert w^{t,K_t-1} - u^* \Vert _2^2 \le \Vert w^{t} - u^* \Vert _2^2 + 3 \eta _t^2 (K-1)^2 L^2. \end{aligned}$$
(30)

It implies

$$\begin{aligned}&\langle s^{t,K_t-1}, w^{t,K_t-1} - u^* \rangle \nonumber \\&\quad \le \frac{\Vert w^{t} - u^* \Vert _2^2 - \Vert w^{t+1} - u^* \Vert _2^2 }{2\eta _t} + \frac{\eta _t}{2} (3K^2-6K+4) L^2. \end{aligned}$$
(31)

Similarly, we get

$$\begin{aligned}&\sum \limits _{k=0}^{K_t-2} \langle s^{t,k}, w^{t,k}-w^{t,k+1} \rangle \nonumber \\&\quad \le \sum \limits _{k=0}^{K_t-2} \left[ \frac{ \Vert w^{t,k} - w^{t,k+1} \Vert _2^2}{2\eta _t} + \frac{\eta _t}{2} \Vert s^{t,k} \Vert _2^2 \right] \nonumber \\&\quad \le \sum \limits _{k=0}^{K_t-2} \left[ \eta _t \Vert s^{t,k} \Vert _2^2 \right] \le \eta _t (K-1) L^2. \end{aligned}$$
(32)

We deduce from (28), (31), (32) that

$$\begin{aligned}&\mathsf {Regret}^T_{\text {ODCA-SG}} \\&\quad \le 2 \sum _{t=1}^T \left[ \frac{\Vert w^{t} - u^* \Vert _2^2 - \Vert w^{t+1} - u^* \Vert _2^2 }{2\eta _t} + \frac{\eta _t}{2} (3K^2-4K+2) L^2 \right] \\&\quad \le 2 \sum _{t=1}^T \left[ \Vert w^{t} - u^* \Vert _2^2 \left( \frac{1}{2\eta _t} - \frac{1}{2\eta _{t-1}} \right) + \frac{\eta _t}{2} (3K^2-4K+2) L^2 \right] , \end{aligned}$$

where, by convention, \(\frac{1}{\eta _0}:=0\).

Let us define \(\eta _t := \dfrac{1}{\sqrt{t}\sqrt{3K^2-4K+2}}\) for all \(t=1,\ldots ,T\). We have

$$\begin{aligned} \mathsf {Regret}^T_{\text {ODCA-SG}}&\le 2\sqrt{3K^2-4K+2}\left[ \frac{L^2\sqrt{T}}{2} + \frac{L^2}{2} \left( 2\sqrt{T}\right) \right] \\&\le 3L^2\sqrt{T}\sqrt{3K^2-4K+2}. \end{aligned}$$

As for ODCA-SGk, since Assumption 1 (iv) is also satisfied, we can derive from (27) that

$$\begin{aligned}&\mathsf {Regret}^T_{\text {ODCA-SG}k} \le 2 \sum _{t=1}^T \left[ \langle s^t, w^t-u^* \rangle - \frac{\gamma }{2}\Vert u^*-w^t \Vert ^2 \right] \\&\quad \le 2 \sum _{t=1}^T \left[ \Vert w^{t} - u^* \Vert _2^2 \left( \frac{1}{2\eta _t} - \frac{1}{2\eta _{t-1}} - \frac{\gamma }{2}\right) + \frac{\eta _t}{2} L^2 \right] . \end{aligned}$$

Defining \(\eta _t := \frac{1}{\gamma t}\) for all \(t=1,\ldots ,T\), we obtain

$$\begin{aligned} \mathsf {Regret}^T_{\text {ODCA-SG}k} \le \frac{L^2\left( 1+\log (T)\right) }{\gamma }. \end{aligned}$$

The proof of Theorem 1 is established. \(\square\)

Appendix 4: Proof of Proposition 1

Proof

(i) From the condition (7) and Theorem 1, we have, for any \(w \in \mathcal {S}\),

$$|\mathcal {M}| \le \sum \limits _{t \in \mathcal {M}} f_t(w^t) \le \sum \limits _{t \in \mathcal {M}} f_t(w) + 3L^2\sqrt{|\mathcal {M}|}\sqrt{3K^2 -4K+2}.$$

Here, \(|\mathcal {M}|\) is the number of the steps in \(\mathcal {M}\).

It implies that \(|\mathcal {M}| \le \overline{a} + \overline{c}\sqrt{|\mathcal {M}|}\). Thus, the proof of (i) is complete.

(ii) From the definition of \(\gamma _{\mathrm {SG}}\), we derive that for any \(w \in \mathcal {S}\),

$$\begin{aligned} |\mathcal {M}| \le \sum _{t \in \mathcal {M}} f_t(w^t) \le \sum _{t \in \mathcal {M}} f_t(w) + \frac{L^2\left( 1+\log (|\mathcal {M}|)\right) }{\gamma _{\mathrm {SG}}}. \end{aligned}$$
(33)

It implies

$$\begin{aligned} |\mathcal {M}| \le \overline{a} + \overline{b}\left( 1+\log (|\mathcal {M}|)\right) . \end{aligned}$$

Considering the strictly convex function \(r : (0,+\infty ) \rightarrow \mathbb {R}\),

$$\begin{aligned}r(x) = x - \overline{a} - \overline{b}\left( 1+\log (x)\right) .\end{aligned}$$

Since \(\lim \nolimits _{x \rightarrow 0^+}r(x) = \lim \nolimits _{x \rightarrow +\infty }r(x) = +\infty\) and \(r(\overline{b}) \le 0\), equation \(r(x) = 0\) has two roots \(\overline{x}_1\), \(\overline{x}_2\) such that \(0 < \overline{x}_2 \le \overline{b} \le \overline{x}_1\). The proof of (ii) is established. \(\square\)

Appendix 5: Euclidean projection onto the probability simplex

figure i

Appendix 6: Description of the experts

We give here a description of the five experts used in the numerical experiments. They are well-known online classification algorithms: perceptron [50, 57, 62], relaxed online maximum margin algorithm [47], approximate maximal margin classification algorithm [23], passive-aggressive learning algorithm [14], classic online gradient descent algorithm [69].

Note that, in this paper, we consider the outcome space \(\mathcal {Y} = \{0,1\}\) and the prediction label \(\tilde{p}_{i,t} = \mathbb {1}_{\langle u_i, x \rangle \ge 0}(x_t) \in \{0,1\}\), where \(u_i\) is the linear classifier of ith expert. Therefore, in the description below, the \(\{0,1\}\) label is used instead to \(\{-1,1\}\) which are often used in linear classification algorithms.

First, the perceptron algorithm is known as the earliest, simplest approach for online binary linear classification [57].

figure j

Second, the relaxed online maximum margin algorithm [47] is an incremental algorithm for classification using a linear threshold function. It can be seen as a relaxed version of the algorithm that searches for the separating hyperplane which maximizes the minimum distance from previous instances classified correctly.

figure k

Third, the approximate maximal margin classification algorithm [23] consists in approximating the maximal margin hyperplane with respect to \(\ell _p\)-norm (\(p \ge 2\)) for a set of linearly separable data. The proposed algorithm in [23] is called Approximate Large Margin Algorithm.

figure l

Fourth, the passive-aggressive learning algorithm [14] computes the classifier based on analytical solution to simple constrained optimization problem which minimizes the distance from the current classifier \(u^t\) to the half-space of vectors which are of the zero hinge-loss on the current sample.

figure m

Finally, the classic online gradient descent algorithm [69] uses the gradient descent method for minimizing the hinge-loss function.

figure n

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le Thi, H.A., Ho, V.T. DCA for online prediction with expert advice. Neural Comput & Applic 33, 9521–9544 (2021). https://doi.org/10.1007/s00521-021-05709-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05709-0

Keywords

Navigation