Skip to main content
Log in

Online AdaBoost-based methods for multiclass problems

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Boosting is a technique forged to transform a set of weak classifiers into a strong ensemble. To achieve this, the components are trained with different data samples and the hypotheses are aggregated in order to perform a better prediction. The use of boosting in online environments is a comparatively new activity, inspired by its success in offline environments, which is emerging to meet new demands. One of the challenges is to make the methods handle significant amounts of information taking into account computational constraints. This paper proposes two new online boosting methods: the first aims to perform a better weight distribution of the instances to closely match the behavior of AdaBoost.M1 whereas the second focuses on multiclass problems and is based on AdaBoost.M2. Theoretical arguments were used to demonstrate their convergence and also that both methods retain the main features of their traditional counterparts. In addition, we performed experiments to compare the accuracy as well as the memory usage of the proposed methods against other approaches using 20 well-known datasets. Results suggest that, in many different situations, the proposed algorithms maintain high accuracies, outperforming the other tested methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Because it refers to the exact calculation, the symbol \(\approx \) is not used in this equation.

  2. Since OABM2 has two parameters (w and L) and three values were selected for each one, a total of 10 tests per dataset were performed, including the default values of the method.

  3. The complexities were defined taking into account the implementations made available by the authors or available in the MOA framework.

  4. This optimization technique is known as memoization.

References

  • Barros RSM, Santos SGTC (2018) A large-scale comparison of concept drift detectors. Inf Sci 451–452(C):348–370. https://doi.org/10.1016/j.ins.2018.04.014

    Article  MathSciNet  Google Scholar 

  • Barros RSM, Santos SGTC, Gonçalves PM Jr (2016) A boosting-like online learning ensemble. In: International joint conference on neural networks (IJCNN), pp 1871–1878. https://doi.org/10.1109/IJCNN.2016.7727427

  • Barros RSM, Cabral DRL, Gonçalves PM Jr, Santos SGTC (2017) RDDM: reactive drift detection method. Expert Syst Appl 90(C):344–355

    Article  Google Scholar 

  • Beygelzimer A, Hazan E, Kale S, Luo H (2015a) Online gradient boosting. In: Proceedings of 28th international conference on neural information processing systems (NIPS’15). MIT Press, pp 2458–2466

  • Beygelzimer A, Kale S, Luo H (2015b) Optimal and adaptive algorithms for online boosting. In: Bach F, Blei D (eds) ICML, JMLR workshop and conference proceedings, vol 37, pp 2323–2331

  • Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In: Proceedings of 15th ACM international conference on knowledge discovery and data mining, New York, pp 139–148

  • Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  • Brzezinski D, Stefanowski J (2013) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94

    Article  Google Scholar 

  • Chen S, Lin H, Lu C (2012) An online boosting algorithm with theoretical justifications. In: Proceedings of the 29th international conference on machine learning, ICML’12, pp 1873–1880

  • Cormen T, Leiserson C, Rivest R, Stein C (2009) Growth of functions. In: Introduction to algorithms, 3rd edn, chap 1. The MIT Press, pp 43–64

  • Dawid A (1984) Present position and potential developments: some personal views: statistical theory: the prequential approach. J R Stat Soc Ser A (Gen) 147(2):278–292. https://doi.org/10.2307/2981683

    Article  Google Scholar 

  • Demiriz A, Bennett KP, Shawe-Taylor J (2002) Linear programming boosting via column generation. Mach Learn 46(1–3):225–254. https://doi.org/10.1023/A:1012470815092

    Article  MATH  Google Scholar 

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Freund Y (1995) Boosting a weak learning algorithm by majority. Inf Comput 121(2):256–285. https://doi.org/10.1006/inco.1995.1136

    Article  MathSciNet  MATH  Google Scholar 

  • Freund Y (2001) An adaptive version of the boost by majority algorithm. Mach Learn 43(3):293–318. https://doi.org/10.1023/A:1010852229904

    Article  MATH  Google Scholar 

  • Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: International conference on machine learning, pp 148–156

  • Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. https://doi.org/10.1006/jcss.1997.1504

    Article  MathSciNet  MATH  Google Scholar 

  • Frías-Blanco I, Campo-Ávila J, Ramos-Jiménez G, Morales-Bueno R, Ortiz-Díaz A, Caballero-Mota Y (2015) Online and non-parametric drift detection methods based on hoeffding’s bounds. IEEE Trans Knowl Data Eng 27(3):810–823

    Article  Google Scholar 

  • Frías-Blanco I, Verdecia-Cabrera A, Ortiz-Díaz A, Carvalho A (2016) Fast adaptive stacking of ensembles. In: Proceedings of the 31st ACM symposium on applied computing (SAC’16), Pisa, Italy, pp 929–934

  • Gama J, Sebastião R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346. https://doi.org/10.1007/s10994-012-5320-9

    Article  MathSciNet  MATH  Google Scholar 

  • Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37. https://doi.org/10.1145/2523813

    Article  MATH  Google Scholar 

  • Grabner H, Bischof H (2006) On-line boosting and vision. In: Proceedings of 2006 IEEE conference on computer vision and pattern recognition, vol 1, Washington, DC, USA, CVPR ’06, pp 260–267. https://doi.org/10.1109/CVPR.2006.215

  • Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), New York, pp 97–106

  • Jung YH, Goetz J, Tewari A (2017) Online multiclass boosting. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates, Inc., Red Hook, pp 920–929

    Google Scholar 

  • Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156. https://doi.org/10.1016/j.inffus.2017.02.004

    Article  Google Scholar 

  • Manapragada C, Webb GI, Salehi M (2018) Extremely fast decision tree. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’18, pp 1953–1962. https://doi.org/10.1145/3219819.3220005

  • Mukherjee I, Schapire RE (2013) A theory of multiclass boosting. J Mach Learn Res 14(1):437–497

    MathSciNet  MATH  Google Scholar 

  • Oza NC, Russell S (2001) Online bagging and boosting. In: Artificial intelligence and statistics, Morgan Kaufmann, pp 105–112

  • Pelossof R, Jones M, Vovsha I, Rudin C (2009) Online coordinate boosting. In: Computer vision workshops (ICCV Workshops). IEEE, pp 1354–1361. https://doi.org/10.1109/ICCVW.2009.5457454

  • Saffari A, Godec M, Pock T, Leistner C, Bischof H (2010) Online multi-class lpboost. In: Proceedings of IEEE conference on computer vision and pattern recognition, CVPR ’10, pp 3570–3577. https://doi.org/10.1109/CVPR.2010.5539937

  • Santos S, Gonçalves P Jr, Silva G, Barros RSM (2014) Speeding up recovery from concept drifts. In: Machine learning and knowledge discovery databases, LNCS, vol 8726. Springer, pp 179–194. https://doi.org/10.1007/978-3-662-44845-8_12

    Chapter  Google Scholar 

  • Schapire R, Freund Y (2012a) Attaining the best possible accuracy. In: Boosting: foundations and algorithms, chap 12. MIT Press, pp 377–413

  • Schapire R, Freund Y (2012b) Learning to rank. In: Boosting: foundations and algorithms, chap 11. MIT Press, pp 341–374

  • Schapire R, Freund Y (2012c) Loss minimization and generalizations of boosting. In: Boosting: foundations and algorithms, chap 7. MIT Press, pp 175–226

  • Schapire R, Freund Y (2012d) Multiclass classification problems. In: Boosting: foundations and algorithms, chap 10. MIT Press, pp 303–340

  • Schapire R, Freund Y (2012e) Using confidence-rated weak predictions. In: Boosting: foundations and algorithms, chap 9. MIT Press, pp 271–302

  • Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336. https://doi.org/10.1023/A:1007614523901

    Article  MATH  Google Scholar 

  • Servedio RA (2003) Smooth boosting and learning with malicious noise. J Mach Learn Res 4:633–648

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Silas Santos is supported by a postgraduate grant from CNPq.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Silas Garrido Teixeira de Carvalho Santos.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Proof for Theorem 3

Proof

Considering that

$$\begin{aligned} F(x,y) = \sum ^{T}_{t=1} \alpha _t \left( [[y = h_t(x)]] - [[y \ne h_t(x)]]\right) , \end{aligned}$$
(10)

where

$$\begin{aligned} \alpha _t = \frac{1}{2} \times ln \left( \frac{1-\varepsilon _t}{\varepsilon _t}\right) . \end{aligned}$$
(11)

Taking into account that

$$\begin{aligned} w_{t+1,i}&= \frac{w_{t,i} \times \beta ^{1-[[h_t(x_i) \ne y_i]]}_{t}}{Z_t}&\end{aligned}$$
(12)
$$\begin{aligned}&= \frac{w_{t,i} \times e^{-\alpha _t\left( [[y_i = h_t(x_i)]] - [[y_i \ne h_t(x_i)]]\right) }}{Z_t}, \end{aligned}$$
(13)

we can say that, after passing through all classifiers t, the final distribution of instance i (\(w_{T+1,i}\)) will be defined by

$$\begin{aligned} w_{T+1,i}&= w_{1,i} \times \frac{e^{-\alpha _1\left( [[y_i = h_1(x_i)]] - [[y_i \ne h_1(x_i)]]\right) }}{Z_1} \times \cdots \times \frac{e^{-\alpha _T\left( [[y_i = h_T(x_i)]] - [[y_i \ne h_T(x_i)]]\right) }}{Z_T}\nonumber \\&= \frac{w_{1,i} \times exp\left( -\sum ^{T}_{t=1}\alpha _{t}([[y_i = h_t(x_i)]] - [[y_i \ne h_t(x_i)]])\right) }{\prod ^{T}_{t=1}Z_t}. \end{aligned}$$
(14)

Combining Eq. (10) with Eq. (14), we obtain

$$\begin{aligned} w_{T+1,i} = \frac{w_{1,i} \times e^{-F(x_i,y_i)}}{\prod ^{T}_{t=1}Z_t}. \end{aligned}$$
(15)

To initiate the construction of the Eq. (8), consider that, given an instance x, an arbitrary class y has not been chosen as a response by the committee (\(h_f(x) \ne y\)). Knowing this, we can say that there is another class (\(\ell \)) that satisfies the following inequalities:

$$\begin{aligned} \sum ^{T}_{t=1}\alpha _t[[y = h_t(x)]] \le \sum ^{T}_{t=1}\alpha _t[[\ell = h_t(x)]] \le \sum ^{T}_{t=1}\alpha _t[[y \ne h_t(x)]]. \end{aligned}$$
(16)

It is important to note that using \(\alpha _t\) in the weighting of the committee’s final response will have the same effect as \(\beta _t\), used in OABM1. In addition, the latter inequality takes advantage of the fact that \(\alpha _t \ge 0\) since \(\varepsilon _t < 1/2\). Thus, \(h_f(x) \ne y\) implies that \(F(x,y) \le 0\). Therefore,

$$\begin{aligned}{}[[h_f(x) \ne y]] \le e^{-F(x,y)}. \end{aligned}$$
(17)

Observing the Eq. (15), the Eq. (17), and taking into account that the training error (left side of the inequality in Eq. (8)) is defined by the sum of the weights of instances incorrectly classified by \(h_f\), we have to:

$$\begin{aligned} Pr_{i \sim w_1}[h_f(x_i) \ne y_i]&= \sum ^{N}_{i=1}w_{1,i} \times [[h_f(x_i) \ne y_i]]\nonumber \\&\le \sum ^{N}_{i=1} w_{1,i} \times e^{-F(x_i,y_i)} \end{aligned}$$
(18)
$$\begin{aligned}&= \sum ^{N}_{i=1}w_{T+1,i} \times \prod ^{T}_{t=1}Z_t. \end{aligned}$$
(19)

The Eq. (18) takes into account the definition of (17), while Eq. (19) uses (15). However, since all instances are not known a priori in an online version, the iteration in N will be done in a gradual way. Therefore,

$$\begin{aligned} Pr_{i \sim w_1}[h_f(x_i) \ne y_i] \le \sum ^{N}_{i=1} \sum ^{i}_{j=1} \left( w_{T+1,j} \times \prod ^{T}_{t=1} Z_t \right) , \end{aligned}$$
(20)

where the iteration in j represents the incremental inclusions of new instances. Finally, to get (8), the value of \(Z_t\), present in (20), needs to be defined. Rewriting the Eq. (12) more completely and in accordance with Line 14 of Algorithm 3, we obtain:

$$\begin{aligned} w_{t+1,i}&= \left( w_{t,i} \times \beta ^{1-[[h_t(x_i) \ne y_i]]}_{t} \right) \big / \left( \tfrac{\xi c_{t}}{cw_{t}} \times \beta _{t}^{1} + \tfrac{\xi w_{t}}{cw_{t}} \times \beta _{t}^{0} \right) \end{aligned}$$
(21)
$$\begin{aligned}&= \left( w_{t,i} \times e^{-\alpha _t\left( [[y_i = h_t(x_i)]] - [[y_i \ne h_t(x_i)]]\right) } \right) \big / \left( \tfrac{\xi c_{t}}{cw_{t}} \times e^{-\alpha _t} + \tfrac{\xi w_{t}}{cw_{t}} \times e^{\alpha _t} \right) . \end{aligned}$$
(22)

Given that \(\varepsilon _t = \xi w_{t}/cw_{t}\), the denominator of 22 can be expressed by:

$$\begin{aligned} Z_t&= (1-\varepsilon _t) \times e^{-\alpha _t} + \varepsilon _t \times e^{\alpha _t}\end{aligned}$$
(23)
$$\begin{aligned}&= (1/2 + \gamma _t) \times e^{-\alpha _t} + (1/2 - \gamma _t) \times e^{\alpha _t} \end{aligned}$$
(24)
$$\begin{aligned}&= \sqrt{1-4\gamma ^{2}_{t}}, \end{aligned}$$
(25)

where the Eqs. (24) and (25) use the definitions of \(\gamma _t\) and \(\alpha _t\), respectively. In this way, by plugging the value of \(Z_t\) (present in Eq. 25) into (20), we get the main equation (8), completing this demonstration. \(\square \)

Appendix B. Proof for Theorem 4

Proof

Considering that

$$\begin{aligned} Pr(x,y) = \tfrac{1}{2} \left( 1 - h_t(x,y_i) + h_t(x,y)\right) \end{aligned}$$
(26)

and that

$$\begin{aligned} F(x,y) = \sum ^{T}_{t=1} \alpha _t \left( h_t(x,y_i) - h_t(x,y)\right) , \end{aligned}$$
(27)

where

$$\begin{aligned} \alpha _t = \frac{1}{2} \times ln \left( \frac{1-\varepsilon _t}{\varepsilon _t}\right) . \end{aligned}$$
(28)

Taking into account that

$$\begin{aligned} w_{t+1,i}&= \frac{w_{t,i} \times \beta ^{(1/2)(1 + h_t(x_i,y_i) - h_t(x_i,y))}_{t}}{Z_t} \end{aligned}$$
(29)
$$\begin{aligned}&\approx \frac{w_{t,i} \times e^{-\alpha _t\left( h_t(x_i,y_i) - h_t(x_i,y)\right) }}{Z_t}, \end{aligned}$$
(30)

we can say that, after passing through all classifiers t, the final distribution of instance i (\(w_{T+1,i}\)) will be defined by

$$\begin{aligned} w_{T+1,i}&= w_{1,i} \times \frac{e^{-\alpha _1\left( h_1(x_i,y_i) - h_1(x_i,y)\right) }}{Z_1} \times \cdots \times \frac{e^{-\alpha _T\left( h_T(x_i,y_i) - h_T(x_i,y)\right) }}{Z_T} \nonumber \\&= \frac{w_{1,i} \times exp\left( -\sum ^{T}_{t=1}\alpha _{t}(h_t(x_i,y_i) - h_t(x_i,y))\right) }{\prod ^{T}_{t=1}Z_t}. \end{aligned}$$
(31)

Combining the Eq. (27) with the Eq. (31), we obtain

$$\begin{aligned} w_{T+1,i} = \frac{w_{1,i} \times e^{-F(x_i,y_i)}}{\prod ^{T}_{t=1}Z_t}. \end{aligned}$$
(32)

Whereas, giving an instance x, an arbitrary class y has not been chosen as a response by the committee (\(h_f(x) \ne y\)), then

$$\begin{aligned} \sum ^{T}_{t=1}\alpha _t\left( h_t(x,y_i) - h_t(x,y)\right) \le \sum ^{T}_{t=1}\alpha _t\left( h_t(x,y_i) - h_t(x,h_f(x))\right) . \end{aligned}$$
(33)

Thus, \(h_f(x) \ne y\) implies that \(F(x,y) \le 0\). This statement can be justified by the following fact: according to the Line 38 of OABM2 (Algorithm 4), the class chosen by the committee will be the one with the highest plausibility among the classifiers, making \(F(x,y) \le F(x,h_f(x)) \le 0\). Therefore,

$$\begin{aligned} Pr(x,h_f(x)) \le e^{-F(x,y)}. \end{aligned}$$
(34)

Taking into account that the training error (left side of the inequality in Eq. 9) is defined by the sum of the distribution weights multiplied by the error probability of \(h_f\) (for each instance) (Freund and Schapire 1997), we have:

$$\begin{aligned} Pr_{i \sim w_1}[h_f(x_i) \ne y_i]&= \sum ^{N}_{i=1}w_{1,i} \times Pr(x_i,h_f(x_i)). \end{aligned}$$
(35)

In addition, assuming that all \(k-1\) iterations that occur in OABM2 are equally important, we can redefine the Eq. (35) as follows:

$$\begin{aligned} Pr_{i \sim w_1}[h_f(x_i) \ne y_i]&= \sum ^{N}_{i=1}w_{1,i} \times (k-1) \times Pr(x_i,h_f(x_i)). \end{aligned}$$
(36)

Then, replacing the definitions of the Eq. (32) and the Eq. (34) in (36) we get:

$$\begin{aligned} Pr_{i \sim w_1}[h_f(x_i) \ne y_i]&\le \sum ^{N}_{i=1} w_{1,i} \times (k-1) \times e^{-F(x_i,y_i)} \end{aligned}$$
(37)
$$\begin{aligned}&= \sum ^{N}_{i=1}w_{T+1,i} \times (k-1) \times \prod ^{T}_{t=1}Z_t. \end{aligned}$$
(38)

The Eq. (37) takes into account the definition of (34), while the Eq. (38) uses (32). However, since all instances are not known a priori in an online version, the iteration in N will be done in a gradual way. So,

$$\begin{aligned} Pr_{i \sim w_1}[h_f(x_i) \ne y_i] \le \sum ^{N}_{i=1} \sum ^{i}_{j=1} \left( w_{T+1,j} \times (k-1) \times \prod ^{T}_{t=1} Z_t \right) , \end{aligned}$$
(39)

where the iteration in j represents the incremental inclusions of new instances. Finally, to get (9), the value of \(Z_t\), present in (39), needs to be defined. Rewriting the Eq. (29) more completely and in accordance with Line 35 of Algorithm 4, we obtain

$$\begin{aligned} w_{t+1,i}&\approx \left( w_{t,i} \times \beta ^{(1/2)(1 + h_t(x_i,y_i) - h_t(x_i,y))}_{t} \right) \Big / \left( \textstyle \sum _{i=1}^{N} w_{t,i} \times \beta ^{(1/2)(1 + h_t(x_i,y_i) - h_t(x_i,y))}_{t} \right) \end{aligned}$$
(40)
$$\begin{aligned}&\approx \left( w_{t,i} \times e^{-\alpha _t\left( h_t(x_i,y_i) - h_t(x_i,y)\right) }_{t} \right) \Big / \left( \textstyle \sum _{i=1}^{N} w_{t,i} \times e^{-\alpha _t\left( h_t(x_i,y_i) - h_t(x_i,y)\right) }_{t} \right) , \end{aligned}$$
(41)

with \(Z_t\) corresponding to the denominators of Eqs. (40) and (41). Since the hypothesis of OABM2 is in the range \([-1,+1]\), \(Z_t\) can be rearranged in the same way as described in subsection 3.1 of Schapire and Singer (1999) and in subsection 9.2.3 of Schapire and Freund (2012e). For this purpose, assume that u and r are defined by:

$$\begin{aligned} u&= h_t(x,y_i) - h_t(x,y) \end{aligned}$$
(42)
$$\begin{aligned} r&= \sum _{i=1}^{N} w_{t,i} (h_t(x_i,y_i) - h_t(x_i,y)). \end{aligned}$$
(43)

Thus,

$$\begin{aligned} Z_t&= \sum _{i=1}^{N} w_{t,i} \times e^{-\alpha _t\left( h_t(x_i,y_i) - h_t(x_i,y)\right) }_{t} \nonumber \\&= \sum _{i=1}^{N} w_{t,i} \times e^{-\alpha _t u_i}_{t}\end{aligned}$$
(44)
$$\begin{aligned}&\approx \sum _{i=1}^{N} w_{t,i} \left[ \left( \frac{1+u_i}{2}\right) e^{-\alpha _t} + \left( \frac{1-u_i}{2}\right) e^{\alpha _t} \right] \nonumber \\&= \left( \frac{e^{\alpha _t}+e^{-\alpha _t}}{2} \right) - \left( \frac{e^{\alpha _t}-e^{-\alpha _t}}{2} \right) r. \end{aligned}$$
(45)

By the definition of \(\alpha _t\) and \(\gamma _t\), and knowing that \(\varepsilon _t = (1-r)/2\) (Schapire and Singer 1999), its possible replace them in 45 and get to:

$$\begin{aligned} Z_t \approx \sqrt{1-4\gamma ^{2}_{t}}. \end{aligned}$$
(46)

Plugging the value of \(Z_t\) (from Eq. 46) into (39), we get the main equation (9), completing this demonstration.\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Santos, S.G.T.d.C., de Barros, R.S.M. Online AdaBoost-based methods for multiclass problems. Artif Intell Rev 53, 1293–1322 (2020). https://doi.org/10.1007/s10462-019-09696-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-019-09696-6

Keywords

Navigation