Abstract
Boosting is a technique forged to transform a set of weak classifiers into a strong ensemble. To achieve this, the components are trained with different data samples and the hypotheses are aggregated in order to perform a better prediction. The use of boosting in online environments is a comparatively new activity, inspired by its success in offline environments, which is emerging to meet new demands. One of the challenges is to make the methods handle significant amounts of information taking into account computational constraints. This paper proposes two new online boosting methods: the first aims to perform a better weight distribution of the instances to closely match the behavior of AdaBoost.M1 whereas the second focuses on multiclass problems and is based on AdaBoost.M2. Theoretical arguments were used to demonstrate their convergence and also that both methods retain the main features of their traditional counterparts. In addition, we performed experiments to compare the accuracy as well as the memory usage of the proposed methods against other approaches using 20 well-known datasets. Results suggest that, in many different situations, the proposed algorithms maintain high accuracies, outperforming the other tested methods.
Similar content being viewed by others
Notes
Because it refers to the exact calculation, the symbol \(\approx \) is not used in this equation.
Since OABM2 has two parameters (w and L) and three values were selected for each one, a total of 10 tests per dataset were performed, including the default values of the method.
The complexities were defined taking into account the implementations made available by the authors or available in the MOA framework.
This optimization technique is known as memoization.
References
Barros RSM, Santos SGTC (2018) A large-scale comparison of concept drift detectors. Inf Sci 451–452(C):348–370. https://doi.org/10.1016/j.ins.2018.04.014
Barros RSM, Santos SGTC, Gonçalves PM Jr (2016) A boosting-like online learning ensemble. In: International joint conference on neural networks (IJCNN), pp 1871–1878. https://doi.org/10.1109/IJCNN.2016.7727427
Barros RSM, Cabral DRL, Gonçalves PM Jr, Santos SGTC (2017) RDDM: reactive drift detection method. Expert Syst Appl 90(C):344–355
Beygelzimer A, Hazan E, Kale S, Luo H (2015a) Online gradient boosting. In: Proceedings of 28th international conference on neural information processing systems (NIPS’15). MIT Press, pp 2458–2466
Beygelzimer A, Kale S, Luo H (2015b) Optimal and adaptive algorithms for online boosting. In: Bach F, Blei D (eds) ICML, JMLR workshop and conference proceedings, vol 37, pp 2323–2331
Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In: Proceedings of 15th ACM international conference on knowledge discovery and data mining, New York, pp 139–148
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Brzezinski D, Stefanowski J (2013) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94
Chen S, Lin H, Lu C (2012) An online boosting algorithm with theoretical justifications. In: Proceedings of the 29th international conference on machine learning, ICML’12, pp 1873–1880
Cormen T, Leiserson C, Rivest R, Stein C (2009) Growth of functions. In: Introduction to algorithms, 3rd edn, chap 1. The MIT Press, pp 43–64
Dawid A (1984) Present position and potential developments: some personal views: statistical theory: the prequential approach. J R Stat Soc Ser A (Gen) 147(2):278–292. https://doi.org/10.2307/2981683
Demiriz A, Bennett KP, Shawe-Taylor J (2002) Linear programming boosting via column generation. Mach Learn 46(1–3):225–254. https://doi.org/10.1023/A:1012470815092
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Freund Y (1995) Boosting a weak learning algorithm by majority. Inf Comput 121(2):256–285. https://doi.org/10.1006/inco.1995.1136
Freund Y (2001) An adaptive version of the boost by majority algorithm. Mach Learn 43(3):293–318. https://doi.org/10.1023/A:1010852229904
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: International conference on machine learning, pp 148–156
Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. https://doi.org/10.1006/jcss.1997.1504
Frías-Blanco I, Campo-Ávila J, Ramos-Jiménez G, Morales-Bueno R, Ortiz-Díaz A, Caballero-Mota Y (2015) Online and non-parametric drift detection methods based on hoeffding’s bounds. IEEE Trans Knowl Data Eng 27(3):810–823
Frías-Blanco I, Verdecia-Cabrera A, Ortiz-Díaz A, Carvalho A (2016) Fast adaptive stacking of ensembles. In: Proceedings of the 31st ACM symposium on applied computing (SAC’16), Pisa, Italy, pp 929–934
Gama J, Sebastião R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346. https://doi.org/10.1007/s10994-012-5320-9
Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37. https://doi.org/10.1145/2523813
Grabner H, Bischof H (2006) On-line boosting and vision. In: Proceedings of 2006 IEEE conference on computer vision and pattern recognition, vol 1, Washington, DC, USA, CVPR ’06, pp 260–267. https://doi.org/10.1109/CVPR.2006.215
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), New York, pp 97–106
Jung YH, Goetz J, Tewari A (2017) Online multiclass boosting. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates, Inc., Red Hook, pp 920–929
Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156. https://doi.org/10.1016/j.inffus.2017.02.004
Manapragada C, Webb GI, Salehi M (2018) Extremely fast decision tree. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’18, pp 1953–1962. https://doi.org/10.1145/3219819.3220005
Mukherjee I, Schapire RE (2013) A theory of multiclass boosting. J Mach Learn Res 14(1):437–497
Oza NC, Russell S (2001) Online bagging and boosting. In: Artificial intelligence and statistics, Morgan Kaufmann, pp 105–112
Pelossof R, Jones M, Vovsha I, Rudin C (2009) Online coordinate boosting. In: Computer vision workshops (ICCV Workshops). IEEE, pp 1354–1361. https://doi.org/10.1109/ICCVW.2009.5457454
Saffari A, Godec M, Pock T, Leistner C, Bischof H (2010) Online multi-class lpboost. In: Proceedings of IEEE conference on computer vision and pattern recognition, CVPR ’10, pp 3570–3577. https://doi.org/10.1109/CVPR.2010.5539937
Santos S, Gonçalves P Jr, Silva G, Barros RSM (2014) Speeding up recovery from concept drifts. In: Machine learning and knowledge discovery databases, LNCS, vol 8726. Springer, pp 179–194. https://doi.org/10.1007/978-3-662-44845-8_12
Schapire R, Freund Y (2012a) Attaining the best possible accuracy. In: Boosting: foundations and algorithms, chap 12. MIT Press, pp 377–413
Schapire R, Freund Y (2012b) Learning to rank. In: Boosting: foundations and algorithms, chap 11. MIT Press, pp 341–374
Schapire R, Freund Y (2012c) Loss minimization and generalizations of boosting. In: Boosting: foundations and algorithms, chap 7. MIT Press, pp 175–226
Schapire R, Freund Y (2012d) Multiclass classification problems. In: Boosting: foundations and algorithms, chap 10. MIT Press, pp 303–340
Schapire R, Freund Y (2012e) Using confidence-rated weak predictions. In: Boosting: foundations and algorithms, chap 9. MIT Press, pp 271–302
Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336. https://doi.org/10.1023/A:1007614523901
Servedio RA (2003) Smooth boosting and learning with malicious noise. J Mach Learn Res 4:633–648
Acknowledgements
Silas Santos is supported by a postgraduate grant from CNPq.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A. Proof for Theorem 3
Proof
Considering that
where
Taking into account that
we can say that, after passing through all classifiers t, the final distribution of instance i (\(w_{T+1,i}\)) will be defined by
Combining Eq. (10) with Eq. (14), we obtain
To initiate the construction of the Eq. (8), consider that, given an instance x, an arbitrary class y has not been chosen as a response by the committee (\(h_f(x) \ne y\)). Knowing this, we can say that there is another class (\(\ell \)) that satisfies the following inequalities:
It is important to note that using \(\alpha _t\) in the weighting of the committee’s final response will have the same effect as \(\beta _t\), used in OABM1. In addition, the latter inequality takes advantage of the fact that \(\alpha _t \ge 0\) since \(\varepsilon _t < 1/2\). Thus, \(h_f(x) \ne y\) implies that \(F(x,y) \le 0\). Therefore,
Observing the Eq. (15), the Eq. (17), and taking into account that the training error (left side of the inequality in Eq. (8)) is defined by the sum of the weights of instances incorrectly classified by \(h_f\), we have to:
The Eq. (18) takes into account the definition of (17), while Eq. (19) uses (15). However, since all instances are not known a priori in an online version, the iteration in N will be done in a gradual way. Therefore,
where the iteration in j represents the incremental inclusions of new instances. Finally, to get (8), the value of \(Z_t\), present in (20), needs to be defined. Rewriting the Eq. (12) more completely and in accordance with Line 14 of Algorithm 3, we obtain:
Given that \(\varepsilon _t = \xi w_{t}/cw_{t}\), the denominator of 22 can be expressed by:
where the Eqs. (24) and (25) use the definitions of \(\gamma _t\) and \(\alpha _t\), respectively. In this way, by plugging the value of \(Z_t\) (present in Eq. 25) into (20), we get the main equation (8), completing this demonstration. \(\square \)
Appendix B. Proof for Theorem 4
Proof
Considering that
and that
where
Taking into account that
we can say that, after passing through all classifiers t, the final distribution of instance i (\(w_{T+1,i}\)) will be defined by
Combining the Eq. (27) with the Eq. (31), we obtain
Whereas, giving an instance x, an arbitrary class y has not been chosen as a response by the committee (\(h_f(x) \ne y\)), then
Thus, \(h_f(x) \ne y\) implies that \(F(x,y) \le 0\). This statement can be justified by the following fact: according to the Line 38 of OABM2 (Algorithm 4), the class chosen by the committee will be the one with the highest plausibility among the classifiers, making \(F(x,y) \le F(x,h_f(x)) \le 0\). Therefore,
Taking into account that the training error (left side of the inequality in Eq. 9) is defined by the sum of the distribution weights multiplied by the error probability of \(h_f\) (for each instance) (Freund and Schapire 1997), we have:
In addition, assuming that all \(k-1\) iterations that occur in OABM2 are equally important, we can redefine the Eq. (35) as follows:
Then, replacing the definitions of the Eq. (32) and the Eq. (34) in (36) we get:
The Eq. (37) takes into account the definition of (34), while the Eq. (38) uses (32). However, since all instances are not known a priori in an online version, the iteration in N will be done in a gradual way. So,
where the iteration in j represents the incremental inclusions of new instances. Finally, to get (9), the value of \(Z_t\), present in (39), needs to be defined. Rewriting the Eq. (29) more completely and in accordance with Line 35 of Algorithm 4, we obtain
with \(Z_t\) corresponding to the denominators of Eqs. (40) and (41). Since the hypothesis of OABM2 is in the range \([-1,+1]\), \(Z_t\) can be rearranged in the same way as described in subsection 3.1 of Schapire and Singer (1999) and in subsection 9.2.3 of Schapire and Freund (2012e). For this purpose, assume that u and r are defined by:
Thus,
By the definition of \(\alpha _t\) and \(\gamma _t\), and knowing that \(\varepsilon _t = (1-r)/2\) (Schapire and Singer 1999), its possible replace them in 45 and get to:
Plugging the value of \(Z_t\) (from Eq. 46) into (39), we get the main equation (9), completing this demonstration.\(\square \)
Rights and permissions
About this article
Cite this article
Santos, S.G.T.d.C., de Barros, R.S.M. Online AdaBoost-based methods for multiclass problems. Artif Intell Rev 53, 1293–1322 (2020). https://doi.org/10.1007/s10462-019-09696-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-019-09696-6