Avoiding Boosting Overfitting by Removing Confusing Samples

Vezhnevets, Alexander; Barinova, Olga

doi:10.1007/978-3-540-74958-5_40

Alexander Vezhnevets¹ &
Olga Barinova¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4701))

Included in the following conference series:

European Conference on Machine Learning

6354 Accesses
27 Citations
3 Altmetric

Abstract

Boosting methods are known to exhibit noticeable overfitting on some datasets, while being immune to overfitting on other ones. In this paper we show that standard boosting algorithms are not appropriate in case of overlapping classes. This inadequateness is likely to be the major source of boosting overfitting while working with real world data. To verify our conclusion we use the fact that any overlapping classes’ task can be reduced to a deterministic task with the same Bayesian separating surface. This can be done by removing “confusing samples” – samples that are misclassified by a “perfect” Bayesian classifier. We propose an algorithm for removing confusing samples and experimentally study behavior of AdaBoost trained on the resulting data sets. Experiments confirm that removing confusing samples helps boosting to reduce the generalization error and to avoid overfitting on both synthetic and real world. Process of removing confusing samples also provides an accurate error prediction based on the work with the training sets.

Download to read the full chapter text

Chapter PDF

LIUBoost: Locality Informed Under-Boosting for Imbalanced Data Classification

A Study on the Noise Label Influence in Boosting Algorithms: AdaBoost, GBM and XGBoost

A review of boosting methods for imbalanced data classification

Article 06 August 2014

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Angelova, A., Abu-Mostafa, Y., Perona, P.: Pruning Training Sets for Learning of Object Categories. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (2005)
Google Scholar
Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998)
Google Scholar
Breiman, L.: Bagging Predictors. Machine Learning 24, 2, 123–140 (1996)
Google Scholar
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2) (1999)
Google Scholar
Domingos, P.: A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. In: Proc. of the 17th National Conference on Artificial Intelligence (2000)
Google Scholar
Domingo, C., Watanabe, O.: Madaboost: A modication of adaboost. In: 13th Annual Conference on Comp. Learning Theory (2000)
Google Scholar
Freund, Y., Schapire, R.: Discussion of the paper Additive logistic regression: a statistical view of boosting. Friedman, J., Hastie, T., Tibshirani, R. The Annals of Statistics 38, 2, 391–393 (2000)
Google Scholar
Freund, Y.: An Adaptive Version of the Boost by Majority Algorithm. Machine Learning 43(3), 293–318 (2001)
Article MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical View of Boosting. The Annals of Statistics 28, 2, 337–407 (2000)
Article MathSciNet Google Scholar
Friedman, J.: Greedy function approximation: A gradient boosting machine. Annals of Statistics 29, 5 (2001)
Google Scholar
Grove, A.J., Schuurmans, D.: Boosting in the limit: Maximizing the margin of learned ensembles. In: Proceedings of the Fifteenth National Conference on Artifical Intelligence (1998)
Google Scholar
Hampel, F.R., Rousseeuw, P.J., Ronchetti, E.M., Stahel, W.A.: Robust Statistics: the Approach Based on Influence Functions. Wiley, New York (1986)
MATH Google Scholar
Jiang, Zhou: Editing training data for kNN classifiers with neural network ensemble. LNCS (2004)
Google Scholar
Krause, N., Singer, Y.: Leveraging the Margin More Carefully. In: ACM International Conference Proceeding Series, vol. 69 (2004)
Google Scholar
Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent. In: Neural Information Processing Systems, vol. 12, pp. 512–518. MIT Press, Cambridge (2000)
Google Scholar
Merler, S., Caprile, B., Furlanello, C.: Bias-variance control via hard points shaving. International Journal of PatternRecognition and Artificial Intelligence (2004)
Google Scholar
Muhlenbach, F., Lallich, S., Zighed, D.A.: Identifying and handling mislabelled instances. Intelligent Information Systems 22, 1, 89–109 (2004)
Article Google Scholar
Nicholson, A.: Generalization Error Estimates and Training Data Valuation, Ph.D. Thesis, California Institute of Technology (2002)
Google Scholar
Niculescu-Mizil, A., Caruana, R.: Obtaining Calibrated Probabilities from Boosting. In: Proc. 21st Conference on Uncertainty in Artificial Intelligence (2005)
Google Scholar
Perrone, M.: Improving regression estimation: Averaging methods for Variance reduction with extension to General Convex Measure Optimization, Ph.D. Thesis, Brown University (1993)
Google Scholar
Ratsch, G.: Robust Boosting and Convex Optimization. Doctoral dissertation, University of Potsdam (2001)
Google Scholar
Reyzin, L., Schapire, R.: How boosting the margin can also boost classifier complexity. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Google Scholar
Rosset, S.: Robust Boosting and Its Relation to Bagging. In: KDD-2005 (2005)
Google Scholar
Sanchez, et al.: Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters (2003)
Google Scholar
Schapire, R., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning 37, 3, 297–336 (1999)
Article Google Scholar
Schapire, R., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. In: Machine Learning: Proceedings of the Fourteenth International Conference (1997)
Google Scholar
Takenouchi, T., Eguchi, S.: Robustifying AdaBoost by adding the naive error rate. Neural Computation 16 (2004)
Google Scholar
Taniguchi, M., Tresp, V.: Averaging Regularized Estimators. Neural Computation (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Moscow State University, dept. of Computational Mathematics and Cybernetics, Graphics and Media Lab, 119992 Moscow, Russia
Alexander Vezhnevets & Olga Barinova

Authors

Alexander Vezhnevets
View author publications
You can also search for this author in PubMed Google Scholar
Olga Barinova
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Joost N. Kok Jacek Koronacki Raomon Lopez de Mantaras Stan Matwin Dunja Mladenič Andrzej Skowron

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vezhnevets, A., Barinova, O. (2007). Avoiding Boosting Overfitting by Removing Confusing Samples. In: Kok, J.N., Koronacki, J., Mantaras, R.L.d., Matwin, S., Mladenič, D., Skowron, A. (eds) Machine Learning: ECML 2007. ECML 2007. Lecture Notes in Computer Science(), vol 4701. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74958-5_40

Download citation

DOI: https://doi.org/10.1007/978-3-540-74958-5_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74957-8
Online ISBN: 978-3-540-74958-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Avoiding Boosting Overfitting by Removing Confusing Samples

Abstract

Chapter PDF

Similar content being viewed by others

LIUBoost: Locality Informed Under-Boosting for Imbalanced Data Classification

A Study on the Noise Label Influence in Boosting Algorithms: AdaBoost, GBM and XGBoost

A review of boosting methods for imbalanced data classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Avoiding Boosting Overfitting by Removing Confusing Samples

Abstract

Chapter PDF

Similar content being viewed by others

LIUBoost: Locality Informed Under-Boosting for Imbalanced Data Classification

A Study on the Noise Label Influence in Boosting Algorithms: AdaBoost, GBM and XGBoost

A review of boosting methods for imbalanced data classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation