Abstract
We present a technique for learning via aggregation in supervised classification. The new method improves classification performance, regardless of which classifier is at its core. This approach exploits the information hidden in subspaces by combinations of aggregating variables and is applicable to high-dimensional data sets. We provide algorithms that randomly divide the variables into smaller subsets and permute them before applying a classification method to each subset. We combine the resulting classes to predict the class membership. Theoretical and simulation analyses consistently demonstrate the high accuracy of our classification methods. In comparison to aggregating observations through sampling, our approach proves to be significantly more effective. Through extensive simulations, we evaluate the accuracy of various classification methods. To further illustrate the effectiveness of our techniques, we apply them to five real-world data sets.
Similar content being viewed by others
Notes
RF: It was utilized with the default parameters, see https://cran.r-project.org/web/packages/randomForest/index.html.
Boosting: It was utilized with the default parameters, see https://cran.r-project.org/web/packages/adabag/index.html.
XGBoost: It was utilized with max_depth = 4, eta = 0.5, nthread = 3, nrounds = 30, subsample = 0, see https://cran.r-project.org/web/packages/xgboost/index.html.
References
Alfaro E, G’amez M, Garc’ia N (2013) adabag: an R package for classification with boosting and bagging. J Stat Softw 54(2):1–35
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Chicago
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining pp 785-794
Croux C, Joossens K, Lemmens A (2007) Trimmed bagging. Comput Stat Data Anal 52(1):362–368
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, Altman RB (2001) Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci 98(24):13784–13789
Gorman RP, Sejnowski TJ (1988) Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw 1(1):75–89
Gul A, Perperoglou A, Khan Z, Mahmoud O, Miftahuddin M, Adler W, Lausen B (2016) Ensemble of a subset of kNN classifiers. Adv Data Anal Classif 1–14
Hall P, Marron JS, Neeman A (2005) Geometric representation of high dimension, low sample size data. J R Stat Soc Ser B (Statistical Methodology) 67(3):427–444
Hastie T, Tibshirani R, Friedman J (2021) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, Berlin
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Hothorn T, Lausen B (2003) Double-bagging: combining classifiers by bootstrap aggregation. Pattern Recogn 36(6):1303–1309
Johnson B (2013) High resolution urban land cover classification using a competitive multi-scale object-based approach. Remote Sens Lett 4(2):131–140
Lee S, Cho S (2001) Smoothed bagging with kernel bandwidth selectors. Neural Process Lett 14(2):157–168
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22
Lichman M (2013) UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Loh WY (2011) Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):14–23
Soleymani M, Lee SMS (2014) Sequential combination of weighted and nonparametric bagging for classification. Biometrika 101(2):491–498
Ting KM, Wells JR, Tan SC, Teng SW, Webb GI (2011) Feature-subspace aggregating: ensembles for stable and unstable learners. Mach Learn 82:375–397
Venables WN, Ripley BD (2013) Modern applied statistics with S-PLUS. Springer Science & Business Media, Berlin
Zhu J, Zou H, Rosset S, Hastie T (2009) Multi-class adaboost. Stat Interface 2(3):349–360
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Amiri, S., Modarres, R. A subspace aggregating algorithm for accurate classification. Comput Stat (2024). https://doi.org/10.1007/s00180-024-01476-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00180-024-01476-3