Abstract
Ensemble methods can improve the effectiveness in text categorization. Due to computation cost of ensemble approaches there is a need for pruning ensembles. In this work we study ensemble pruning based on data partitioning. We use a ranked-based pruning approach. For this purpose base classifiers are ranked and pruned according to their accuracies in a separate validation set. We employ four data partitioning methods with four machine learning categorization algorithms. We mainly aim to examine ensemble pruning in text categorization. We conduct experiments on two text collections: Reuters-21578 and BilCat-TRT. We show that we can prune 90% of ensemble members with almost no decrease in accuracy. We demonstrate that it is possible to increase accuracy of traditional ensembling with ensemble pruning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)
Caruana, R., Munson, A., Niculescu-Mizil, A.: Getting the most out of ensemble selection. In: ICDM 2006, pp. 828–833. IEEE Computer Society, Washington, DC (2006)
Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection from libraries of models. In: Proceedings of The Twenty-First Int. Conf. on ML, ICML 2004, p. 18 (2004)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27 (1967)
Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
Dong, Y.S., Han, K.S.: Text classification based on data partitioning and parameter varying ensembles. In: Proceedings of the 2005 ACM Symposium on Applied Computing, SAC 2005, pp. 1044–1048 (2005)
Hernández-lobato, D., MartÃnez-Muñoz, G., Suárez, A.: Pruning in ordered regression bagging ensembles. In: Proceedings of IJCNN 2006, IEEE WCCI 2006, Vancouver, BC, pp. 1266–1273 (2006)
John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: UAI 1995, pp. 338–345 (1995)
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Symposium on Document Analysis and Information Retrieval, pp. 81–93. ISRI, Univ. of Nevada, Las Vegas (1994)
Lu, Z., Wu, X., Zhu, X., Bongard, J.: Ensemble pruning via individual contribution ordering. In: Proceedings of the 16th ACM SIGKDD, KDD 2010, pp. 871–880 (2010)
Margineantu, D.D., Dietterich, T.G.: Pruning adaptive boosting. In: Proceedings of the Fourteenth International Conference on ML, ICML 1997, pp. 211–218 (1997)
MartÃnez-Muñoz, G., Suárez, A.: Aggregation ordering in bagging. In: Proc. of the IASTED, pp. 258–263. Acta Press (2004)
MartÃnez-Muñoz, G., Suárez, A.: Using boosting to prune bagging ensembles. Pattern Recognition Letters 28, 156–165 (2007)
Prodromidis, A.L., Stolfo, S.J., Chan, P.K.: Effective and efficient pruning of meta-classifiers in a distributed data mining system. Tech. rep. (1999)
Quinlan, J.R.: C4.5: Programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Toraman, C.: Text Categorization and Ensemble Pruning in Turkish News Portals. M.Sc. Thesis. Bilkent University, Ankara, Turkey (2011)
Tsoumakas, G., Partalas, I., Vlahavas, I.: A taxonomy and short review of ensemble selection. In: ECAI 2008, Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications (2008)
Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer Inc., Secaucus (1982)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers Inc., San Francisco (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Toraman, C., Can, F. (2011). Ensemble Pruning for Text Categorization Based on Data Partitioning. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25631-8_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-25631-8_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25630-1
Online ISBN: 978-3-642-25631-8
eBook Packages: Computer ScienceComputer Science (R0)