Ensembles of Decision Trees for Imbalanced Data

Rodríguez, Juan J.; Díez-Pastor, José F.; García-Osorio, César

doi:10.1007/978-3-642-21557-5_10

Juan J. Rodríguez¹⁹,
José F. Díez-Pastor¹⁹ &
César García-Osorio¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 6713))

Included in the following conference series:

International Workshop on Multiple Classifier Systems

1655 Accesses
3 Citations

Abstract

Ensembles of decision trees are considered for imbalanced datasets. Conventional decision trees (C4.5) and trees for imbalanced data (CCPDT: Class Confidence Proportion Decision Tree) are used as base classifiers. Ensemble methods, based on undersampling and oversampling, for imbalanced data are considered. Conventional ensemble methods, not specific for imbalanced data, are also studied: Bagging, Random Subspaces, AdaBoost, Real AdaBoost, MultiBoost and Rotation Forest. The results show that the ensemble method is much more important that the type of decision trees used as base classifier. Rotation Forest is the ensemble method with the best results. For the decision tree methods, CCPDT shows no advantage.

This work was supported by the Project TIN2008-03151 of the Spanish Ministry of Education and Science.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

He, H., Garcia, E.A.: Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21, 1263–1284 (2009)
Article Google Scholar
Cieslak, D., Chawla, N.: Learning decision trees for unbalanced data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008)
Chapter Google Scholar
Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algorithm for Imbalanced Data Sets. In: 10th SIAM International Conference on Data Mining, SDM 2010, pp. 766–777. SIAM, Philadelphia (2010)
Chapter Google Scholar
Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
MATH Google Scholar
Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research 11, 169–198 (1999)
MATH Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 40, 185–197 (2010)
Article Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving prediction of the minority class in boosting, pp. 107–119 (2003)
Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 539–550 (2009)
Article Google Scholar
Hoens, T., Chawla, N.: Generating Diverse Ensembles to Counter the Problem of Class Imbalance. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 488–499. Springer, Heidelberg (2010)
Chapter Google Scholar
Flach, P.: The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: Proc. 20th International Conference on Machine Learning (ICML 2003), pp. 194–201. AAAI Press, Menlo Park (2003)
Google Scholar
Chawla, N., Cieslak, D., Hall, L., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery 17, 225–252 (2008)
Article Google Scholar
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 832–844 (1998)
Article Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119–139 (1997)
Article MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Annals of Statistics 95, 337–407 (2000)
Article MATH Google Scholar
Webb, G.I.: Multiboosting: A technique for combining boosting and wagging. Machine Learning 40, 159–196 (2000)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Rodríguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1619–1630 (2006)
Article Google Scholar
Kuncheva, L.I., Rodríguez, J.J.: An experimental study on rotation forest ensembles. In: Haindl, M., Kittler, J., Roli, F. (eds.) MCS 2007. LNCS, vol. 4472, pp. 459–468. Springer, Heidelberg (2007)
Chapter Google Scholar
Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
Olszewski, R.T.: Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data. PhD thesis, Computer Science Department, Carnegie Mellon University (2001)
Google Scholar
Kuncheva, L.I., Hadjitodorov, S.T., Todorova, L.P.: Experimental comparison of cluster ensemble methods. In: FUSION 2006, Florence, Italy (2006)
Google Scholar
Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University (2003)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11 (2009)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006)
MATH Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Burgos, Spain
Juan J. Rodríguez, José F. Díez-Pastor & César García-Osorio

Authors

Juan J. Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
José F. Díez-Pastor
View author publications
You can also search for this author in PubMed Google Scholar
César García-Osorio
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Engineering and Systems, Universitá di Napoli Federico II, Italy
Carlo Sansone
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, GU2 7XH, Surrey, UK
Josef Kittler
Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123, Cagliari, Italy
Fabio Roli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodríguez, J.J., Díez-Pastor, J.F., García-Osorio, C. (2011). Ensembles of Decision Trees for Imbalanced Data. In: Sansone, C., Kittler, J., Roli, F. (eds) Multiple Classifier Systems. MCS 2011. Lecture Notes in Computer Science, vol 6713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21557-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-21557-5_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21556-8
Online ISBN: 978-3-642-21557-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics