Instance Selection for Class Imbalanced Problems by Means of Selecting Instances More than Once

  • Javier Pérez-Rodríguez
  • Aida de Haro-García
  • Nicolás García-Pedrajas
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7023)


Although many more complex learning algorithms exist, k-nearest neighbor (k-NN) is still one of the most successful classifiers in real-world applications. One of the ways of scaling up the k-nearest neighbors classifier to deal with huge datasets is instance selection. Due to the constantly growing amount of data in almost any pattern recognition task, we need more efficient instance selection algorithms, which must achieve larger reductions while maintaining the accuracy of the selected subset.

However, most instance selection method do not work well in class imbalanced problems. Most algorithms tend to remove too many instances from the minority class. In this paper we present a way to improve instance selection for class imbalanced problems by allowing the algorithms to select instances more than once. In this way, the fewer instances of the minority can cover more portions of the space, and the same testing error of the standard approach can be obtained faster and with fewer instances. No other constraint is imposed on the instance selection method.

An extensive comparison using 40 datasets from the UCI Machine Learning Repository shows the usefulness of our approach compared with the established method of evolutionary instance selection. Our method is able to, in the worst case, match the error obtained by standard instance selection with a larger reduction and shorter execution time.


Minority Class Instance Selection Multiple Instance Learning Imbalance Ratio Class Imbalanced Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baluja, S.: Population-based incremental learning. Technical Report CMU-CS-94-163, Carnegie Mellon University, Pittsburgh (1994)Google Scholar
  2. 2.
    Barandela, R., Sánchez, J.L., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36, 849–851 (2003)CrossRefGoogle Scholar
  3. 3.
    Basri, R., Hassner, T., Zelnik-Manor, L.: Approximate nearest subspace search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1–13 (2010)Google Scholar
  4. 4.
    Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Brodley, C.E.: Recursive automatic bias selection for classifier construction. Machine Learning 20(1/2), 63–94 (1995)CrossRefGoogle Scholar
  6. 6.
    Cano, J.R., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation 7(6), 561–575 (2003)CrossRefGoogle Scholar
  7. 7.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)zbMATHGoogle Scholar
  8. 8.
    de Haro-García, A., García Pedrajas, N.: A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery 18(3), 392–418 (2009)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Eshelman, L.J.: The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. Morgan Kauffman, San Mateo (1990)Google Scholar
  11. 11.
    Frank, A., Asuncion, A.: Uci machine learning repository (2010)Google Scholar
  12. 12.
    Franti, P., Virmajoki, O., Hautamaki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1875–1881 (2006)CrossRefGoogle Scholar
  13. 13.
    Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proc. of the Thirteenth International Conference on Machine Learning, Bari, Italy, pp. 148–156 (1996)Google Scholar
  14. 14.
    Fu, Z., Robles-Kelly, A., Zhou, J.: Milis: Multiple instance learning with instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence (in press, 2011)Google Scholar
  15. 15.
    García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary Computation 17(3), 275–306 (2009)CrossRefGoogle Scholar
  16. 16.
    García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence 174, 410–441 (2010)MathSciNetCrossRefGoogle Scholar
  17. 17.
    García-Pedrajas, N., Romero del Castillo, J.A., Ortiz-Boyer, D.: A cooperative coevolutionary algorithm for instance selection for instance-based learning. Machine Learning 78, 381–420 (2010)MathSciNetCrossRefGoogle Scholar
  18. 18.
    García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Applied Soft Computing 9, 1304–1314 (2009)CrossRefGoogle Scholar
  19. 19.
    Gates, G.W.: The reduced nearest neighbor rule. IEEE Transactions on Information Theory 18(3), 431–433 (1972)CrossRefGoogle Scholar
  20. 20.
    Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison–Wesley, Reading (1989)zbMATHGoogle Scholar
  21. 21.
    Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  22. 22.
    Liu, H., Motoda, H.: On issues of instance selection. Data Mining and Knowledge Discovery 6, 115–130 (2002)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Liu, J., Hu, Q., Yu, D.: A comparative study on rough set based class imbalance learning. Knowledge-Based Systems 21, 753–763 (2008)CrossRefGoogle Scholar
  24. 24.
    Louis, S.J., Li, G.: Combining robot control strategies using genetic algorithms with memory. In: Angeline, P.J., McDonnell, J.R., Reynolds, R.G., Eberhart, R. (eds.) EP 1997. LNCS, vol. 1213, pp. 431–442. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  25. 25.
    Marchiori, E.: Class conditional nearest neighbor for large margin instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(2), 364–370 (2010)CrossRefGoogle Scholar
  26. 26.
    Maudes-Raedo, J., Rodríguez-Díez, J.J., García-Osorio, C.: Disturbing neighbors diversity for decision forest. In: Valentini, G., Okun, O. (eds.) Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications (SUEMA 2008), Patras, Grecia, pp. 67–71 (July 2008)Google Scholar
  27. 27.
    Samet, H.: K-nearest neighbor finding using maxnearestdist. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(2), 243–252 (2008)CrossRefGoogle Scholar
  28. 28.
    Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007)CrossRefzbMATHGoogle Scholar
  29. 29.
    Whitley, D.: The GENITOR algorithm and selective pressure. In: Proc 3rd International Conf. on Genetic Algorithms, pp. 116–121. Morgan Kaufmann Publishers, Los Altos (1989)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Javier Pérez-Rodríguez
    • 1
  • Aida de Haro-García
    • 1
  • Nicolás García-Pedrajas
    • 1
  1. 1.Department of Computing and Numerical AnalysisUniversity of CórdobaSpain

Personalised recommendations