SCIS: Combining Instance Selection Methods to Increase Their Effectiveness over a Wide Range of Domains

  • Yoel Caises
  • Antonio González
  • Enrique Leyva
  • Raúl Pérez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5788)


Instance selection is a feasible strategy to solve the problem of dealing with large databases in inductive learning. There are several proposals in this area, but none of them consistently outperforms the others over a wide range of domains. In this paper we present a set of measures to characterize the databases, as well as a new algorithm that uses these measures and, depending on the data characteristics, it applies the method or combination of methods expected to produce the best results. This approach was evaluated over 20 databases and with six different learning paradigms. The results have been compared with those achieved by five well-known state-of-the-art methods.


Instance selection data reduction machine learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining & Knowledge Disc. 6, 153–172 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Kim, S., Oommen, B.: A Brief Taxonomy and Ranking of Creative Prototype Reduction Schemes. Patt. Anal. Applic. 6, 232–244 (2003)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Reinartz, T.: A Unifying View on Instance Selection. Data Mining & Knowledge Disc. 6, 191–210 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Mollineda, R., Sánchez, J., Sotoca, J.: Data Characterization for Effective Prototype Selection. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, pp. 27–34. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  5. 5.
    Hart, P.E.: The Condensed Nearest Neighbor Rule. IEEE Trans. on IT 14, 515–516 (1968)CrossRefGoogle Scholar
  6. 6.
    Wilson, D.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans. on Syst., Man, and Cybernetics 2(3), 408–421 (1972)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Kruskal, J.: On the Shortest Spanning Subtree of a Graph and the Travelling Salesman Problem. Proc. of the Amer. Math. Soc. 7(1), 48–50 (1956)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Machine Learning 6(1), 37–66 (1991)Google Scholar
  9. 9.
    Kim, S., Oommen, B.: Enhancing Prototype Reduction Schemes with LVQ3-Type Algorithms. Patt. Recognition 36, 1083–1093 (2003)CrossRefzbMATHGoogle Scholar
  10. 10.
    Zhao, K., Zhou, S., Guan, J., Zhou, A.: C-pruner: An Improved Instance Pruning Algorithm. In: Int. Conf. on Machine Learning & Cybernetics, 2003, vol. 1, pp. 94–99 (2003)Google Scholar
  11. 11.
    Aha, D.W. (ed.): Lazy Learning. Kluwer Academic Publishers, Norwell (1997)zbMATHGoogle Scholar
  12. 12.
    Wilson, D., Martinez, T.: Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning 38, 257–286 (2000)CrossRefzbMATHGoogle Scholar
  13. 13.
    González, A., Pérez, R.: SLAVE: A Genetic Learning System Based on an Iterative Approach. IEEE Trans. on Fuzzy Systems 7, 176–191 (1999)CrossRefGoogle Scholar
  14. 14.
    Quinlan, J.R.: C4.5: Program for Machine Learning. M. Kaufman, S. Mateo (1993)Google Scholar
  15. 15.
    Cover, T., Hart, P.: Nearest Neighbor Pattern Classification. IEEE Trans on Information Theory 13(1), 21–27 (1967)CrossRefzbMATHGoogle Scholar
  16. 16.
    UCI Machine Learning Repository,
  17. 17.
    Bernadó-Mansilla, E., Llorá, X., Garrel, J.: XCS and GALE: A Comparative Study of Two Learning Classifier. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115–132. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  18. 18.
    John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: 11th Conf. on Uncertainty in AI, pp. 338–345. Morgan Kaufmann, San Mateo (1995)Google Scholar
  19. 19.
    Frank, E., Witten, I.: Generating Accurate Rule Sets without Global Optimization. In: 15th Int. Conf. on Machine Learning, pp. 144–151. Morgan Kaufmann, San Francisco (1998)Google Scholar
  20. 20.
    Platt, J.: Fast Training of Support Vector Machines Using SMO. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Adv. in Kernel Methods, pp. 185–208. MIT Press, Cambridge (1999)Google Scholar
  21. 21.
    Demsar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR 7, 1–30 (2006)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is “Nearest Neighbor” Meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  23. 23.
    Aggarwal, C., Hinneburg, A., Keim, D.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Yoel Caises
    • 1
  • Antonio González
    • 2
  • Enrique Leyva
    • 1
  • Raúl Pérez
    • 2
  1. 1.Facultad de Informática y MatemáticaUniversidad de HolguínCuba
  2. 2.Dpto de Ciencias de la Computación e IA, ETSIITUniversidad de GranadaEspaña

Personalised recommendations