In Defense of Online Kmeans for Prototype Generation and Instance Reduction

  • Mauricio García-Limón
  • Hugo Jair EscalanteEmail author
  • Alicia Morales-Reyes
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10022)


The nearest neighbor rule is one of the most popular algorithms for data mining tasks due in part to its simplicity and theoretical/empirical properties. However, with the availability of large volumes of data, this algorithm suffers from two problems: the computational cost to classify a new example, and the need to store the whole training set. To alleviate these problems instance reduction algorithms are often used to obtain a new condensed training set that in addition to reducing the computational burden, in some cases they improve the classification performance. Many instance reduction algorithms have been proposed so far, obtaining outstanding performance in mid size data sets. However, the application of the most competitive instance reduction algorithms becomes prohibitive when dealing with massive data volumes. For this reason, in recent years, it has become crucial the development of large scale instance reduction algorithms. This paper elaborates on the usage of a classic algorithm for clustering: K-means for tackling the instance reduction problem in big data. We show that this traditional algorithm outperforms most state of the art instance reduction methods in mid size data sets. In addition, this algorithm can cope well with massive data sets and still obtain quite competitive performance. Therefore, the main contribution of this work is showing the validity of this often depreciated algorithm for a quite relevant task in a quite relevant scenario.


Online Version Online Algorithm Competitive Performance Instance Selection Batch Learning 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aha, D.W., Kibler, D., Albert, M.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)Google Scholar
  2. 2.
    Angiulli, F.: Fast nearest neighbor condensation for large data sets classification. IEEE Trans. Knowl. Data Eng. 19(11), 1450–1464 (2007)CrossRefGoogle Scholar
  3. 3.
    Arnaiz, A., Diez, F., Rodrguez, J.J., Garca, C.: Instance selection of linear complexity for big data. Knowl.-Based Syst. 107, 83–95 (2016)CrossRefGoogle Scholar
  4. 4.
    Bottou, L.: Stochastic learning. In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) Machine Learning 2003. LNCS (LNAI), vol. 3176, pp. 146–168. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Cano, J.R., Herrera, F., Lozano, M.: Stratification for scaling up evolutionary prototype selection. Pattern Recogn. Lett. 26(7), 953–963 (2005)CrossRefGoogle Scholar
  6. 6.
    Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (1967)CrossRefzbMATHGoogle Scholar
  7. 7.
    Cruz-Vega, I., Escalante, H.J.: Improved learning rule for LVQ based on granular computing. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Sossa-Azuela, J.H., Olvera López, J.A., Famili, F. (eds.) MCPR 2015. LNCS, pp. 54–63. Springer, Heidelberg (2015)Google Scholar
  8. 8.
    Cruz-Vega, I., Escalante, H.J.: An online and incremental GRLVQ algorithm for prototype generation based on granular computing. Soft Comput. 1–14 (2016)Google Scholar
  9. 9.
    Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012)CrossRefGoogle Scholar
  10. 10.
    García-Limón, M., Escalante, H.J., Morales, E., Morales-Reyes, A.: Simultaneous generation of prototypes and features through genetic programming. In: Proceedings of the Conference on Genetic and Evolutionary Computation, pp. 517–524. ACM (2014)Google Scholar
  11. 11.
    Garcia-Pedrajas, N., de Haro-Garcia, A., Perez-Rodriguez, J.: A scalable approach to simultaneous evolutionary instance and feature selection. Inf. Sci. 228, 150–174 (2013)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theor. 14(3), 515–516 (1968)CrossRefGoogle Scholar
  13. 13.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2001)CrossRefzbMATHGoogle Scholar
  14. 14.
    Kohonen, T.: The self-organizing map. Neurocomputing 21(1), 1–6 (1998)CrossRefzbMATHGoogle Scholar
  15. 15.
    Kuncheva, L.I., Jain, L.C.: Nearest neighbor classifier: simultaneous editing and feature selection. Pattern Recogn. Lett. 20(11–13), 1149–1156 (1999)CrossRefGoogle Scholar
  16. 16.
    Lemaire, V., Salperwyck, C., Bondu, A.: A survey on supervised classification on data streams. In: Zimányi, E., Kutsche, R.-D. (eds.) eBISS 2014. LNBIP, vol. 205, pp. 88–125. Springer, Heidelberg (2015)Google Scholar
  17. 17.
    Nanni, L., Lumini, A.: Particle swarm optimization for prototype reduction. Neurocomputing 72(4), 1092–1097 (2009)CrossRefGoogle Scholar
  18. 18.
    Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A new fast prototype selection method based on clustering. Pattern Anal. Appl. 13(2), 131–141 (2010)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Ougiaroglou, S., Evangelidis, G.: RHC: a non-parametric cluster-based data reduction for efficient k-NN classification. Pattern Anal. Appl. 19, 1–17 (2014)MathSciNetGoogle Scholar
  20. 20.
    Raicharoen, T., Lursinsap, C.: A divide-and-conquer approach to the pairwise opposite class-nearest neighbor (POC-NN) algorithm. Pattern Recogn. Lett. 26(10), 1554–1567 (2005)CrossRefGoogle Scholar
  21. 21.
    Sánchez, J.S., Pla, F., Ferri, F.: Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recogn. Lett. 18(6), 507–513 (1997)CrossRefGoogle Scholar
  22. 22.
    Lozano, M., Sotoca, J.M., Sanchez, J.S., Pla, F.: An adaptive condensing algorithm based on mixtures of gaussians. Recent Adv. Artif. Intell. Res. Dev. 113, 225 (2004)Google Scholar
  23. 23.
    Toussaint, G.T.: Proximity graphs for nearest neighbor decision rules: recent progress. In: Interface-2002, 34th Symposium on Computing and Statistics (2002)Google Scholar
  24. 24.
    Triguero, I., Derrac, J., Garcia, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans. Syst. Man Cybern. Part C 42(1), 86–100 (2012)CrossRefGoogle Scholar
  25. 25.
    Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F.: MRPR: a mapreduce solution for prototype reduction in big data classification. Neurocomputing 150, 331–345 (2015). Part ACrossRefGoogle Scholar
  26. 26.
    Wilson, D.R., Martinez, T.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000)CrossRefzbMATHGoogle Scholar
  27. 27.
    Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 3, 408–421 (1972)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Wu, X., Kumar, V., Quinlan, R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z., Steinbach, M., Hand, D., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2007)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Mauricio García-Limón
    • 1
  • Hugo Jair Escalante
    • 1
    Email author
  • Alicia Morales-Reyes
    • 1
  1. 1.Instituto Nacional de Astrofísica, Óptica y ElectrónicaPueblaMexico

Personalised recommendations