Generating Fixed-Size Training Sets for Large and Streaming Datasets

  • Stefanos OugiaroglouEmail author
  • Georgios Arampatzis
  • Dimitris A. Dervos
  • Georgios Evangelidis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10509)


The k Nearest Neighbor is a popular and versatile classifier but requires a relatively small training set in order to perform adequately, a prerequisite not satisfiable with the large volumes of training data that are nowadays available from streaming environments. Conventional Data Reduction Techniques that select or generate training prototypes are also inappropriate in such environments. Dynamic RHC (dRHC) is a prototype generation algorithm that can update its condensing set when new training data arrives. However, after repetitive updates, the size of the condensing set may become unpredictably large. This paper proposes dRHC2, a new variation of dRHC, which remedies the aforementioned drawback. dRHC2 keeps the size of the condensing set in a convenient, manageable by the classifier, level by ranking the prototypes and removing the least important ones. dRHC2 is tested on several datasets and the experimental results reveal that it is more efficient and noise tolerant than dRHC and is comparable to dRHC in terms of accuracy.


k-NN classification Data reduction Prototype generation Data streams Clustering 


  1. 1.
    Aggarwal, C.: Data Streams: Models and Algorithms. Advances in Database Systems Series. Springer, Heidelberg (2007)CrossRefzbMATHGoogle Scholar
  2. 2.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). Google Scholar
  3. 3.
    Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi. Valued Logic Soft Comput. 17(2–3), 255–287 (2011)Google Scholar
  4. 4.
    Beringer, J., Hüllermeier, E.: Efficient instance-based learning on data streams. Intell. Data Anal. 11(6), 627–650 (2007). Google Scholar
  5. 5.
    Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (2006). CrossRefzbMATHGoogle Scholar
  6. 6.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006). MathSciNetzbMATHGoogle Scholar
  7. 7.
    Gama, J.A., Sebastião, R., Rodrigues, P.P.: Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 329–338, KDD 2009. ACM, New York (2009).
  8. 8.
    Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). CrossRefGoogle Scholar
  9. 9.
    Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)CrossRefGoogle Scholar
  10. 10.
    Olvera-Lopez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: A new fast prototype selection method based on clustering. Pattern Anal. Appl. 13(2), 131–141 (2010)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Ougiaroglou, S., Evangelidis, G.: Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the Fifth Balkan Conference in Informatics, pp. 168–173, BCI 2012. ACM, New York (2012).
  12. 12.
    Ougiaroglou, S., Evangelidis, G.: RHC: a non-parametric cluster-based data reduction for efficient k-NN classification. Pattern Anal. Appl. 19(1), 93–109 (2014). MathSciNetCrossRefGoogle Scholar
  13. 13.
    Ougiaroglou, S., Evangelidis, G.: WebDR: a web workbench for data reduction. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) Machine Learning and Knowledge Discovery in Databases. LNCS, vol. 8726, pp. 464–467. Springer, Heidelberg (2014). Google Scholar
  14. 14.
    Sánchez, J.S.: High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn. 37(7), 1561–1564 (2004)CrossRefGoogle Scholar
  15. 15.
    Triguero, I., Derrac, J., Garcia, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans. Sys. Man Cyber Part C 42(1), 86–100 (2012). CrossRefGoogle Scholar
  16. 16.
    Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report TCD-CS-2004-15, The University of Dublin, Trinity College, Department of Computer Science, Dublin, Ireland (2004)Google Scholar
  17. 17.
    Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000). CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Stefanos Ougiaroglou
    • 1
    • 2
    Email author
  • Georgios Arampatzis
    • 1
  • Dimitris A. Dervos
    • 1
  • Georgios Evangelidis
    • 2
  1. 1.Department of Information TechnologyAlexander TEI of ThessalonikiSindosGreece
  2. 2.Department of Applied Informatics, School of Information SciencesUniversity of MacedoniaThessalonikiGreece

Personalised recommendations