Knowledge and Information Systems

, Volume 30, Issue 1, pp 113–133 | Cite as

Cluster-based instance selection for machine classification

Open Access
Regular Paper

Abstract

Instance selection in the supervised machine learning, often referred to as the data reduction, aims at deciding which instances from the training set should be retained for further use during the learning process. Instance selection can result in increased capabilities and generalization properties of the learning model, shorter time of the learning process, or it can help in scaling up to large data sources. The paper proposes a cluster-based instance selection approach with the learning process executed by the team of agents and discusses its four variants. The basic assumption is that instance selection is carried out after the training data have been grouped into clusters. To validate the proposed approach and to investigate the influence of the clustering method used on the quality of the classification, the computational experiment has been carried out.

Keywords

Machine learning Data mining Instance selection Multi-agent system 

Notes

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution,and reproduction in any medium, provided the original author(s) and source are credited.

References

  1. 1.
    Aha DW, Kibler D, Albert MK (1999) Instance-based learning algorithms. Mach Learn 6: 37–66Google Scholar
  2. 2.
    Andrews NO, Fox EA (2007) Clustering for data reduction: a divide and conquer approach. Technical Report TR-07-36, Computer Science, Virginia TechGoogle Scholar
  3. 3.
    Asuncion A, Newman DJ (2007) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine http://www.ics.uci.edu/~mlearn/MLRepository.htm. Accessed 24 June 2009
  4. 4.
    Bellifemine FL, Caire G, Greenwood D (2007) Developing multi-agent systems with JADE. Wiley Series in Agent Technology, LondonCrossRefGoogle Scholar
  5. 5.
    Bezdek JC, Kuncheva LI (2000) Nearest prototype classifier design: an experimental study. Int J Intell Syst 16(2): 1445–1473CrossRefGoogle Scholar
  6. 6.
    Cano JR, Herrera F, Lozano M (2004) On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Appl Soft Comput 6: 323–332CrossRefGoogle Scholar
  7. 7.
    Caragea D, Silvescu A, Honavar V (2003) A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. Int J Hybrid Intell Syst 1(1–2): 80–89Google Scholar
  8. 8.
    Chang Chin-Liang (1974) Finding prototypes for nearest neighbor classifier. IEEE Trans Comput 23(11): 1179–1184CrossRefMATHGoogle Scholar
  9. 9.
    Czarnowski I, Jȩdrzejowicz P (2004) An approach to instance reduction in supervised learning. In: Coenen F (eds) Research and development in intelligent systems XX. Springer, London, pp 267–282CrossRefGoogle Scholar
  10. 10.
    Czarnowski I, Jȩdrzejowicz P (2009) Distributed learning algorithm based on data Reduction. In: Proceedings of ICAART 2009, Porto, Potrugal, pp 198–203Google Scholar
  11. 11.
    Czarnowski I, Jȩdrzejowicz P, Wierzbowska I (2009) An A-Team approach to learning classifiers from distributed data sources. Int J Intell Inf Database Syst 3(4)Google Scholar
  12. 12.
    Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(3): 131–156CrossRefGoogle Scholar
  13. 13.
    Duch W, Blachnik M, Wieczorek T (2005) Probabilistic distance measures for prototype-based rules. In: Proceedings of the 12th international conference on neural information processing ICONIP 2005, pp 445–450Google Scholar
  14. 14.
    Eschrich S, Ke J, Hall LO, Goldgof DB (2003) Fast accurate fuzzy clustering through data reduction. IEEE Trans Fuzzy Syst 11(2): 262–270CrossRefGoogle Scholar
  15. 15.
    Frawley WJ, Piatetsky-Shapiro G, Matheus C (1991) Knowledge discovery in databases—an overview. In: Piatetsky-Shapiro G, Matheus C, Knowledge discovery in databases. AAAI/MIT Press, CambridgeGoogle Scholar
  16. 16.
    Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32: 675–701CrossRefGoogle Scholar
  17. 17.
    Grudziński K, Duch W (2000) SBL-PM: Simple algorithm for selection of reference instances in similarity based methods. In: Proceedings of the intelligence systems, Bystra, Poland, pp 99–107Google Scholar
  18. 18.
    Gu B, Hu F, Liu H (2001) Sampling: knowing whole from its part. In: Liu H, Motoda H (eds) Instance selection and construction for data mining. Kluwer, Dordrecht, pp 21–37Google Scholar
  19. 19.
    Hamo Y, Markovitch S (2005) The COMPSet algorithm for Subset Selection. In: Proceedings of the nineteenth international joint conference for artificial intelligence, Edinburgh, Scotland, pp 728–733Google Scholar
  20. 20.
    Hart PE (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 14: 515–516CrossRefGoogle Scholar
  21. 21.
    Jankowski N, Grochowski M (2005) Instances selection algorithms in the conjunction with LVQ. In: Hamza MH (eds) Artificial intelligence and applications. ACTA Press, Innsbruck, p 453Google Scholar
  22. 22.
    Jȩdrzejowicz P (1999) Social learning algorithm as a tool for solving some difficult scheduling problems. Found Comput Decis Sci 24: 51–66Google Scholar
  23. 23.
    Kim S-W, Oommen BJ (2003) A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Anal Appl 6: 232–244CrossRefMathSciNetGoogle Scholar
  24. 24.
    Kohonen T (1986) Learning vector quantization for pattern recognition. Technical Report TKK-F-A601, Helsinki University of Technology, Espoo, FinlandGoogle Scholar
  25. 25.
    Kuncheva LI, Bezdek JC (1998) Nearest prototype classification: clustering, genetic algorithm or random search?. IEEE Trans Syst Man Cybern 28(1): 160–164CrossRefGoogle Scholar
  26. 26.
    Lazarevic A, Obradovic Z (2001) The distributed boosting algorithm. In: Proceedings ACM-SIG KDD international conference on knowledge discovery and data mining, San Francisco, pp 311–316Google Scholar
  27. 27.
    Liu H, Lu J, Yao (1998) Identifying relevant databases for multidatabase mining. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining, pp 210–221Google Scholar
  28. 28.
    Liu H, Motoda H (2001) Instance selection and construction for data mining. Kluwer, DordrechtGoogle Scholar
  29. 29.
    Lu X, Xueguang S (2004) Methods of chemometrics. Science Press, BeijingGoogle Scholar
  30. 30.
    MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, vol 1, pp 281–297Google Scholar
  31. 31.
    Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14: 273–298 doi:10.1007/s10115-007-0090-6 CrossRefMATHGoogle Scholar
  32. 32.
    Morgan J, Daugherty R, Hilchie A, Carey B (2003) Sample size and modeling accuracy of decision tree based data mining tools. Acad Inf Manage Sci J 6(2): 71–99Google Scholar
  33. 33.
    Nanni L, Lumini A (2009) Particle swarm optimization for prototype reduction. Neurocomputing 72(4–6): 1092–1097CrossRefGoogle Scholar
  34. 34.
    Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, SanMateoGoogle Scholar
  35. 35.
    Raman B (2003) Enhancing learning using feature and example selection. Texas A&M University, College StationGoogle Scholar
  36. 36.
    Ritter GL, Woodruff HB, Lowry SR, Isenhour TL (1975) An algorithm for a selective nearest decision rule. IEEE Trans Inf Theory 21: 665–669CrossRefMATHGoogle Scholar
  37. 37.
    Rozsypal A, Kubat M (2003) Selecting representative examples and attributes by a genetic algorithm. Intell Data Anal 7(4): 291–304MATHGoogle Scholar
  38. 38.
    Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithm. In: Proceedings of the international conference on machine learning, pp 293–301Google Scholar
  39. 39.
    Song HH, Lee SW (1996) LVQ combined with simulated annealing for optimal design of large-set reference models. Neural Netw 9(2): 329–336CrossRefMathSciNetGoogle Scholar
  40. 40.
    Struyf A, Hubert M, Rousseeuw PJ (1996) Clustering in object-oriented environment. J Stat Softw 1(4): 1–30Google Scholar
  41. 41.
    Talukdar SN, Pyo SS, Giras T (1983) Asynchronous procedures for parallel processing. IEEE Trans PAS 102(11): 3652–3659Google Scholar
  42. 42.
    The European Network of Excellence on Intelligence Technologies for Smart Adaptive Systems (EUNITE)—EUNITE World Competition in domain of Intelligent Technologies (2002). http://neuron.tuke.sk/competition2 (Accessed 30 April 2002)
  43. 43.
    Tomek I (1976) An experiment with the edited nearest-neighbour rule. IEEE Trans Syst Man Cybern 6(6): 448–452CrossRefMATHMathSciNetGoogle Scholar
  44. 44.
    Uno T (2009) Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl Inf Syst. doi:10.1007/s10115-009-0271-6
  45. 45.
    Wang BX, Japkowicz N (2009) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst. doi:10.1007/s10115-009-0198-y
  46. 46.
    Weiss SM, Kulikowski CA (1991) Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning and expert systems. Morgan Kaufmann, San MateoGoogle Scholar
  47. 47.
    Wilson DR, Martinez TR (2000a) An integrated instance-based learning algorithm. Comput Intell 16: 1–28CrossRefMathSciNetGoogle Scholar
  48. 48.
    Wilson DR, Martinez TR (2000b) Reduction techniques for instance-based learning algorithm. Mach Learn 33(3): 257–286CrossRefGoogle Scholar
  49. 49.
    Winton D, Pete E (2001) Using instance selection to combine multiple models learned from disjoint subsets. In: Liu H, Motoda H (eds) Instance selection and construction for data mining. Kluwer, DordrechtGoogle Scholar
  50. 50.
    Witten IH, Frank E (2003) Data mining: practical machine learning tools and techniques with JAVA implementations. Morgan Kaufmann, San FranciscoGoogle Scholar
  51. 51.
    Wolper DH (2001) The supervised learning no free lunch theorems, Technical Raport. NASA Ames Research Center, Moffett FieldGoogle Scholar
  52. 52.
    Wu Y, Ianakiew KG, Govindraju V (2001) Improvement in k-nearest neighbor classification. In: Proceedings of ICARP 2001, LNCS 2013. Springer, Berlin, pp 222–229Google Scholar
  53. 53.
    Yasser EL-Manzalawy, Honavar V (2005) WLSVM: Integrating LibSVM into Weka environment. http://www.cs.iastate.edu/~yasser/wlsv. Accessed 20 Jan 2008
  54. 54.
    Yu K, Xu Xiaowei, Ester M, Kriegel H-P (2004) Feature weighting and instance selection for collaborative filtering: an information-theoretic approach. Knowl Inf Syst 5(2): 201–224CrossRefGoogle Scholar
  55. 55.
    Zhu X, Wu X (2006) Scalable representative instance selection and ranking. In: IEEE Proceedings of the 18th international conference on pattern recognition, vol 3, pp 352–355Google Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  1. 1.Department of Information SystemsGdynia Maritime UniversityGdyniaPoland

Personalised recommendations