Abstract
Instance selection is an important research problem of data pre-processing in the data mining field. The aim of instance selection is to reduce the data size by filtering out noisy data, which may degrade the mining performance, from a given dataset. Genetic algorithms have presented an effective instance selection approach to improve the performance of data mining algorithms. However, current approaches only pursue the simplest evolutionary process based on the most reasonable and simplest rules. In this paper, we introduce a novel instance selection algorithm, namely a genetic-based biological algorithm (GBA). GBA fits a “biological evolution” into the evolutionary process, where the most streamlined process also complies with the reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Consequently, we can closely simulate the natural evolution of an algorithm, such that the algorithm will be both efficient and effective. Our experiments are based on comparing GBA with five state-of-the-art algorithms over 50 different domain datasets from the UCI Machine Learning Repository. The experimental results demonstrate that GBA outperforms these baselines, providing the lowest classification error rate and the least storage requirement. Moreover, GBA is very computational efficient, which only requires slightly larger computational cost than GA.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-014-1339-0/MediaObjects/500_2014_1339_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-014-1339-0/MediaObjects/500_2014_1339_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-014-1339-0/MediaObjects/500_2014_1339_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-014-1339-0/MediaObjects/500_2014_1339_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-014-1339-0/MediaObjects/500_2014_1339_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-014-1339-0/MediaObjects/500_2014_1339_Fig6_HTML.gif)
Similar content being viewed by others
Notes
The experimental environments are as follows: CPU: Intel(R) Core(TM) i7-3770 @ 3.40 GHz, RAN: 32 GB, OS: Windows 7–64bit, Code: Matlab R2012a.
References
Aggarwal CC, Yu PC (2001) Outlier detection for high dimensional data. In: Proceedings of the ACM SIGMOD conference, pp 37–46
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Ball P (2002) Natural strategies for the molecular engineer. Nanotechnology 13:R15–R28
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Hoboken
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6:153–172
Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction: an experimental study. IEEE Trans Evolut Comput 7(6):561–575
Derrac J, García S, Herrera F (2010) A survey on evolutionary instance selection and generation. Int J Appl Metaheur Comput 1(1):60–92
Ellstrand NC (2003) Dangerous liaisons: when cultivated plants mate with their wild relatives. Johns Hopkins University Press, Baltimore
Emigh TH (1980) Comparison of tests for Hardy–Weinberg equilibrium. Biometrics 36(4):627–642
Flynn JJ, Wyss AR (1998) Recent advances in South American mammalian paleontology. Trends Eco Evol 13(11):449–454
García-Pedrajas N, del Castillo JAR, Ortiz-Boyer D (2010) A cooperative coevolutionary algorithm for instance selection for instance-based learning. Mach Learn 78:381–420
Guyonm I (2003) An Introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Holland JH (1992) Adaptation in natural and artificial system: an introductory analysis with applications to biology, control, and artificial intelligence. A Bradford Book, Chester
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Jankowski N, Grochowski M (2004) Comparison of instances selection algorithms I: algorithms survey. International conference on artificial intelligence and soft computing, pp 598–603
Jing SY (2013) A hybrid genetic algorithm for feature subset selection in rough set theory. Soft Comput 18:1373–1382
Knorr EM, Ng R, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8:237–253
Koepfli KP, Gompper ME, Eizirik E, Ho CC, Linden L, Maldonado JE, Wayne RK (2007) Phylogeny of the Procyonidae (Mammalia: Carvnivora): molecules, morphology and the Great American interchange. Mol Phylogenet Evol 43(3):1076–1095
Li X-B, Jacob VS (2008) Adaptive data reduction for large-scale transaction data. Eur J Oper Res 188(3):910–924
Liu H, Motoda H (2001) Instance selection and construction for data mining. Kluwer, Boston
Morgan GS (2002) Late Rancholabrean mammals from southernmost Florida and neotropical influence in Florida pleistocene faunas. Smithson Contrib Paleobiol 93:15–38
Nojima Y, Ishibuchi H, Kuwajima I (2009) Parallel distributed genetic fuzzy rule selection. Soft Comput 13:511–519
Odum HT (1994) Ecological and general systems: an introduction to systems ecology. University Press of Colorado, Niwot
Pollan M (2001) The year in ideas. A-Z. Genetic pollution, The New York Times
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, Burlington
Pradhan S, Wu X (1999) Instance selection in data mining. Technical report. Department of Computer Science, University of Colorado at Boulder
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Reinartz T (2002) A unifying view on instance selection. Data Min Knowl Discov 6:191–210
Stern C (1962) Wilhelm Weinberg. Genetics 47:1–5
Uludağ G, Kiraz B, Etaner-Uyar AŞ, Özcan E (2013) A hybrid multi-population framework for dynamic environments combining online and offline learning. Soft Comput 17:2327–2348
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38:257–286
Xie XF, Liu J, Wang ZJ (2014) A cooperative group optimization system. Soft Comput 18:469–495
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by V. Loia.
Appendix: The schema theorems corresponding to GA and GBA
Appendix: The schema theorems corresponding to GA and GBA
The original model of GA is
where \(H\) represents the schema, \(t \) is the generation, \(m(H,t)\) is the number of strings belonging to schema \(H\) at generation \(t\), \(f(H)\) is the observed fitness, \(r_{c }\) is the crossover rate, \(\delta (H\)) is the defining length, \(l \) is the length of the code, \(r_{m }\) is the mutation rate, and \(o(H)\) is the order of a schema.
The modified model of GBA is
where \(H\) represents the schema, \(t \) is the generation, \(m(H,t)\) is the number of strings belonging to schema \(H\) at generation \(t\), \(Nf(H)\) is the nonlinear fitness functions, \(r_{c }\) is the crossover rate, \(\delta (H)\) is the defining length, \(l \) is the length of the code, \(r_{m}\) is the mutation rate, \(o(H)\) is the order of a schema, \(r_{mg}\) is the great migration rate, \(MGT(m(H,t),t)\) is the trigger of the Great Migration, and \(GK(K)\) is the genetic king protection mechanisms, which can retain the good schema \(H\).
The definition of nonlinear fitness functions Nf(\(H)\) is
It will increase/reduce the fitness strength depends on the threshold.
In addition, the GK(\(H)\) represents the Genetic King Protection Mechanisms, and the definition of GK is
If \(Nf(H)\ge Threshold\) then \(GK(H)=1\) else \(GK(H)=0\).
If the schema \(H\) is good enough, it will be retained by genetic king protection mechanisms.
The definition of great migration \(MGT(m(H,t),t)\) is
If the best fitness value is stable then \(MGT(m(H,t),t)=1\), else \(MGT(m(H,t),t)=0\).
If the best fitness value is stable, then apply the great migration (A strong mutation) (c.f. Fig. 6 for the pseudo code of GBA).
Rights and permissions
About this article
Cite this article
Chen, ZY., Tsai, CF., Eberle, W. et al. Instance selection by genetic-based biological algorithm. Soft Comput 19, 1269–1282 (2015). https://doi.org/10.1007/s00500-014-1339-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1339-0