Skip to main content

Advertisement

Log in

Instance selection by genetic-based biological algorithm

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Instance selection is an important research problem of data pre-processing in the data mining field. The aim of instance selection is to reduce the data size by filtering out noisy data, which may degrade the mining performance, from a given dataset. Genetic algorithms have presented an effective instance selection approach to improve the performance of data mining algorithms. However, current approaches only pursue the simplest evolutionary process based on the most reasonable and simplest rules. In this paper, we introduce a novel instance selection algorithm, namely a genetic-based biological algorithm (GBA). GBA fits a “biological evolution” into the evolutionary process, where the most streamlined process also complies with the reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Consequently, we can closely simulate the natural evolution of an algorithm, such that the algorithm will be both efficient and effective. Our experiments are based on comparing GBA with five state-of-the-art algorithms over 50 different domain datasets from the UCI Machine Learning Repository. The experimental results demonstrate that GBA outperforms these baselines, providing the lowest classification error rate and the least storage requirement. Moreover, GBA is very computational efficient, which only requires slightly larger computational cost than GA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://archive.ics.uci.edu/ml/.

  2. The experimental environments are as follows: CPU: Intel(R) Core(TM) i7-3770 @ 3.40 GHz, RAN: 32 GB, OS: Windows 7–64bit, Code: Matlab R2012a.

References

  • Aggarwal CC, Yu PC (2001) Outlier detection for high dimensional data. In: Proceedings of the ACM SIGMOD conference, pp 37–46

  • Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  • Ball P (2002) Natural strategies for the molecular engineer. Nanotechnology 13:R15–R28

    Article  Google Scholar 

  • Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Hoboken

    MATH  Google Scholar 

  • Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6:153–172

    Article  MATH  MathSciNet  Google Scholar 

  • Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction: an experimental study. IEEE Trans Evolut Comput 7(6):561–575

    Article  Google Scholar 

  • Derrac J, García S, Herrera F (2010) A survey on evolutionary instance selection and generation. Int J Appl Metaheur Comput 1(1):60–92

    Article  Google Scholar 

  • Ellstrand NC (2003) Dangerous liaisons: when cultivated plants mate with their wild relatives. Johns Hopkins University Press, Baltimore

    Google Scholar 

  • Emigh TH (1980) Comparison of tests for Hardy–Weinberg equilibrium. Biometrics 36(4):627–642

    Article  MATH  MathSciNet  Google Scholar 

  • Flynn JJ, Wyss AR (1998) Recent advances in South American mammalian paleontology. Trends Eco Evol 13(11):449–454

    Article  Google Scholar 

  • García-Pedrajas N, del Castillo JAR, Ortiz-Boyer D (2010) A cooperative coevolutionary algorithm for instance selection for instance-based learning. Mach Learn 78:381–420

    Article  MathSciNet  Google Scholar 

  • Guyonm I (2003) An Introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  • Holland JH (1992) Adaptation in natural and artificial system: an introductory analysis with applications to biology, control, and artificial intelligence. A Bradford Book, Chester

  • Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  • Jankowski N, Grochowski M (2004) Comparison of instances selection algorithms I: algorithms survey. International conference on artificial intelligence and soft computing, pp 598–603

  • Jing SY (2013) A hybrid genetic algorithm for feature subset selection in rough set theory. Soft Comput 18:1373–1382

  • Knorr EM, Ng R, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8:237–253

    Article  Google Scholar 

  • Koepfli KP, Gompper ME, Eizirik E, Ho CC, Linden L, Maldonado JE, Wayne RK (2007) Phylogeny of the Procyonidae (Mammalia: Carvnivora): molecules, morphology and the Great American interchange. Mol Phylogenet Evol 43(3):1076–1095

    Article  Google Scholar 

  • Li X-B, Jacob VS (2008) Adaptive data reduction for large-scale transaction data. Eur J Oper Res 188(3):910–924

    Article  MATH  Google Scholar 

  • Liu H, Motoda H (2001) Instance selection and construction for data mining. Kluwer, Boston

  • Morgan GS (2002) Late Rancholabrean mammals from southernmost Florida and neotropical influence in Florida pleistocene faunas. Smithson Contrib Paleobiol 93:15–38

    Google Scholar 

  • Nojima Y, Ishibuchi H, Kuwajima I (2009) Parallel distributed genetic fuzzy rule selection. Soft Comput 13:511–519

    Article  Google Scholar 

  • Odum HT (1994) Ecological and general systems: an introduction to systems ecology. University Press of Colorado, Niwot

    Google Scholar 

  • Pollan M (2001) The year in ideas. A-Z. Genetic pollution, The New York Times

  • Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, Burlington

    Google Scholar 

  • Pradhan S, Wu X (1999) Instance selection in data mining. Technical report. Department of Computer Science, University of Colorado at Boulder

  • Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106

    Google Scholar 

  • Reinartz T (2002) A unifying view on instance selection. Data Min Knowl Discov 6:191–210

    Article  MATH  MathSciNet  Google Scholar 

  • Stern C (1962) Wilhelm Weinberg. Genetics 47:1–5

    Google Scholar 

  • Uludağ G, Kiraz B, Etaner-Uyar AŞ, Özcan E (2013) A hybrid multi-population framework for dynamic environments combining online and offline learning. Soft Comput 17:2327–2348

    Article  Google Scholar 

  • Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421

    Article  MATH  Google Scholar 

  • Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38:257–286

    Article  MATH  Google Scholar 

  • Xie XF, Liu J, Wang ZJ (2014) A cooperative group optimization system. Soft Comput 18:469–495

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chih-Fong Tsai.

Additional information

Communicated by V. Loia.

Appendix: The schema theorems corresponding to GA and GBA

Appendix: The schema theorems corresponding to GA and GBA

The original model of GA is

$$\begin{aligned} m( {H,t+1})&= m( {H,t})\times \frac{f( H)}{\overline{f} }\nonumber \\&\times \left[ {1-r_c \frac{\delta ( H)}{l-1}-o( H)r_m } \right] , \end{aligned}$$

where \(H\) represents the schema, \(t \) is the generation, \(m(H,t)\) is the number of strings belonging to schema \(H\) at generation \(t\), \(f(H)\) is the observed fitness, \(r_{c }\) is the crossover rate, \(\delta (H\)) is the defining length, \(l \) is the length of the code, \(r_{m }\) is the mutation rate, and \(o(H)\) is the order of a schema.

The modified model of GBA is

$$\begin{aligned}&m( {H,t+1})=m( {H,t})\times \frac{Nf( H)}{\overline{Nf} }\\&\quad \times \left[ {1\!-\!r_c \frac{\delta ( H)}{l\!-\!1}\!-\!o( H)r_m \!-\!o(H)r_{mg} \!\times \! MGT(m( {H,t}),t)} \right] \nonumber \\&\quad +\,GK(H), \end{aligned}$$

where \(H\) represents the schema, \(t \) is the generation, \(m(H,t)\) is the number of strings belonging to schema \(H\) at generation \(t\), \(Nf(H)\) is the nonlinear fitness functions, \(r_{c }\) is the crossover rate, \(\delta (H)\) is the defining length, \(l \) is the length of the code, \(r_{m}\) is the mutation rate, \(o(H)\) is the order of a schema, \(r_{mg}\) is the great migration rate, \(MGT(m(H,t),t)\) is the trigger of the Great Migration, and \(GK(K)\) is the genetic king protection mechanisms, which can retain the good schema \(H\).

The definition of nonlinear fitness functions Nf(\(H)\) is

$$\begin{aligned} { Nf}=\frac{{ Hyperbolic} { tangent} \left( \frac{f(H)-0.5}{\sigma ^2}\right) +1}{2} \end{aligned}$$

It will increase/reduce the fitness strength depends on the threshold.

In addition, the GK(\(H)\) represents the Genetic King Protection Mechanisms, and the definition of GK is

If \(Nf(H)\ge Threshold\) then \(GK(H)=1\) else \(GK(H)=0\).

If the schema \(H\) is good enough, it will be retained by genetic king protection mechanisms.

The definition of great migration \(MGT(m(H,t),t)\) is

If the best fitness value is stable then \(MGT(m(H,t),t)=1\), else \(MGT(m(H,t),t)=0\).

If the best fitness value is stable, then apply the great migration (A strong mutation) (c.f. Fig. 6 for the pseudo code of GBA).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, ZY., Tsai, CF., Eberle, W. et al. Instance selection by genetic-based biological algorithm. Soft Comput 19, 1269–1282 (2015). https://doi.org/10.1007/s00500-014-1339-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-014-1339-0

Keywords

Navigation