Instance selection by genetic-based biological algorithm

Chen, Zong-Yao; Tsai, Chih-Fong; Eberle, William; Lin, Wei-Chao; Ke, Shih-Wen

doi:10.1007/s00500-014-1339-0

Instance selection by genetic-based biological algorithm

Methodologies and Application
Published: 21 June 2014

Volume 19, pages 1269–1282, (2015)
Cite this article

Soft Computing Aims and scope Submit manuscript

Zong-Yao Chen¹,
Chih-Fong Tsai¹,
William Eberle²,
Wei-Chao Lin³ &
…
Shih-Wen Ke⁴

351 Accesses
10 Citations
Explore all metrics

Abstract

Instance selection is an important research problem of data pre-processing in the data mining field. The aim of instance selection is to reduce the data size by filtering out noisy data, which may degrade the mining performance, from a given dataset. Genetic algorithms have presented an effective instance selection approach to improve the performance of data mining algorithms. However, current approaches only pursue the simplest evolutionary process based on the most reasonable and simplest rules. In this paper, we introduce a novel instance selection algorithm, namely a genetic-based biological algorithm (GBA). GBA fits a “biological evolution” into the evolutionary process, where the most streamlined process also complies with the reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Consequently, we can closely simulate the natural evolution of an algorithm, such that the algorithm will be both efficient and effective. Our experiments are based on comparing GBA with five state-of-the-art algorithms over 50 different domain datasets from the UCI Machine Learning Repository. The experimental results demonstrate that GBA outperforms these baselines, providing the lowest classification error rate and the least storage requirement. Moreover, GBA is very computational efficient, which only requires slightly larger computational cost than GA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal instance subset selection from big data using genetic algorithm and open source framework

Article Open access 05 July 2022

An Efficient Approach for Instance Selection

Optimization of Evolutionary Instance Selection

Notes

http://archive.ics.uci.edu/ml/.
The experimental environments are as follows: CPU: Intel(R) Core(TM) i7-3770 @ 3.40 GHz, RAN: 32 GB, OS: Windows 7–64bit, Code: Matlab R2012a.

References

Aggarwal CC, Yu PC (2001) Outlier detection for high dimensional data. In: Proceedings of the ACM SIGMOD conference, pp 37–46
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Google Scholar
Ball P (2002) Natural strategies for the molecular engineer. Nanotechnology 13:R15–R28
Article Google Scholar
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Hoboken
MATH Google Scholar
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6:153–172
Article MATH MathSciNet Google Scholar
Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction: an experimental study. IEEE Trans Evolut Comput 7(6):561–575
Article Google Scholar
Derrac J, García S, Herrera F (2010) A survey on evolutionary instance selection and generation. Int J Appl Metaheur Comput 1(1):60–92
Article Google Scholar
Ellstrand NC (2003) Dangerous liaisons: when cultivated plants mate with their wild relatives. Johns Hopkins University Press, Baltimore
Google Scholar
Emigh TH (1980) Comparison of tests for Hardy–Weinberg equilibrium. Biometrics 36(4):627–642
Article MATH MathSciNet Google Scholar
Flynn JJ, Wyss AR (1998) Recent advances in South American mammalian paleontology. Trends Eco Evol 13(11):449–454
Article Google Scholar
García-Pedrajas N, del Castillo JAR, Ortiz-Boyer D (2010) A cooperative coevolutionary algorithm for instance selection for instance-based learning. Mach Learn 78:381–420
Article MathSciNet Google Scholar
Guyonm I (2003) An Introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Google Scholar
Holland JH (1992) Adaptation in natural and artificial system: an introductory analysis with applications to biology, control, and artificial intelligence. A Bradford Book, Chester
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Article Google Scholar
Jankowski N, Grochowski M (2004) Comparison of instances selection algorithms I: algorithms survey. International conference on artificial intelligence and soft computing, pp 598–603
Jing SY (2013) A hybrid genetic algorithm for feature subset selection in rough set theory. Soft Comput 18:1373–1382
Knorr EM, Ng R, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8:237–253
Article Google Scholar
Koepfli KP, Gompper ME, Eizirik E, Ho CC, Linden L, Maldonado JE, Wayne RK (2007) Phylogeny of the Procyonidae (Mammalia: Carvnivora): molecules, morphology and the Great American interchange. Mol Phylogenet Evol 43(3):1076–1095
Article Google Scholar
Li X-B, Jacob VS (2008) Adaptive data reduction for large-scale transaction data. Eur J Oper Res 188(3):910–924
Article MATH Google Scholar
Liu H, Motoda H (2001) Instance selection and construction for data mining. Kluwer, Boston
Morgan GS (2002) Late Rancholabrean mammals from southernmost Florida and neotropical influence in Florida pleistocene faunas. Smithson Contrib Paleobiol 93:15–38
Google Scholar
Nojima Y, Ishibuchi H, Kuwajima I (2009) Parallel distributed genetic fuzzy rule selection. Soft Comput 13:511–519
Article Google Scholar
Odum HT (1994) Ecological and general systems: an introduction to systems ecology. University Press of Colorado, Niwot
Google Scholar
Pollan M (2001) The year in ideas. A-Z. Genetic pollution, The New York Times
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, Burlington
Google Scholar
Pradhan S, Wu X (1999) Instance selection in data mining. Technical report. Department of Computer Science, University of Colorado at Boulder
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Google Scholar
Reinartz T (2002) A unifying view on instance selection. Data Min Knowl Discov 6:191–210
Article MATH MathSciNet Google Scholar
Stern C (1962) Wilhelm Weinberg. Genetics 47:1–5
Google Scholar
Uludağ G, Kiraz B, Etaner-Uyar AŞ, Özcan E (2013) A hybrid multi-population framework for dynamic environments combining online and offline learning. Soft Comput 17:2327–2348
Article Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Article MATH Google Scholar
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38:257–286
Article MATH Google Scholar
Xie XF, Liu J, Wang ZJ (2014) A cooperative group optimization system. Soft Comput 18:469–495
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Management, National Central University, Jhongli, Taiwan
Zong-Yao Chen & Chih-Fong Tsai
Department of Computer Science, Tennessee Technological University, Cookeville, USA
William Eberle
Department of Computer Science and Information Engineering, Hwa Hsia Institute of Technology, New Taipei, Taiwan
Wei-Chao Lin
Department of Information and Computer Engineering, Chung Yuan Christian University, Jhongli, Taiwan
Shih-Wen Ke

Authors

Zong-Yao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Fong Tsai
View author publications
You can also search for this author in PubMed Google Scholar
William Eberle
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Chao Lin
View author publications
You can also search for this author in PubMed Google Scholar
Shih-Wen Ke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chih-Fong Tsai.

Additional information

Communicated by V. Loia.

Appendix: The schema theorems corresponding to GA and GBA

The original model of GA is

$$\begin{aligned} m( {H,t+1})&= m( {H,t})\times \frac{f( H)}{\overline{f} }\nonumber \\&\times \left[ {1-r_c \frac{\delta ( H)}{l-1}-o( H)r_m } \right] , \end{aligned}$$

where $H$ represents the schema, $t $ is the generation, $m(H,t)$ is the number of strings belonging to schema $H$ at generation $t$, $f(H)$ is the observed fitness, $r_{c }$ is the crossover rate, $\delta (H$) is the defining length, $l $ is the length of the code, $r_{m }$ is the mutation rate, and $o(H)$ is the order of a schema.

The modified model of GBA is

$$\begin{aligned}&m( {H,t+1})=m( {H,t})\times \frac{Nf( H)}{\overline{Nf} }\\&\quad \times \left[ {1\!-\!r_c \frac{\delta ( H)}{l\!-\!1}\!-\!o( H)r_m \!-\!o(H)r_{mg} \!\times \! MGT(m( {H,t}),t)} \right] \nonumber \\&\quad +\,GK(H), \end{aligned}$$

where $H$ represents the schema, $t $ is the generation, $m(H,t)$ is the number of strings belonging to schema $H$ at generation $t$, $Nf(H)$ is the nonlinear fitness functions, $r_{c }$ is the crossover rate, $\delta (H)$ is the defining length, $l $ is the length of the code, $r_{m}$ is the mutation rate, $o(H)$ is the order of a schema, $r_{mg}$ is the great migration rate, $MGT(m(H,t),t)$ is the trigger of the Great Migration, and $GK(K)$ is the genetic king protection mechanisms, which can retain the good schema $H$.

The definition of nonlinear fitness functions Nf($H)$ is

$$\begin{aligned} { Nf}=\frac{{ Hyperbolic} { tangent} \left( \frac{f(H)-0.5}{\sigma ^2}\right) +1}{2} \end{aligned}$$

It will increase/reduce the fitness strength depends on the threshold.

In addition, the GK($H)$ represents the Genetic King Protection Mechanisms, and the definition of GK is

If $Nf(H)\ge Threshold$ then $GK(H)=1$ else $GK(H)=0$.

If the schema $H$ is good enough, it will be retained by genetic king protection mechanisms.

The definition of great migration $MGT(m(H,t),t)$ is

If the best fitness value is stable then $MGT(m(H,t),t)=1$, else $MGT(m(H,t),t)=0$.

If the best fitness value is stable, then apply the great migration (A strong mutation) (c.f. Fig. 6 for the pseudo code of GBA).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, ZY., Tsai, CF., Eberle, W. et al. Instance selection by genetic-based biological algorithm. Soft Comput 19, 1269–1282 (2015). https://doi.org/10.1007/s00500-014-1339-0

Download citation

Published: 21 June 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s00500-014-1339-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instance selection by genetic-based biological algorithm

Abstract

Access this article

Similar content being viewed by others

Optimal instance subset selection from big data using genetic algorithm and open source framework

An Efficient Approach for Instance Selection

Optimization of Evolutionary Instance Selection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: The schema theorems corresponding to GA and GBA

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Instance selection by genetic-based biological algorithm

Abstract

Access this article

Similar content being viewed by others

Optimal instance subset selection from big data using genetic algorithm and open source framework

An Efficient Approach for Instance Selection

Optimization of Evolutionary Instance Selection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: The schema theorems corresponding to GA and GBA

Appendix: The schema theorems corresponding to GA and GBA

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation