Abstract
DNA microarrays are being used to characterize the genetic expression of several illnesses, such as cancer. There has been interest in developing automated methods to classify the data generated by those microarrays. The problem is complex due to the availability of just a few samples to train the classifiers, and the fact that each sample may contain several thousands of features. One possibility is to select a reduced set of features (genes). In this work we propose a wrapper method that is a modified version of the Inertial Geometric Particle Swarm Optimization.We name it MIGPSO. We compare MIGPSO with other approaches. The results are promising. MIGPSO obtained an increase in accuracy of about 4 %. The number of genes selected is also competitive.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Genetic data can be used to diagnose, treat, prevent and cure many illnesses. Among some new technologies developed in this area, one that has been growing fast in the recent years is the DNA microarrays. A DNA microarray is a platform that enable the execution of several thousands of experiments simultaneously. The microarray is a matrix of nucleic acid probes (genes) arranged on a solid surface. The probes contain also certain fluorescent nucleotides. When the tissue from a subject is put in contact with the microarray, some of the genes from the microarray may bind with genetic material from the tissue. This process is called hybridization. When that happens, there is a fluorescent expression that can be measured. Such information is stored as a vector \({\varvec{x}} = [x_1, x_2, \ldots , x_m]\) where each \(x_i \in {\varvec{x}}\) represents the level of expression of a certain gene. For illnesses such as cancer, the size of the vector usually takes several thousands. The samples from microarrays are labeled using other techniques such as biopsy. Having the vector and the label, a classfier can be used to learn the patterns behind that information.
Training a classifier with a small number of samples, and a big number of features, becomes a challenge. Another complication happens when the datasets are imbalanced. One possibility to deal with those complications consists in retaining only those genes that are really relevant to characterize the pattern for a given disease. This is called feature reduction.
In this paper, we propose an improved binary geometric particle swarm optimization to perform feature reduction in DNA microarray datasets. The method we propose is compared with other alternatives reported in the literature. Our method allows an increase in accuracy of about 3 % over the best previous result reported.
The rest of the paper is organized as follows: Sect. 2 presents an overview of previous works, Sect. 3 describes Particle Swarm Optimization, Sect. 4 introduces our proposed method, Sect. 5 describes the experiments and results and finally, Sect. 6 presents the conclusions and future work.
2 Previous Work
The methods to perform feature selection can be classified as: filter methods, wrapper methods and embedded methods [1, 2]. In the filter approach, simple statistics are computed to rank features and select the best ones. Examples of such methods are: information gain, Euclidean distance, t-test and Chi-square [3]. Filters are fast, can be applied to high-dimensional datasets, and can be used independently from the classifier. Their major disadvantages are: (a) they consider each feature independently, ignoring possible correlations; (b) that they ignore the interaction with the classifier [4].
In the wrapper approach, the feature selection is performed using classifiers. A feature subset is chosen, the classifier is trained and evaluated. The performance of the classifier is associated with the subset used. The process is repeated until a “good"subset is found [1]. In general, this approach provides a better classification accuracy, since the features are selected taking into consideration their contribution to the classification accuracy of the classifiers. The disadvantage of the wrapper approach is its computational requirement when combined with classifiers such as neural networks and support vector machines.
Authors such as [5] report a hybrid method that requires three steps. In the first one they use unconditional univariate mixture modeling, then they rank features using information gain and in the third phase they use a Markov blanket filter. This method works well to eliminate redundant features. The experiments included three classifiers: k-NN, Gaussian generative model and logistic regression. Guyon et al. [6] present a complete analysis on gene selection for cancer identification. In their work, they show that selected features are more important than the classifier, and that ranking coefficients for features can be used as classifier weights. Their proposal consists in a Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) as gene selector. The method is tested in several public databases. In [7], a hybrid model is introduced. This model consists of two steps. The first one is a filter to rank the genes. The second one evaluates the subsets with a wrapper. The experiments included four public cancer datasets. In [8], Ant Colony Optimization (ACO) is used in gene selection. The experiments included information from prostate tumor and lung carcinomas. The paper is not clear on how the subsets are evaluated, but they show some results using SVM and Multi-Layer Perceptron (MLP). In [9] a wrapper approach is used, combining ACO with SVM. Each ant constructs a subset of features that is evaluated using an SVM. The classifier accuracy is used to update pheromone. The experiments include five datasets.
Regarding Particle Swarm Optimization (PSO), in [10] the authors propose a method named IBPSO, which stands for Improved Binary Particle Swarm Optimization. The information on which genes will be considered at any given time is provided through a binary set, where 1 means that the corresponding gene will be used and 0 otherwise. The evaluation of subsets is given by the accuracy obtained by a 1-NN (Nearest Neighbor). The experiments included eleven datasets. In nine of those cases the technique outperformed other approaches. In [11], a fairly complex process is introduced. The authors proposed an Integer-Coded Genetic Algorithm (ICGA) to perform gene selection. Then, a PSO driven Extreme Learning Machine (ELM) algorithm is used to deal with problems related to sparse/imbalanced data. Vieira [12], on the other hand, presents a modified binary PSO (MBPSO) for feature selection with the simultaneous optimization of SVM kernel parameter setting. The algorithm is applied to the analysis of mortality in septic patients.
In [13] authors propose a two steps process. In the first one a filter is applied to retain the genes with minimum redundancy and maximum relevance (MRMR). The second step uses a genetic algorithm to identify the highly discriminating genes. The experiments were run on five different datasets. In a recent work, Chen et al. [14], combine a PSO algorithm with the C4.5 classifier. The important genes were proposed using PSO, and then C4.5 was employed as a fitness function.
3 Particle Swarm Optimization
Particle Swarm Optimization (PSO) is an optimization technique that was originally developed by Kennedy and Eberhart [15] based on studies of social models of animals or insects. Since then, it has gained popularity, and many applications have been reported [16]. Particle swarm optimization is similar to a genetic algorithm in the sense that both work with a population of possible solutions.
In PSO the potential solutions (particles), are encoded by means of a position vector \({\varvec{x}}\). Each particle has also a velocity vector \({\varvec{v}}\) that is used to guide the exploration of the search space. The velocity of each particle is modified iteratively by its personal best position \({\varvec{p}}\) (i.e., the position giving the best fitness value so far), and \({\varvec{g}}\), the best position of the entire swarm. In this way, each particle combines its own knowledge (local information) with the collective knowledge (global information). At each time step t, the velocity is updated and the particle is moved to a new position. The new position \({\varvec{x}}(t+1)\) is given by Eq. 1, where \({\varvec{x}}(t)\) is the previous position and \({\varvec{v}}(t+1)\) the new velocity.
The new velocity is computed by Eq. 2:
where U(a, b) is a uniformly distributed number in [a, b]. \(\phi _1\) and \(\phi _2\) are often called acceleration coefficients. They determine the significance of \({\varvec{p}}(t)\) and \({\varvec{g}}(t)\), respectively [17]. The process of updating position and velocity of the particles continues until a certain stopping criteria is met [18]. In the original version of the algorithm, the velocity was kept within a range \([-V_{max}, +V_{max} ]\).
3.1 Inertial Geometric PSO
The original PSO algorithm has been extended in several ways. For instance, in [19], the authors propose a modification of Eq. 2, by introducing a parameter \(\omega \), called inertia weight. This parameter reduces the importance of \(V_{max}\) and it is interpreted as the fluidity of the medium in which particles are moving [18]. Thus, the equation to update the velocity of particles is given by Eq. 3:
PSO was conceived as a method for the optimization of continuous nonlinear functions [15]. However, Moraglio et al. [20] proposed a very interesting way of using PSO for Euclidean, Manhattan and Hamming spaces. This is relevant to us since in the particular case of the DNA microarray features selection, we will be searching on a combinatorial space. By defining a geometric framework, the authors can generalize the operation of PSO to any search space endowed with a distance. This approach is called Geometric PSO (GPSO). In a later work by the same group, they extend GPSO to include the inertia weight [21], giving place to Inertial GPSO (IGPSO). Algorithm 1 shows the pseudocode of the IGPSO [21]. The definition of the convex combinations and the extension ray recombinations are included in the same reference. Note that GPSO and IGPSO do not compute explicitly the velocity of the particles, in difference with PSO. They only compute the position at different times.
4 Methodology
We tried to use the original IGPSO algorithm to work with DNA microarray data. However, in some preliminary experiments we found that the performance of the method was not good in our problem. A further analysis of the causes allowed us to propose a modification that improves the behavior of the method. Here we describe our methodology. It consists of four steps: (1) encoding of DNA microarray feature selection, (2) initialization of the population, (3) definition of a fitness function and (4) adaptation of IGPSO to update the particles.
4.1 Encoding of DNA Microarray Data
A solution for the gene selection problem in DNA microarray data basically consists in a list of genes that are used to train a classifier. A common way to encode a solution i is by using a binary vector \({\varvec{x}}_i = [x_{i1} , x_{i2} , \ldots , x_{im}] \) where each bit \(x_{ij}\) takes a binary value. If gene j is selected in solution i, \(x_{ij}\leftarrow 1\), otherwise \(x_{ij}\leftarrow 0\). m represents the total amount of genes. Such particles could be represented on a Hamming space.
4.2 Population Initialization
The population is divided into four groups. The first one contains 10 % of the total population. Particles in this set will be initialized with N genes chosen at random. The second group contains 20 % of the population. Those particles will be initialized with 2N genes being selected. The third group is the 30 % of the population and its elements will be initialized with 3N genes selected. The rest of the particles (40 %) will be initialized with (1 / 2) m of the genes selected. Following the advice of [22], we define \(N \leftarrow 4\).
4.3 Fitness Evaluation
Since the method that we propose is a wrapper, we need to train and test a classifier with each combination of genes given by PSO. The accuracy of the classifier is used as the fitness function. In this work, we report results for a multi-layer perceptron (MLP), although different classifiers can be used. Equation 4 shows how to compute the fitness.
where \(\alpha , \beta \) are numbers in the interval [0, 1], such that \(\alpha = 1-\beta \), acc is the accuracy given by the classifier, m is the number of genes in the microarray, and r is the reduced number of genes selected. \(\alpha \) and \(\beta \) represent the importance given to accuracy and number of selected genes, respectively.
4.4 Particle Position Updating
Our proposal is based on IGPSO [21], with two improvements. We noticed that sometimes the IGPSO was getting stuck in local minima. To avoid that, we propose considering a threshold value T, along with a counter k. k is increased if during an iteration, the global best \({\varvec{g}}\) is not changed. If the counter reaches T, then the global best is reset with zeros. This will induce the exploration of the space, allowing to escape from the local minima.
The position update is performed in steps 8 and 9 of Algorithm 1. Step 8 consists in a 3-parental uniform crossover for binary strings. In our work, we take that process as defined in [20]. It is important to mention that step 8 includes the use of Eq. 3, which requires three parameters: \(\omega , \phi _1\) and \(\phi _2\).
The step 9 in Algorithm 1 is performed by means of an extension ray in Hamming spaces [21]. The extension ray C is defined as \(C \in ER(A,B)\) iff \(C \in [A,B]\) or \(B \in [A,C]\). Where A, B and C are points in the space.
For the case of feature selection, the effect of the operations in steps 8 and 9 as originally defined, is that at each iteration the particles will select less genes. Thus, the distance between A and B will become very small compared with the maximum distance m. Moreover, the positions in which A and B will be equal will be mostly in positions occupied by 0’s. That will in fact provoke that those bits be flipped to 1’s. This means that more genes will be selected, which is the oposite to what we are looking for. In our exploratory experiments, this not only increased the time required to compute the fitness of each particle, but also decreased the accuracy.
Thus, we propose a modification to operator ER, that we call MER. The idea is the following. Once the probability P is computed, a vector \(R=[r_1, r_2, \ldots , r_n]\), of n random numbers in (0, 1) is generated and we count how many \(r_i \le P\). That amount (q) represents the number of bits that are equal in A and B and that will therefore change from 1 to 0, or from 0 to 1 in C. To avoid that the operator choses only those bits in which A and B are 0, only q / 2 bits where A and B are 0 will be randomly chosen to be changed to 1, and similarly, q / 2 bits will be selected from the ones in which A and B are 1 to be changed to 0. Algorithm 2 describes this steps in detail.
5 Experiments and Results
5.1 Experimental Setup
All our experiments were run on a HP Proliant G8 server with 2 Intel processors, 8 cores each and 256 Gb RAM. The algorithms are evaluated on two public datasets. One is the colon cancer dataset, which contains the expression levels of 2000 genes in 22 normal colon tissues and 40 tumor [23]. The other dataset is the prostate dataset (GEMS), which involves the expression levels of 10,509 genes in 52 normal prostate tissue, and 50 tumor [24].
5.2 Parameter Tuning
The algorithm requires the determination of parameters \(\omega , \phi _1\) and \(\phi _2\). A design of experiments was performed considering three factors (the parameters), three levels for the first factor and six levels for the other two. A factorial experiment required a total of 47 different combinations of values, with 10 repetitions for each combination. For colon dataset, the best values were: \(\omega = 0.3, \phi _1= 2.2, \phi _2=2.2\). For prostate cancer, \(\omega = 0.3, \phi _1= 1.4, \phi _2=1.4\).
5.3 Comparison with Other Methods
In this section we compare the results of MIGPSO against other methods reported in the literature. Table 1 shows time, accuracy and number of selected genes. As it is possible to observe, MIGPSO performs better in all. It improves an average of 4 % the accuracy, while time is reduced to almost half. The number of genes selected is reduced to almost one third.
In Table 2, we show a comparison for the colon dataset with other wrapper methods. The classifiers considered are: Naive Bayes (NB), C4.5, and Support Vector Machines (SVM). Six different methods for feature selection were considered [7]. The best wrapper reported for this dataset is SVM + Hybrid PSO/GA with 92 % accuracy. MIGPSO with an MLP reach an accuracy of 96 %. That is an increase of almost 4 % in average. But it is worth noting that the amount of genes required is smaller in MIGPSO. Only 6 genes, compared with 18 in the other case. In perspective with the total amount of genes, that means a reduction of three orders of magnitude.
Table 3 shows a similar comparison for prostate dataset. The classifiers considered are: SVM, Self-Organizing Map (SOM), BPNN (back propagation neural network), Naïve Bayes, Decision Trees (CART), Artificial Immune Recognition System (AIRS). PSO with C4.5 Decision Tree (PSODT). All those experiments use PSO as feature selection. The best method reported for this dataset is PSODT with 94.31 % accuracy. Our MIGPSO with an MLP goes to 98.04 %accuracy. Almost 4 % above the best reported.
6 Conclusions and Future Work
New technologies such as DNA microarrays are generating information that can be of relevance to detect, treat and prevent illnesses in a better way. However, the generated data presents challenges for current computational methods. The difference between the number of attributes and samples can be of several orders of magnitude. Even more, most datasets of microarray information are imbalanced. There are many more samples for healthy than from diseased subjects. The main concerns for automated processing and classification of new data are on the one hand the reduction of attributes analyzed and on the other hand, the use of robust classifiers. Our experiments suggest that performing a good attribute selection is worth when classifying. Having less attributes reduces the processing time, but also may increase the correct classification of new data. In this work we show that a modification of a geometric version of PSO allows a Multi-Layer perceptron to classify with high accuracy.
There are of course some issues that might be worth to explore in future work, such as: (1) running more experiments with other classifiers in the wrapper, (2) running experiments with more datasets, particularly with multi-class problems.
References
Koller, D., Sahami, M.: Toward optimal feature selection. Technical report, Stanford InfoLab, Stanford University (1996)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. J. Artif. Intel. 97(1), 273–324 (1997)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. J. Bioinf. 23(19), 2507–2517 (2007)
Xing, E.P., Jordan, M.I., Karp, R.M., et al.: Feature selection for high-dimensional genomic microarray data. In: ICML, vol. 1, pp. 601–608. Citeseer (2001)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. J. Mach. Learn. 446(1–3), 389–422 (2002)
Ruiz, R., Riquelme, J.C., Aguilar-Ruiz, J.S.: Incremental wrapper-based gene selection from microarray data for cancer classification. J. Patt. Recog. 39(12), 2383–2392 (2006)
Chiang, Y.M., Chiang, H.M., Lin, S.Y.: The application of ant colony optimization for gene selection in microarray-based cancer classification. Int. Conf. Mach. Learn. Cybern. 7, 4001–4006 (2008)
Yu, H., Gu, G., Liu, H., Shen, J., Zhao, J.: A modified ant colony optimization algorithm for tumor marker gene selection. Genom. Proteom. Bioinf. 7(4), 200–208 (2009)
Chuang, L.Y., Chang, H.W., Tu, C.J., Yang, C.H.: Improved binary PSO for feature selection using gene expression data. Comp. Biol. Chem. 32(1), 29–38 (2008)
Saraswathi, S., Sundaram, S., Sundararajan, N., Zimmermann, M., Nilsen-Hamilton, M.: ICGA-PSO-ELM approach for accurate multiclass cancer classification resulting in reduced gene sets in which genes encoding secreted proteins are highly represented. T. Comput. Biol. Bioinf. 8(2), 452–463 (2011)
Vieira, S.M., Mendonça, L.F., Farinha, G.J., Sousa, J.M.: Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients. J. Appl. Soft Comput. 13(8), 3494–3504 (2013)
El Akadi, A., Amine, A., El Ouardighi, A., Aboutajdine, D.: A two-stage gene selection scheme utilizing MRMR filter and GA wrapper. Know. Inf. Syst. 26(3), 487–500 (2011)
Chen, K.H., Wang, K.J., Tsai, M.L., Wang, K.M., Adrian, A.M., Cheng, W.C., Yang, T.S., Teng, N.C., Tan, K.P., Chang, K.S.: Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinf. 15(1), 49 (2014)
Kennedy, J., Eberhart, R.: Particle swarm optimization. IEEE Int. Conf. Neural Networks. 4, 1942–1948 (1995)
García-Gonzalo, E., Fernández-Martínez, J.: A brief historical review of particle swarm optimization (PSO). J. Bioinf. Intel. Cont. 1(1), 3–16 (2012)
Yang, X.S.: Engineering Optimization: An Introduction With Metaheuristic Applications. Wiley, New York (2010)
Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intell. 1(1), 33–57 (2007)
Shi, Y., Eberhart, R.: A modified particle swarm optimizer. In: IEEE Conference Evolutionary Computation., pp. 69–73 (1998)
Moraglio, A., Di Chio, C., Togelius, J., Poli, R.: Geometric particle swarm optimization. J. Artif. Evol. Appl. 11, 247–250 (2008)
Moraglio, A., Togelius, J.: Inertial geometric particle swarm optimization. In: IEEE Conference on Evolutionary Computation, pp. 1973–1980 (2009)
Alba, E., García-Nieto, J., Jourdan, L., Talbi, E.G.: Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms. In: IEEE Conference on Evolutionary Computation, pp. 284–290 (2007)
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci. 96(12), 6745–6750 (1999)
Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer cell 1(2), 203–209 (2002)
Acknowledgments
The authors thank Tecnológico de Monterrey, Campus Guadalajara, as well as IPN-CIC under project SIP 20151187, and CONACYT under project 155014 for the economical support to carry out this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Garibay, C., Sanchez-Ante, G., Falcon-Morales, L.E., Sossa, H. (2015). Modified Binary Inertial Particle Swarm Optimization for Gene Selection in DNA Microarray Data. In: Carrasco-Ochoa, J., Martínez-Trinidad, J., Sossa-Azuela, J., Olvera López, J., Famili, F. (eds) Pattern Recognition. MCPR 2015. Lecture Notes in Computer Science(), vol 9116. Springer, Cham. https://doi.org/10.1007/978-3-319-19264-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-19264-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19263-5
Online ISBN: 978-3-319-19264-2
eBook Packages: Computer ScienceComputer Science (R0)