Keywords

1 Introduction

Genetic data can be used to diagnose, treat, prevent and cure many illnesses. Among some new technologies developed in this area, one that has been growing fast in the recent years is the DNA microarrays. A DNA microarray is a platform that enable the execution of several thousands of experiments simultaneously. The microarray is a matrix of nucleic acid probes (genes) arranged on a solid surface. The probes contain also certain fluorescent nucleotides. When the tissue from a subject is put in contact with the microarray, some of the genes from the microarray may bind with genetic material from the tissue. This process is called hybridization. When that happens, there is a fluorescent expression that can be measured. Such information is stored as a vector \({\varvec{x}} = [x_1, x_2, \ldots , x_m]\) where each \(x_i \in {\varvec{x}}\) represents the level of expression of a certain gene. For illnesses such as cancer, the size of the vector usually takes several thousands. The samples from microarrays are labeled using other techniques such as biopsy. Having the vector and the label, a classfier can be used to learn the patterns behind that information.

Training a classifier with a small number of samples, and a big number of features, becomes a challenge. Another complication happens when the datasets are imbalanced. One possibility to deal with those complications consists in retaining only those genes that are really relevant to characterize the pattern for a given disease. This is called feature reduction.

In this paper, we propose an improved binary geometric particle swarm optimization to perform feature reduction in DNA microarray datasets. The method we propose is compared with other alternatives reported in the literature. Our method allows an increase in accuracy of about 3 % over the best previous result reported.

The rest of the paper is organized as follows: Sect. 2 presents an overview of previous works, Sect. 3 describes Particle Swarm Optimization, Sect. 4 introduces our proposed method, Sect. 5 describes the experiments and results and finally, Sect. 6 presents the conclusions and future work.

2 Previous Work

The methods to perform feature selection can be classified as: filter methods, wrapper methods and embedded methods [1, 2]. In the filter approach, simple statistics are computed to rank features and select the best ones. Examples of such methods are: information gain, Euclidean distance, t-test and Chi-square [3]. Filters are fast, can be applied to high-dimensional datasets, and can be used independently from the classifier. Their major disadvantages are: (a) they consider each feature independently, ignoring possible correlations; (b) that they ignore the interaction with the classifier [4].

In the wrapper approach, the feature selection is performed using classifiers. A feature subset is chosen, the classifier is trained and evaluated. The performance of the classifier is associated with the subset used. The process is repeated until a “good"subset is found [1]. In general, this approach provides a better classification accuracy, since the features are selected taking into consideration their contribution to the classification accuracy of the classifiers. The disadvantage of the wrapper approach is its computational requirement when combined with classifiers such as neural networks and support vector machines.

Authors such as [5] report a hybrid method that requires three steps. In the first one they use unconditional univariate mixture modeling, then they rank features using information gain and in the third phase they use a Markov blanket filter. This method works well to eliminate redundant features. The experiments included three classifiers: k-NN, Gaussian generative model and logistic regression. Guyon et al. [6] present a complete analysis on gene selection for cancer identification. In their work, they show that selected features are more important than the classifier, and that ranking coefficients for features can be used as classifier weights. Their proposal consists in a Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) as gene selector. The method is tested in several public databases. In [7], a hybrid model is introduced. This model consists of two steps. The first one is a filter to rank the genes. The second one evaluates the subsets with a wrapper. The experiments included four public cancer datasets. In [8], Ant Colony Optimization (ACO) is used in gene selection. The experiments included information from prostate tumor and lung carcinomas. The paper is not clear on how the subsets are evaluated, but they show some results using SVM and Multi-Layer Perceptron (MLP). In [9] a wrapper approach is used, combining ACO with SVM. Each ant constructs a subset of features that is evaluated using an SVM. The classifier accuracy is used to update pheromone. The experiments include five datasets.

Regarding Particle Swarm Optimization (PSO), in [10] the authors propose a method named IBPSO, which stands for Improved Binary Particle Swarm Optimization. The information on which genes will be considered at any given time is provided through a binary set, where 1 means that the corresponding gene will be used and 0 otherwise. The evaluation of subsets is given by the accuracy obtained by a 1-NN (Nearest Neighbor). The experiments included eleven datasets. In nine of those cases the technique outperformed other approaches. In [11], a fairly complex process is introduced. The authors proposed an Integer-Coded Genetic Algorithm (ICGA) to perform gene selection. Then, a PSO driven Extreme Learning Machine (ELM) algorithm is used to deal with problems related to sparse/imbalanced data. Vieira [12], on the other hand, presents a modified binary PSO (MBPSO) for feature selection with the simultaneous optimization of SVM kernel parameter setting. The algorithm is applied to the analysis of mortality in septic patients.

In [13] authors propose a two steps process. In the first one a filter is applied to retain the genes with minimum redundancy and maximum relevance (MRMR). The second step uses a genetic algorithm to identify the highly discriminating genes. The experiments were run on five different datasets. In a recent work, Chen et al. [14], combine a PSO algorithm with the C4.5 classifier. The important genes were proposed using PSO, and then C4.5 was employed as a fitness function.

3 Particle Swarm Optimization

Particle Swarm Optimization (PSO) is an optimization technique that was originally developed by Kennedy and Eberhart [15] based on studies of social models of animals or insects. Since then, it has gained popularity, and many applications have been reported [16]. Particle swarm optimization is similar to a genetic algorithm in the sense that both work with a population of possible solutions.

In PSO the potential solutions (particles), are encoded by means of a position vector \({\varvec{x}}\). Each particle has also a velocity vector \({\varvec{v}}\) that is used to guide the exploration of the search space. The velocity of each particle is modified iteratively by its personal best position \({\varvec{p}}\) (i.e., the position giving the best fitness value so far), and \({\varvec{g}}\), the best position of the entire swarm. In this way, each particle combines its own knowledge (local information) with the collective knowledge (global information). At each time step t, the velocity is updated and the particle is moved to a new position. The new position \({\varvec{x}}(t+1)\) is given by Eq. 1, where \({\varvec{x}}(t)\) is the previous position and \({\varvec{v}}(t+1)\) the new velocity.

$$\begin{aligned} {\varvec{x}}(t+1) = {\varvec{x}}(t)+{\varvec{v}}(t+1) \end{aligned}$$
(1)

The new velocity is computed by Eq. 2:

$$\begin{aligned} {\varvec{v}}(t+1) = {\varvec{v}}(t)+U(0,\phi _1)({\varvec{p}}(t)-{\varvec{x}}(t)) + U(0,\phi _2)({\varvec{g}}(t)-{\varvec{x}}(t)) \end{aligned}$$
(2)

where U(ab) is a uniformly distributed number in [ab]. \(\phi _1\) and \(\phi _2\) are often called acceleration coefficients. They determine the significance of \({\varvec{p}}(t)\) and \({\varvec{g}}(t)\), respectively [17]. The process of updating position and velocity of the particles continues until a certain stopping criteria is met [18]. In the original version of the algorithm, the velocity was kept within a range \([-V_{max}, +V_{max} ]\).

3.1 Inertial Geometric PSO

The original PSO algorithm has been extended in several ways. For instance, in [19], the authors propose a modification of Eq. 2, by introducing a parameter \(\omega \), called inertia weight. This parameter reduces the importance of \(V_{max}\) and it is interpreted as the fluidity of the medium in which particles are moving [18]. Thus, the equation to update the velocity of particles is given by Eq. 3:

$$\begin{aligned} {\varvec{v}}(t+1) = \omega {\varvec{v}}(t)+U(0,\phi _1)({\varvec{p}}(t)-{\varvec{x}}(t)) + U(0,\phi _2)({\varvec{g}}(t)-{\varvec{x}}(t)) \end{aligned}$$
(3)

PSO was conceived as a method for the optimization of continuous nonlinear functions [15]. However, Moraglio et al. [20] proposed a very interesting way of using PSO for Euclidean, Manhattan and Hamming spaces. This is relevant to us since in the particular case of the DNA microarray features selection, we will be searching on a combinatorial space. By defining a geometric framework, the authors can generalize the operation of PSO to any search space endowed with a distance. This approach is called Geometric PSO (GPSO). In a later work by the same group, they extend GPSO to include the inertia weight [21], giving place to Inertial GPSO (IGPSO). Algorithm 1 shows the pseudocode of the IGPSO [21]. The definition of the convex combinations and the extension ray recombinations are included in the same reference. Note that GPSO and IGPSO do not compute explicitly the velocity of the particles, in difference with PSO. They only compute the position at different times.

figure a

4 Methodology

We tried to use the original IGPSO algorithm to work with DNA microarray data. However, in some preliminary experiments we found that the performance of the method was not good in our problem. A further analysis of the causes allowed us to propose a modification that improves the behavior of the method. Here we describe our methodology. It consists of four steps: (1) encoding of DNA microarray feature selection, (2) initialization of the population, (3) definition of a fitness function and (4) adaptation of IGPSO to update the particles.

4.1 Encoding of DNA Microarray Data

A solution for the gene selection problem in DNA microarray data basically consists in a list of genes that are used to train a classifier. A common way to encode a solution i is by using a binary vector \({\varvec{x}}_i = [x_{i1} , x_{i2} , \ldots , x_{im}] \) where each bit \(x_{ij}\) takes a binary value. If gene j is selected in solution i, \(x_{ij}\leftarrow 1\), otherwise \(x_{ij}\leftarrow 0\). m represents the total amount of genes. Such particles could be represented on a Hamming space.

4.2 Population Initialization

The population is divided into four groups. The first one contains 10 % of the total population. Particles in this set will be initialized with N genes chosen at random. The second group contains 20 % of the population. Those particles will be initialized with 2N genes being selected. The third group is the 30 % of the population and its elements will be initialized with 3N genes selected. The rest of the particles (40 %) will be initialized with (1 / 2) m of the genes selected. Following the advice of [22], we define \(N \leftarrow 4\).

4.3 Fitness Evaluation

Since the method that we propose is a wrapper, we need to train and test a classifier with each combination of genes given by PSO. The accuracy of the classifier is used as the fitness function. In this work, we report results for a multi-layer perceptron (MLP), although different classifiers can be used. Equation 4 shows how to compute the fitness.

$$\begin{aligned} f({\varvec{x}}_i) = \alpha (acc) + \beta \left( \frac{m-r}{m-1} \right) \end{aligned}$$
(4)

where \(\alpha , \beta \) are numbers in the interval [0, 1], such that \(\alpha = 1-\beta \), acc is the accuracy given by the classifier, m is the number of genes in the microarray, and r is the reduced number of genes selected. \(\alpha \) and \(\beta \) represent the importance given to accuracy and number of selected genes, respectively.

4.4 Particle Position Updating

Our proposal is based on IGPSO [21], with two improvements. We noticed that sometimes the IGPSO was getting stuck in local minima. To avoid that, we propose considering a threshold value T, along with a counter k. k is increased if during an iteration, the global best \({\varvec{g}}\) is not changed. If the counter reaches T, then the global best is reset with zeros. This will induce the exploration of the space, allowing to escape from the local minima.

The position update is performed in steps 8 and 9 of Algorithm 1. Step 8 consists in a 3-parental uniform crossover for binary strings. In our work, we take that process as defined in [20]. It is important to mention that step 8 includes the use of Eq. 3, which requires three parameters: \(\omega , \phi _1\) and \(\phi _2\).

The step 9 in Algorithm 1 is performed by means of an extension ray in Hamming spaces [21]. The extension ray C is defined as \(C \in ER(A,B)\) iff \(C \in [A,B]\) or \(B \in [A,C]\). Where AB and C are points in the space.

For the case of feature selection, the effect of the operations in steps 8 and 9 as originally defined, is that at each iteration the particles will select less genes. Thus, the distance between A and B will become very small compared with the maximum distance m. Moreover, the positions in which A and B will be equal will be mostly in positions occupied by 0’s. That will in fact provoke that those bits be flipped to 1’s. This means that more genes will be selected, which is the oposite to what we are looking for. In our exploratory experiments, this not only increased the time required to compute the fitness of each particle, but also decreased the accuracy.

Thus, we propose a modification to operator ER, that we call MER. The idea is the following. Once the probability P is computed, a vector \(R=[r_1, r_2, \ldots , r_n]\), of n random numbers in (0, 1) is generated and we count how many \(r_i \le P\). That amount (q) represents the number of bits that are equal in A and B and that will therefore change from 1 to 0, or from 0 to 1 in C. To avoid that the operator choses only those bits in which A and B are 0, only q / 2 bits where A and B are 0 will be randomly chosen to be changed to 1, and similarly, q / 2 bits will be selected from the ones in which A and B are 1 to be changed to 0. Algorithm 2 describes this steps in detail.

figure b

5 Experiments and Results

5.1 Experimental Setup

All our experiments were run on a HP Proliant G8 server with 2 Intel processors, 8 cores each and 256 Gb RAM. The algorithms are evaluated on two public datasets. One is the colon cancer dataset, which contains the expression levels of 2000 genes in 22 normal colon tissues and 40 tumor [23]. The other dataset is the prostate dataset (GEMS), which involves the expression levels of 10,509 genes in 52 normal prostate tissue, and 50 tumor [24].

5.2 Parameter Tuning

The algorithm requires the determination of parameters \(\omega , \phi _1\) and \(\phi _2\). A design of experiments was performed considering three factors (the parameters), three levels for the first factor and six levels for the other two. A factorial experiment required a total of 47 different combinations of values, with 10 repetitions for each combination. For colon dataset, the best values were: \(\omega = 0.3, \phi _1= 2.2, \phi _2=2.2\). For prostate cancer, \(\omega = 0.3, \phi _1= 1.4, \phi _2=1.4\).

5.3 Comparison with Other Methods

In this section we compare the results of MIGPSO against other methods reported in the literature. Table 1 shows time, accuracy and number of selected genes. As it is possible to observe, MIGPSO performs better in all. It improves an average of 4 % the accuracy, while time is reduced to almost half. The number of genes selected is reduced to almost one third.

In Table 2, we show a comparison for the colon dataset with other wrapper methods. The classifiers considered are: Naive Bayes (NB), C4.5, and Support Vector Machines (SVM). Six different methods for feature selection were considered [7]. The best wrapper reported for this dataset is SVM + Hybrid PSO/GA with 92 % accuracy. MIGPSO with an MLP reach an accuracy of 96 %. That is an increase of almost 4 % in average. But it is worth noting that the amount of genes required is smaller in MIGPSO. Only 6 genes, compared with 18 in the other case. In perspective with the total amount of genes, that means a reduction of three orders of magnitude.

Table 1. Comparison between the original IGPSO and MIGPSO. For prostate, IGPSO was allowed to run 13 hours and it did not gave an answer.
Table 2. Comparison with several classifiers and feature selection methods for colon cancer dataset.
Table 3. Comparison for prostate cancer dataset. The first seven results are taken from [14]. The number of genes for some of the methods was not available.

Table 3 shows a similar comparison for prostate dataset. The classifiers considered are: SVM, Self-Organizing Map (SOM), BPNN (back propagation neural network), Naïve Bayes, Decision Trees (CART), Artificial Immune Recognition System (AIRS). PSO with C4.5 Decision Tree (PSODT). All those experiments use PSO as feature selection. The best method reported for this dataset is PSODT with 94.31 % accuracy. Our MIGPSO with an MLP goes to 98.04 %accuracy. Almost 4 % above the best reported.

6 Conclusions and Future Work

New technologies such as DNA microarrays are generating information that can be of relevance to detect, treat and prevent illnesses in a better way. However, the generated data presents challenges for current computational methods. The difference between the number of attributes and samples can be of several orders of magnitude. Even more, most datasets of microarray information are imbalanced. There are many more samples for healthy than from diseased subjects. The main concerns for automated processing and classification of new data are on the one hand the reduction of attributes analyzed and on the other hand, the use of robust classifiers. Our experiments suggest that performing a good attribute selection is worth when classifying. Having less attributes reduces the processing time, but also may increase the correct classification of new data. In this work we show that a modification of a geometric version of PSO allows a Multi-Layer perceptron to classify with high accuracy.

There are of course some issues that might be worth to explore in future work, such as: (1) running more experiments with other classifiers in the wrapper, (2) running experiments with more datasets, particularly with multi-class problems.