Feature selection using differential evolution for microarray data classification

The dimensions of microarray datasets are very large, containing noise and redundancy. The problem with microarray datasets is the presence of more features compared to the number of samples, which adversely affects algorithm performance. In other words, the number of columns exceeds the number of rows. Therefore, to extract precise information from microarray datasets, a robust technique is required. Microarray datasets play a critical role in detecting various diseases, including cancer and tumors. This is where feature selection techniques come into play. In recent times, feature selection (FS) has gained significant importance as a data preparation method, particularly for high-dimensional data. It is preferable to address classification problems with fewer features while maintaining high accuracy, as not all features are necessary to achieve this goal. The primary objective of feature selection is to identify the optimal subset of features. In this context, we will employ the Differential Evolution (DE) algorithm. DE is a population-based stochastic search approach that has found widespread use in various scientific and technical domains to solve optimization problems in continuous spaces. In our approach, we will combine DE with three different classification algorithms: Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). Our analysis will include a comparison of the accuracy achieved by each algorithmic model on each dataset, as well as the fitness error for each model. The results indicate that when feature selection was used the results were better compared to the results where the feature selection was not used.


Introduction
The enormous dimensionality of microarray [1,2] data has always made it challenging to categorise it.The development of microarray data technology has made it possible for researchers to currently examine more than 1000 genes in a single experiment.The quantitative data found in microarray data is quite useful despite its size.Due of its size, it has algorithmic instability and a large computational cost.Problems include fewer samples, inadequate validation, and a high number of feature genes due to noise and outliers in gene expression.Microarray technology is frequently used to measure and monitor gene expression activation levels, which are subsequently utilised to diagnose and treat a range of disorders.Microarray, a technology that assists in the resolution of many biological challenges, may yield significant data.Despite the fact that training and testing samples are occasionally small, the raw dataset characteristics range from 6000 to 60,000.(less than 100 cases).Since not all genes are required for this strategy, it is crucial to extract informative genes, making feature selection crucial.If the researcher wants to make sense of their data and find the genes that have important biological links to the classification [3][4][5] or clustering issue, the feature selection [6][7][8][9][10][11][12][13][14][15][16][17] approach is essential in these sorts of high-dimensional challenges.Selection of features is required since not all of the characteristics in the dataset are always suitable for making predictions.The feature selection purges the dataset of unnecessary and duplicated features.We can evaluate the microarray datasets using a variety of machine learning [18][19][20][21][22][23][24][25][26][27][28] techniques.Therefore, feature selection for microarray data is our ultimate goal.
The primary objective is to use feature selection to identify the most appropriate number of features and to maximize classification accuracy, which may be characterised as a multi-objective combinatory optimization problem.In some circumstances, eliminating irrelevant and superfluous characteristics from original datasets can enhance classifier classification accuracy.In other words, feature size may be decreased while classification accuracy is increased.The accuracy of the final classifier model may be considerably improved by feature selection.It is also crucial in determining the appropriate subset of predictive traits.A doctor, for instance, may base their assessment of the seriousness of a certain ailment on a categorization made using the predetermined parameters.By focusing on feature selection, the prediction accuracy is improved.
The search data during evolution has gained substantial interest for evolutionary algorithms, and several types of data mining approaches have been developed to derive relevant information from this data to direct the evolution search.Rainer Storn and Kenneth Price's differential evolution (DE) technique is utilised as a foundation for improving feature selection.Differential evolution (DE) [29], a sort of evolutionary algorithm, has been employed as a successful global optimizer.It's a reliable population-based stochastic method that excels in continuous space.DE has been applied successfully in several engineering fields, including communication, mechanical engineering, pattern detection and water resource management, Feature selection is a search technique used to specify the feature subset.An optimization algorithm functions in conjunction with a feature selection technique.The mistake caused by the selection of unnecessary qualities is minimized by this optimization approach.The contribution here is to use differential evolution (DE) technique with feature selection approach to seeks the best feature subset.We will be using machine learning classification algorithms such as Logistic Regression [30], Decision tree (DT) [31,32], and Random Forest (RF) [33].To determine which algorithms perform the best, we will compare fitness levels and accuracy.
The following is the remainder of the paper: -section 2 contains literature survey, section 3 contains methodology, section 4 contains result analysis and section 5 contains conclusion.

Literature survey
This technique, from Dash, R. et al. [34] uses a wrapper strategy called population-based differential evolution gene selection (DEGS) to select features.Differential evolution is applied in the microarray dataset to assign ideal rankings to each gene by utilizing a probability distribution factor, and then uses the fitness function of classification error.Pareto optimization is used with the objective functions of Information Gain and Signal Noise Ratio.The selected genes are then verified using classifiers such as Artificial neural network (ANN), K-nearest neighbour (KNN), Naïve Bayesian classifier (NB), and Support vector machine (SVM).The technique was tested on datasets such as leukemia, DLBCL, prostate, colon cancer, and results showed that the support vector machine classifier performed better than the other classifiers.
According to Chuang LY, et al. [35], DE was utilised to predict Operon.An operon is a collection of genes that all have the same promoter and are translated into the same mRNA.Two forms of operons that are helpful in predicting Escherichia coli are lac operon and trp operon.The proposed technique yielded accuracy scores of 0.954, 0.954, 0.907 and 0.954, for the genomes of S. aureus, M. TB, B. subtilis and P. aeruginosa PA01, respectively.Only three variables were used in our strategy, which is less than previous operon prediction techniques, yet it was still able to achieve a balance between sensitivity and specificity.Since the obtained prediction accuracy corresponds favourably with that reached by prior methodologies, the suggested method may be used to address related prediction concerns.Future research will use a broader variety of biological characteristics to predict operons and provide accompanying prediction data in order to better understand the impact of additional variables on the operon prediction challenge.
To identify informative gene subsets that correctly Tasoulis, et al. [36] describe each sample they used EA and FNN are used.The goal is often to decrease the original gene pool from 5000-10,000 or more genes to 50-100 genes.To choose these predictive genes and carry out dimension reduction, a number of gene selection techniques based on statistical analysis have been developed.These techniques include principal component analysis (PCA), information gain theory, and t-statistics.They had worked on colon and prostate dataset.It is clear that choosing which features to use is challenging and has a big impact on classification accuracy in general.Accuracy on training data is frequently extremely good, but it is seldom repeated on testing data.
Zaho et al. [37] utilized three filter methods, namely Information gain, Fisher score, and T-statistics, to prevent overfitting by generating a simple rank of features.The researchers gathered four sets of data, namely GSE1456, GSE2034, GSE2990, and GSE4922, and used SVM to classify the dataset with DE using the feature subsets.The study found that Fisher score had the highest accuracy on the GSE2034 dataset when using a subset of the top 512 features, while Information gain and T-statistics filters had lower accuracy rates of 76.4% and 78.8%, respectively, when applied to the same dataset with subsets of the top 64 and 128 features.Furthermore, Fisher score provided excellent results on the GSE2990 dataset, with an accuracy rate of 80.3%.
Vijaya et al. [38] proposed a hybrid technique that combines statistical analysis and evolutionary computation to select candidate genes from vast gene expression data with limited samples.The technique involves adjusting the scaling factor to dynamically evolve a mutant member that produces superior classification accuracy, thereby enhancing the convergence rate of the DE algorithm.Notably, the approach achieved impressive classification results of 88.65%, 89.61%, and 95.13% for datasets related to DLBCL outcome, prostate outcome, and colon malignancy, respectively.
Dixit, A, et al. ( 2020) [39] used blood cancer image dataset containing 231 images for image analysis.In comparison to traditional SVM accuracy of 86.96%, DE-SVM classification accuracy is 98.55%.As opposed to the standard NB accuracy of 95.6%, DE-NB achieves 100% accuracy.
As DE has been used in many fields, it's up to the researchers to use and test DE in different field of work.So, here we are using it on microarray datasets, having high dimensional datasets used in prediction of cancer and tumour.The gap here is that not many of the researchers have worked on the microarray datasets as it take huge amount of time to give results.To select minimum set of optimal features must be the ultimate goal which indeed helps us in increasing the accuracy of our model.

Methodology
In this section, we will cover machine learning (ML) algorithms, the differential evolution algorithm, and feature selection with the use of the differential evolution algorithm.

Machine learning algorithms
Under the category of supervised learning algorithm is Decision Tree.It is applied to problems involving classification and regression.In this instance, the condition constantly divides the data.The most typical algorithm for classification issues is this one.It has a tree-like structure, with internal nodes standing in for dataset attributes, a branch for classification rules, and leaf nodes for results.
RF is a supervised machine learning technique used for classifying data.It is a widely used method for classifying and predicting data.The average classification and regression decision is based on the majority vote of the choice trees, which are constructed on a variety of samples.Being able to handle data sets with both categorical variables (as in classification) and continuous variables is one of the RF's key advantages (as in regression).When it comes to classification issues, it performs better.
For classification and predictive analytics, logistic regression is a common application of supervised learning.Through the use of a number of independent variables, it calculates the likelihood of an event, such as a pass or fail or a vote or no vote.It is used to forecast the categorical dependent variable since it provides the probability and is bounded between 0 and 1.It is used to ascertain or predict the propensity of a binary (0/1), (Yes/No), or (Yes/No) event.

Differential evolution algorithm
The DE is a population-based heuristic approach that was developed by Storn and Price in the 1990s to determine the global optimum for the continuous space domain.To generate new candidate solutions, maintain a population of candidates, and retain the candidate solution with the highest fitness score for the optimization task, DE employs uncomplicated equations that combine existing candidate solutions.Although it has several parameters, like GA, it is fairly simple to implement.The trial vector's generation and its regulating factors, such as the population's size (N), crossover rate (CR), and scaling factor (F), determine how well DE performs.Initialization, crossover, mutation, and selection are some of its four stages.
However, selecting an appropriate strategy and the related parameters is problematic since their ideal settings might change for various issues and even for the same problem at different stages of development.DE is carried out in three key steps: 1-Mutation: Creating a mutant vector is done using the search technique known as mutation.2-Selection: DE uses selection to narrow the search to a potential location.3-Crossover: It is a probabilistic approach for comparing solutions in quest of the best one.
In the initialization phase, populations are generated randomly with individuals as P = [X 1 , X 2 , X 3 , ..., X n ] T and each individual or target vector are of dimension D, called genomes are defined as Here, i = 1, 2, 3, ..., N indi- cates individual size in the population.The value of individual population X i has been presented in the range of [X max , X min ] .
Here, the search space should be bound with the maximum and minimum values of X max = [X max,1 , X max,2 , ..., X max,D ] and X min = [X min,1 , X min,2 , ..., X min,D ] respectively.Initially, at generation 1, the i th population of the feature can be generated by using rand() which produces a uniform random number between [0, 1] according to Eq. 1, x max,d and x min,d is upper and lower bound of feature d.
In mutation phase, a mutant vector ] is created at generation t is expressed in Eq. 2. Three random vectors such as such as X t i,r1 , X t i,r2 , X t i,r3 are picked from of the present populace.-Herer1, r2, and r3 are three randomly chosen integers within range of [1,N].For each mutant vector, such indices have been created at random.The mutant vector is calculated by subtracting any two of these three vectors from one another and scaling the result by a factor of F. The third vector now includes the scaled difference.
The crossover operations are performed on target vector and mutant vector, after successful generation of mutation operation.In the crossover phase, the components of the mutant vector V t i,d are exchanged with target vector X t i,d to get trail vector, Y t i,d = [y t i,1 , y t i,2 , y t i,3 , ..., y t i,D ] .The crossover can be done in two ways such as exponential and binomial crossover.The exponential crossover can be achieved by generating a random integer number (P) within range of [1,D].The trail vector is updated from initial point to point P of target vector, if j ∈ [1 − P] otherwise from P+1 to D by the mutant vector is expressed by Eq. 3. Similarly, the binomial crossover can be performed based on the random number generation.The trial vector is updated by the mutant vector, if and only if the value of the crossover rate (CR) is smaller than or equal to the random number generated; otherwise, the target vector is expressed in Eq. 4. Here, rand() generates random number within range of 0 and 1.
In the selection phase, it simply selects the components of target vector or trial vector to next iteration based on the fitness of the components of the target and trial vectors.If the fitness of trial vector is improved as compared to fitness of (1) ) y t i,j = x t i,j for j ∈ [1 − P], v t i,j Otherwise (4) y t i,j = x t i,j if rand() ≤ CR or j ≠ j rand , v t i,j Otherwise target vector, then trial vector has been updated by target vector for next iteration is expressed by Eq. 5.The minimum or maximum fitness value is considered based on the minimization or maximization problem respectively.
All the aforementioned steps of DE have been executed for certain amount of iterations to get the optimal result.This certain amount of iterations depends upon the complexity of problem.

Algorithm of DE
1. Initialize population randomly with N individuals as P = [X 1 , X 2 , X 3 , ..., X n ] T and each individual can be defined as , where X i ∈ [X max , X min ] , scaling factor (F) and crossover rate (CR).2. Calculate the fitness of the target vector 3. t=1 4. while(t ≤ max i tr ) do 5. for I=1 to N do 6.Compute donor vector

Feature selection using differential evolution algorithm
FSDE is an evolutionary technique that starts with an initial solution to search for best subset of features till the stopping criteria is satisfied.Initially, the population of N individuals P = [X 1 , X 2 , X 3 , ..., X n ] T are generated randomly and each indi- vidual are of dimension D, can be defined as Here, i = 1, 2, 3, ..., N is amount of individuals in the population.Initially, at generation 1, the i th population of the feature can be generated by using Eq.6, the function rand() produces consistent random numbers between 0 and 1, x max,j and x min,j is upper and lower bound of the feature j.
To signify the acceptance or rejection of characteristics, the produced populations are flipped to either 1 or 0, respectively.Then, using the chosen characteristics, we compute each individual's fitness.
In the next step, the mutant vector ] is computed at generation t is expressed in Eq. 7. Here, three random vectors such as X t i,r1 , X t i,r2 , andX t i,r3 are chosen from the present population,where, r1, r2, and r3 represent randomly selected integers within the range of [1,N].The difference between any two of these three vectors is obtained to create the mutant vector, which is then scaled by a factor of F. The third vector is used to add the scaled difference.
The mutant vector of individuals has been limited to range of V t i,j ∈ [V max , V min ] .If the limit of any individual of the mutant vector (V t i,j ) exceeds its limit, then maximum or minimum value is considered for the corresponding individual based on the maximization and minimization problem is presented by Eq. 8.
In crossover stage, the trial vector is generated Y t i = [y t i,1 , y t i,2 , y t i,3 , ..., y t i,D ] among the components of mutant vector ( V t i,j ) and target vector ( X t i,j ).The binomial crossover operation is performed based on the random number generation.The trial vector is updated by the mutant vector.The updation depends upon the numerical quantity of random number (5) i,j = x min,j + rand() * (x max,j − x min,j ) which is less than or equal to the value of crossover rate (CR) otherwise the target vector is expressed in Eq. 9. Here, rand() represents uniform random number within the range of 0 and 1.The is randomly selected integer index within range of [1,D].The crossover rate is a parameter that can be set by the user and has a value between 0 and 1 that determines the parameters to be derived from mutant vector.
The probability of selection or elimination of each feature of each individual of the trial vector is computed by using a transfer function.By employing Eq. 10, the transfer function transforms the continuous feature to its equivalent binary features.
The condition for accepting a new individual, P(yi, j t ) , is when it is greater than or equal to a random integer uniformly distributed between 0 and 1, as specified in Eq. 11.
The fitness function calculates a fitness value for each candidate solution based on its ability to accurately classify the data.In this case, the fitness function is defined in equation 12: where, N is the total number of instances in the testing data.Ac1[k] is the predicted class label for the k-th instance.
TrueClass [k] is the true (actual) class label of the k-th instance in the testing data.I is an indicator function that returns 1 if its argument is true (i.e., if the predicted class label is not equal to the true class label) and 0 otherwise.fit[i] is used to store the fitness values and Y is the trail vector.
The fitness of the newly generated individuals of the trial vector ( Y t i,j ) are computed and compared with the individuals of target vector ( X t i,j ) and accordingly, the fitness values determine how individuals are updated.
The algorithm has been show below.The entire process has been repeated for a certain amount of iterations (in our case 200 times) or till the stopping criteria are reached to obtain the best set of features.The amount of individuals has been fixed to 10. Finally, the classification algorithm receives the ideal set of features for performance assessment.Figure 1 shows the visual representation of the FSDE algorithm.

Algorithm of FS-DE
1. Initialize population randomly with N individuals as P = [X 1 , X 2 , X 3 , ..., X n ] T and each individual can be defined as , where X i ∈ [X max , X min ] , scaling factor (F) and crossover rate (CR).2. Compute the fitness of the target vector 3. t=1 4. while(t ≤ max itr ) do 5. for I=1 to N do 6.Compute donor vector V i = [v i,1 , v i,2 , v i,3 , ..., v i,D ] for the corresponding target vector X t i by using mutation operation (9) y t i,j = x t i,j if rand() ≤ CR or j ≠ j rand , v t i,j Otherwise (10) P(y t i,j ) = Limit the individuals of the mutant vector 8. Compute a trail vector Y t i,d = [y t i,1 , y t i,2 , y t i,3 , ..., y t i,D ] for the corresponding target vector X t i through exponential or binomial crossover process.9. Use the transfer function P(y t i,j ) = 1 1+e −10 * y t i,j −0.5 for conversion of continuous feature to binary feature.10.Access the fitness of trial vector Y t i and modify the target vector for the next generation

Result analysis
This section explains about the details and gathering of datasets, experimental setup, and experimental results.

Dataset
Microarray datasets may be found on a number of websites, including kaggle, UCI, mendeley dataciteb12, the Global Health Observatory Data Repository, and many others.I collected the datasets from the Mendeley Data website [40], and their names are as follows: lymphoma, lung, breast, ovary, and CNS (central nervous system).There are very large datasets with characteristics that typically range from 6000 to 60,000.Information about the datasets is provided in Table 1.We may employ a variety of machine learning algorithms on the microarray dataset.We employed microarray data, which is typically a high-dimensional dataset.Thus, we may conclude that high-dimensional features and a restricted number of samples are common characteristics of microarray data.Microarray data classification is a challenging task for machine learning researchers due to the large number of characteristics and the limited sample sizes.These kinds of information are useful for making diagnoses or identifying certain cancer or tumour types.As the dataset contains a relatively large number of features compared to the number of instances, which can raise concerns about overfitting in machine learning models.Overfitting occurs when a model learns to perform exceptionally well on the training data but struggles to generalize to new, unseen data because it has effectively memorized noise and patterns that are specific to the training set.Having a large number of features relative to the number of instances can exacerbate the risk of overfitting.
The inclusion of a large number of features in each dataset likely reflects the complexity and diversity of the data being analyzed.Each feature represents a characteristic or attribute of the data that may potentially contribute to the task at hand for classification or prediction.In some domains, particularly in fields like genomics, image analysis, or highdimensional data analysis, it is common to deal with datasets that have thousands or even millions of features.These features could be measurements, gene expressions, image pixels, or any other relevant information that might contain valuable patterns.When dealing with datasets with a high dimensionality (many features) and a relatively small number of instances, there is a higher risk of overfitting.This is because the model may find spurious correlations or noise in the data, leading to overly complex models that do not generalize well to new data.In the context of the table 1, the "CNS, " "Lymphoma, " "Lung, " "Breast, " and "MLL" datasets all exhibit this characteristic, which could potentially make them susceptible to overfitting if not addressed properly.
To mitigate overfitting in datasets with a large number of features relative to instances, the Feature Selection method is used which prioritize feature selection techniques to identify and retain only the most relevant and informative features.Removing irrelevant or redundant features can reduce the model's complexity and decrease the risk of overfitting.

Experimental setup
The experiment was carried out in Python using VS Code.AMD Ryzen 7 5000 series processors with 16 GB of RAM and AMD Radeon graphics make up the system setup.The population size was set to 10, crossover rate (CR) to 0.9, scaling factor (F) to 0.8 and max iteration to 200.

Experimental results
The FSDE algorithm was utilized to select an optimal feature set.The details about the dataset has been shown in Table 1.We have used different medical datasets to test the DE algorithm.The datasets are CNS, Lymphoma, Lung, Breast, and MLL.The classification algorithms used here are DT, RF, and LR.It is cleared from the results that the accuracy obtained using DE algorithm is more than the accuracy obtained without it.
Despite having a higher number of features than instances, applying the DE algorithm helped us reduce the number of features by almost 50%.The accuracy of the algorithms has been shown in fig.2, 4, 6, 8, 10.The error plots has been shown in Fig. 3, 5, 7, 9, 11.The table 2 provides a detailed overview of an analysis that examines the impact of feature selection (FS) on the performance of different machine learning algorithms across multiple datasets.Each row in the table represents a specific dataset, and within each dataset, three machine learning algorithms are considered: Decision Tree (DT), Random Forest (RF), and Logistic Regression.The "No. of features selected" column indicates the number of features chosen after applying feature selection techniques.Feature selection is a crucial step aimed at    classification accuracy of each algorithm after feature selection has been applied to the dataset.This metric reflects the impact of feature selection on the model's predictive performance.In many cases, feature selection proves to be beneficial.It often leads to an improvement in classification accuracy when compared to models trained without feature selection.This is particularly evident in the "CNS" and "Lung" datasets, where all three algorithms exhibit substantial accuracy gains after feature selection.Different machine learning algorithms respond differently to feature selection.Logistic Regression consistently benefits from feature selection in most datasets, achieving the highest accuracy gains.Random Forest also demonstrates improvements in accuracy in some cases but remains relatively stable in others.Decision Trees, on the other hand, exhibit mixed results, with some datasets showing substantial gains and others showing marginal or no improvement.

Conclusion
The chosen mutation approach and the values of its related parameter determine how well DE works.Depending on the nature of the issue and the available computational resources, many optimization problems require various mutation procedures with various parameter values.During different phases of the evolution, several mutation strategies with various parameter values could be more beneficial for the issue at hand than a single mutation strategy with a specific parameter setting, as in the standard DE.Three algorithms LR, RF, and DT were employed.When choosing the features, we employed DE and compared accuracy with and without FS.In summary, the findings of this analysis shed light on the intricate relationship between feature selection, machine learning algorithms, and classification accuracy across diverse datasets.The key takeaways from this study highlight that the efficacy of feature selection is influenced by both the specific dataset under consideration and the choice of machine learning algorithm.Notably, feature selection proves to be a valuable tool in enhancing classification accuracy in many instances, with significant performance gains observed in certain datasets.The sensitivity of different machine learning algorithms to feature selection further underscores the need for careful consideration when selecting the most suitable approach for a given dataset.Logistic Regression consistently benefits from feature selection across multiple datasets, while Random Forest exhibits variable responses, and Decision Trees show mixed outcomes.The demerit of FSDE is that its parameter needs to be changed depending upon the problem.In essence, this analysis underscores that the effectiveness of feature selection is nuanced and contingent on the specific dataset's nature and the choice of machine learning algorithm.The FSDE can also be used in different fields such as health care and medical data, image analysis, financial and economic data.It's important to note that while DE can be applied to a variety of datasets, the choice of feature selection method should always align with the specific characteristics and objectives of the analysis.Additionally, the effectiveness of DE-based feature selection depends on parameter tuning, such as population size, mutation strategy, and stopping criteria, which may need to be adjusted to achieve the best results for a given dataset and task.