Optimizing Tree-Based Contrast Subspace Mining Using Genetic Algorithm

Mining contrast subspace is a task of finding contrast subspace where a given query object is most similar to a target class but dissimilar to non-target class in a multidimensional data set. Recently, tree-based contrast subspace mining method has been introduced to find contrast subspace in numerical data set effectively. However, the contrast subspace search of the tree-based method may be trapped in local optima within the search space. This paper proposes a tree-based method which incorporates genetic algorithm to optimize the contrast subspace search by identifying global optima contrast subspace. The experiment results showed that the proposed method performed well on several cases compared to the variation of the tree-based method.


Introduction
Given a multidimensional data set comprised of target and non-target classes, mining contrast subspace finds contrast subspace of a query object. A contrast subspace of a query object is a subspace or subset of features in which the query object is most similar to target class but dissimilar to nontarget class. Query object can be any object in which its contrast subspace is essential to be investigated. The identified contrast subspace is crucial in giving insight into the query object with regards to the target class and non-target class. Mining contrast subspace has many important applications in the field such as disease diagnosis or fraud detection. For example, in disease diagnosis, a medical doctor may want to know the symptoms that make the patient most likely belong to a target class of disease against other class of disease. Those identified symptoms can help the medical doctor in making accurate disease diagnosis and then provide appropriate treatment to the patient. Similarly, in credit card fraud detection, an analyst may want to know the features that cause a credit card transaction more similar to the fraud cases than the normal cases. Those features can provide information about the case for further investigation.
Tree-based contrast subspace mining method has been introduced to identify contrast subspace of query object in two-class multidimensional numerical data set [1,2]. The tree-based contrast subspace method used tree-based likelihood contrast scoring function to estimate the likelihood contrast score of subspaces with respect to a given query object. That is the degree to which the query object is more likely similar to a target class against non-target class in a subspace. The tree-based method finds a subset of relevant features with high likelihood contrast score and searches for highly scored contrast subspaces from the relevant features. The tree-based likelihood contrast score estimation of a subspace involves partitioning the subspace space into two group of data objects recursively on which the target objects and non-target objects are well separated with respect to the query object until the group contains only a single class or meets the minimum number of objects threshold. Accordingly, the tree-based likelihood contrast score of a subspace is the ratio of probability of target objects to probability of non-target objects in the group that containing query object. Recently, a genetic algorithm-based method has been proposed to optimize the parameter setting of the tree-based method which further improves the accuracy of the method. However, the genetic algorithm has not been used to optimize the contrast subspace search of the tree-based method. The tree-based contrast subspace mining method searches contrast subspaces of query object from a fixed small set of relevant features. This may cause the contrast subspace search more likely to be trapped in a local optima within the search space. Hence, it may deteriorate the accuracy performance of the method in identifying the contrast subspace of query object. Genetic algorithm has been widely applied in various optimization research works to find the most optimal solution to problem [3][4][5][6][7]. In this paper, we propose a genetic tree-based method which incorporates genetic algorithm to optimize the contrast subspace search of the method. That is a population of candidate potential subsets of relevant features will undergo a series of evolvement in which the tree-based likelihood contrast score of subspaces obtained from the subsets of features are maximized. Accordingly, the subspaces search can be performed on wide relevant feature space to find global optima contrast subspace.
The organization of this paper is as follows: The second section presents the literature review. Third section describes the framework of the genetic tree-based contrast subspace mining method. This is followed by a section that is presenting the experimental design and analysis for evaluating the effectiveness of the genetic tree-based method in finding relevant contrast subspaces of query object. The last section concludes this paper with the conclusion and future works.

Related Works
To the best of our knowledge, there are only few mining contrast subspace methods that have been proposed in the literature.
CSMiner (Contrast Subspace Miner) which employed the density-based likelihood contrast scoring function has been proposed to identify contrast subspace of a query object in numerical data set [8]. The density-based likelihood contrast scoring function estimates the likelihood contrast score of a subspace with respect to a query object based on the ratio of probability density of target objects to probability density of non-target objects. Contrast subspace of a query object should have high density-likelihood contrast score. CSMiner searches subspaces set in depth-first search manner and prunes subspaces from the search space based on the upper bound of probability density of target objects. However, it is inefficient for large search space that can be generated from high dimensionality (i.e., number of features) of data.
CSMiner-BPR (i.e., Contrast Subspace Miner-Bounding Pruning Refining) has been proposed to address the efficiency issue of the CSMiner [9]. It searches subspace space and prunes subspaces based on the upper bound of probability density of target objects and the lower bound of probability density of non-target objects within their neighborhood. This accelerates the mining contrast subspace process through saving the computation time for those objects outside of the neighborhood. Nevertheless, the density-based likelihood contrast scoring function involves pairwise distance measure causes the score tends to decrease when the dimensionality of subspace increases. It requires an adjustment to the dimensionality of subspaces which may affect the performance of mining contrast subspace.
TB-CSMiner (Tree-Based Contrast Subspace Miner) method has been introduced which employs the tree-based likelihood contrast scoring function. It uses the concept of divide-and-conquer of decision tree method which is not affected by the dimensionality of subspace [1]. For a subspace, the tree-based likelihood contrast scoring function attempts to gather query object with the target objects but separate it from the non-target objects in group. The ratio of target objects and non-target objects in group is then computed. TB-CSMiner avoids brute force search by searching subspaces from a space consisting only relevant features. High tree-based likelihood contrast score of subspace signifies subspace is the contrast subspace of query object. The effectiveness of TB-CSMiner is heavily dependent on its predefined parameters values. Hence, it is crucial to optimize the parameter setting to improve the performance of the method in identifying the accurate contrast subspace for query object.
TB-CSMiner with optimized parameter values has been proposed which uses genetic algorithm in the optimization process for a particular data set at hand [2]. It generates an initial population of different sets of parameter's values. The fitness of each set of parameters values is then assessed based on the accuracy performance of TB-CSMiner using the parameters values to find contrast subspaces of the given query object. A subset of sets of parameters values having high accuracy are selected to be reproduced via crossover and mutation operations to generate a new population iteratively. At the end, the highly accurate set of parameters values is returned as the best parameter setting for the TB-CSMiner method. This work is different from our work in which the existing work focuses on optimizing only the parameter setting of the TB-CSMiner method using a genetic algorithm. Hence, the genetic algorithm is designed specifically to find the best parameter setting for TB-CSMiner. Another factor that might affect the effectiveness of the TB-CSMiner method is its subspace search strategy. TB-CSMiner searches for a potential contrast subspace from a fixed small set of relevant features. This causes the method more likely to return the local optima contrast subspace for the given query object.

Genetic Tree-Based Contrast Subspace Mining Method
Genetic algorithm is an evolutionary algorithm inspired by the Darwinian natural selection and a genetic computational model of biological process of evolution [3][4][5][6][7]. It is well known that genetic algorithm can find feasible global solution for various optimization problems. That is a genetic algorithm searches for the best possible solution from a pool of possible solutions by examining the solutions based on a fitness function. Multiple fitter solutions are kept and undergo evolution to generate new possible solutions over several generations. This will ensure the global optima solution can be found for a problem in an acceptable time. The application of the genetic algorithm in the tree-based contrast subspace mining method enables the examination of wider possible potential subspaces derived from the given full-dimensional data rather than a fixed small subset of features. Hence, the genetic tree-based contrast subspace mining method employs genetic algorithm to optimize the subspace search strategy to identify the global optima contrast subspace for the given query object in the two-class multidimensional numerical data. Figure 1 illustrates the framework of the genetic tree-based contrast subspace mining method.
Given a two-class multidimensional numerical data set, a target class, a query object, the genetic tree-based mining contrast subspace process begins by designing the chromosomes represent different subsets of l features. After that, an initial population of chromosomes is generated. The fitness evaluation is performed on each chromosome in the population based on the tree-based likelihood contrast scoring function. Based on the fitness score of the chromosomes, several chromosomes in the population are selected into a new population by using the roulette wheel selection method. Then, chromosomes in the new population are reproduced first via crossover operation and followed by mutation operation to generate new chromosomes. A series of fitness evaluation, selection, crossover, and mutation process will be performed until the maximum number of iterations µ is met. Lastly, h subspaces having high tree-based likelihood contrast score are identified as the most relevant contrast subspaces of the query object. The following subsections describe the main stages involved in greater details.

Chromosome Representation
The representation of chromosomes is designed to correspond to different subsets of features from the full feature set in the data set.

Initial Population
An initial population consists of p random chromosomes is generated. Each random chromosome represents a subset of features picked randomly from a collection of possible subsets of l features that can be derived from the full feature set in data set, where l is less than the dimensionality of full feature set.

Fitness Evaluation
At this stage, the fitness of each chromosome is evaluated by assessing the contrast subspace obtained from the underlying subset of features based on the tree-based likelihood contrast scoring function. Herein, the tree-based likelihood contrast scores of t random subspaces which are searched from the subset of features with respect to query object are estimated. t is the number of random subspaces with t > 1. Highly scored random subspace is then taken as the contrast subspace attained from the subset of features.
Specifically, the estimation of the tree-based likelihood contrast score of a subspace by using the tree-based likelihood contrast scoring function is as follow: Given a twoclass d-dimensional numerical data set O comprised of target objects O + belong to target class C + and non-target objects O − belong to non-target class C − , and a query object q, for a subset of features S ub , the tree-based likelihood contrast scoring function constructs a half binary tree from the S ub space. The tree construction starts by selecting a random feature f from S ub and the f value of a which has the highest information gain score is used as the splitting criterion such as f ≤ a and f > a. Then, the splitting criterion is used to split the data objects into left node that containing a subset of objects with f has value at most a, and right node having a subset of objects with f has value greater than a. This process is performed recursively until the node contains only either target objects or non-target objects or meets the minimum number of objects threshold MinObjs. The nodes at the bottom of the tree are known as leaf node X leaf . Lastly, those features involved in tree construction constitute a subspace S derived from the S ub . The tree-based likelihood contrast score of S will be measured using Eq. where freq(C + , X leaf ) is the number of target objects in the leaf node, | | O + | | denotes the number of target objects in the data set, freq(C − , X leaf ) denotes the number of non-target objects in the leaf node, | | O − | | is the number of non-target objects in the data set and ε is a small constant value. A high tree-based likelihood contrast score of subspace indicates that query object is more similar to the target class against non-target class in the subspace. The highly scored random subspace is then taken as the best contrast subspace for query object that can be identified from the chromosome.

Selection
During selection stage, a subset of chromosomes is selected from current population using the roulette wheel selection method. Those chromosomes will be reproduced through the crossover and mutation operations to form a new population. The roulette wheel selection method first estimates the selection probability of each chromosome that is the proportion of a chromosome's fitness to the total fitness scores and subsequently the cumulative probability u i after including each ith chromosome [10]. After that, a random integer r is picked within the range 0 and 1. The ith chromosome is only selected if u i-1 < r ≤ u i . This selection process continues until the new population consists of p chromosome.

Crossover
The commonly used one-point crossover operation with a probability of crossover pc is performed on the chromosomes in the new population to produce new chromosomes [11]. One-point crossover begins with choosing two parent chromosomes randomly from the newly generated population and followed by a random integer r within the range 0 and 1. It chooses randomly a crossover point from 1 to total genes − 1 if r < pc. The fragments of the parent chromosomes after the crossover point are interchanged to produce two new chromosomes. These chromosomes replace the parent chromosomes in the new population. The crossover operation is performed iteratively for the remaining parent chromosomes in the new population. An example of onepoint crossover operation on two parent chromosomes with a crossover point 3 is illustrated in Fig. 3. The first parent chromosome representing a subset of features {f 1 ,f 2 ,f 3 ,f 4 ,f 5 } and the second parent chromosome representing a subset of features {f 1 ,f 2 ,f 6 ,f 7 ,f 8 }. After the crossover operation is performed at crossover point 3, the fragments of parent chromosome 1 and 2 after the crossover point are exchanged. This creates offspring 1 that holds subset of features {f 1 ,f 2 ,f 3 ,f 7 ,f 8 } and off spring 2 that carries subset of features {f 1 ,f 2 ,f 6 ,f 4 ,f 5 }.

Mutation
At this stage, the mutation operation with a probability of mutation pm is performed on the chromosomes in the new population [12]. The mutation operation starts with the first gene of a parent chromosome and then chooses a random integer r within the range 0 and 1. It mutates the gene by changing its value to other index position of a random feature if r < pm. These processes are repeated for the rest of genes of the chromosome. The parent chromosome with mutated gene will be a new chromosome that represents a new subset of features. This mutation operation is performed repeatedly for all remaining parent chromosomes in the new population. Figure 4 shows an example of mutation operation on a parent chromosome representing subset of features {f 1 ,f 2 ,f 3 ,f 4 ,f 5 }. After the mutation operation, the second gene of the parent chromosome is mutated which it changes the value of the gene from 2 to 6. This produces an offspring that carries a new subset of features {f 1 ,f 6 ,f 3 ,f 4 ,f 5 }.

Experimental Setup and Analysis
An experiment is carried out to evaluate the performance accuracy of the genetic tree-based contrast subspace mining method by comparing to the TB-CSMiner method (i.e., without genetic algorithm) and TB-CSMiner method with optimized parameter setting in finding contrast subspaces of query object. This experiment is conducted on six real-world multidimensional numerical data sets from UCI machine learning repository namely the Breast Cancer Wisconsin (BCW), the Wine, the Pima Indian Diabetes (PID), the Glass Identification (Glass), the Climate Model Simulation Crushes (CMSC), and the Waveform (Wave) data sets [13]. Table 1 tabulates the details of the data sets. Since there is no ground truth contrast subspace provided in the realworld two-class multidimensional numerical data set, the accuracy of the method is assessed based on the classification accuracy on the contrast subspace projected data set as suggested in [1,2]. For the genetic tree-based method, this experiment uses the parameter setting which is found often able to perform well in optimization problem [14,15]. The parameter setting of the genetic tree-based method is shown in Table 2.
In addition to that, it uses the best minimum number of objects MinObjs, small constant values ε, and several number of relevant of features l based on data sets which have been identified in the previous work as shown in Table 3 [2]. However, a smaller number of random subspaces t is used that is 10 to accelerate the mining contrast subspace process. The genetic tree-based method is implemented in Matlab 9.2 programming language and the classification accuracy evaluation is implemented in Java programming language.
The procedures of this experiment are as follows. For each data set, all objects are taken as query objects. The class of the query object is assigned as the target class. For a query object and a target class, the genetic tree-based method is run on the data set. Herein, only one contrast subspace with the highest tree-based likelihood contrast score is considered. This process is performed repeatedly for the remaining query objects. After the contrast subspace of all query objects have been identified, the classification accuracy of the contrast subspaces with respect to query object is assessed. For a contrast subspace of a query object, the data set is first projected onto the contrast subspace with respect to the query object. Then, the contrast subspace space is fed into several classifiers that include J48 (decision tree), NB (naive bayes), SVM (support vector machine), and RF (random forest), in WEKA to perform classification based on 20-fold cross validation. Lastly, the classification accuracies on contrast subspace for all query objects are averaged for each of the classifiers.
Meanwhile, the default parameter setting as suggested in the previous works is used for both tree-based and tree-based with optimized parameter setting [1,2]. Table 4 presents the average percentage of classification accuracy on BCW, PID, Wine, Glass, CMSC, and Wave data sets for classifier J48, NB, SVM, and RF.
Based on the results, the genetic tree-based method identified contrast subspaces that attained higher classification accuracy compared to the tree-based method with OPS for NB and SVM on BCW data set. The respective classification accuracy is 99.44% and 96.59%. The genetic tree-based method identified contrast subspaces with higher classification accuracy, 96.05% for SVM on Wine data set. While the genetic tree-based method produced contrast subspaces having higher classification accuracy than the treebased method with OPS for J48 and RF on Glass data set. It obtained 85.18% and 87.33% for J48 and RF respectively. Besides, it gained contrast subspaces that achieved higher accuracy that is 97.77% for J48 on Wave data set.
Overall, the genetic tree-based method demonstrated good results on only few cases. This is mainly due to the parameter setting of the genetic algorithm that includes the size of population, the probability of crossover, and the probability of mutation are not optimized for mining contrast

Conclusion
The proposed genetic tree-based contrast subspace mining method employs genetic algorithm to optimize the process of searching contrast subspaces of the given query object in two-class multidimensional numerical data set. For a query object, a sequence of different populations of subspaces has been generated from an initial population of random subspaces. Over the generation, the tree-based likelihood contrast scores of subspaces in a population with respect to the query object are assessed. Highly scored subspaces as potential contrast subspaces of query object are passed on from one population to the subsequence population. This will preserve the current best identified subspaces and thus ensure the optimal contrast subspaces for the query object can be attained. At the end, the highly scored subspaces are taken as the best contrast subspaces of the given query object. The empirical studies showed that the genetic treebased method performed well on some cases compared to both benchmarked tree-based methods in finding contrast subspaces of query objects on multidimensional numerical data sets. The parameter setting of the genetic algorithm may affect the effectiveness of the genetic tree-based method. Nevertheless, that parameter setting is not optimized for identifying contrast subspace of query object. Future work would aim to optimize the parameter setting of the genetic algorithm to further improve the performance of the genetic tree-based contrast subspace mining method.
Author Contributions FS and RA have contributed to the conception, design, analysis, and writing of this manuscript. All authors read and approved the final manuscript.
Funding No funding was received for conducting this study.

Availability of Data and Materials
The datasets analysed during the current study are available from the corresponding author on reasonable request.

Conflict of Interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Ethics approval Not applicable.

Consent for Publication
Publisher has the author's permission to publish the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.