Active Mining Discriminative Gene Sets
Searching for good discriminative gene sets (DGSs) in microarray data is important for many problems, such as precise cancer diagnosis, correct treatment selection, and drug discovery. Small and good DGSs can help researchers eliminate “irrelavent” genes and focus on “critical” genes that may be used as biomarkers or that are related to the development of cancers. In addition, small DGSs will not impose demanding requirements to classifiers, e.g., high-speed CPUs, large memorys, etc. Furthermore, if the DGSs are used as diagnostic measures in the future, small DGSs will simplify the test and therefore reduce the cost. Here, we propose an algorithm of searching for DGSs, which we call active mining discriminative gene sets (AM-DGS). The searching scheme of the AM-DGS is as follows: the gene with a large t-statistic is assigned as a seed, i.e., the first feature of the DGS. We classify the samples in a data set using a support vector machine (SVM). Next, we add the gene with the greatest power to correct the misclassified samples into the DGS, that is the gene with the largest t-statistic evaluated with only the mis-classified samples is added. We keep on adding genes into the DGS according to the SVM’s mis-classified data until no error appears or overfitting occurs. We tested the proposed method with the well-known leukemia data set. In this data set, our method obtained two 2-gene DGSs that achieved 94.1% testing accuracy and a 4-gene DGS that achieved 97.1% testing accuracy. This result showed that our method obtained better accuracy with much smaller DGSs compared to 3 widely used methods, i.e., T-statistics, F-statistics, and SVM-based recursive feature elimination (SVM-RFE).
KeywordsSupport Vector Machine Acute Lymphoblastic Leukemia Testing Accuracy Correction Score Sequential Minimum Optimization
Unable to display preview. Download preview PDF.
- 4.Platt, J.C.: Sequential Minimum Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research, Cambridge, U.K., Technical Report (1998)Google Scholar
- 14.Khan, J.M., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks. Nature Medicine 7, 673–679 (2001)CrossRefGoogle Scholar
- 16.Devore, J., Peck, R.: Statistics: the Exploration and Analysis of Data, 3rd edn. Duxbury Press, Pacific Grove (1997)Google Scholar
- 17.Xing, E.P., Jordan, M.I., Karp, R.M.: Feature Selection for High-Dimensional Genomic Microarray Data. In: Proc. of the 18th International Conference on Machine Learning, pp. 601–608. Morgan Kaufmann, San Francisco (2001)Google Scholar