Active Mining Discriminative Gene Sets

  • Feng Chu
  • Lipo Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4029)


Searching for good discriminative gene sets (DGSs) in microarray data is important for many problems, such as precise cancer diagnosis, correct treatment selection, and drug discovery. Small and good DGSs can help researchers eliminate “irrelavent” genes and focus on “critical” genes that may be used as biomarkers or that are related to the development of cancers. In addition, small DGSs will not impose demanding requirements to classifiers, e.g., high-speed CPUs, large memorys, etc. Furthermore, if the DGSs are used as diagnostic measures in the future, small DGSs will simplify the test and therefore reduce the cost. Here, we propose an algorithm of searching for DGSs, which we call active mining discriminative gene sets (AM-DGS). The searching scheme of the AM-DGS is as follows: the gene with a large t-statistic is assigned as a seed, i.e., the first feature of the DGS. We classify the samples in a data set using a support vector machine (SVM). Next, we add the gene with the greatest power to correct the misclassified samples into the DGS, that is the gene with the largest t-statistic evaluated with only the mis-classified samples is added. We keep on adding genes into the DGS according to the SVM’s mis-classified data until no error appears or overfitting occurs. We tested the proposed method with the well-known leukemia data set. In this data set, our method obtained two 2-gene DGSs that achieved 94.1% testing accuracy and a 4-gene DGS that achieved 97.1% testing accuracy. This result showed that our method obtained better accuracy with much smaller DGSs compared to 3 widely used methods, i.e., T-statistics, F-statistics, and SVM-based recursive feature elimination (SVM-RFE).


Support Vector Machine Acute Lymphoblastic Leukemia Testing Accuracy Correction Score Sequential Minimum Optimization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Guyon, I., Wecton, J., Barnhill, S., Vapnik, V.: Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46, 389–422 (2002)MATHCrossRefGoogle Scholar
  2. 2.
    Mitra, P., Murthy, C.A., Pal, S.K.: A Probabilistic Active Support Vector Learning Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 413–418 (2004)CrossRefGoogle Scholar
  3. 3.
    Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2, 45–66 (2002)MATHCrossRefGoogle Scholar
  4. 4.
    Platt, J.C.: Sequential Minimum Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research, Cambridge, U.K., Technical Report (1998)Google Scholar
  5. 5.
    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, 531–537 (1999)CrossRefGoogle Scholar
  6. 6.
    Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc. Natl. Acad. Sci. USA. 96, 6745–6750 (1999)CrossRefGoogle Scholar
  7. 7.
    Wang, Y., Makedon, F., Ford, J., Pearlman, J.: Hykgene: a Hybrid Approach for Selecting Marker Genes for Phenotype Classification Using Microarray Gene Expression Data. Bioinformatics 21, 1530–1537 (2005)CrossRefGoogle Scholar
  8. 8.
    Li, L., Weinberg, C.R., Darden, T.A., Pedersen, L.G.: Gene Selection for Sample Classification Based on Gene Expression Data: Study of Sensitivity to Choice of Parameters of the GA/KNN Method. Bioinformaitcs 17, 1131–1142 (2001)CrossRefGoogle Scholar
  9. 9.
    Cho, J.H., Lee, D., Park, J.H., Lee, I.B.: Gene Selection and Classification from Microarray Data Using Kernel Machine. FEBS Letters 571, 93–98 (2004)CrossRefGoogle Scholar
  10. 10.
    Li, J., Wong, L.: Identifying Good Diagnostic Gene Groups from Gene Expressin Profiles Using the Concept of Emerging Patterns. Bioinformatics 18, 725–734 (2002)CrossRefGoogle Scholar
  11. 11.
    Lai, Y., Wu, B., Chen, L., Zhao, H.: Statistical Method for Identifying Differential Gene-Gene Coexpression Patterns. Bioinformatics 21, 1565–1571 (2005)CrossRefGoogle Scholar
  12. 12.
    Broet, P., Lewin, A., Richardson, S., Dalmasso, C., Magdelenat, H.: A Mixture Model-Based Strategy for Selecting Sets of Genes in Multiclass Response Microarray Experiments. Bioinformatics 20, 2562–2571 (2004)CrossRefGoogle Scholar
  13. 13.
    Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al.: Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling. Nature 403, 503–511 (2000)CrossRefGoogle Scholar
  14. 14.
    Khan, J.M., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks. Nature Medicine 7, 673–679 (2001)CrossRefGoogle Scholar
  15. 15.
    Deutsch, J.M.: Evolutionary Algorithms for Finding Optimal Gene Sets in Microarray Prediction. Bioinformatics 19, 45–52 (2003)CrossRefGoogle Scholar
  16. 16.
    Devore, J., Peck, R.: Statistics: the Exploration and Analysis of Data, 3rd edn. Duxbury Press, Pacific Grove (1997)Google Scholar
  17. 17.
    Xing, E.P., Jordan, M.I., Karp, R.M.: Feature Selection for High-Dimensional Genomic Microarray Data. In: Proc. of the 18th International Conference on Machine Learning, pp. 601–608. Morgan Kaufmann, San Francisco (2001)Google Scholar
  18. 18.
    Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)MATHGoogle Scholar
  19. 19.
    Wang, L.P. (ed.): Support Vector Machines: Theory and Applications. Springer, Berlin (2005)MATHGoogle Scholar
  20. 20.
    Devijver, P., Kittler, J.: Pattern Recognition: a Statistical Approach. Prentice Hall, London (1982)MATHGoogle Scholar
  21. 21.
    Fu, X., Wang, L.P.: Data Dimensionality Reduction with Application to Simplifying RBF Network Structure and Improving Classification Performance. IEEE Trans. on Systems, Man, and Cybernetics-Part b: Cybernetics 33, 399–409 (2003)CrossRefGoogle Scholar
  22. 22.
    Ji, S., Krishnapuram, B., Carin, L.: Hidden Markov Models and Its Application to Active Learning. IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 522–532 (2006)CrossRefGoogle Scholar
  23. 23.
    Riccardi, G., Hakkani-Tur, D.: Active Learning: Theory and Application to Automatic Speech Recognition. IEEE Trans. on Speech and Audio Processing 13, 504–511 (2005)CrossRefGoogle Scholar
  24. 24.
    Liu, X., Krishnan, A., Mondry, A.: An Entropy-Based Gene Selection Method for Cancer Classification Using Microarray Data. BMC Bioinformatics 6, 76 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Feng Chu
    • 1
    • 2
  • Lipo Wang
    • 1
    • 2
  1. 1.College of Information EngineeringXiangtan UniversityXiangtan, HunanChina
  2. 2.School of Electrical and Electronic EngineeringNanyang Technological UniversitySingapore

Personalised recommendations