Combined Gene Selection Methods for Microarray Data Analysis

  • Hong Hu
  • Jiuyong Li
  • Hua Wang
  • Grant Daggard
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4251)


In recent years, the rapid development of DNA Microarray technology has made it possible for scientists to monitor the expression level of thousands of genes in a single experiment. As a new technology, Microarray data presents some fresh challenges to scientists since Microarray data contains a large number of genes (around tens thousands) with a small number of samples (around hundreds). Both filter and wrapper gene selection methods aim to select the most informative genes among the massive data in order to reduce the size of the expression database. Gene selection methods are used in both data preprocessing and classification stages. We have conducted some experiments on different existing gene selection methods to preprocess Microarray data for classification by benchmark algorithms SVMs and C4.5. The study suggests that the combination of filter and wrapper methods in general improve the accuracy performance of gene expression Microarray data classification. The study also indicates that not all filter gene selection methods help improve the performance of classification. The experimental results show that among tested gene selection methods, Correlation Coefficient is the best gene selection method for improving the classification accuracy on both SVMs and C4.5 classification algorithms.


Support Vector Machine Microarray Data Gene Selection Microarray Data Analysis Informative Gene 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alizadeh, A., Eishen, M.B., Davis, E., Ma, C., et al.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)CrossRefGoogle Scholar
  2. 2.
    Brown, M., Grundy, W., Lin, D., Cristianini, N., Sugnet, C., Furey, T., Jr., M., Haussler, D.: Knowledge-based analysis of microarray gene expression data by using suport vector machines. In: Proc. Natl. Acad. Sci., vol. 97, pp. 262–267 (2000)Google Scholar
  3. 3.
    Cho, S.-B., Won, H.-H.: Machine learning in dna microarray analysis for cancer classification. In: CRPITS ’19: Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics 2003, Adelaide, Australia, pp. 189–198. Australian Computer Society, Inc., Darlinghurst, Australia (2003)Google Scholar
  4. 4.
    Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)zbMATHGoogle Scholar
  5. 5.
    Dettling, M.: Bagboosting for tumor classification with gene expression data. Bioinformatics 20(18), 3583–3593 (2004)CrossRefGoogle Scholar
  6. 6.
    Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning 40, 139–157Google Scholar
  7. 7.
    Furey, T.S., Christianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Hauessler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10), 906–914 (2000)CrossRefGoogle Scholar
  8. 8.
    Gordon, G., Jensen, R., Hsiao, L.-L., Gullans, S., et al.: Translation of microarray data into clinically relevant cancer diagnostic tests using gege expression ratios in lung cancer and mesothelioma. Cancer Research 62, 4963–4967 (2002)Google Scholar
  9. 9.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3), 389–422 (2002)zbMATHCrossRefGoogle Scholar
  10. 10.
    Joachims, T.: Making large-scale support vector machine learning practical. In: Advances in Kernel Methods: Support Vector Machines (1998)Google Scholar
  11. 11.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Proceedings of ECML 1998, 10th European Conference on Machine Learning, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  12. 12.
    Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997)zbMATHCrossRefGoogle Scholar
  13. 13.
    Koller, D., Sahami, M.: Toward optimal feature selection. In: International Conference on Machine Learning, pp. 284–292 (1996)Google Scholar
  14. 14.
    Li, J., Liu, H.: Kent ridge bio-medical data set repository (2002),
  15. 15.
    Li, J., Liu, H., Ng, S.-K., Wong, L.: Discovery of significant rules for classifying cancer diagnosis data. In: ECCB, pp. 93–102 (2003)Google Scholar
  16. 16.
    Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10, pp. 570–576. MIT Press, Cambridge (1998)Google Scholar
  17. 17.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  18. 18.
    Osuna, E., Freund, R., Girosi, F.: Training support vector machines:an application to face detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1997)Google Scholar
  19. 19.
    Quinlan, J.: Improved use of continuous attributes in C4.5. Artificial Intelligence Research 4, 77–90 (1996)zbMATHGoogle Scholar
  20. 20.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California (1993)Google Scholar
  21. 21.
    Tan, A.C., Gibert, D.: Ensemble machine learning on gene expression data for cancer classification. Applied Bioinformatics 2(3), 75–83 (2003)Google Scholar
  22. 22.
    Golub, T.R., Slonim, D.K., Tamayo, P., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)CrossRefGoogle Scholar
  23. 23.
    Van’t Veer, L.J., Dai, H., Van de Vijver, M.J., et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002)CrossRefGoogle Scholar
  24. 24.
    West, M., Blanchette, C., et al.: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. U S A 98, 11462–11467 (2001)CrossRefGoogle Scholar
  25. 25.
    Yeang, C., Ramaswamy, S., Tamayo, P., et.,, al,: Molecular classification of multiple tumor types. Bioinformatics 17(Suppl. 1), 316–322 (2001)Google Scholar
  26. 26.
    Yu, L., Liu, H.: Redundancy based feature selection for microarray data. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, pp. 737–742 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hong Hu
    • 1
  • Jiuyong Li
    • 1
  • Hua Wang
    • 1
  • Grant Daggard
    • 2
  1. 1.Department of Mathematics and ComputingUniversity of Southern QueenslandAustralia
  2. 2.Department of Biological and Physical SciencesUniversity of Southern QueenslandAustralia

Personalised recommendations