Feature Elimination Approach Based on Random Forest for Cancer Diagnosis

  • Ha-Nam Nguyen
  • Trung-Nghia Vu
  • Syng-Yup Ohn
  • Young-Mee Park
  • Mi Young Han
  • Chul Woo Kim
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4293)


The performance of learning tasks is very sensitive to the characteristics of training data. There are several ways to increase the effect of learning performance including standardization, normalization, signal enhancement, linear or non-linear space embedding methods, etc. Among those methods, determining the relevant and informative features is one of the key steps in the data analysis process that helps to improve the performance, reduce the generation of data, and understand the characteristics of data. Researchers have developed the various methods to extract the set of relevant features but no one method prevails. Random Forest, which is an ensemble classifier based on the set of tree classifiers, turns out good classification performance. Taking advantage of Random Forest and using wrapper approach first introduced by Kohavi et al, we propose a new algorithm to find the optimal subset of features. The Random Forest is used to obtain the feature ranking values. And these values are applied to decide which features are eliminated in the each iteration of the algorithm. We conducted experiments with two public datasets: colon cancer and leukemia cancer. The experimental results of the real world data showed that the proposed method results in a higher prediction rate than a baseline method for certain data sets and also shows comparable and sometimes better performance than the feature selection methods widely used.


Feature Selection Random Forest Feature Subset Feature Selection Method Gini Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kohavi, R., John, G.H.: Wrappers for Feature Subset Selection. Artificial Intelligence, 273–324 (1997)Google Scholar
  2. 2.
    Blum, A.L., Langley, P.: Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence, 245–271 (1997)Google Scholar
  3. 3.
    Breiman, L.: Random forest. Machine Learning 45, 5–32 (2001)MATHCrossRefGoogle Scholar
  4. 4.
    Torkkola, K., Venkatesan, S., Liu, H.: Sensor selection for maneuver classification. In: Proceedings. The 7th International IEEE Conference on Intelligent Transportation Systems, pp. 636–641 (2004)Google Scholar
  5. 5.
    Wu, Y., Zhang, A.: Feature selection for classifying high-dimensional numerical data. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 251–258 (2004)Google Scholar
  6. 6.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, Chichester (2001)MATHGoogle Scholar
  7. 7.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman and Hall, New York (1984)MATHGoogle Scholar
  8. 8.
    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, J.P., Mesirov, J., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, 531–537 (1999)CrossRefGoogle Scholar
  9. 9.
    Fröhlich, H., Chapelle, O., Schölkopf, B.: Feature Selection for Support Vector Machines by Means of Genetic Algorithms. In: 15th IEEE International Conference on Tools with Artificial Intelligence, p. 142 (2003)Google Scholar
  10. 10.
    Chen, X.-w.: Gene Selection for Cancer Classification Using Bootstrapped Genetic Algorithms and Support Vector Machines. In: IEEE Computer Society Bioinformatics Conference, p. 504 (2003)Google Scholar
  11. 11.
    Zhang, H., Yu, C.-Y., Singer, B.: Cell and tumor classification using gene expression data: Construction of forests. Proceeding of the National Academy of Sciences of the United States of America 100, 4168–4172 (2003)CrossRefGoogle Scholar
  12. 12.
    Doak, J.: An evaluation of feature selection methods and their application to computer security, Technical Report CSE-92-18, Department of Computer Science and Engineering, University of Carlifornia (1992)Google Scholar
  13. 13.
    Das, S.: Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the 18th ICML ( (2001)Google Scholar
  14. 14.
    Ng, A.Y.: On feature selection: learning with exponentially many irrelevant features as training examples. In: Proceedings of the Fifteenth International Conference on Machine Learning (1998) Google Scholar
  15. 15.
    Xing, E., Jordan, M., Carp, R.: Feature selection for highdimensional genomic microarray data. In: Proc. of the 18th ICML (2001)Google Scholar
  16. 16.
    Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A Fast Scalable Classifier for Data Mining. In: Proceeding of the International Conference on Extending Database Technology, pp. 18–32 (1996)Google Scholar
  17. 17.
    Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proceedings of National Academy of Sciences of the United States of American 96, 6745–6750 (1999)CrossRefGoogle Scholar
  18. 18.
    Nguyen, H.-N., Ohn, S.-Y., Park, J., Park, K.-S.: Combined Kernel Function Approach in SVM for Diagnosis of Cancer. In: Proceedings of the First International Conference on Natural Computation (2005)Google Scholar
  19. 19.
    Su, T., Basu, M., Toure, A.: Multi-Domain Gating Network for Classification of Cancer Cells using Gene Expression Data. In: Proceedings of the International Joint Conference on Neural Networks, pp. 286–289 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Ha-Nam Nguyen
    • 1
  • Trung-Nghia Vu
    • 1
  • Syng-Yup Ohn
    • 1
  • Young-Mee Park
    • 2
  • Mi Young Han
    • 3
  • Chul Woo Kim
    • 4
  1. 1.Dept. of Computer and Information EngineeringHankuk Aviation UniversitySeoulKorea
  2. 2.Dept. of Cell Stress BiologyRoswell Park Cancer InstituteSUNY BuffaloUSA
  3. 3.Bioinfra Inc.SeoulKorea
  4. 4.Dept. of Pathology, Tumor Immunity Medical Research CenterSeoul National University College of MedicineSeoulKorea

Personalised recommendations