Gene Selection for Cancer Classification using Support Vector Machines

Abstract

DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues.

In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer.

In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate.

References

  1. Aerts, H. (1996). Chitotriosidase-New biochemical marker. Gauchers News.

  2. Alizadeh, A. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403:3, 503-511.

    Google Scholar 

  3. Alon, U. et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. PNAS, 96, 6745-6750, Cell Biology. The data is available on-line at http://www.molbio.princeton.edu/colondata.

    Google Scholar 

  4. Aronson, N. (1999). Remodeling the mammary GI and at the termination of breast feeding: Role of a new regulator protein BRP39. The Beat, University of South Alabama College of Medecine, July, 1999.

  5. Ben Hur, A., Horn, D., Siegelman, H., & Vapnik, V. (2000). A support vector method for clustering. Advances in Neural Information Processing Systems 13, Cambridge, MA: MIT Press.

    Google Scholar 

  6. Blum, A. & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245-271.

    Google Scholar 

  7. Boser, B., Guyon, I., & Vapnik, V. (1992). An training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144-152). Pittsburgh: ACM.

    Google Scholar 

  8. Bradley, P. & Mangasarian, O. (1998). Feature selection via concave minimization and support vector machines. In Proceedings of the 13th International Conference on Machine Learning (pp. 82-90). San Francisco, CA.

  9. Bradley, P., Mangasarian, O., & Street, W. (1998). Feature selection via mathematical programming. Technical Report. INFORMS Journal on Computing, 10, 209-217.

    Google Scholar 

  10. Bredensteiner, E. & Bennett, K. (1999). Multicategory classification for support vector machines. Computational Optimizations and Applications, 12, 53-79.

    Google Scholar 

  11. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M., Jr., & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines.

  12. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2000). Choosing kernel parameters for support vector machines. AT &T Labs Technical Report.

  13. Cortes, C. & Vapnik, V. (1995). Support vector networks. Machine Learning, 20:3, 273-297.

    Google Scholar 

  14. Cristianini, N. & Shawe-Taylor, J. (1999). An introduction to support vector machines. Cambridge,MA: Cambridge University Press.

    Google Scholar 

  15. Duda, R. O. & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley.

    Google Scholar 

  16. Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. PNAS, 95, 14863-14868.

    Google Scholar 

  17. Fodor, S. A. (1997). Massively parallel genomics. Science, 277, 393-395.

    Google Scholar 

  18. Furey, T., Cristianini, N., Duffy, N., Bednarski, D., Schummer, M., & Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906-914.

    Google Scholar 

  19. Ghigna, C., Moroni, M., Porta, C., Riva, I., & Biamonti, G. (1998). Altered expression of heterogeneous nuclear ribonucleoproteins and SR factors in human. Cancer Research, 58, 5818-5824.

    Google Scholar 

  20. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., & Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531-537. The data is available on-line at http://www.genome.wi.mit. edu/MPR/data set ALL AML.html.

    Google Scholar 

  21. Guyon, I. (1999). SVM Application Survey: http://www.clopinet.com/SVM.applications.html.

  22. Guyon, I., Makhoul, J., Schwartz, R., & Vapnik, V. (1998). What size test set gives good error rate estimates? PAMI, 20:1, 52-64, IEEE.

    Google Scholar 

  23. Guyon, I., Matic, N., & Vapnik, V. (1996). Discovering informative patterns and data cleaning. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy, (Eds.). Advances in knowledge discovery and data mining (pp. 181-203). Cambridge, MA: MIT Press.

    Google Scholar 

  24. Guyon, I., Vapnik, V., Boser, B., Bottou, L., & Solla, S. A. (1992). Structural risk minimization for character recognition. In J. E. Moody et al. (Ed), Advances in neural information processing systems 4 (NIPS 91), (pp. 471-479). San Mateo CA: Morgan Kaufmann.

    Google Scholar 

  25. Harlan, D. M., Graff, J. M., Stumpo, D. J., Eddy Jr, R. L., Shows, T. B., Boyle, J. M., & Blackshear, P. J. (1991). The human myristoylated alanine-rich C kinase substrate (MARCKS) gene (MACS). Analysis of its gene product, promoter, and chromosomal localization. Journal of Biological Chemistry, 266:22, 14399-14405.

    Google Scholar 

  26. Hastie, T., Tibshirani, R., Eisen, M., Brown, P., Ross, D., Scherf, U., Weinstein, J., Alisadeh, A., Staudt, L., & Botstein, D. (2000). Gene shaving: A new class of clustering methods for expression arrays. Stanford Technical Report.

  27. Jebara, T. & Jaakkola, T. (2000). Feature selection and dualities in maximum entropy discrimination. In 16th Conference on Uncertainty in Artificial Intelligence, UAI 2000, July 2000.

  28. Karakiulakis, G., Papanikolaou, C., Jankovic, S. M., Aletras, A., Papakonstantinou, E., Vretou, E., & Mirtsou-Fidani, V. (1997). Increased type IV collagen-degrading activity in metastases originating from primary tumors of the human colon. Invasion and Metastasis, 17:3, 158-168.

    Google Scholar 

  29. Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27, 7-50.

    Google Scholar 

  30. Kohavi, R. & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97:12, 273-324.

    Google Scholar 

  31. LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimum brain damage. In D. Touretzky (Ed.). Advances in neural information processing systems 2 (pp. 598-605). San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  32. Macalma, T., Otte, J., Hensler, M. E., Bockholt, S. M., Louis, H. A., Kalff-Suske, M., Grzeschik, K. H., von der Ahe, D., & Beckerle, M. C. (1996). Molecular characterization of human zyxin. Journal of Biological Chemistry, 271:49, 31470-31478.

    Google Scholar 

  33. Moser, T. L., Sharon Stack, M., Asplin, I., Enghild, J. J., Højrup, P., Everitt, L., Hubchak, S., William Schnaper, H., & Pizzo, S. V. (1999). Angiostatin binds ATP synthase on the surface of human endothelial cells. PNAS, 96:6, 2811-2816.

    Google Scholar 

  34. Mukherjee, S., Tamayo, P., Slonim, D., Verri, A., Golub, T., Messirov, J. P., & Poggio, T. (2000). Support vector machine classification of microarray data. AI memo 182. CBCL paper 182. MIT. Can be retrieved from ftp://publications.ai.mit.edu.

  35. de Oliveira, E. C. (1999). Chronic Trypanosoma cruzi infection associated to colon cancer. An experimental study in rats. Resumo di Tese. Revista da Sociedade Brasileira de Medicina Tropical, 32:1, 81-82.

    Google Scholar 

  36. Osaka, M., Rowley, J. D., & Zeleznik-Le, N. J. (1999). MSF (MLL septin-like fusion), a fusion partner gene of MLL, in a therapy-related acute myeloid leukemia with at (11; 17)(q23; q25). PNAS, 96:11, 6428-6433.

    Google Scholar 

  37. Pavlidis, P., Weston, J., Cai, J., & Grundy, W. N. (2000). Gene functional analysis from heterogeneous data. Submitted for publication.

  38. Perou, C. M. et al. (1999). Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. PNAS, 96, 9212-9217.

    Google Scholar 

  39. Schölkopf, B., Smola, A., & Muller, K.-R. (1998). Non-linear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319.

    Google Scholar 

  40. Shürmann, J. (1996). Pattern classification. Wiley Interscience.

  41. Smola, A. & Schölkopf, B. (2000). Sparce greedy matrix approximation for machine learning. In Proceedings of the 17th International Conference on Machine Learning (pp. 911-918).

  42. Thorsteinsdottir, U., Krosl, J., Kroon, E., Haman, A., Hoang, T., & Sauvageau, G. (1999). The oncoprotein E2APbx1a collaborates with Hoxa9 to acutely transform primary bone marrow cells. Molecular Cell Biology, 19:9, 6355-6366.

    Google Scholar 

  43. Vapnik, V. N. (1998). Statistical learning theory. Wiley Interscience.

  44. Walsh, J. H. (1999). Epidemiologic evidence underscores role for folate as foiler of colon cancer. Gastroenterology, 116, 3-4.

    Google Scholar 

  45. Weston, J., Muckerjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2000). Feature selection for SVMs. In Proceedings of NIPS 2000, to appear.

  46. Weston, J. & Guyon, I. (2000b). Feature selection for kernel machines using stationary weight approximation. In preparation.

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Guyon, I., Weston, J., Barnhill, S. et al. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46, 389–422 (2002). https://doi.org/10.1023/A:1012487302797

Download citation

  • diagnosis
  • diagnostic tests
  • drug discovery
  • RNA expression
  • genomics
  • gene selection
  • DNA micro-array
  • proteomics
  • cancer classification
  • feature selection
  • support vector machines
  • recursive feature elimination