Skip to main content

Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles

  • Conference paper
Data Mining for Biomedical Applications (BioDM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3916))

Included in the following conference series:

Abstract

Recently developed Serial Analysis of Gene Expression (SAGE) technology enables us to simultaneously quantify the expression levels of tens of thousands of genes in a population of cells. SAGE is better than Microarray in that SAGE can monitor both known and unknown genes while Microarray can only measure known genes. SAGE gene expression profiling based cancer classification is a better choice since cancers may be due to some unknown genes. Whereas a wide range of methods has been applied to traditional Microarray based cancer classification, relatively few studies have been done on SAGE based cancer classification. In our study we evaluate popular machine learning methods (SVM, Naive Bayes, Nearest Neighbor, C4.5 and RIPPER) for classifying cancers based on SAGE data. In order to deal with the high dimensional problem, we propose to use Chi-square for tag/gene selection. Both binary classification and multicategory classification are investigated. The experiments are based on two human SAGE datasets: brain and breast. The results show that SVM and Naive Bayes are the top-performing SAGE classifiers and that Chi-square based gene selection can improve the performance of all the five classifiers investigated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Li, C., Haiyan, H., Seth, B., Jun, L., Connie, C., Wing, W.: Clustering Analysis of SAGE Data Using a Poisson Approach. Genome Biology 5, 51 (2004)

    Article  Google Scholar 

  2. Ng, R.T., Sander, J., Sleumer, M.C.: Hierarchical Cluster Analysis of SAGE Data for Cancer Profiling. BIOKDD, 65–72 (2001)

    Google Scholar 

  3. Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W.: Serial Analysis of Gene Expression. Science 270, 484–487 (1995)

    Article  Google Scholar 

  4. Buckhaults, P., et al.: Identifying Tumor Origin Using a Gene Expression-based Classification Map. Cancer Res. 63, 4144–4149 (2003)

    Google Scholar 

  5. Porter, D., et al.: A Neural Survival Factor is a Candidate Oncogene in Breast Cancer. Proc. Natl. Acad. Sci. USA (100), 10931–10936 (2003)

    Article  Google Scholar 

  6. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene Selection for Cancer Classification Using Support Vector Machines. Machine Learning 46(1/3), 389–422 (2002)

    Article  MATH  Google Scholar 

  7. Statnikov, A., et al.: A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis. Bioinformatics Advance Access (September 16, 2004)

    Google Scholar 

  8. Golub, T., et al.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286 (October 15, 1999)

    Google Scholar 

  9. Cortes, C., Vapnik, V.: Support-vector Networks. Machine Learning 20(3), 273–297 (1995)

    MATH  Google Scholar 

  10. Sander, J., Ng, R.T., Sleumer, M.C., Saint Yuen, M., Jones, S.J.: A Methodology for Analyzing SAGE Libraries for Cancer Profiling. ACM Transactions on Information Systems 23(1), 35–60 (2005)

    Article  Google Scholar 

  11. Cohen, W.W.: Fast Effective Rule Induction. Machine Learning. In: Proceedings of the Twelfth International Conference (ML 1995) (1995)

    Google Scholar 

  12. SAGEMap (2005), http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL4

  13. Schneider, K.-M.: A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 307–314 (April 2003)

    Google Scholar 

  14. Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D.P., Levy, S.: A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis. Bioinformatics (2004)

    Google Scholar 

  15. Androutsopoulos, I., et al.: Learning to Filter Spam E-mail: A Comparison of a Naive Bayesian and a Memory-based Approach. In: Proceedings of the Workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000)

    Google Scholar 

  16. John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Mateo (1995)

    Google Scholar 

  17. Breiman, L.: Bagging Predictors. Machine Learning 24(2), 123–140 (1996)

    MATH  Google Scholar 

  18. Witten, I., Frank, E.: Data Mining –Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  19. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)

    Google Scholar 

  20. Aha, D., Kibler, D.: Instance-based Learning Algorithms. Machine Learning 6, 37–66 (1991)

    MATH  Google Scholar 

  21. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13(3), 637–649 (2001)

    Article  MATH  Google Scholar 

  22. Zhang, L., Zhu, J., Tianshun, Y.: An Evaluation of Statistical Spam Filtering Techniques. ACM Trans. Asian Lang. Inf. Process 3(4), 243–269 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jin, X., Xu, A., Bie, R., Guo, P. (2006). Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles. In: Li, J., Yang, Q., Tan, AH. (eds) Data Mining for Biomedical Applications. BioDM 2006. Lecture Notes in Computer Science(), vol 3916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11691730_11

Download citation

  • DOI: https://doi.org/10.1007/11691730_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33104-9

  • Online ISBN: 978-3-540-33105-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics