k-Top Scoring Pair Algorithm for feature selection in SVM with applications to microarray data classification

Abstract

Top Scoring Pair (TSP) and its ensemble counterpart, k-Top Scoring Pair (k-TSP), were recently introduced as competitive options for solving classification problems of microarray data. However, support vector machine (SVM) which was compared with these approaches is not equipped with feature or variable selection mechanism while TSP itself is a kind of variable selection algorithm. Moreover, an ensemble of SVMs should also be considered as a possible competitor to k-TSP. In this work, we conducted a fair comparison between TSP and SVM-recursive feature elimination (SVM-RFE) as the feature selection method for SVM. We also compared k-TSP with two ensemble methods using SVM as their base classifier. Results on ten public domain microarray data indicated that TSP family classifiers serve as good feature selection schemes which may be combined effectively with other classification methods.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    http://leo.ugr.es/elvira/DBCRepository/index.html. Accessed 12 Jun 2008

  2. 2.

    http://faculty.vassar.edu/lowry/kappa.htm. Accessed 11 Mar 2009

References

  1. Alizadeh AA, Eisen MB, Davis EE et al (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511

    Article  Google Scholar 

  2. Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):6745–6750

    Article  Google Scholar 

  3. Beer DG, Kardia SL, Huang CC et al (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 8(8):816–824

    Google Scholar 

  4. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  MathSciNet  Google Scholar 

  5. Buciu I, Kotropoulos C, Pitas I (2006) Demonstrating the stability of support vector machines for classification. Signal Process 86(9):2364–2380

    Article  Google Scholar 

  6. Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167

    Article  Google Scholar 

  7. Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46

    Article  Google Scholar 

  8. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(2):185–205

    Article  MathSciNet  Google Scholar 

  9. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    MATH  Article  MathSciNet  Google Scholar 

  10. Geman D, d’Avignon C, Naiman D, Winslow R (2004) Classifying gene expression profiles from pairwise mrna comparisons. Stat Appl Genet Mol Biol 3(1):19

    MathSciNet  Google Scholar 

  11. Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  Google Scholar 

  12. Gordon GJ, Jensen RV, li Hsiao L et al (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62:4963–4967

    Google Scholar 

  13. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    MATH  Article  Google Scholar 

  14. Joachims T (1999) Making large-scale support vector machine learning practical. In: Schölkopf B, Burges CJC, Smola AJ (eds) Advances in kernel methods: support vector learning. MIT Press, Cambridge, pp 169–184

    Google Scholar 

  15. Kim HC, Pang S, Je HM, Kim D, Bang SY (2003) Constructing support vector machine ensemble. Pattern Recognit 36(12):2757–2767

    MATH  Article  Google Scholar 

  16. Lai C, Reinders M, Veer LV, Wessels L (2006) A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinformatics 7(1), http://dx.doi.org/10.1186/1471-2105-7-235

  17. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  18. Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Schölkopf B, Burges CJC, Smola AJ (eds) Advances in kernel methods: support vector learning. MIT Press, Cambridge, pp 185–208

    Google Scholar 

  19. Pomeroy SL, Tamayo P, Gaasenbeek M et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436–442

    Article  Google Scholar 

  20. Rosenwald A, Wright G, Chan WC et al (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N Engl J Med 346(25):1937–1947

    Article  Google Scholar 

  21. Shipp MA, Ross KN, Tamayo P et al (2002) Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74

    Article  Google Scholar 

  22. Singh D, Febbo PG, Ross K et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209

    Article  Google Scholar 

  23. Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D (2005) Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21(20):3896–3904

    Article  Google Scholar 

  24. Vapnik VN (1998) Statistical Learning Theory. Wiley-Interscience

  25. Wigle DA, Jurisica I, Radulovich N et al (2002) Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. Cancer Res 62:3005–3008

    Google Scholar 

  26. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann

Download references

Acknowledgements

The authors would like to appreciate anonymous reviewers for their valuable comments that improved the presentation of this paper. The work of S. Kim was supported by the Special Research Grant of Sogang University 200811028.01.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Saejoon Kim.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Yoon, S., Kim, S. k-Top Scoring Pair Algorithm for feature selection in SVM with applications to microarray data classification. Soft Comput 14, 151–159 (2010). https://doi.org/10.1007/s00500-009-0437-x

Download citation

Keywords

  • Top Scoring Pair
  • k-TSP
  • SVM
  • SVM-RFE
  • Ensemble methods