Skip to main content

Feature Selection and Machine Learning with Mass Spectrometry Data

  • Protocol
  • First Online:
Book cover Bioinformatics Methods in Clinical Research

Part of the book series: Methods in Molecular Biology ((MIMB,volume 593))

Abstract

Mass spectrometry has been used in biochemical research for a long time. However, its potential for discovering proteomic biomarkers using protein mass spectra has aroused tremendous interest in the last few years. In spite of its potential for biomarker discovery, it is recognized that the identification of meaningful proteomic features from mass spectra needs careful evaluation. Hence, extracting meaningful features and discriminating the samples based on these features are still open areas of research. Several research groups are actively involved in making the process as perfect as possible. In this chapter, we provide a review of major contributions toward feature selection and classification of proteomic mass spectra involving MALDI-TOF and SELDI-TOF technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Albrethsen J. (2007) Reproducibility in protein profiling by MALDI-TOF mass spectrometry. Clin Chem 53: 852–858.

    Article  PubMed  CAS  Google Scholar 

  2. Stühler K., Baessmann C, Sitek B, Jabs W, Lubeck M, Poschmann G, Chamrad DC, Blüggel M, Meyer HE. (2008) Label-free proteomics: a versatile tool for differential proteome, ABRF 2008, V12-T: Bruker Daltonics Poster, Salt Lake City, UT.

    Google Scholar 

  3. Diamandis EP. (2003) Serum proteomic patterns for detection of prostate cancer. J Natl Cancer Inst 95:489–490.

    Article  PubMed  CAS  Google Scholar 

  4. Hilario M, Kalousis A, Pellegrini C, Muller M. (2006) Processing and classification of protein mass spectra. Mass Spectrum Rev 25:409–449.

    Article  CAS  Google Scholar 

  5. Baggerly K, Morris J, Coombes K. (2004) Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics 20: 777–785.

    Article  PubMed  CAS  Google Scholar 

  6. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA. (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359:527–577.

    Article  Google Scholar 

  7. Wagner M, Naik DN, Pothen A, Kasukurti S, Devineni RR, Bao-Ling A, Semmes OJ, Wright JL. (2004) Computational protein biomarker prediction: a case study for prostate cancer. BMC Bioinformatics 5:26.

    Article  PubMed  Google Scholar 

  8. Yasui Y, Pepe M, Thompson ML, Adam BL, Wright GL, Jr., Qu Y, et al. (2003) A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics 4:449–463.

    Article  PubMed  Google Scholar 

  9. Coombes KR, Tsavachidis S, Morris JS, Baggerly KA, Hung MC, Kuerer HM. (2005) Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform, Proteomics 5:4107–4117.

    Article  PubMed  CAS  Google Scholar 

  10. Sorace JM, Zhan M. (2003) A data review and re-assessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics 4:24.

    Article  PubMed  Google Scholar 

  11. Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data, Bioinformatics 19: 1636–1643.

    Article  PubMed  CAS  Google Scholar 

  12. Baggerly KA, Morris JS, Wang J, Gold D, Xiao LC, Coombes KR. (2003) A comprehensive approach to the analysis of matrix-assisted laser desorption/ionization time of flight proteomics spectra from serum samples. Proteomics 3:1667–1672.

    Article  PubMed  CAS  Google Scholar 

  13. Breen EJ, Hopwood FG, Williams KL, Wilkins MR. (2000) Automatic Poisson peak harvesting for high throughput protein identification. Electrophoresis 21:2243–2251.

    Article  PubMed  CAS  Google Scholar 

  14. Breen EJ, Holstein WL, Hopwood FG, Smith PE, Thomas ML, Wilkins MR. (2003) Automated peak harvesting of MALDI-MS spectra for high throughput proteomics. Spectroscopy 17:579–596.

    CAS  Google Scholar 

  15. Sollie P, Breen EJ, Jones R. (1996) Recursive Implementation of Erosions and Dilations Along Discrete Lines at Arbitrary Angles. IEEE Trans Pattern Anal Mach Intell, 18:562–567.

    Article  Google Scholar 

  16. Liu H, Li J, Wong L. (2002) A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform 13: 51–60.

    PubMed  CAS  Google Scholar 

  17. Satten GA, Datta S, Moura H, Woolfitt AR, Carvalho MG, Carlone GM, et al. (2004) Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens. Bioinformatics 20: 3128–3136.

    Article  PubMed  CAS  Google Scholar 

  18. Shao XG, Leung AK, Chau FT. (2003) Wavelet: a new trend in chemistry. Acc Chem Res 36:276–283.

    Article  PubMed  CAS  Google Scholar 

  19. Saeys Y, Inza I, Larrañaga P. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517.

    Article  PubMed  CAS  Google Scholar 

  20. Kirby M. (2001) Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns, John Wiley & Sons, New York.

    Google Scholar 

  21. Savitzky A, Golay MJE. (1964) Smoothing and differentiation of data by simplified least squares procedures. Anal Chem 36:1627–1639.

    Article  CAS  Google Scholar 

  22. Eilers PHC, Marx BD. (1996) Flexible smoothing with B-splines and penalties. Statist Sci 11:89–121.

    Article  Google Scholar 

  23. Kast J, et al. (2003) Noise filtering techniques for electrospray quadrupole time of fluid mass spectra. J Am Soc Mass Spectrom 14:766–776.

    Article  PubMed  CAS  Google Scholar 

  24. Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R. (2005) Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 21:1764–1775.

    Article  PubMed  CAS  Google Scholar 

  25. Yasui Y, McLerran D, Adam BL, Winget M, Thornquist M, Feng Z. (2003) An automated peak identification/calibration procedure for high-dimensional protein measures from mass spectrometers. J Biomed Biotechnol 2003:242–248.

    Article  PubMed  Google Scholar 

  26. Serra J. (Ed.). (1988) Image Analysis and Mathematical Morphology, Vol. 2: Theoretical Advances, Academic Press, New York.

    Google Scholar 

  27. Bhanot G, Alexe G, Venkataraghavan B, Levine AJ. (2006) A robust meta classification strategy for cancer detection from MS data. Proteomics 6:592–604.

    Article  PubMed  CAS  Google Scholar 

  28. Benjamini Y, Hochberg Y. (1995) Controlling the false discovery rate – a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B 57:289–300.

    Google Scholar 

  29. Westfall P, Young SS. (1993) Resampling-Based Multiple Testing, Examples and Methods for p-Value Adjustment, John Wiley & Sons, New York.

    Google Scholar 

  30. Dudoit S, Yang YH, Speed TP, Callow MJ. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139.

    Google Scholar 

  31. Datta S, Datta S. (2005) Empirical Bayes screening of many p-values with applications to microarray studies. Bioinformatics 21:1987–1994.

    Article  PubMed  CAS  Google Scholar 

  32. Datta S, DePadilla L. (2006) Feature selection and machine learning with mass spectrometry data for distinguishing cancer and non-cancer samples. Stat Methodol, 3:79–92.

    Article  Google Scholar 

  33. Zhu W, Wang X, Ma Y, Rao M, Glimm J, Kovach JS. (2003) Detection of cancer specific markers amid massive mass spectral data. Proc Natl Acad Sci USA 100:14666–14671.

    Article  PubMed  CAS  Google Scholar 

  34. Izmirlian G. (2004) Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann NY Acad Sci 1020:154–174.

    Article  PubMed  CAS  Google Scholar 

  35. Yu JS, Ongarello S, Fiedler R, Chen XW, Toffolo G, Cobelli C, Trajanoski Z. (2005) Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics 21:2200–2209.

    Article  PubMed  CAS  Google Scholar 

  36. Levner I. (2005) Feature selection and nearest centroid classification for protein mass spectrometry. BMC Bioinformatics 6:68.

    Article  PubMed  CAS  Google Scholar 

  37. Ressom HW, Varghese RS, Abdel-Hamid M, Eissa SA, Saha D, et al. (2005) Analysis of mass spectral serum profiles for biomarker selection. Bioinformatics 21:4039–4045.

    Article  PubMed  CAS  Google Scholar 

  38. Ressom HW, Varghese RS, Drake SK, Hortin GL, Abdel-Hamid M, Loffredo CA, Goldman R. (2007) Peak selection from MALDI-TOF mass spectra using ant colony optimization. Bioinformatics 23:619–626.

    Article  PubMed  CAS  Google Scholar 

  39. Dorigo M, Di Caro G, Gambardella LM. (1999) Ant algorithms for discrete optimization. Artif Life 5:137–172.

    Article  PubMed  CAS  Google Scholar 

  40. Lal TN, Chapelle O, Scholkopf B. (2006) Combining a filter method with SVMs. In Feature Extraction, Foundations and Applications (Guyon I, et al., Eds.), Springer-Verlag, New York.

    Google Scholar 

  41. Weston J, Elisseeff A, Schölkopf B, Tipping M. (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461.

    Article  Google Scholar 

  42. Guyon I, Weston J, Barnhill S, Vapnik V. (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422.

    Article  Google Scholar 

  43. Zhang X, Lu X, Shi Q, Xu XQ, Leung HC, Harris LN, Iglehart JD, Miron A, Liu JS, Wong WH. (2006) Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics 7:197.

    Article  PubMed  CAS  Google Scholar 

  44. Guyon I, Gunn S, Hur AB, Dror G. (2004) Result analysis of the NIPS 2003 feature selection challenge. In Proceedings of the Neural Information Processing Systems, Vancouver, Canada, pp. 545–552.

    Google Scholar 

  45. Geurts P, Fillet M, de Seny D, Meuwis MA, Malaise M, Merville MP, Wehenkel L. (2005) Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 21:3138–3145.

    Article  PubMed  CAS  Google Scholar 

  46. Breiman L. (2001) Random forests. Mach Learn, 45:5–32.

    Article  Google Scholar 

  47. Datta S. (2008) Classification of breast cancer versus normal samples from mass spectrometry profiles using linear discriminant analysis of important features selected by random forest. Stat Appl Genet Mol Biol 7:7.

    Google Scholar 

  48. Pearson K. (1901) On lines and planes of closest fit to systems of points in space. Philos Mag, 2:559–572.

    Google Scholar 

  49. Wold S, Martens H, Wold H. (1983) The multivariate calibration problem in chemistry solved by 120 the PLS method. In Lecture Notes in Mathematics: Matrix Pencils (Ruhe A, Kaegstroe MB, Eds.), Springer-Verlag, Heidelberg, Germany, pp. 286–293.

    Google Scholar 

  50. Holland JH. (1994) Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, MIT Press, Cambridge, MA, p. 15.

    Google Scholar 

  51. Kohonen Y. (1982) Self-organizing formation of topologically correct feature maps. Biol. Cyber 43:59–69.

    Article  Google Scholar 

  52. Adam BL, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, Semmes OJ, Schellhammer PF, Yasui Y, Feng Z, Wright GL. (2002) Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res. 62:3609–3614.

    PubMed  CAS  Google Scholar 

  53. Ball G, Mian S, Holding F, Allibone RO, Lowe J, Ali S, Li G, McCardle S, Ellis IO, Creaser C, Rees RC. (2002) An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumors and rapid identification of potential biomarkers, Bioinformatics 18:395–404.

    Article  PubMed  CAS  Google Scholar 

  54. Purohit PV, Rocke DM. (2003) Discriminant models for high-throughput proteomics mass spectrometer data. Proteomics 3:1699–1703.

    Article  PubMed  CAS  Google Scholar 

  55. Hilario M, Kalousis A, Muller M, Pellegrini C. (2003) Machine learning approaches to lung cancer prediction from mass spectra. Proteomics 3:1716–1719.

    Article  PubMed  CAS  Google Scholar 

  56. Lilien RH, Farid H, Donald BR. (2003) Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum. J Comput Biol 10:925–946.

    Article  PubMed  CAS  Google Scholar 

  57. Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le Q. (2004) Sample classification from protein mass spectrometry, by “peak probability contrasts.” Bioinformatics 20:3034–3044.

    Article  PubMed  CAS  Google Scholar 

  58. Hastie T, Tibshirani R, Friedman J. (2001) The Elements of Statistical Learning, Springer-Verlag, New York.

    Google Scholar 

  59. Zou H, Hastie T. (2005) Regularization and variable selection via the elastic net. J Roy Statist Soc B 67:301–320.

    Article  Google Scholar 

  60. Fisher RA. (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188.

    Google Scholar 

  61. Vapnik VN. (1998) Statistical Learning Theory, John Wiley & Sons, New York.

    Google Scholar 

  62. Devijiver P, Kittler J. (1982) Pattern Recognition: A Statistical Approach, Prentice-Hall, London.

    Google Scholar 

  63. Ripley BD. (1996) Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge.

    Google Scholar 

  64. Breiman L. (1999) Using adaptive bagging to debias regressions. Technical report, 547, Statistics Dept., University of California at Berkeley.

    Google Scholar 

  65. Efron B, Tibshirani R. (1995) Cross-validation and the bootstrap: estimating the error rate of a prediction rule. Technical report, TR-477.

    Google Scholar 

  66. Strimenopoulou F, Brown PJ. (2008) Empirical Bayes logistic regression. Stanford University Stat Appl Genet Mol Biol., 7:9.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Datta, S., Pihur, V. (2010). Feature Selection and Machine Learning with Mass Spectrometry Data. In: Matthiesen, R. (eds) Bioinformatics Methods in Clinical Research. Methods in Molecular Biology, vol 593. Humana Press. https://doi.org/10.1007/978-1-60327-194-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-1-60327-194-3_11

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-60327-193-6

  • Online ISBN: 978-1-60327-194-3

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics