Feature Selection for Classification of SELDI-TOF-MS Proteomic Profiles

Abstract

Background: Proteomic peptide profiling is an emerging technology harbouring great expectations to enable early detection, enhance diagnosis and more clearly define prognosis of many diseases. Although previous research work has illustrated the ability of proteomic data to discriminate between cases and controls, significantly less attention has been paid to the analysis of feature selection strategies that enable learning of such predictive models. Feature selection, in addition to classification, plays an important role in successful identification of proteomic biomarker panels.

Methods: We present a new, efficient, multivariate feature selection strategy that extracts useful feature panels directly from the high-throughput spectra. The strategy takes advantage of the characteristics of surface-enhanced laser desorption/ionisation time-of-flight mass spectrometry (SELDI-TOF-MS) profiles and enhances widely used univariate feature selection strategies with a heuristic based on multivariate de-correlation filtering. We analyse and compare two versions of the method: one in which all feature pairs must adhere to a maximum allowed correlation (MAC) threshold, and another in which the feature panel is built greedily by deciding among best univariate features at different MAC levels.

Results: The analysis and comparison of feature selection strategies was carried out experimentally on the pancreatic cancer dataset with 57 cancers and 59 controls from the University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, USA. The analysis was conducted in both the whole-profile and peak-only modes. The results clearly show the benefit of the new strategy over univariate feature selection methods in terms of improved classification performance.

Conclusion: Understanding the characteristics of the spectra allows us to better assess the relative importance of potential features in the diagnosis of cancer. Incorporation of these characteristics into feature selection strategies often leads to a more efficient data analysis as well as improved classification performance.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Table I
Fig. 7
Table II
Fig. 8
Table IX
Table III
Fig. 10
Table IV
Fig. 11
Fig. 12

Notes

  1. 1.

    The pancreatic data were provided courtesy of Herbert J. Zeh III, David C. Whitcomb and William L. Bigbee.

  2. 2.

    The cube-root transform for SELDI-TOF-MS profiles was suggested to us by Jeffrey Morris (personal communication) from the University of Texas MD Anderson Cancer Center, Houston, Texas, USA.

  3. 3.

    To learn the SVM model, we use an iterative optimisation algorithm described by Mangasarian and Musicant.[50]

  4. 4.

    An alternative approach is to split profiles into two groups, case and control, and average them separately. This would eliminate a chance of peak cancellation. However, in this case a peak alignment procedure is necessary to merge two sets of peak positions.

References

  1. 1.

    Petricoin EF, Ardekani AM, Hitt BA, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002; 359: 572–7

    PubMed  Article  CAS  Google Scholar 

  2. 2.

    Wright Jr GW, Cazares LH, Leung SM, et al. Proteinchip(R) surface enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures. Prostate Cancer Prostatic Dis 1999; 2(5/6): 264–76

    CAS  Google Scholar 

  3. 3.

    Adam BL, Vlahou A, Semmes OJ, et al. Proteomic approaches to biomarker discovery in prostate and bladder cancers. Proteomics 2001; 1: 1264–70

    PubMed  Article  CAS  Google Scholar 

  4. 4.

    Zhu W, Wang X, Ma Y, et al. Detection of cancer-specific markers amid massive mass spectral data. Proc Natl Acad Sci U S A 2003; 100: 14666–71

    PubMed  Article  CAS  Google Scholar 

  5. 5.

    Jones MB, Krutzsch H, Shu H, et al. Proteomic analysis and identification of new biomarkers and therapeutic targets for invasive ovarian cancer. Proteomics 2002; 2: 76–84

    PubMed  Article  CAS  Google Scholar 

  6. 6.

    Li J, Zhang Z, Rosenzweig J, et al. Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin Chem 2002; 48: 1296–304

    PubMed  CAS  Google Scholar 

  7. 7.

    Watkins B, Szaro R, Ball S, et al. Detection of early stage cancer by serum protein analysis. Am Lab 2001; 6: 32–6

    Google Scholar 

  8. 8.

    Wadsworth JT, Somers K, Stack B, et al. Identification of patients with head and neck cancer using serum protein profiles. Arch Otolaryngol Head Neck Surg 2004 Jan; 130: 98–104

    PubMed  Article  Google Scholar 

  9. 9.

    Poon TC, Yip TT, Chan AT, et al. Comprehensive proteomic profiling identifies serum proteomic signatures for detection of hepatocellular carcinoma and its subtypes. Clin Chem 2003 May; 49(5): 752–60

    PubMed  Article  CAS  Google Scholar 

  10. 10.

    Zhukov TA, Johanson RA, Cantor AB, et al. Discovery of distinct protein profiles specific for lung tumors and pre-malignant lung lesions by SELDI mass spectrometry. Lung Cancer 2003; 40: 267–79

    PubMed  Article  Google Scholar 

  11. 11.

    Xiao XY, Tang Y, Wei XP, et al. A preliminary analysis of non-small cell lung cancer biomarkers in serum. Biomed Environ Sci 2003; 16: 140–8

    PubMed  Google Scholar 

  12. 12.

    Kozak KR, Amneus MW, Pusey SM, et al. Identification of biomarkers for ovarian cancer using strong anion-exchange ProteinChips: potential use in diagnosis and prognosis. Proc Natl Acad Sci U S A 2003 Oct; 100(21): 12343–8

    PubMed  Article  CAS  Google Scholar 

  13. 13.

    Adam BL, Qu Y, Davis JW, et al. Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res 2002; 62: 3609–14

    PubMed  CAS  Google Scholar 

  14. 14.

    Petricoin E, Ornstein DK. Serum proteomic patterns for detection of prostate cancer. J Natl Cancer Inst 2002; 94(20): 1576–8

    PubMed  Article  CAS  Google Scholar 

  15. 15.

    Qu Y, Adam BL, Yasui Y, et al. Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin Chem 2002; 48: 1835–43

    PubMed  CAS  Google Scholar 

  16. 16.

    Qu Y, Adam B, Thornquist M, et al. Data reduction using a discrete wavelet transform in discriminant analysis of very high dimensionality data. Biometrics 2003; 59: 143–51

    PubMed  Article  Google Scholar 

  17. 17.

    Yasui Y, Pepe M, Thompson ML, et al. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics 2003; 4: 449–63

    PubMed  Article  Google Scholar 

  18. 18.

    Carpenter M, Melath S, Zhang S, et al. Statistical processing and analysis of proteomic and genomic data [online]. Available from URL: http://www.pharmasug.org/2003/BestPapers/spl06.pdf [Accessed 2004 Jan]

  19. 19.

    Sidransky D, Irizarry R, Califano JA, et al. Serum protein MALDI profiling to distinguish upper aerodigestive tract cancer patients from control subjects. J Natl Cancer Inst 2003; 95: 1711–7

    PubMed  Article  CAS  Google Scholar 

  20. 20.

    Jain AK, Dubes RC. Algorithms for clustering data. Englewood Cliffs (NJ): Prentice-Hall, 1988

    Google Scholar 

  21. 21.

    Jolliffe IT. Principal component analysis. New York: Springer-Verlag, 1986

    Google Scholar 

  22. 22.

    Jutten C, Herault J. Blind separation of sources 1: an adaptive algorithm based on neuromimetic architecture. Signal Process 1991; 24(1): 1–10

    Article  Google Scholar 

  23. 23.

    Lee TW. Independent component analysis: theory and applications. Boston (MA): Kluwer Academic Publishers, 1998

    Google Scholar 

  24. 24.

    Grizzle WE, Adam BL, Bigbee WL, et al. Serum protein expression profiling for cancer detection: validation of a SELDI-based approach for prostate cancer. Dis Markers 2004; 19: 185–95

    Google Scholar 

  25. 25.

    Baggerly KA, Morris JS, Coombes KR. Reproducibility of SELDI-TOF protein patterns in serum: comparing data sets from different experiments. Bioinformatics 2004; 20(5): 777–85

    PubMed  Article  CAS  Google Scholar 

  26. 26.

    Durbin BP, Hardin JS, Hawkins DM, et al. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002; 18: 105–10

    Article  Google Scholar 

  27. 27.

    Sankoff D, Kruskal J. Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Reading (MA): Addison-Wesley, 1983

    Google Scholar 

  28. 28.

    Sakoe H, Chiba S. Dynamic programming optimization for spoken word recognition. IEEE Trans Acoust 1978 Feb; 26: 43–9

    Article  Google Scholar 

  29. 29.

    Eilers PHC. Parametric time warping. Anal Chem 2004; 76: 404–11

    PubMed  Article  CAS  Google Scholar 

  30. 30.

    Ramsey JO, Li X. Curve registration. J R Stat Soc Ser B 1998; 60: 351–63

    Article  Google Scholar 

  31. 31.

    Semmes OJ, Feng Z, Adam BL, et al. Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I. Assessment of platform reproducibility. Clin Chem 2005 Jan; 51(1): 102–12

    PubMed  Article  CAS  Google Scholar 

  32. 32.

    Grizzle WE, Meleth S, Eltoum IA, et al. Novel approaches to smoothing and comparing SELDI TOF spectra. Cancer Inform 2004; 1(1): 78–85

    Google Scholar 

  33. 33.

    Semmes OJ, Feng Z, Adam B-L, et al. Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I. Assessment of platform reproducibility. Clin Chem 2005; 51: 102–12

    PubMed  Article  CAS  Google Scholar 

  34. 34.

    Breiman L, Friedman JH, Olshen RA, et al. Classification and regression trees. Belmont (CA): Wadsworth, 1984

    Google Scholar 

  35. 35.

    Quinlan JR. C4.5: programs for machine learning. San Francisco (CA): Morgan Kaufmann, 1993

    Google Scholar 

  36. 36.

    Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New York: Springer, 2001

    Google Scholar 

  37. 37.

    Burgess C. A tutorial on support vector machines for pattern recognition. Mach Learn J 1998; 2: 121–67

    Google Scholar 

  38. 38.

    Vapnik VN. The nature of statistical learning theory. New York: Springer-Verlag, 1995

    Google Scholar 

  39. 39.

    Scholkopf B, Smola A. Learning with kernels. Boston (MA): MIT Press, 2002

    Google Scholar 

  40. 40.

    Bayes T. An essay towards solving a problem in the doctrine of chances. Philos Trans R Soc Lond 1763; 53: 370–418

    Article  Google Scholar 

  41. 41.

    Russel S, Norvig P. Artificial intelligence: a modern approach. Englewood Cliffs (NJ): Prentice Hall, 2002

    Google Scholar 

  42. 42.

    Ball G, Mian S, Holding F, et al. An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumors and rapid identification of potential biomarkers. Bioinformatics 2002; 18(3): 395–404

    PubMed  Article  CAS  Google Scholar 

  43. 43.

    Haykin S. Neural networks. New York: Macmillan, 1994

    Google Scholar 

  44. 44.

    Bishop C. Neural networks for pattern recognition. Oxford: Oxford University Press, 1995

    Google Scholar 

  45. 45.

    Lyons-Weiler J, Pelikan R, Zeh III HJ, et al. Assessing the statistical significance of the achieved classification error of classifiers constructed using serum peptide profiles and a prescription for random resampling repeated studies for massive high-throughput genomic and proteomic studies. Cancer Inform 2005; 1(1): 53–77

    PubMed  CAS  Google Scholar 

  46. 46.

    Good P. Permutation tests: a practical guide to resampling methods for testing hypothesis. New York: Springer-Verlag, 1994

    Google Scholar 

  47. 47.

    Kendall MG. The treatment of ties in ranking problems. Biometrika 1945; 33: 239–51

    PubMed  Article  CAS  Google Scholar 

  48. 48.

    Goland P, Fischl B. Permutation tests for classification: towards statistical significance in image-based studies. The 18th International Conference on Information Processing in Medical Imaging. New York: Springer-Verlag, 2003; 2732: 330–41. Lecture Notes in Computer Science

    Google Scholar 

  49. 49.

    Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed. New York: John Wiley and Sons, 2000

    Google Scholar 

  50. 50.

    Mangasarian OL, Musicant DR. Lagrangian support vector machines. J Mach Learn Res 2001; 3: 161–77

    Google Scholar 

  51. 51.

    Fisher R. The use of multiple measurements in taxonomic problems. Ann Eugen 1936; 7: 79–188

    Google Scholar 

  52. 52.

    Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001; 17(6): 509–19

    PubMed  Article  CAS  Google Scholar 

  53. 53.

    Hanley J, McNeil B. The meaning and use of the area under a receiver operating characteristic curve. Diagn Radiol 1982; 143(1): 29–36

    CAS  Google Scholar 

  54. 54.

    Cover TH, Thomas JA. Elements of information theory. New York: Wiley-Interscience, 1991

    Google Scholar 

  55. 55.

    Bonnlander BV, Weigend AS. Selecting input variables using mutual information and nonparametric density estimation. International Symposium on Artificial Neural Networks (ISANN); Tainan, Taiwan; 1994 Dec 15–17

  56. 56.

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995; 57: 289–300

    Google Scholar 

  57. 57.

    Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001; 98(9): 5116–21

    PubMed  Article  CAS  Google Scholar 

  58. 58.

    Student. The probable error of a mean. Biometrika 1908; 6: 1–25

  59. 59.

    Kohavi R, John GH. The wrapper approach. In: Liu H, Motoda H, editors. Feature selection for knowledge discovery in databases. New York: Springer-Verlag, 1998

    Google Scholar 

  60. 60.

    Coombes KR, Tsavachidis S, Morris JS, et al. Improved peak detection and quantication of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Houston (TX): University of Texas, 2004. MD Anderson Biostatistics Technical Report no.: UTMDABTR-001-04

    Google Scholar 

  61. 61.

    Diamandis EP. Point: proteomic patterns in biological fluids: do they represent the future of cancer diagnostics? Clin Chem 2003; 49: 1272–5

    PubMed  Article  CAS  Google Scholar 

  62. 62.

    Petricoin III E, Liotta LA. Counterpoint: the vision for a new diagnostic paradigm. Clin Chem 2003; 49: 1276–8

    PubMed  Article  CAS  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Early Detection Research Network grant no. UO1 CA84968 to William L. Bigbee, and Lung SPORE (Specialized Programs of Research Excellence) grant no. P50 CA90440 to Jill M. Siegfried (which supported some members of our team).

The authors have no conflicts of interest that are directly relevant to the content of this article.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Dr Milos Hauskrecht.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Hauskrecht, M., Pelikan, R., Malehorn, D.E. et al. Feature Selection for Classification of SELDI-TOF-MS Proteomic Profiles. Appl-Bioinformatics 4, 227–246 (2005). https://doi.org/10.2165/00822942-200504040-00003

Download citation

Keywords

  • Feature Selection
  • Feature Selection Method
  • Test Error
  • Proteomic Profile
  • Intensity Reading