Background: Proteomic peptide profiling is an emerging technology harbouring great expectations to enable early detection, enhance diagnosis and more clearly define prognosis of many diseases. Although previous research work has illustrated the ability of proteomic data to discriminate between cases and controls, significantly less attention has been paid to the analysis of feature selection strategies that enable learning of such predictive models. Feature selection, in addition to classification, plays an important role in successful identification of proteomic biomarker panels.
Methods: We present a new, efficient, multivariate feature selection strategy that extracts useful feature panels directly from the high-throughput spectra. The strategy takes advantage of the characteristics of surface-enhanced laser desorption/ionisation time-of-flight mass spectrometry (SELDI-TOF-MS) profiles and enhances widely used univariate feature selection strategies with a heuristic based on multivariate de-correlation filtering. We analyse and compare two versions of the method: one in which all feature pairs must adhere to a maximum allowed correlation (MAC) threshold, and another in which the feature panel is built greedily by deciding among best univariate features at different MAC levels.
Results: The analysis and comparison of feature selection strategies was carried out experimentally on the pancreatic cancer dataset with 57 cancers and 59 controls from the University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, USA. The analysis was conducted in both the whole-profile and peak-only modes. The results clearly show the benefit of the new strategy over univariate feature selection methods in terms of improved classification performance.
Conclusion: Understanding the characteristics of the spectra allows us to better assess the relative importance of potential features in the diagnosis of cancer. Incorporation of these characteristics into feature selection strategies often leads to a more efficient data analysis as well as improved classification performance.
This is a preview of subscription content, log in to check access.
The pancreatic data were provided courtesy of Herbert J. Zeh III, David C. Whitcomb and William L. Bigbee.
The cube-root transform for SELDI-TOF-MS profiles was suggested to us by Jeffrey Morris (personal communication) from the University of Texas MD Anderson Cancer Center, Houston, Texas, USA.
To learn the SVM model, we use an iterative optimisation algorithm described by Mangasarian and Musicant.
An alternative approach is to split profiles into two groups, case and control, and average them separately. This would eliminate a chance of peak cancellation. However, in this case a peak alignment procedure is necessary to merge two sets of peak positions.
Petricoin EF, Ardekani AM, Hitt BA, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002; 359: 572–7
Wright Jr GW, Cazares LH, Leung SM, et al. Proteinchip(R) surface enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures. Prostate Cancer Prostatic Dis 1999; 2(5/6): 264–76
Adam BL, Vlahou A, Semmes OJ, et al. Proteomic approaches to biomarker discovery in prostate and bladder cancers. Proteomics 2001; 1: 1264–70
Zhu W, Wang X, Ma Y, et al. Detection of cancer-specific markers amid massive mass spectral data. Proc Natl Acad Sci U S A 2003; 100: 14666–71
Jones MB, Krutzsch H, Shu H, et al. Proteomic analysis and identification of new biomarkers and therapeutic targets for invasive ovarian cancer. Proteomics 2002; 2: 76–84
Li J, Zhang Z, Rosenzweig J, et al. Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin Chem 2002; 48: 1296–304
Watkins B, Szaro R, Ball S, et al. Detection of early stage cancer by serum protein analysis. Am Lab 2001; 6: 32–6
Wadsworth JT, Somers K, Stack B, et al. Identification of patients with head and neck cancer using serum protein profiles. Arch Otolaryngol Head Neck Surg 2004 Jan; 130: 98–104
Poon TC, Yip TT, Chan AT, et al. Comprehensive proteomic profiling identifies serum proteomic signatures for detection of hepatocellular carcinoma and its subtypes. Clin Chem 2003 May; 49(5): 752–60
Zhukov TA, Johanson RA, Cantor AB, et al. Discovery of distinct protein profiles specific for lung tumors and pre-malignant lung lesions by SELDI mass spectrometry. Lung Cancer 2003; 40: 267–79
Xiao XY, Tang Y, Wei XP, et al. A preliminary analysis of non-small cell lung cancer biomarkers in serum. Biomed Environ Sci 2003; 16: 140–8
Kozak KR, Amneus MW, Pusey SM, et al. Identification of biomarkers for ovarian cancer using strong anion-exchange ProteinChips: potential use in diagnosis and prognosis. Proc Natl Acad Sci U S A 2003 Oct; 100(21): 12343–8
Adam BL, Qu Y, Davis JW, et al. Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res 2002; 62: 3609–14
Petricoin E, Ornstein DK. Serum proteomic patterns for detection of prostate cancer. J Natl Cancer Inst 2002; 94(20): 1576–8
Qu Y, Adam BL, Yasui Y, et al. Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin Chem 2002; 48: 1835–43
Qu Y, Adam B, Thornquist M, et al. Data reduction using a discrete wavelet transform in discriminant analysis of very high dimensionality data. Biometrics 2003; 59: 143–51
Yasui Y, Pepe M, Thompson ML, et al. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics 2003; 4: 449–63
Carpenter M, Melath S, Zhang S, et al. Statistical processing and analysis of proteomic and genomic data [online]. Available from URL: http://www.pharmasug.org/2003/BestPapers/spl06.pdf [Accessed 2004 Jan]
Sidransky D, Irizarry R, Califano JA, et al. Serum protein MALDI profiling to distinguish upper aerodigestive tract cancer patients from control subjects. J Natl Cancer Inst 2003; 95: 1711–7
Jain AK, Dubes RC. Algorithms for clustering data. Englewood Cliffs (NJ): Prentice-Hall, 1988
Jolliffe IT. Principal component analysis. New York: Springer-Verlag, 1986
Jutten C, Herault J. Blind separation of sources 1: an adaptive algorithm based on neuromimetic architecture. Signal Process 1991; 24(1): 1–10
Lee TW. Independent component analysis: theory and applications. Boston (MA): Kluwer Academic Publishers, 1998
Grizzle WE, Adam BL, Bigbee WL, et al. Serum protein expression profiling for cancer detection: validation of a SELDI-based approach for prostate cancer. Dis Markers 2004; 19: 185–95
Baggerly KA, Morris JS, Coombes KR. Reproducibility of SELDI-TOF protein patterns in serum: comparing data sets from different experiments. Bioinformatics 2004; 20(5): 777–85
Durbin BP, Hardin JS, Hawkins DM, et al. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002; 18: 105–10
Sankoff D, Kruskal J. Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Reading (MA): Addison-Wesley, 1983
Sakoe H, Chiba S. Dynamic programming optimization for spoken word recognition. IEEE Trans Acoust 1978 Feb; 26: 43–9
Eilers PHC. Parametric time warping. Anal Chem 2004; 76: 404–11
Ramsey JO, Li X. Curve registration. J R Stat Soc Ser B 1998; 60: 351–63
Semmes OJ, Feng Z, Adam BL, et al. Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I. Assessment of platform reproducibility. Clin Chem 2005 Jan; 51(1): 102–12
Grizzle WE, Meleth S, Eltoum IA, et al. Novel approaches to smoothing and comparing SELDI TOF spectra. Cancer Inform 2004; 1(1): 78–85
Semmes OJ, Feng Z, Adam B-L, et al. Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I. Assessment of platform reproducibility. Clin Chem 2005; 51: 102–12
Breiman L, Friedman JH, Olshen RA, et al. Classification and regression trees. Belmont (CA): Wadsworth, 1984
Quinlan JR. C4.5: programs for machine learning. San Francisco (CA): Morgan Kaufmann, 1993
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New York: Springer, 2001
Burgess C. A tutorial on support vector machines for pattern recognition. Mach Learn J 1998; 2: 121–67
Vapnik VN. The nature of statistical learning theory. New York: Springer-Verlag, 1995
Scholkopf B, Smola A. Learning with kernels. Boston (MA): MIT Press, 2002
Bayes T. An essay towards solving a problem in the doctrine of chances. Philos Trans R Soc Lond 1763; 53: 370–418
Russel S, Norvig P. Artificial intelligence: a modern approach. Englewood Cliffs (NJ): Prentice Hall, 2002
Ball G, Mian S, Holding F, et al. An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumors and rapid identification of potential biomarkers. Bioinformatics 2002; 18(3): 395–404
Haykin S. Neural networks. New York: Macmillan, 1994
Bishop C. Neural networks for pattern recognition. Oxford: Oxford University Press, 1995
Lyons-Weiler J, Pelikan R, Zeh III HJ, et al. Assessing the statistical significance of the achieved classification error of classifiers constructed using serum peptide profiles and a prescription for random resampling repeated studies for massive high-throughput genomic and proteomic studies. Cancer Inform 2005; 1(1): 53–77
Good P. Permutation tests: a practical guide to resampling methods for testing hypothesis. New York: Springer-Verlag, 1994
Kendall MG. The treatment of ties in ranking problems. Biometrika 1945; 33: 239–51
Goland P, Fischl B. Permutation tests for classification: towards statistical significance in image-based studies. The 18th International Conference on Information Processing in Medical Imaging. New York: Springer-Verlag, 2003; 2732: 330–41. Lecture Notes in Computer Science
Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed. New York: John Wiley and Sons, 2000
Mangasarian OL, Musicant DR. Lagrangian support vector machines. J Mach Learn Res 2001; 3: 161–77
Fisher R. The use of multiple measurements in taxonomic problems. Ann Eugen 1936; 7: 79–188
Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001; 17(6): 509–19
Hanley J, McNeil B. The meaning and use of the area under a receiver operating characteristic curve. Diagn Radiol 1982; 143(1): 29–36
Cover TH, Thomas JA. Elements of information theory. New York: Wiley-Interscience, 1991
Bonnlander BV, Weigend AS. Selecting input variables using mutual information and nonparametric density estimation. International Symposium on Artificial Neural Networks (ISANN); Tainan, Taiwan; 1994 Dec 15–17
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995; 57: 289–300
Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001; 98(9): 5116–21
Student. The probable error of a mean. Biometrika 1908; 6: 1–25
Kohavi R, John GH. The wrapper approach. In: Liu H, Motoda H, editors. Feature selection for knowledge discovery in databases. New York: Springer-Verlag, 1998
Coombes KR, Tsavachidis S, Morris JS, et al. Improved peak detection and quantication of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Houston (TX): University of Texas, 2004. MD Anderson Biostatistics Technical Report no.: UTMDABTR-001-04
Diamandis EP. Point: proteomic patterns in biological fluids: do they represent the future of cancer diagnostics? Clin Chem 2003; 49: 1272–5
Petricoin III E, Liotta LA. Counterpoint: the vision for a new diagnostic paradigm. Clin Chem 2003; 49: 1276–8
This work was supported in part by the Early Detection Research Network grant no. UO1 CA84968 to William L. Bigbee, and Lung SPORE (Specialized Programs of Research Excellence) grant no. P50 CA90440 to Jill M. Siegfried (which supported some members of our team).
The authors have no conflicts of interest that are directly relevant to the content of this article.
About this article
Cite this article
Hauskrecht, M., Pelikan, R., Malehorn, D.E. et al. Feature Selection for Classification of SELDI-TOF-MS Proteomic Profiles. Appl-Bioinformatics 4, 227–246 (2005). https://doi.org/10.2165/00822942-200504040-00003
- Feature Selection
- Feature Selection Method
- Test Error
- Proteomic Profile
- Intensity Reading