A Clustering Based Hybrid System for Mass Spectrometry Data Analysis

  • Pengyi Yang
  • Zili Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5265)


Recently, much attention has been given to the mass spectrometry (MS) technology based disease classification, diagnosis, and protein-based biomarker identification. Similar to microarray based investigation, proteomic data generated by such kind of high-throughput experiments are often with high feature-to-sample ratio. Moreover, biological information and pattern are compounded with data noise, redundancy and outliers. Thus, the development of algorithms and procedures for the analysis and interpretation of such kind of data is of paramount importance. In this paper, we propose a hybrid system for analyzing such high dimensional data. The proposed method uses the k-mean clustering algorithm based feature extraction and selection procedure to bridge the filter selection and wrapper selection methods. The potential informative mass/charge (m/z) markers selected by filters are subject to the k-mean clustering algorithm for correlation and redundancy reduction, and a multi-objective Genetic Algorithm selector is then employed to identify discriminative m/z markers generated by k-mean clustering algorithm. Experimental results obtained by using the proposed method indicate that it is suitable for m/z biomarker selection and MS based sample classification.


Feature Selection Information Gain Feature Subset Subset Evaluation Base Feature Selection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Morris, J.S., Coombes, K.R., Koomen, J., Baggerly, K.A., Kobayashi, R.: Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 21(9), 1764–1775 (2005)CrossRefPubMedGoogle Scholar
  2. 2.
    Petricoin, E.F., Liotta, L.A.: SELDI-TOF-based serum proteomic pattern diagnostics for early detection of cencer. Curr. Opin. Biotechnol. 15, 24–30 (2004)CrossRefPubMedGoogle Scholar
  3. 3.
    Petricoin, E.F., Ornstein, D.K., Paweletz, C.P., Ardekani, A.M., Hackett, P.S., Hitt, B.A., Velassco, A., Trucco, C., Wiegand, L., Wood, K., Simone, C.B., Levine, P.J., Linehan, W.M., Emmert-Buck, M.R., Steinberg, S.M., Kohn, E.C., Liotta, L.A.: Serum Proteomic Patterns for Detection of Prostate Cancer. Journal of the National Cancer Institute 94(20), 1576–1578 (2002)CrossRefPubMedGoogle Scholar
  4. 4.
    Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., SteinBerg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.: Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359, 572–577 (2002)CrossRefGoogle Scholar
  5. 5.
    Li, L., Umbach, D.M., Terry, P., Taylor, J.A.: Application of the GA/KNN method to SELDI proteomics data. Bioinformatics 20(10), 1638–1640 (2004)CrossRefPubMedGoogle Scholar
  6. 6.
    Yu, J.S., Ongarello, S., Fiedler, R., Chen, X.W., Toffolo, G., Cobelli, C., Trajanoski, Z.: Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics 21(10), 2200–2209 (2005)CrossRefPubMedGoogle Scholar
  7. 7.
    Boguski, M.S., McIntosh, M.W.: Biomedical informatics for proteomics. Nature 422, 233–236 (2003)CrossRefPubMedGoogle Scholar
  8. 8.
    Somorjai, R.L., Dolenko, B., Baumgartner, R.: Class prediction and discovery using gene microarray and protenomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 19(12), 1484–1491 (2003)CrossRefPubMedGoogle Scholar
  9. 9.
    Ding, C., Peng, H.: Minimum Redundancy Feature Selection From Microarray Gene Expression Data. Journal of Bioinformatics and Computational Biology 3(2), 185–205 (2005)CrossRefPubMedGoogle Scholar
  10. 10.
    Golub, T.R., Tamayo, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Boomfield, C.D., Lander, E.S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, 531–537 (1999)CrossRefPubMedGoogle Scholar
  11. 11.
    Liu, H., Li, J., Wang, L.: A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns. Genome Informatics 13, 51–60 (2002)PubMedGoogle Scholar
  12. 12.
    Su, Y., Murali, T., Pavlovic, V., Schaffer, M., Kasif, S.: RankGene: Identification of Diagnostic Genes Based on Expression Data. Bioinformatics 19(12), 1578–1579 (2003)CrossRefPubMedGoogle Scholar
  13. 13.
    Kohavi, R., John, G.: Wrapper for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997)CrossRefGoogle Scholar
  14. 14.
    Jaeger, J., Sengupta, R., Ruzzo, W.L.: Improved Gene Selection for Clssification of Microarrays. Pac. Symp. Biocomput., 53–64 (2003)Google Scholar
  15. 15.
    Jirapech-Umpai, T., Aitken, S.: Feature Selection and Classification for Microarray Data Analysis: Evolutionary Methods for Identifying Predictive Genes. BMC Bioinformatics 6, 146 (2005)CrossRefGoogle Scholar
  16. 16.
    Yang, P.Y., Zhang, Z.L.: Hybrid Methods to Select Informative Gene Sets in Microarray Data Classification. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 811–815. Springer, Heidelberg (2007)Google Scholar
  17. 17.
    Yang, P.Y., Zhang, Z.L.: A Hybrid Approach to Selecting Susceptible Single Nucleotide Polymorphisms for Complex Disease Analysis. In: Proceedings of BMEI 2008, pp. 214–218. IEEE, Los Alamitos (2008)Google Scholar
  18. 18.
    Quinlan, J.R.: Learning efficient classification procedures and their applicaiton to chess and games. In: Machine Learning: An Artificial Intelligence Approach. Morgan Kaufmann, San Mateo (1983)Google Scholar
  19. 19.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
  20. 20.
    Blum, A.L., Langley, P.: Selection of relevent Features and Examples in Machine Learning. Artificial Intelligence 97(1-2), 245–271 (1997)CrossRefGoogle Scholar
  21. 21.
    Geurts, P., Fillet, M., de Seny, D., Meuwis, M.A., Malaise, M., Merville, M.P., Wehenkel, L.: Proteomic mass spectra classifcation using decision tree based ensemble methods. Bioinformatics 21, 3138–3145 (2005)CrossRefPubMedGoogle Scholar
  22. 22.
    Wang, Y., Makedon, F., Ford, J., Pearlman, J.: HykGene: A Hybrid Approach for Selecting Marker Genes for Phenotype Classification using Microarray Gene Expression Data. Bioinformatics 21(8), 1530–1537 (2005)CrossRefPubMedGoogle Scholar
  23. 23.
    Zhang, Z.L., Yang, P.Y.: An Ensemble of Classifier with Genetic Algorithm Based Feature Selection (accepted by IEEE Intelligent Informatics Bulletin)Google Scholar
  24. 24.
    Cai, Z., Goebel, R., Salavatipour, M.R., Lin, G.: Selecting Dissimilar Genes for Multi-Class Classification, an Application in Cancer Subtyping. BMC Bioinformatics 8, 206 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  25. 25.
    Hanczar, B., Courtine, M., Benis, A., Hennegar, C., Clement, K., Zucker, J.-D.: Improving classification of microarray data using prototype-based feature selection. SIGKDD Explorations 5, 23–30 (2003)CrossRefGoogle Scholar
  26. 26.
    Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Pengyi Yang
    • 1
  • Zili Zhang
    • 1
    • 2
  1. 1.Intelligent Software and Software Engineering LaboratoryFaculty of Computer and Information Science, Southwest UniversityChongqingChina
  2. 2.School of Engineering and Information TechnologyDeakin UniversityGeelongAustralia

Personalised recommendations