Feature Subset Selection for Protein Subcellular Localization Prediction

  • Qing-Bin Gao
  • Zheng-Zhi Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4115)


Most of the existing methods for protein subcellular localization prediction are based on a large number of features that are considered to be potentially useful for determining protein subcellular localizations. However, predictors with large numbers of input variables usually suffer from the curse of dimensionality as well as the risk of overfitting. Using only those features that are relevant for protein subcellular localization might improve the prediction performance and might also provide us with some biologically useful knowledge. In this paper, we present a feature ranking based feature subset selection approach for subcellular localization prediction of proteins in the context of support vector machines (SVMs). Experimental results show that this method improves the prediction performance with selected subsets of features. It is anticipated that the proposed method will be a powerful tool for large-scale annotation of biological data.


Support Vector Machine Feature Selection Prediction Performance Location Accuracy Total Accuracy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Andrade, M.A., O’Donoghue, S.I., Rost, B.: Adaptation of Protein Surfaces to Subcellular Location. J. Mol. Biol. 276, 517–525 (1998)CrossRefGoogle Scholar
  2. 2.
    Nakai, K., Horton, P.: PSORT: a Program for Detecting Sorting Signals in Proteins and Predicting their Subcellular Localization. Trends Biochem. Sci. 24, 34–36 (1999)CrossRefGoogle Scholar
  3. 3.
    Emanuelsson, O., Nielsen, H., Brunk, S., Von Heijne, G.: Predicting Subcellular Localization of Proteins Based on their N-terminal Amino Acids Sequences. J. Mol. Biol. 300, 1005–1016 (2000)CrossRefGoogle Scholar
  4. 4.
    Nakashima, H., Nishikawa, K.: Discrimination of Intracellular and Extracellular Proteins using Amino Acid Composition and Residues-pair Frequencies. J. Mol. Biol. 238, 54–61 (1994)CrossRefGoogle Scholar
  5. 5.
    Cedano, J., Aloy, P., Perez-Pons, J.A., Querol, E.: Relation between Amino Acid Composition and Cellular Location of Proteins. J. Mol. Biol. 266, 594–600 (1997)CrossRefGoogle Scholar
  6. 6.
    Reinhardt, A., Hubbard, T.: Using Neural Networks for Prediction of the Subcellular Location of Proteins. Nucleic Acids Res. 26, 2230–2236 (1998)CrossRefGoogle Scholar
  7. 7.
    Chou, K.C., Elrod, D.W.: Protein Subcellular Location Prediction. Protein Eng. 12, 107–118 (1999)CrossRefGoogle Scholar
  8. 8.
    Yuan, Z.: Prediction of Protein Subcellular Location using Markov Chain Models. FEBS Lett. 451, 23–26 (1999)CrossRefGoogle Scholar
  9. 9.
    Hua, S., Sun, Z.: Support Vector Machine Approach for Protein Subcellular Location Prediction. Bioinformatics 17, 721–728 (2001)CrossRefGoogle Scholar
  10. 10.
    Park, K.J., Kanehisa, M.: Prediction of Protein Subcellular Locations by Support Vector Machines using Compositions of Amino Acids and Amino Acid Pairs. Bioinformatics 19, 1656–1663 (2003)CrossRefGoogle Scholar
  11. 11.
    Yu, C.S., Lin, C.J., Hwang, J.K.: Predicting Subcellular Localization of Proteins for Gram-negative Bacteria by Support Vector Machines based on N-peptide Compositions. Protein Sci. 13, 1402–1406 (2004)CrossRefGoogle Scholar
  12. 12.
    Feng, Z.P., Zhang, C.T.: Prediction of the Subcellular Location of Prokaryotic Proteins Based on the Hydrophobic Index of the Amino Acids. Int. J. Biol. Macromol. 14, 255–261 (2001)CrossRefGoogle Scholar
  13. 13.
    Sarda, D., Chua, G.H., Li, K.B., Krishnan, A.: pSLIP: SVM based Protein Subcellular Localization Prediction using Multiple Physicochemical Properties. BMC Bioinformatics 6, 152 (2005)CrossRefGoogle Scholar
  14. 14.
    Chou, K.C.: Prediction of Protein Subcellular Locations by Incorporating Quasi-sequence-order Effect. Biochem. Biophys.Res. Commun. 278, 477–483 (2000)CrossRefGoogle Scholar
  15. 15.
    Chou, K.C.: Prediction of Protein Cellular Attributes using Pseudo-amino Acid Composition. Proteins Struct. Funct. Genet. 43, 246–255 (2001)CrossRefGoogle Scholar
  16. 16.
    Chou, K.C., Cai, Y.D.: Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location. J. Biol. Chem. 277, 45765–45769 (2002)CrossRefGoogle Scholar
  17. 17.
    Feng, Z.P., Zhang, C.T.: A Graphic Representation of Protein Primary Structure and its Application in Predicting Subcellular Locations of Prokaryotic Proteins. Int. J. Biochem. Cell Biol. 34, 298–307 (2002)CrossRefGoogle Scholar
  18. 18.
    Chou, K.C., Cai, Y.D.: A New Hybrid Approach to Predict Subcellular Localization of Proteins by Incorporating Gene Ontology. Biochem. Biophys. Res. Commun. 311, 743–747 (2003)CrossRefGoogle Scholar
  19. 19.
    Bhasin, M., Raghava, G.P.: ESLpred: SVM-based Method for Subcellular Localization of Eukaryotic Proteins using Dipeptide Composition and PSIBLAST. Nucleic Acids Res 32, 414–419 (2004)CrossRefGoogle Scholar
  20. 20.
    Xie, D., Li, A., Wang, M., Fan, Z., Feng, H.: LOCSVMPSI: a Web Server for Subcellular Localization of Eukaryotic Proteins using SVM and Profile of PSI-BLAST. Nucleic Acids Res 33, 105–110 (2005)CrossRefGoogle Scholar
  21. 21.
    Xiao, X., Shao, S., Ding, Y., Huang, Z., Chen, X., Chou, K.C.: Using Cellular Automata to Generate Image Representation for Biological Sequences. Amino Acids 28, 29–35 (2005)CrossRefGoogle Scholar
  22. 22.
    Cai, Y.D., Chou, K.C.: Predicting Subcellular Localization of Proteins in a Hybridization Space. Bioinformatics 20, 1151–1156 (2004)CrossRefGoogle Scholar
  23. 23.
    Bhasin, M., Garg, A., Raghava, G.-P.S.: PSLpred: Prediction of Subcellular Localization of Bacterial Proteins. Bioinformatics 21, 2522–2524 (2005)CrossRefGoogle Scholar
  24. 24.
    Gao, Q.B., Wang, Z.Z., Yan, C., Du, Y.H.: Prediction of Protein Subcellular Location using a Combined Feature of Sequence. FEBS Lett. 579, 3444–3448 (2005)CrossRefGoogle Scholar
  25. 25.
    Matsuda, S., Vert, J.P., Saigo, H., Ueda, N., Toh, H., Akutsu, T.: A Novel Representation of Protein Sequences for Prediction of Subcellular Location using Support Vector Machines. Protein Sci. 14, 2804–2813 (2005)CrossRefGoogle Scholar
  26. 26.
    Xiao, X., Shao, S., Ding, Y., Huang, Z., Huang, Y., Chou, K.C.: Using Complexity Measure Factor to Predict Protein Subcellular Location. Amino Acids 28, 57–61 (2005)CrossRefGoogle Scholar
  27. 27.
    Pan, Y.X., Li, D.W., Duan, Y., Zhang, Z.Z., Xu, M.Q., Feng, G.Y., He, L.: Predicting Protein Subcellular Location using Digital Signal Processing. Acta. Biochim. Biophys. Sin. 37, 88–96 (2005)CrossRefGoogle Scholar
  28. 28.
    Hoglund, A., Donnes, P., Blum, T., Adolph, H.W., Kohlbacher, O.: MultiLoc: Prediction of Protein Subcellular Localization using N-terminal Targeting Sequences, Sequence Motifs, and Amino Acid Composition. Bioinformatics 22, 1158–1165 (2006)CrossRefGoogle Scholar
  29. 29.
    Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature Selection for Genetic Sequence Classification. Bioinformatics 14, 139–143 (1998)CrossRefGoogle Scholar
  30. 30.
    Degroeve, S., Baets, B.D., de Peer, Y.V., Rouze, P.: Feature Subset Selection for Splice Site Prediction. Bioinformatics 18, S75–S83 (2002)Google Scholar
  31. 31.
    Wang, M., Yang, J., Xu, Z.J., Chou, K.C.: SLLE for Predicting Membrane Protein Types. J. Theor. Biol. 232, 7–15 (2005)CrossRefMathSciNetGoogle Scholar
  32. 32.
    Wu, C., Whitson, G., McLarty, J., Ermongkonchai, A., Chang, T.C.: Protein Classification Artificial Neural System. Protein Sci 1, 667–677 (1992)CrossRefGoogle Scholar
  33. 33.
    Yang, M.Q., Yang, J.K., Zhang, Y.Z.: Extracting Features from Primary Structure to Enhance Structural and Functional Prediction. In: RECOMB (2005)Google Scholar
  34. 34.
    Wang, J.T.L., Ma, Q., Shasha, D., Wu, C.H.: New Techniques for Extracting Features from Protein Sequences. IBM Sys. J. 40, 426–441 (2001)CrossRefGoogle Scholar
  35. 35.
    Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)MATHGoogle Scholar
  36. 36.
    ScholkÖpf, B., Burges, C., Vapnik, V.: Extracting Support Data for a Given Task. In: Proc. First Int. Conf. KDDM, AAAI Press, Menlo Park (1995)Google Scholar
  37. 37.
    Hsu, C.W., Lin, C.J.: A Comparison of Methods for Multi-class Support Vector Machines. IEEE Trans. Neural Networks. 13, 415–425 (2002)CrossRefGoogle Scholar
  38. 38.
    Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector Machines (2001), Software is available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Qing-Bin Gao
    • 1
  • Zheng-Zhi Wang
    • 1
  1. 1.Institute of AutomationNational University of Defense TechnologyChangshaPeople’s Republic of China

Personalised recommendations