Frequent Subsequence-Based Protein Localization

  • Osmar R. Zaïane
  • Yang Wang
  • Randy Goebel
  • Gregory Taylor
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3916)


Extracellular plant proteins are involved in numerous pro- cesses including nutrient acquisition, communication with other soil organisms, protection from pathogens, and resistance to disease and toxic metals. Insofar as these proteins are strategically positioned to play a role in resistance to environmental stress, biologists are interested in proteomic tools in analyzing extracellular proteins. In this paper, we present three methods using frequent subsequences of amino acids: one based on support vector machines (SVM), one based on boosting and FSP, a new frequent subsequence pattern method. We test our methods on a plant dataset and the experimental results show that our methods perform better than the existing approaches based on amino acid composition.


Support Vector Machine Subcellular Localization Amino Acid Composition Outer Membrane Protein Extracellular Protein 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Antonie, M.-L., Zaïane, O.R., Coman, A.: Chapter Associative Classifiers for Medical Images. In: MDM/KDD 2002 and KDMCD 2002. LNCS, vol. 2797, pp. 68–83. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  2. 2.
    Bhasin, M., Raghava, G.: Eslpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and psi-blast. Nucleic Acids Research 32, W414–W419 (2004)CrossRefGoogle Scholar
  3. 3.
    Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31, 365–370 (2003)CrossRefGoogle Scholar
  4. 4.
    Cohen, W., Singer, Y.: A simple, fast and effective rule learner. In: Proceedings of Annual Conference of American Association for Artificial Intelligence, pp. 335–342 (1999)Google Scholar
  5. 5.
    Eisenhaber, F., Bork, P.: Wanted: subcellular localization of proteins based on sequence. Trends in Cell Biology 8, 169–170 (1998)CrossRefGoogle Scholar
  6. 6.
    Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G.: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology 300, 1005–1016 (2000)CrossRefGoogle Scholar
  7. 7.
    Frenkel, K.A.: The human genome project and informatics. Communications of the ACM 34(11), 41–51 (1991)CrossRefGoogle Scholar
  8. 8.
    Garg, A., Bhasin, M., Raghava, G.: Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. Journal of Biological Chemistry 280(15), 14427–14432 (2005)CrossRefGoogle Scholar
  9. 9.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
  10. 10.
    Hua, S., Sun, Z.: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17(8), 721–728 (2001)CrossRefGoogle Scholar
  11. 11.
    Hunter, L.: Artificial Intelligence and Molecular Biology. AAAI Press, Menlo Park (1993)Google Scholar
  12. 12.
    Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer, Dordrecht (2002)CrossRefGoogle Scholar
  13. 13.
    Joshi, M.V., Agarwal, R.C., Kumar, V.: Mining needles in a haystack: Classifying rare classes via two-phase rule induction. In: Proceedings of ACM SIGMOD Conference, Santa Barbara, CA, pp. 91–102 (2001)Google Scholar
  14. 14.
    Lu, Z.: Predicting protein sub-cellular localization from homologs using machine learning algorithms. Master thesis, Department of Computing Science, University of Alberta (2002)Google Scholar
  15. 15.
    Nair, R., Rost, B.: Inferring sub-cellular localization through automatic lexical analysis. In: Proceedings of the tenth International Conference on Intelligent Syetems for Molecular Biology, pp. 78–86. Oxford University Press, Oxford (2002)Google Scholar
  16. 16.
    Nakai, K.: A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14, 897–911 (1992)CrossRefGoogle Scholar
  17. 17.
    Nielsen, H., Engelbrecht, J., Brunak, S.: A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. International Journal of Neural Systems 8, 581–599 (1997)CrossRefGoogle Scholar
  18. 18.
    Nielsen, H., Engelbrecht, J., Brunak, S., von Heijne, G.: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 10(1), 1–6 (1997)CrossRefGoogle Scholar
  19. 19.
    Reinhardt, A., Hubbard, T.: Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research 26(9), 2230–2236 (1998)CrossRefGoogle Scholar
  20. 20.
    Schapire, R., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999)CrossRefMATHGoogle Scholar
  21. 21.
    Schapire, R., Singer, Y.: BoosTexter: A boosting-based system for text categorization. Machine Learning 39(2), 135–168 (2000)CrossRefMATHGoogle Scholar
  22. 22.
    She, R., Chen, F., Wang, K., Ester, M., Gardy, J.L., Brinkman, F.S.L.: Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of ACM SIGKDD Conference, Washington, DC, USA (2003)Google Scholar
  23. 23.
    Ting, K.M.: A comparative study of cost-sensitive boosting algorithms. In: Proceedings of Intl. Conference on Machine Learning, pp. 983–990 (2000)Google Scholar
  24. 24.
    Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)MATHGoogle Scholar
  25. 25.
    Wang, J., Chirn, G., Marr, T., Shapiro, B., Shasha, D., Zhang, K.: Combinatorial pattern discovery for scientific data: Some preliminary results. In: Proceedings of ACM SIGMOD Conference, Minnesota, USA (1994)Google Scholar
  26. 26.
    Wang, Y.: EPPdb: A database for proteomic analysis of extracytosolic plant proteins. Master thesis, Department of Computing Science, University of Alberta (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Osmar R. Zaïane
    • 1
  • Yang Wang
    • 1
  • Randy Goebel
    • 1
  • Gregory Taylor
    • 1
  1. 1.University of AlbertaEdmonton ALBCanada

Personalised recommendations