Protein Subcellular Localization Prediction Using Artificial Intelligence Technology

  • Rajesh Nair
  • Burkhard Rost
Part of the Methods in Molecular Biology book series (MIMB, volume 484)


Proteins perform many important tasks in living organisms, such as catalysis of biochemical reactions, transport of nutrients, and recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its “function.” One aspect of protein function that has been the target of intensive research by computational biologists is its subcellular localization. Proteins must be localized in the same subcellular compartment to cooperate toward a common physiological function. Aberrant subcellular localization of proteins can result in several diseases, including kidney stones, cancer, and Alzheimer’s disease. To date, sequence homology remains the most widely used method for inferring the function of a protein. However, the application of advanced artificial intelligence (AI)-based techniques in recent years has resulted in significant improvements in our ability to predict the subcellular localization of a protein. The prediction accuracy has risen steadily over the years, in large part due to the application of AI-based methods such as hidden Markov models (HMMs), neural networks (NNs), and support vector machines (SVMs), although the availability of larger experimental datasets has also played a role. Automatic methods that mine textual information from the biological literature and molecular biology databases have considerably sped up the process of annotation for proteins for which some information regarding function is available in the literature. State-of-the-art methods based on NNs and HMMs can predict the presence of N-terminal-sorting signals extremely accurately. Ab initio methods that predict subcellular localization for any protein sequence using only the native amino acid sequence and features predicted from the native sequence have shown the most remarkable improvements. The prediction accuracy of these methods has increased by over 30% in the past decade. The accuracy of these methods is now on par with high-throughput methods for predicting localization, and they are beginning to play an important role in directing experimental research. In this chapter, we review some of the most important methods for the prediction of subcellular localization.

Key Words

Protein subcellular localization prediction sorting signals neural networks support vector machines hidden Markov models amino acid composition text analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Venter, J. C., Adams, M. D., Myers, E. W., et al. (2001) The sequence of the human genome. Science 291(5507), 1304–1351.PubMedCrossRefGoogle Scholar
  2. 2.
    Brutlag, D. L. (1998) Genomics and computational molecular biology. Curr. Opin. Microbiol. 1(3), 340–345.PubMedCrossRefGoogle Scholar
  3. 3.
    Harrison, P. M., Bamborough, P., Daggett, V., Prusiner, S., and Cohen, F. E. (1997) The prion folding problem. Curr. Opin. Struct. Biol. 7, 53–59.PubMedCrossRefGoogle Scholar
  4. 4.
    Bork, P. and Koonin, E. V. (1998) Predicting functions from protein sequences—where are the bottlenecks? Nat. Genet. 18(4), 313–318.PubMedCrossRefGoogle Scholar
  5. 5.
    Luscombe, N. M., Greenbaum, D., and Gerstein, M. (2001) What is bioinformatics? A proposed definition and overview of the field. Methods Inf. Med. 40(4), 346–358.PubMedGoogle Scholar
  6. 6.
    Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y. (1998) Predicting function: from genes to genomes and back. J. Mol. Biol. 283(4), 707–725.PubMedCrossRefGoogle Scholar
  7. 7.
    Rost, B., Liu, J., Nair, R., Wrzeszczynski, K. O., and Ofran, Y. (2003) Automatic prediction of protein function. Cell. Mol. Life Sci. 60(12), 2637–2650.PubMedCrossRefGoogle Scholar
  8. 8.
    Apweiler, R., Attwood, T. K., Bairoch, A., et al. (2000) InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16(12), 1145–1150.PubMedCrossRefGoogle Scholar
  9. 9.
    Ashburner, M., Ball, C. A., Blake, J. A., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25(1), 25–29.PubMedCrossRefGoogle Scholar
  10. 10.
    Lodish, H., Berk, A., Baltimore, D., and Darnell, J. (2000) Molecular Cell Biology, 4th ed. W. H. Freeman & Co, New York.Google Scholar
  11. 11.
    Skach, W. R. (2000) Defects in processing and trafficking of the cystic fibrosis transmembrane conductance regulator. Kidney Int. 57(3), 825–831.PubMedCrossRefGoogle Scholar
  12. 12.
    Payne, A. S., Kelly, E. J., and Gitlin, J. D. (1998) Functional expression of the Wilson disease protein reveals mislocalization and impaired copper-dependent trafficking of the common H1069Q mutation. Proc. Natl. Acad. Sci. USA 95(18), 10854–10859.PubMedCrossRefGoogle Scholar
  13. 13.
    Parfrey, H., Mahadeva, R., and Lomas, D. A. (2003) Alpha(1)-antitrypsin deficiency, liver disease and emphysema. Int. J. Biochem. Cell Biol. 35(7), 1009–1014.PubMedCrossRefGoogle Scholar
  14. 14.
    Davis, T. N. (2004) Protein localization in proteomics. Curr. Opin. Chem. Biol. 8(1), 49–53.PubMedCrossRefGoogle Scholar
  15. 15.
    Nakai, K. (2000) Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem. 54, 277–344.PubMedCrossRefGoogle Scholar
  16. 16.
    Schneider, G. and Fechner, U. (2004) Advances in the prediction of protein targeting signals. Proteomics 4(6), 1571–1580.PubMedCrossRefGoogle Scholar
  17. 17.
    Schatz, G. and Dobberstein, B. (1996) Common principles of protein translocation across membranes. Science 271(5255), 1519–1526.PubMedCrossRefGoogle Scholar
  18. 18.
    Darnell, J., Lodish, H., and Baltimore, D. (1990) Molecular Cell Biology, 2nd ed. W. H. Freeman & Co, New York.Google Scholar
  19. 19.
    Valencia, A. and Pazos, F. (2002) Computational methods for the prediction of protein interactions. Curr. Opin. Struct. Biol. 12(3), 368–373.PubMedCrossRefGoogle Scholar
  20. 20.
    Wu, C. H., Nikolskaya, A., Huang, H., et al. (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 32(1), D112–114.PubMedCrossRefGoogle Scholar
  21. 21.
    Nakai, K. (2001) Review: prediction of in vivo fates of proteins in the era of genomics and proteomics. J. Struct. Biol. 134(2–3), 103–116.PubMedCrossRefGoogle Scholar
  22. 22.
    Apweiler, R., Gateau, A., Contrino, S., et al. (1997) Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 33–43.PubMedGoogle Scholar
  23. 23.
    Bairoch, A. and Apweiler, R. (1997) The SWISS-PROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Res. 25, 31–36.PubMedCrossRefGoogle Scholar
  24. 24.
    Simpson, J. C., Wellenreuther, R., Poustka, A., Pepperkok, R., and Wiemann, S. (2000) Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing. EMBO Rep. 1(3), 287–292.PubMedCrossRefGoogle Scholar
  25. 25.
    Nakai, K. and Kanehisa, M. (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14(4), 897–911.PubMedCrossRefGoogle Scholar
  26. 26.
    Nakai, K. and Horton, P. (1999) PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem. Sci. 24(1), 34–36.PubMedCrossRefGoogle Scholar
  27. 27.
    Bannai, H., Tamada, Y., Maruyama, O., Nakai, K., and Miyano, S. (2000) Extensive feature detection of N-terminal protein sorting signals. Bioinformatics 18(2), 298–305.CrossRefGoogle Scholar
  28. 28.
    Gardy, J. L., Spencer, C., Wang, K., et al. (2003) PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 31(13), 3613–3617.PubMedCrossRefGoogle Scholar
  29. 29.
    Horton, P., Park, K. J., Obayashi, T., et al. (2007) WoLF PSORT: protein localization predictor. Nucleic Acids Res. 35 (Web Server issue), W585–587.PubMedCrossRefGoogle Scholar
  30. 30.
    von Heijne, G. (1995) Protein sorting signals: simple peptides with complex functions. EXS 73, 67–76.Google Scholar
  31. 31.
    Cokol, M., Nair, R., and Rost, B. (2000) Finding nuclear localization signals. EMBO Rep. 1(5), 411–415.PubMedCrossRefGoogle Scholar
  32. 32.
    von Heijne, G. (1985) Signal sequences. The limits of variation. J. Mol. Biol. 184, 99–105.CrossRefGoogle Scholar
  33. 33.
    Voos, W., Martin, H., Krimmer, T., and Pfanner, N. (1999) Mechanisms of protein translocation into mitochondria. Biochim. Biophys. Acta 1422(3), 235–254.PubMedGoogle Scholar
  34. 34.
    Bruce, B. D. (2000) Chloroplast transit peptides: structure, function and evolution. Trends Cell Biol. 10(10), 440–447.PubMedCrossRefGoogle Scholar
  35. 35.
    Nielsen, H., Brunak, S., and von Heijne, G. (1999) Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng. 12, 3–9.PubMedCrossRefGoogle Scholar
  36. 36.
    Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300(4), 1005–1016.PubMedCrossRefGoogle Scholar
  37. 37.
    Boden, M. and Hawkins, J. (2005) Prediction of subcellular localization using sequence-biased recurrent networks. Bioinformatics 21(10), 2279–2286.PubMedCrossRefGoogle Scholar
  38. 38.
    Kall, L., Krogh, A., and Sonnhammer, E. L. (2004) A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 338(5), 1027–1036.PubMedCrossRefGoogle Scholar
  39. 39.
    Emanuelsson, O. and von Heijne, G. (2001) Prediction of organellar targeting signals. Biochim. Biophys. Acta 1541(1–2), 114–119.PubMedGoogle Scholar
  40. 40.
    Gaasterland, T. and Oprea, M. (2001) Whole-genome analysis: annotations and updates. Curr. Opin. Struct. Biol. 11(3), 377–381.PubMedCrossRefGoogle Scholar
  41. 41.
    Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge, UK.Google Scholar
  42. 42.
    Mattaj, I. W. and Englmeier, L. (1998) Nucleocytoplasmic transport: the soluble phase. Annu. Rev. Biochem. 67, 265–306.PubMedCrossRefGoogle Scholar
  43. 43.
    Jans, D. A., Xiao, C. Y., and Lam, M. H. (2000) Nuclear targeting signal recognition: a key control point in nuclear transport? BioEssays 22(6), 532–544.PubMedCrossRefGoogle Scholar
  44. 44.
    Brameier, M., Krings, A., and MacCallum, R. M. (2007) NucPred—predicting nuclear localization of proteins. Bioinformatics 23(9), 1159–1160.PubMedCrossRefGoogle Scholar
  45. 45.
    Liu, J. and Rost, B. (2002) Target space for structural genomics revisited. Bioinformatics 18(7), 922–933.PubMedCrossRefGoogle Scholar
  46. 46.
    Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10(1), 1–6.PubMedCrossRefGoogle Scholar
  47. 47.
    Bendtsen, J. D., Nielsen, H., von Heijne, G., and Brunak, S. (2004) Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340(4), 783–795.PubMedCrossRefGoogle Scholar
  48. 48.
    Qian, N. and Sejnowski, T. J. (1988) Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202, 865–884.PubMedCrossRefGoogle Scholar
  49. 49.
    Nielsen, H. and Krogh, A. (1998) Prediction of signal peptides and signal anchors by a hidden Markov model. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 122–130.PubMedGoogle Scholar
  50. 50.
    Nair, R., Carter, P., and Rost, B. (2003) NLSdb: database of nuclear localization signals. Nucleic Acids Res. 31(1), 397–399.PubMedCrossRefGoogle Scholar
  51. 51.
    LaCasse, E. C. and Lefebvre, Y. A. (1995) Nuclear localization signals overlap DNA-or RNA-binding domains in nucleic acid-binding proteins. Nucleic Acids Res. 23(10), 1647–1656.PubMedCrossRefGoogle Scholar
  52. 52.
    Apweiler, R., Bairoch, A., Wu, C. H., et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 32 (Database issue), D115–119.PubMedCrossRefGoogle Scholar
  53. 53.
    Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28(1), 45–48.PubMedCrossRefGoogle Scholar
  54. 54.
    Iliopoulos, I., Enright, A. J., and Ouzounis, C. A. (2001) Textquest: document clustering of Medline abstracts for concept discovery in molecular biology. Pac. Symp. Biocomput. 384–395.Google Scholar
  55. 55.
    Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., and Mostafa, J. (2001) Detecting gene relations from Medline abstracts. Pac. Symp. Biocomput. 483–495.Google Scholar
  56. 56.
    Friedman, C., Kra, P., Yu, H., Krauthammer, M., and Rzhetsky, A. (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl. 1), S74–82.Google Scholar
  57. 57.
    Stapley, B. J., Kelley, L. A., and Sternberg, M. J. (2002) Predicting the subcellular location of proteins from text using support vector machines. Pac. Symp. Biocomput. 374–385.Google Scholar
  58. 58.
    Shatkay, H., Hoglund, A., Brady, S., Blum, T., Donnes, P., and Kohlbacher, O. (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23(11), 1410–1417.PubMedCrossRefGoogle Scholar
  59. 59.
    Hoglund, A., Blum, T., Brady, S., et al. (2006) Significantly improved prediction of subcellular localization by integrating text and protein sequence data. Pac. Symp. Biocomput. 16–27.Google Scholar
  60. 60.
    Lu, Z. and Hunter, L. (2005) Go molecular function terms are predictive of subcellular localization. Pac. Symp. Biocomput. 151–161.Google Scholar
  61. 61.
    Raychaudhuri, S., Schutze, H., and Altman, R. B. (2002) Using text analysis to identify functionally coherent gene groups. Genome Res. 12(10), 1582–1590.PubMedCrossRefGoogle Scholar
  62. 62.
    Chalmel, F., Lardenois, A., Thompson, J. D. et al. (2005) GOAnno: GO annotation based on multiple alignment. Bioinformatics 21(9), 2095–2096.PubMedCrossRefGoogle Scholar
  63. 63.
    Nair, R. and Rost, B. (2002) Inferring sub-cellular localization through automated lexical analysis. Bioinformatics 18(Suppl. 1), S78–S86.PubMedGoogle Scholar
  64. 64.
    Lu, Z., Szafron, D., Greiner, R., et al. (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4), 547–556.PubMedCrossRefGoogle Scholar
  65. 65.
    Tamames, J., Ouzounis, C., Casari, G., Sander, C., and Valencia, A. (1998) EUCLID: automatic classification of proteins in functional classes by their database annotations. Bioinformatics 14(6), 542–543.PubMedCrossRefGoogle Scholar
  66. 66.
    Lewis, D. D. and Ringuette, M. (1994) Comparison of two learning algorithms for text categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR’94). Las Vegas, NV, April 11–13, 1994.Google Scholar
  67. 67.
    Dasarathy, B. V. (1991) Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA.Google Scholar
  68. 68.
    Kretschmann, E., Fleischmann, W., and Apweiler, R. (2001) Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10), 920–926.PubMedCrossRefGoogle Scholar
  69. 69.
    Eisenhaber, F. and Bork, P. (1999) Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries. Bioinformatics 15(7–8), 528–535.PubMedCrossRefGoogle Scholar
  70. 70.
    Fleischmann, W., Moller, S., Gateau, A., and Apweiler, R. (1999) A novel method for automatic functional annotation of proteins. Bioinformatics 15(3), 228–233.PubMedCrossRefGoogle Scholar
  71. 71.
    Mott, R., Schultz, J., Bork, P., and Ponting, C. P. (2002) Predicting protein cellular localization using a domain projection method. Genome Res. 12(8), 1168–1174.PubMedCrossRefGoogle Scholar
  72. 72.
    Xie, D., Li, A., Lin, X., Wang, M., Jiang, Z., and Feng, H. (2005) Using motifs in the prediction of eukaryotic protein subcellular localization. Conf. Proc. IEEE Eng. Med. Biol. Soc. 3, 2802–2804.PubMedGoogle Scholar
  73. 73.
    Guda, C. and Subramaniam, S. (2005) pTARGET: a new method for predicting protein subcellular localization in eukaryotes. Bioinformatics 21(21), 3963–3969.PubMedCrossRefGoogle Scholar
  74. 74.
    Nair, R. and Rost, B. (2005) Mimicking cellular sorting improves prediction of subcellular localization. J. Mol. Biol. 348(1), 85–100.PubMedCrossRefGoogle Scholar
  75. 75.
    Nishikawa, K. and Ooi, T. (1982) Correlation of the amino acid composition of a protein to its structural and biological characteristics. J. Biochem. 91, 1821–1824.PubMedGoogle Scholar
  76. 76.
    Nakashima, H. and Nishikawa, K. (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 238(1), 54–61.PubMedCrossRefGoogle Scholar
  77. 77.
    Andrade, M. A., O’Donoghue, S. I., and Rost, B. (1998) Adaptation of protein surfaces to subcellular location. J. Mol. Biol. 276(2), 517–525.PubMedCrossRefGoogle Scholar
  78. 78.
    Nakai, K. and Kanehisa, M. (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins 11, 95–110.PubMedCrossRefGoogle Scholar
  79. 79.
    Reinhardt, A. and Hubbard, T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26(9), 2230–2236.PubMedCrossRefGoogle Scholar
  80. 80.
    Hua, S. and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17(8), 721–728.PubMedCrossRefGoogle Scholar
  81. 81.
    Vapnik, V. N. (1995) The Nature of Statistical Learning Theory. Springer-Verlag, New York.Google Scholar
  82. 82.
    Park, K. J. and Kanehisa, M. (2003) Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid paris. Bioinformatics 19(13), 1656–1663.PubMedCrossRefGoogle Scholar
  83. 83.
    Cai, Y. D., Liu, X. J., Xu, X. B., and Chou, K. C. (2002) Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J. Cell. Biochem. 84(2), 343–348.PubMedCrossRefGoogle Scholar
  84. 84.
    Chou, K. C. and Cai, Y. D. (2003) Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. J. Cell. Biochem. 90(6), 1250–1260.PubMedCrossRefGoogle Scholar
  85. 85.
    Sarda, D., Chua, G. H., Li, K. B,, and Krishnan, A. (2005) pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinform. 6, 152.CrossRefGoogle Scholar
  86. 86.
    Ogul, H. and Mumcuogu, E. U. (2007) Subcellular localization prediction with new protein encoding schemes. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(2), 227–232.PubMedCrossRefGoogle Scholar
  87. 87.
    Donnes, P. and Hoglund, A. (2004) Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinform. 2(4), 209–215.Google Scholar
  88. 88.
    Emanuelsson, O., Brunak, S., von Heijne, G., and Nielsen, H. (2007) Locating proteins in the cell using TargetP, SignalP and related tools. Nat. Protoc. 2(4), 953–971.PubMedCrossRefGoogle Scholar
  89. 89.
    Yu, C. S., Chen, Y. C., Lu, C. H., and Hwang, J. K. (2006) Prediction of protein subcellular localization. Proteins 64(3), 643–651.PubMedCrossRefGoogle Scholar
  90. 90.
    Guda, C. (2006) pTARGET: a web server for predicting protein subcellular localization. Nucleic Acids Res. 34(Web Server issue), W210–213.PubMedCrossRefGoogle Scholar
  91. 91.
    Pierleoni, A., Martelli, P. L., Fariselli, P., and Casadio, R. (2006) BaCelLo: a balanced subcellular localization predictor. Bioinformatics 22(14), e408–416.PubMedCrossRefGoogle Scholar
  92. 92.
    Sprenger, J., Fink, J. L., and Teasdale, R. D. (2006) Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinform. 7(Suppl. 5), S3.CrossRefGoogle Scholar
  93. 93.
    Gardy, J. L. and Brinkman, F. S. (2006) Methods for predicting bacterial protein subcellular localization. Nat. Rev. Microbiol. 4(10), 741–751.PubMedCrossRefGoogle Scholar
  94. 94.
    Nair, R. and Rost, B. (2002) Sequence conserved for subcellular localization. Protein Sci. 11(12), 2836–2847.PubMedCrossRefGoogle Scholar
  95. 95.
    Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int. J. Neural Syst. 8(5–6), 581–599.PubMedCrossRefGoogle Scholar
  96. 96.
    Small, I., Peeters, N., Legeai, F., and Lurin, C. (2004) Predotar: a tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics 4(6), 1581–1590.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press, Totowa, NJ 2008

Authors and Affiliations

  • Rajesh Nair
    • 1
  • Burkhard Rost
    • 1
  1. 1.CUBIC, Department of Biochemistry and Molecular Biophysics and Center for Computational Biology and BioinformaticsColumbia UniversityNew YorkUSA

Personalised recommendations