Abstract
A complexity-based approach is proposed to predict subcellular location of proteins. Instead of extracting features from protein sequences as done previously, our approach is based on a complexity decomposition of symbol sequences. In the first step, distance between each pair of protein sequences is evaluated by the conditional complexity of one sequence given the other. Subcellular location of a protein is then determined using the k-nearest neighbor algorithm. Using three widely used data sets created by Reinhardt and Hubbard, Park and Kanehisa, and Gardy et al., our approach shows an improvement in prediction accuracy over those based on the amino acid composition and Markov model of protein sequences.
Similar content being viewed by others
References
Andrade MA, O’Donoghue SI, Rost B (1998) Adaptation of protein surfaces to subcellular location. J Mol Biol 276:517–525
Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver JL (1999) Compositional complexity of DNA sequence models. Comput Phys Commun 121(1):136–138
Bezdek JC, Hall LO, Clarke LP (1993) Review of MR image segmentation techniques using pattern recognition. Med Phys 20:1033–1048
Boyd D, Schierle C, Beckwith J (1998) How many membrane proteins are there? Protein Sci 7:201–205
Cedano J, Aloy P, Pérez-Pons JA, Querol E (1997) Relation between amino acid composition and cellular location of proteins. J Mol Biol 266:594–600
Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. Proteins Struct Funct Genet 43:246–255
Chou KC, Cai YD (2002) Using functional domain composition and support vector machines for prediction of protein subcellular location. J Biol Chem 277:45765–45769
Chou KC, Cai YD (2003) A new hybrid approach to predict subcellular localization of proteins by incorporating Gene ontology. Biochem Biophys Res Commun 311:743–747
Chou KC, Cai YD (2004) Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochem Biophys Res Commun 320:1236–1239
Chou KC, Elrod DW (1999) Protein subcellular location prediction. Protein Eng 12:107–118
Chou KC, Shen HB (2007) Review: recent progresses in protein subcellular location prediction. Anal Biochem 370:1–16
Chou KC, Shen HB (2008) Cell-PLoc: a package of web-servers for predicting subcellular localization of proteins in various organisms. Nat Protoc 3:153–162
Chou KC, Zhang CT (1995) Review: prediction of protein structural classes. Crit Rev Biochem Mol Biol 30:275–349
Diao Y, Ma D, Wen Z, Yin J, Xiang J, Li M (2008) Using pseudo amino acid composition to predict transmembrane regions in protein: cellular automata and Lempel–Ziv complexity. Amino Acids 34(1):111–117
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley, New York
Emanuelsson O, Nielsen H, Brunak S, Von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300:1005–1016
Feng ZP, Zhang CT (2002) A graphic representation of protein sequence and predicting the subcellular localizations of prokaryotic proteins. Int J Biochem Cell Biol 34:298–307
Gao QB, Wang ZZ, Yan C, Du YH (2005) Prediction of protein subcellular location using a combined feature of sequence. FEBS Lett 579:3444–3448
Gardy JL, Spencer C, Wang K, Ester M, Tusnady GE, Simon I, Hua S, deFays K, Lambert C, Nakai K, Brinkman FS (2003) PSORT-B: improving protein subcellular Iocalization prediction for Gram-negative bacteria. Nucleic Acids Res 31:3613–3617
Guo J, Lin YL, Sun ZR (2005) A novel method for protein subcellular localization: combining residue-couple model and SVM. Proc APBC 2005:117–129
Hua SJ, Sun ZR (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17:721–728
Huang Y, Li YD (2004) Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics 20:21–28
Lempel A, Ziv J (1976) On the complexity of finite sequence. IEEE T Inform Theory 22:75–81
Leszczynski K, Cosby S, Bissett R, Provost D, Boyko S, Loose S, Mvilongo E (1999) Application of a fuzzy pattern classifier to decision making in portal verification of radiotherapy. Phys Med Biol 44:253–269
Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20:547–556
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451
Murphy RF, Boland MV, Velliste M (2000) Towards a systematics for protein subcellular location: quantitative description of protein localization patterns and automated analysis of fluorescence microscope images. Proc Int Conf Intell Syst Mol Biol 8:251–259
Nakai K (2000) Protein sorting signals and prediction of subcellular localization. Adv Protein Chem 54:277–344
Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins 11:95–110
Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol 238:54–61
Nielsen H, Engelbrecht J, Brunak S, Von Heijne G (1997) A neural network method for identification of prokaryotic and eukaryotic signal perptides and prediction of their cleavage sites. Int J Neural Sys 8:581–599
Nielsen H, Brunak S, Von Heijne G (1999) Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng 12:3–9
Orlov YL, Potapov VN (2004) Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res 32:628–633
Otu HH, Sayood K (2003) A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19:2122–2130
Park KJ, Kanehisa M (2003) Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19:1656–1663
Reinhardt A, Hubbard T (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res 26:2230–2236
Sadovsky MG (2003) The method to compare nucleotide sequences based on minimum entropy principle. Bull Math Biol 65:309–322
Troyanskaya OG, Arbell O, Koren Y, Landau GM, Bolshoy A (2002) Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity. Bioinformatics 18(5):679–688
Wang J, Zheng X (2008) Comparison of protein secondary structures based on backbone dihedral angles. J Theor Biol 250:382–387
Xiao X, Shao S, Ding Y, Huang Z, Huang Y, Chou KC (2005) Using complexity measure factor to predict protein subcellular location. Amino Acids 28:57–61
Xie D, Li A, Wang M, Fan Z, Feng H (2005) LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res 33:105–110
Yu CS, Lin CJ, Hwang JK (2004) Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci 13:1402–1406
Yuan Z (1999) Prediction of protein subcellular locations using Markov chain models. FEBS Lett 451:23–26
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE T Inform Theory 23:337–343
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE T Inform Theory 24:530–536
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zheng, X., Liu, T. & Wang, J. A complexity-based method for predicting protein subcellular location. Amino Acids 37, 427–433 (2009). https://doi.org/10.1007/s00726-008-0172-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-008-0172-0