Molecular Diversity

, Volume 14, Issue 3, pp 551–558 | Cite as

Predicting subcellular location of proteins using integrated-algorithm method

Full-Length Paper


Protein’s subcellular location, which indicates where a protein resides in a cell, is an important characteristic of protein. Correctly assigning proteins to their subcellular locations would be of great help to the prediction of proteins’ function, genome annotation, and drug design. Yet, in spite of great technical advance in the past decades, it is still time-consuming and laborious to experimentally determine protein subcellular locations on a high throughput scale. Hence, four integrated-algorithm methods were developed to fulfill such high throughput prediction in this article. Two data sets taken from the literature (Chou and Elrod, Protein Eng 12:107–118, 1999) were used as training set and test set, which consisted of 2,391 and 2,598 proteins, respectively. Amino acid composition was applied to represent the protein sequences. The jackknife cross-validation was used to test the training set. The final best integrated-algorithm predictor was constructed by integrating 10 algorithms in Weka (a software tool for tackling data mining tasks, based on an mRMR (Minimum Redundancy Maximum Relevance, method. It can achieve correct rate of 77.83 and 80.56% for the training set and test set, respectively, which is better than all of the 60 algorithms collected in Weka. This predicting software is available upon request.


mRMR (Minimum redundancy maximum relevance) Subcellular localization Amino acid composition Integrated-algorithm method Weka 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11030_2009_9182_MOESM1_ESM.txt (779 kb)
ESM 1 (TXT 778 kb)
11030_2009_9182_MOESM2_ESM.doc (54 kb)
ESM (DOC 53.5 kb)
11030_2009_9182_MOESM3_ESM.doc (398 kb)
ESM (DOC 398 kb)
11030_2009_9182_MOESM4_ESM.doc (432 kb)
ESM (DOC 432 kb)


  1. 1.
    Chou KC, Elrod DW (1999) Protein subcellular location prediction. Protein Eng 12: 107–118CrossRefPubMedGoogle Scholar
  2. 2.
    Eisenhaber F, Bork P (1998) Wanted: subcellular localization of proteins based on sequence. Trends Cell Biol 8: 169–170CrossRefPubMedGoogle Scholar
  3. 3.
    Hua S, Sun Z (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17: 721–728CrossRefPubMedGoogle Scholar
  4. 4.
    Yuan Z (1999) Prediction of protein subcellular locations using Markov chain models. FEBS Lett 451: 23–26CrossRefPubMedGoogle Scholar
  5. 5.
    Reinhardt A, Hubbard T (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res 26: 2230–2236CrossRefPubMedGoogle Scholar
  6. 6.
    Frank E, Witten IH (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San FranciscoGoogle Scholar
  7. 7.
    Gewehr JE, Szugat M, Zimmer R (2007) BioWeka—extending the Weka framework for bioinformatics. Bioinformatics 23: 651–653CrossRefPubMedGoogle Scholar
  8. 8.
    Gonzalez-Diaz H, Aguero-Chapin G, Varona J, Molina R, Delogu G, Santana L, Uriarte E, Podda G (2007) 2D-RNA-coupling numbers: a new computational chemistry approach to link secondary structure topology with biological function. J Comput Chem 28: 1049–1056CrossRefPubMedGoogle Scholar
  9. 9.
    Munteanu CR, Gonzalez-Diaz H, Magalhaes AL (2008) Enzymes/ non-enzymes classification model complexity based on composition, sequence, 3D and topological indices. J Theor Biol 254: 476–482CrossRefPubMedGoogle Scholar
  10. 10.
    Peng HC, Long FH, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Analysis Mach Intell 27: 1226–1238CrossRefGoogle Scholar
  11. 11.
    Cai YD, Chou KC (2006) Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. J Theor Biol 238: 395–400CrossRefPubMedGoogle Scholar
  12. 12.
    Won HH, Kim MJ, Kim S, Kim JW (2008) EnsemPro: an ensemble approach to predicting transcription start sites in human genomic DNA sequences. Genomics 91: 259–266CrossRefPubMedGoogle Scholar
  13. 13.
    Cedano J, Aloy P, Perez-Pons JA, Querol E (1997) Relation between amino acid composition and cellular location of proteins. J Mol Biol 266: 594–600CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  1. 1.Institute of System BiologyShanghai UniversityShanghaiChina
  2. 2.Centre for Computational Systems BiologyFudan UniversityShanghaiChina
  3. 3.Department of Biomedical EngineeringShanghai Jiao Tong UniversityShanghaiChina
  4. 4.Shanghai Key Laboratory of Trustworthy ComputingEast China Normal UniversityShanghaiChina

Personalised recommendations