Molecular Diversity

, Volume 14, Issue 4, pp 667–671 | Cite as

Prediction of subcellular location of mycobacterial protein using feature selection techniques

  • Hao Lin
  • Hui Ding
  • Feng-Biao Guo
  • Jian Huang
Full-Length Paper


Mycobacterium tuberculosis is the primary pathogen causing tuberculosis, which is one of the most prevalent infectious diseases. The subcellular location of mycobacterial proteins can provide essential clues for proteins function research and drug discovery. Therefore, it is highly desirable to develop a computational method for fast and reliable prediction of subcellular location of mycobacterial proteins. In this study, we developed a support vector machine (SVM) based method to predict subcellular location of mycobacterial proteins. A total of 444 non-redundant mycobacterial proteins were used to train and test proposed model by using jackknife cross validation. By selecting traditional pseudo amino acid composition (PseAAC) as parameters, the overall accuracy of 83.3% was achieved. Moreover, a feature selection technique was developed to find out an optimal amount of PseAAC for improving predictive performance. The optimal amount of PseAAC improved overall accuracy from 83.3 to 87.2%. In addition, the reduced amino acids in N-terminus and non N-terminus of proteins were combined in models for further improving predictive successful rate. As a result, the maximum overall accuracy of 91.2% was achieved with average accuracy of 79.7%. The proposed model provides highly useful information for further experimental research. The prediction model can be accessed free of charge at


Protein subcellular localization Pseudo amino acid composition Feature selection Mycobacterium tuberculosis Reduced amino acids 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Yeh JI, Mao L (2006) Prediction of membrane proteins in Mycobacterium tuberculosis using a support vector machine algorithm. J Comput Biol 13: 126–129. doi: 10.1089cmb.2006.13.126 CrossRefPubMedGoogle Scholar
  2. 2.
    Chou KC, Shen HB (2007) Review: recent progresses in protein subcellular location prediction. Anal Biochem 370: 1–16. doi: 10.1016/j.ab.2007.07.006 CrossRefPubMedGoogle Scholar
  3. 3.
    Chou KC, Shen HB (2008) Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc 3: 153–162. doi: 10.1038/nprot.2007.494 CrossRefPubMedGoogle Scholar
  4. 4.
    Shen HB, Chou KC (2007) Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochem Biophys Res Commun 355: 1006–1011. doi: 10.1016/j.bbrc.2007.02.071 CrossRefPubMedGoogle Scholar
  5. 5.
    Shen HB, Chou KC (2007) Gpos-Ploc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Des Sel 20: 39–46. doi: 10.1093/protein/gzl053 CrossRefPubMedGoogle Scholar
  6. 6.
    Shen HB, Chou KC (2007) Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. Biopolymers 85: 233–240. doi: 10.1002/bip.20640 CrossRefPubMedGoogle Scholar
  7. 7.
    Shen HB, Yang J, Chou KC (2007) Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction. Amino Acids 33: 57–61. doi: 10.1007/s00726-006-0478-8 CrossRefPubMedGoogle Scholar
  8. 8.
    Wang T, Yang J (2009) Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gram-negative bacterial proteins. Mol Divers. doi: 10.1007/s11030-009-9134-z
  9. 9.
    Niu B, Jian YH, Feng KY, Lu WC, Cai YD, Li GZ (2008) Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins. Mol Divers 12: 41–45. doi: 10.1007/s11030-008-9073-0 CrossRefPubMedGoogle Scholar
  10. 10.
    Kalate RN, Tambe SS, Kulkarni BD (2003) Artificial neural networks for prediction of mycobacterial promoter sequences. Comput Biol Chem 27: 555–564. doi: 10.1016/j.compbiolchem.2003.09.004 CrossRefPubMedGoogle Scholar
  11. 11.
    González-Díaz H, Pérez-Bello A, Uriarte E, González-Díaz Y (2006) QSAR study for mycobacterial promoters with low sequence homology. Bioorg Med Chem Lett 16: 547–553. doi: 10.1016/j.bmcl.2005.10.057 CrossRefPubMedGoogle Scholar
  12. 12.
    González-Díaz H, Pérez-Bello A, Uriarte E (2005) Stochastic molecular descriptors for polymers. 3. Markov electrostatic moments as polymer 2D-folding descriptors: RNA-QSAR for mycobacterial promoters. Polymer 46: 6461–6473. doi: 10.1016/j.polymer.2005.04.104 CrossRefGoogle Scholar
  13. 13.
    González-Díaz H, Pérez-Bello A, Cruz-Monteagudo M, González-Díaz Y, Santana L, Uriarte E (2007) Chemometrics for QSAR with low sequence homology: mycobacterial promoter sequences recognition with 2D-RNA entropies. Chemom Intell Lab Syst 85: 20–26. doi: 10.1016/j.chemolab.2006.03.005 CrossRefGoogle Scholar
  14. 14.
    Perez-Bello A, Munteanu CR, Ubeira FM, De Magalhães AL, Uriarte E, González-Díaz H (2009) Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices. J Theor Biol 256: 458–466. doi: 10.1016/j.jtbi.2008.09.035 CrossRefPubMedGoogle Scholar
  15. 15.
    González-Díaz H, Prado-Prado F, Ubeira FM (2008) Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem 8: 1676–1690. doi: 10.2174/156802608786786543 CrossRefPubMedGoogle Scholar
  16. 16.
    González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E (2008) Proteomics, networks and connectivity indices. Proteomics 8: 750–778. doi: 10.1002/pmic.200700638 CrossRefPubMedGoogle Scholar
  17. 17.
    Rashid M, Saha S, Raghava GPS (2007) Support vector machine-based method for predicting subcellular localization of mycobacterial proteins using evolutional information and motifs. BMC Bioinformatics 8: 337. doi: 10.1186/1471-2105-8-337 CrossRefPubMedGoogle Scholar
  18. 18.
    Nair R, Rost B (2002) Sequence conserved for subcellular localization. Protein Sci 11: 2836–2847. doi: 10.1110/ps.0207402 CrossRefPubMedGoogle Scholar
  19. 19.
    Yu CS, Chen YC, Lu CH, Hwang JK (2006) Prediction of protein subcellular localization. Proteins 64: 643–651. doi: 10.1002/prot.21018 CrossRefPubMedGoogle Scholar
  20. 20.
    Lin H, Ding H, Guo FB, Zhang AY, Huang J (2008) Predicting subcellular localization of Mycobacterial proteins by using Chou’s pseudo amino acid composition. Protein Pept Lett 15: 739–744. doi: 10.2174/092986608785133681 CrossRefPubMedGoogle Scholar
  21. 21.
    Park KJ, Gromiha MM, Horton P, Suwa M (2005) Discrimination of outer membrane proteins using support vector machines. Bioinformatics 21: 4223–4229. doi: 10.1093/bioinformatics/bti697 CrossRefPubMedGoogle Scholar
  22. 22.
    Chen YL, Li QZ (2007) Prediction of the subcellular location of apoptosis proteins. J Theor Biol 245: 775–783. doi: 10.1016/j.jtbi.2006.11.010 CrossRefPubMedGoogle Scholar
  23. 23.
    Chen YL, Li QZ (2007) Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. J Theor Biol 248: 377–381. doi: 10.1016/j.jtbi.2007.05.019 CrossRefPubMedGoogle Scholar
  24. 24.
    Emanuelsson O, Nielsen H, Brunak S, Heijine G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300: 1005–1016. doi: 10.1006/jmbi.2000.3903 CrossRefPubMedGoogle Scholar
  25. 25.
    Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659. doi: 10.1093/bioinformatics/btl158 CrossRefPubMedGoogle Scholar
  26. 26.
    Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at
  27. 27.
    Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. Proteins 43: 246–255. doi: 10.1002/prot.1035 CrossRefPubMedGoogle Scholar
  28. 28.
    Shen HB, Chou KC (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373: 386–388. doi: 10.1016/j.ab.2007.10.012 CrossRefPubMedGoogle Scholar
  29. 29.
    Russell RB, Saqi MA, Sayle RA, Bates PA, Sternberg MJ (1997) Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mol Biol 269: 423–439. doi: 10.1006/jmbi.1997.1019 CrossRefPubMedGoogle Scholar
  30. 30.
    Pánek J, Eidhammer I, Aasland R (2005) A new method for identification of protein (Sub)families in a set of proteins based on hydropathy distribution in proteins. Proteins 58: 923–934. doi: 10.1002/prot.20356 CrossRefPubMedGoogle Scholar
  31. 31.
    Agüero-Chapin G, González-Díaz H, Molina R, Varona-Santos J, Uriarte E, González-Díaz Y (2006) Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS Lett 580: 723–730. doi: 10.1016/j.febslet.2005.12.072 CrossRefPubMedGoogle Scholar
  32. 32.
    Chou KC, Zhang CT (1995) Review: prediction of protein structural classes. Crit Rev Biochem Mol Biol 30: 275–349. doi: 10.3109/10409239509083488 CrossRefPubMedGoogle Scholar
  33. 33.
    Chou KC (1999) A key driving force in determination of protein structural classes. Biochem Biophys Res Commun 264: 216–224. doi: 10.1006/bbrc.1999.1325 CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  1. 1.Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and TechnologyUniversity of Electronic Science and Technology of ChinaChengduChina

Personalised recommendations