Skip to main content
Log in

Prediction of subcellular location of mycobacterial protein using feature selection techniques

  • Full-Length Paper
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Abstract

Mycobacterium tuberculosis is the primary pathogen causing tuberculosis, which is one of the most prevalent infectious diseases. The subcellular location of mycobacterial proteins can provide essential clues for proteins function research and drug discovery. Therefore, it is highly desirable to develop a computational method for fast and reliable prediction of subcellular location of mycobacterial proteins. In this study, we developed a support vector machine (SVM) based method to predict subcellular location of mycobacterial proteins. A total of 444 non-redundant mycobacterial proteins were used to train and test proposed model by using jackknife cross validation. By selecting traditional pseudo amino acid composition (PseAAC) as parameters, the overall accuracy of 83.3% was achieved. Moreover, a feature selection technique was developed to find out an optimal amount of PseAAC for improving predictive performance. The optimal amount of PseAAC improved overall accuracy from 83.3 to 87.2%. In addition, the reduced amino acids in N-terminus and non N-terminus of proteins were combined in models for further improving predictive successful rate. As a result, the maximum overall accuracy of 91.2% was achieved with average accuracy of 79.7%. The proposed model provides highly useful information for further experimental research. The prediction model can be accessed free of charge at http://cobi.uestc.edu.cn/cobi/people/hlin/webserver.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Yeh JI, Mao L (2006) Prediction of membrane proteins in Mycobacterium tuberculosis using a support vector machine algorithm. J Comput Biol 13: 126–129. doi:10.1089cmb.2006.13.126

    Article  CAS  PubMed  Google Scholar 

  2. Chou KC, Shen HB (2007) Review: recent progresses in protein subcellular location prediction. Anal Biochem 370: 1–16. doi:10.1016/j.ab.2007.07.006

    Article  CAS  PubMed  Google Scholar 

  3. Chou KC, Shen HB (2008) Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc 3: 153–162. doi:10.1038/nprot.2007.494

    Article  CAS  PubMed  Google Scholar 

  4. Shen HB, Chou KC (2007) Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochem Biophys Res Commun 355: 1006–1011. doi:10.1016/j.bbrc.2007.02.071

    Article  CAS  PubMed  Google Scholar 

  5. Shen HB, Chou KC (2007) Gpos-Ploc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Des Sel 20: 39–46. doi:10.1093/protein/gzl053

    Article  CAS  PubMed  Google Scholar 

  6. Shen HB, Chou KC (2007) Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. Biopolymers 85: 233–240. doi:10.1002/bip.20640

    Article  CAS  PubMed  Google Scholar 

  7. Shen HB, Yang J, Chou KC (2007) Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction. Amino Acids 33: 57–61. doi:10.1007/s00726-006-0478-8

    Article  CAS  PubMed  Google Scholar 

  8. Wang T, Yang J (2009) Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gram-negative bacterial proteins. Mol Divers. doi:10.1007/s11030-009-9134-z

  9. Niu B, Jian YH, Feng KY, Lu WC, Cai YD, Li GZ (2008) Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins. Mol Divers 12: 41–45. doi:10.1007/s11030-008-9073-0

    Article  CAS  PubMed  Google Scholar 

  10. Kalate RN, Tambe SS, Kulkarni BD (2003) Artificial neural networks for prediction of mycobacterial promoter sequences. Comput Biol Chem 27: 555–564. doi:10.1016/j.compbiolchem.2003.09.004

    Article  CAS  PubMed  Google Scholar 

  11. González-Díaz H, Pérez-Bello A, Uriarte E, González-Díaz Y (2006) QSAR study for mycobacterial promoters with low sequence homology. Bioorg Med Chem Lett 16: 547–553. doi:10.1016/j.bmcl.2005.10.057

    Article  PubMed  Google Scholar 

  12. González-Díaz H, Pérez-Bello A, Uriarte E (2005) Stochastic molecular descriptors for polymers. 3. Markov electrostatic moments as polymer 2D-folding descriptors: RNA-QSAR for mycobacterial promoters. Polymer 46: 6461–6473. doi:10.1016/j.polymer.2005.04.104

    Article  Google Scholar 

  13. González-Díaz H, Pérez-Bello A, Cruz-Monteagudo M, González-Díaz Y, Santana L, Uriarte E (2007) Chemometrics for QSAR with low sequence homology: mycobacterial promoter sequences recognition with 2D-RNA entropies. Chemom Intell Lab Syst 85: 20–26. doi:10.1016/j.chemolab.2006.03.005

    Article  Google Scholar 

  14. Perez-Bello A, Munteanu CR, Ubeira FM, De Magalhães AL, Uriarte E, González-Díaz H (2009) Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices. J Theor Biol 256: 458–466. doi:10.1016/j.jtbi.2008.09.035

    Article  CAS  PubMed  Google Scholar 

  15. González-Díaz H, Prado-Prado F, Ubeira FM (2008) Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem 8: 1676–1690. doi:10.2174/156802608786786543

    Article  PubMed  Google Scholar 

  16. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E (2008) Proteomics, networks and connectivity indices. Proteomics 8: 750–778. doi:10.1002/pmic.200700638

    Article  PubMed  Google Scholar 

  17. Rashid M, Saha S, Raghava GPS (2007) Support vector machine-based method for predicting subcellular localization of mycobacterial proteins using evolutional information and motifs. BMC Bioinformatics 8: 337. doi:10.1186/1471-2105-8-337

    Article  PubMed  Google Scholar 

  18. Nair R, Rost B (2002) Sequence conserved for subcellular localization. Protein Sci 11: 2836–2847. doi:10.1110/ps.0207402

    Article  CAS  PubMed  Google Scholar 

  19. Yu CS, Chen YC, Lu CH, Hwang JK (2006) Prediction of protein subcellular localization. Proteins 64: 643–651. doi:10.1002/prot.21018

    Article  CAS  PubMed  Google Scholar 

  20. Lin H, Ding H, Guo FB, Zhang AY, Huang J (2008) Predicting subcellular localization of Mycobacterial proteins by using Chou’s pseudo amino acid composition. Protein Pept Lett 15: 739–744. doi:10.2174/092986608785133681

    Article  CAS  PubMed  Google Scholar 

  21. Park KJ, Gromiha MM, Horton P, Suwa M (2005) Discrimination of outer membrane proteins using support vector machines. Bioinformatics 21: 4223–4229. doi:10.1093/bioinformatics/bti697

    Article  CAS  PubMed  Google Scholar 

  22. Chen YL, Li QZ (2007) Prediction of the subcellular location of apoptosis proteins. J Theor Biol 245: 775–783. doi:10.1016/j.jtbi.2006.11.010

    Article  CAS  PubMed  Google Scholar 

  23. Chen YL, Li QZ (2007) Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. J Theor Biol 248: 377–381. doi:10.1016/j.jtbi.2007.05.019

    Article  CAS  PubMed  Google Scholar 

  24. Emanuelsson O, Nielsen H, Brunak S, Heijine G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300: 1005–1016. doi:10.1006/jmbi.2000.3903

    Article  CAS  PubMed  Google Scholar 

  25. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659. doi:10.1093/bioinformatics/btl158

    Article  CAS  PubMed  Google Scholar 

  26. Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/_cjlin/libsvm

  27. Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. Proteins 43: 246–255. doi:10.1002/prot.1035

    Article  CAS  PubMed  Google Scholar 

  28. Shen HB, Chou KC (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373: 386–388. doi:10.1016/j.ab.2007.10.012

    Article  CAS  PubMed  Google Scholar 

  29. Russell RB, Saqi MA, Sayle RA, Bates PA, Sternberg MJ (1997) Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mol Biol 269: 423–439. doi:10.1006/jmbi.1997.1019

    Article  CAS  PubMed  Google Scholar 

  30. Pánek J, Eidhammer I, Aasland R (2005) A new method for identification of protein (Sub)families in a set of proteins based on hydropathy distribution in proteins. Proteins 58: 923–934. doi:10.1002/prot.20356

    Article  PubMed  Google Scholar 

  31. Agüero-Chapin G, González-Díaz H, Molina R, Varona-Santos J, Uriarte E, González-Díaz Y (2006) Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS Lett 580: 723–730. doi:10.1016/j.febslet.2005.12.072

    Article  PubMed  Google Scholar 

  32. Chou KC, Zhang CT (1995) Review: prediction of protein structural classes. Crit Rev Biochem Mol Biol 30: 275–349. doi:10.3109/10409239509083488

    Article  CAS  PubMed  Google Scholar 

  33. Chou KC (1999) A key driving force in determination of protein structural classes. Biochem Biophys Res Commun 264: 216–224. doi:10.1006/bbrc.1999.1325

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hao Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, H., Ding, H., Guo, FB. et al. Prediction of subcellular location of mycobacterial protein using feature selection techniques. Mol Divers 14, 667–671 (2010). https://doi.org/10.1007/s11030-009-9205-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-009-9205-1

Keywords

Navigation