Mycobacterium tuberculosis is the primary pathogen causing tuberculosis, which is one of the most prevalent infectious diseases. The subcellular location of mycobacterial proteins can provide essential clues for proteins function research and drug discovery. Therefore, it is highly desirable to develop a computational method for fast and reliable prediction of subcellular location of mycobacterial proteins. In this study, we developed a support vector machine (SVM) based method to predict subcellular location of mycobacterial proteins. A total of 444 non-redundant mycobacterial proteins were used to train and test proposed model by using jackknife cross validation. By selecting traditional pseudo amino acid composition (PseAAC) as parameters, the overall accuracy of 83.3% was achieved. Moreover, a feature selection technique was developed to find out an optimal amount of PseAAC for improving predictive performance. The optimal amount of PseAAC improved overall accuracy from 83.3 to 87.2%. In addition, the reduced amino acids in N-terminus and non N-terminus of proteins were combined in models for further improving predictive successful rate. As a result, the maximum overall accuracy of 91.2% was achieved with average accuracy of 79.7%. The proposed model provides highly useful information for further experimental research. The prediction model can be accessed free of charge at http://cobi.uestc.edu.cn/cobi/people/hlin/webserver.
Shen HB, Chou KC (2007) Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. Biopolymers 85: 233–240. doi:10.1002/bip.20640CrossRefPubMedGoogle Scholar
González-Díaz H, Pérez-Bello A, Uriarte E (2005) Stochastic molecular descriptors for polymers. 3. Markov electrostatic moments as polymer 2D-folding descriptors: RNA-QSAR for mycobacterial promoters. Polymer 46: 6461–6473. doi:10.1016/j.polymer.2005.04.104CrossRefGoogle Scholar
González-Díaz H, Pérez-Bello A, Cruz-Monteagudo M, González-Díaz Y, Santana L, Uriarte E (2007) Chemometrics for QSAR with low sequence homology: mycobacterial promoter sequences recognition with 2D-RNA entropies. Chemom Intell Lab Syst 85: 20–26. doi:10.1016/j.chemolab.2006.03.005CrossRefGoogle Scholar
Perez-Bello A, Munteanu CR, Ubeira FM, De Magalhães AL, Uriarte E, González-Díaz H (2009) Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices. J Theor Biol 256: 458–466. doi:10.1016/j.jtbi.2008.09.035CrossRefPubMedGoogle Scholar
Rashid M, Saha S, Raghava GPS (2007) Support vector machine-based method for predicting subcellular localization of mycobacterial proteins using evolutional information and motifs. BMC Bioinformatics 8: 337. doi:10.1186/1471-2105-8-337CrossRefPubMedGoogle Scholar
Pánek J, Eidhammer I, Aasland R (2005) A new method for identification of protein (Sub)families in a set of proteins based on hydropathy distribution in proteins. Proteins 58: 923–934. doi:10.1002/prot.20356CrossRefPubMedGoogle Scholar
Agüero-Chapin G, González-Díaz H, Molina R, Varona-Santos J, Uriarte E, González-Díaz Y (2006) Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS Lett 580: 723–730. doi:10.1016/j.febslet.2005.12.072CrossRefPubMedGoogle Scholar