Abstract
Machine learning (ML) has been extensively applied to develop models and to understand high-throughput data of biological processes. However, new ML models, trained with novel experimental results, are required to build regularly for more precise predictions. ML methods can build models from numeric data, whereas biological data are generally textual (DNA, protein sequences) or images and needs feature calculation algorithms to generate quantitative features. Programming skills along with domain knowledge are required to develop these algorithms. Therefore, the process of knowledge discovery through ML is decelerated due to lack of generic tools to construct features and to build models directly from the data. Hence, we developed a schema that calculates about 5,000 features, selects relevant features and develops protein classifiers from the training data. To demonstrate the general applicability and robustness of our method, fungal adhesins and nuclear receptor proteins were used for building classifiers which outperformed existing classifiers when tested on independent data. Next, we built a classifier for mitochondrial proteins of Plasmodium falciparum which causes human malaria because the latest corresponding classifiers are not publically accessible. Our classifier attained 98.18 % accuracy and 0.95 Matthews correlation coefficient by fivefold cross-validation and outperformed existing classifiers on independent test set. We implemented this schema as user-friendly and open source application Pro-Gyan (http://code.google.com/p/pro-gyan/), to build and share executable classifiers without programming knowledge.
Abbreviations
- ML:
-
Machine learning
- MP:
-
Mitochondrial proteins
- MCC:
-
Matthews correlation coefficient
- ANN:
-
Artificial Neural Network
- SVM:
-
Support vector machine
- FCA:
-
Feature calculation algorithms
- FSM:
-
Feature selection methods
- FCBF:
-
Fast correlation-based feature
- PF:
-
Plasmodium falciparum
- nrPfM165:
-
Non-redundant training set
- nrPfM205:
-
Non-redundant test set
- API:
-
Application programming interface
- AAC:
-
Amino acid composition
- AAPC:
-
Amino acid pair composition
- AC:
-
Auto-correlation
- CDTd:
-
Composition, transition and distribution descriptors
- PseAAC_T1:
-
Pseudo amino acid composition Type1
- PseAAC_T2:
-
Pseudo amino acid composition Type2
- SOD:
-
Sequence-order-coupling descriptors
- CD:
-
Charge distribution
- IDP:
-
Intrinsically disordered proteins
- FWF:
-
FoldIndex window-based features
- MAE:
-
Mean absolute error
- TF:
-
Top features
- JRE:
-
Java runtime environment
- GUI:
-
Graphics user interface
- PSSM:
-
Position-specific scoring matrix
- FK-NN:
-
Fuzzy K nearest neighbor
- NR:
-
Nuclear receptor
- ROC:
-
Receiver operating characteristic
- Pgc:
-
Pro-Gyan classifier
References
Arvey A, Agius P, Noble WS, Leslie C (2012) Sequence and chromatin determinants of cell-type—specific transcription factor binding. Genome Res 22(9):1723–1734
Atkinson GC, Kuzmenko A, Kamenski P, Vysokikh MY, Lakunina V, Tankov S, Smirnova E, Soosaar A, Tenson T, Hauryliuk V (2012) Evolutionary and genetic analyses of mitochondrial translation initiation factors identify the missing mitochondrial IF3 in S. cerevisiae. Nucleic Acids Res 40(13):6122–6134
Bánfai B, Jia H, Khatun J, Wood E, Risk B, Gundling WE, Kundaje A, Gunawardena HP, Yu Y, Xie L, Krajewski K, Strahl BD, Chen X, Bickel P, Giddings MC, Brown JB, Lipovich L (2012) Long noncoding RNAs are rarely translated in two human cell lines. Genome Res 22(9):1646–1657
Bender A, van Dooren GG, Ralph SA, McFadden GI, Schneider G (2003) Properties and prediction of mitochondrial transit peptides from Plasmodium falciparum. Mol Biochem Parasitol 132(2):59–66
Bum Ju L, Keun Ho R (2008) Feature extraction from protein sequences and classification of enzyme function. In: International conference on biomedical engineering and informatics, 2008. BMEI 2008, 27–30 May 2008, pp 138–142
Cai CZ, Han LY, Ji ZL, Chen YZ (2004) Enzyme family classification by support vector machines. Proteins: Struct, Funct, Bioinf 55(1):66–76
Cao DS, Xu QS, Liang YZ (2013) Propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 29(7):960–962
Chen YW, Lin CJ (2006) Combining SVMs with various feature selection strategies. In: Guyon I, Nikravesh M, Gunn S, Zadeh L (eds) Feature extraction, vol 207., Studies in fuzziness and soft computingSpringer, Berlin, pp 315–324
Chen YL, Li QZ, Zhang LQ (2012) Using increment of diversity to predict mitochondrial proteins of malaria parasite: integrating pseudo-amino acid composition and structural alphabet. Amino Acids 42(4):1309–1316
Chih-Chung C, Chih-Jen L (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27. doi:10.1145/1961189.1961199
Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19
Chou KC, Cai YD (2005) Prediction of membrane protein types by incorporating amphipathic effects. J Chem Inf Model 45(2):407–413. doi:10.1021/ci049686v10.1021/ci049686v
Dunker AK, Silman I, Uversky VN, Sussman JL (2008) Function and structure of inherently disordered proteins. Curr Opin Struct Biol 18(6):756–764
Emanuelsson O, Nielsen H, S Brunak, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016
Emanuelsson O, von Heijne G, Schneider G (2001) Analysis and prediction of mitochondrial targeting peptides. Methods Cell Biol 65:175–187
Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A (2005) Protein identification and analysis tools on the ExPASy server. In: Walker JM (ed) The proteomics protocols handbook. Humana press Inc., New York, pp 571–607
Guda C, Fahy E, Subramaniam S (2004) MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics 20(11):1785–1794
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Hammen PK, Weiner H (1998) Mitochondrial leader sequences: structural similarities and sequence differences. J Exp Zool 282(1–2):280–283
Han LY, Cai CZ, Lo SL, Chung MCM, Chen YZ (2004) Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA 10(3):355–368
Horton P, Park K-J, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K (2007) WoLF PSORT: protein localization predictor. Nucleic Acids Res 35(suppl 2):W585–W587
Jia C, Liu T, Chang AK, Zhai Y (2011) Prediction of mitochondrial proteins of malaria parasite using bi-profile Bayes feature extraction. Biochimie 93(4):778–782
Kumar M, Verma R, Raghava GPS (2006) Prediction of mitochondrial proteins using support vector machine and hidden Markov model. J Biol Chem 281(9):5357–5363
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 34(suppl 2):W32–W37
Li ZC, Zhou XB, Lin YR, Zou XY (2008) Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 35(3):581–590
Muggleton SH (2006) 2020 Computing: exceeding human limits. Nature 440(7083):409–410
Murray CJL, Rosenfeld LC, Lim SS, Andrews KG, Foreman KJ, Haring D, Fullman N, Naghavi M, Lozano R, Lopez AD (2012) Global malaria mortality between 1980 and 2010: a systematic analysis. Lancet 379(9814):413–431
Oehring SC, Woodcroft BJ, Moes S, Wetzel J, Dietz O, Pulfer A, Dekiwadia C, Maeser P, Flueck C, Witmer K (2012) Organellar proteomics reveals hundreds of novel nuclear proteins in the malaria parasite Plasmodium falciparum. Genome Biol 13(11):R108
Prilusky J, Felder CE, Zeev-Ben-Mordehai T, Rydberg EH, Man O, Beckmann JS, Silman I, Sussman JL (2005) FoldIndex©: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21(16):3435–3438
Quinlan JR (1993) C4 5: programs for machine learning. Morgan Kaufmann, Burlington, Massachusetts, United States
Ramana J, Gupta D (2010) Faapred: a SVM-based prediction method for fungal adhesins and adhesin-like proteins. PLoS ONE 5(3):e9695
Saeys Y, Inza I, Larrańaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Shamim MTA, Anwaruddin M, Nagarajaram HA (2007) Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 23(24):3320–3327
Shen H-B, Chou K-C (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373(2):386–388
Singh GP, Dash D (2008) How expression level influences the disorderness of proteins, vol 371. Elsevier, Amsterdam
Smialowski P, Frishman D, Kramer S (2010) Pitfalls of supervised feature selection. Bioinformatics 26(3):440–443. doi:10.1093/bioinformatics/btp621
Uversky VN, Gillespie JR, Fink AL (2000) Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins 41(3):415–427
Vapnik V (1999) The nature of statistical learning theory, 2nd edn. Springer, Heidelberg
Verma R, Varshney G, Raghava GPS (2010) Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile. Amino Acids 39(1):101–110
Wang P, Xiao X, Chou K-C (2011) NR-2L: a two-level predictor for identifying nuclear receptor subfamilies based on sequence-derived features. PLoS ONE 6(8):e23505
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Machine learning-international workshop then conference, 2003, p 856
Acknowledgments
RDR acknowledges HP and CSIR Genesis project (BSC-0121) for funding. DD acknowledges CSIR Genesis project (BSC-0121) for funding through IGIB. The authors also like to thank Ritwick Pal for providing the image of the main window of Pro-Gyan.
Conflict of interest
The authors declare that they have no conflict of interest.
Author information
Authors and Affiliations
Corresponding author
Additional information
Work performed at: CSIR-Institute of Genomics and Integrative Biology.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Das Roy, R., Dash, D. Selection of relevant features from amino acids enables development of robust classifiers. Amino Acids 46, 1343–1351 (2014). https://doi.org/10.1007/s00726-014-1697-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-014-1697-z