Skip to main content

Advertisement

Log in

Selection of relevant features from amino acids enables development of robust classifiers

Amino Acids Aims and scope Submit manuscript

Abstract

Machine learning (ML) has been extensively applied to develop models and to understand high-throughput data of biological processes. However, new ML models, trained with novel experimental results, are required to build regularly for more precise predictions. ML methods can build models from numeric data, whereas biological data are generally textual (DNA, protein sequences) or images and needs feature calculation algorithms to generate quantitative features. Programming skills along with domain knowledge are required to develop these algorithms. Therefore, the process of knowledge discovery through ML is decelerated due to lack of generic tools to construct features and to build models directly from the data. Hence, we developed a schema that calculates about 5,000 features, selects relevant features and develops protein classifiers from the training data. To demonstrate the general applicability and robustness of our method, fungal adhesins and nuclear receptor proteins were used for building classifiers which outperformed existing classifiers when tested on independent data. Next, we built a classifier for mitochondrial proteins of Plasmodium falciparum which causes human malaria because the latest corresponding classifiers are not publically accessible. Our classifier attained 98.18 % accuracy and 0.95 Matthews correlation coefficient by fivefold cross-validation and outperformed existing classifiers on independent test set. We implemented this schema as user-friendly and open source application Pro-Gyan (http://code.google.com/p/pro-gyan/), to build and share executable classifiers without programming knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Abbreviations

ML:

Machine learning

MP:

Mitochondrial proteins

MCC:

Matthews correlation coefficient

ANN:

Artificial Neural Network

SVM:

Support vector machine

FCA:

Feature calculation algorithms

FSM:

Feature selection methods

FCBF:

Fast correlation-based feature

PF:

Plasmodium falciparum

nrPfM165:

Non-redundant training set

nrPfM205:

Non-redundant test set

API:

Application programming interface

AAC:

Amino acid composition

AAPC:

Amino acid pair composition

AC:

Auto-correlation

CDTd:

Composition, transition and distribution descriptors

PseAAC_T1:

Pseudo amino acid composition Type1

PseAAC_T2:

Pseudo amino acid composition Type2

SOD:

Sequence-order-coupling descriptors

CD:

Charge distribution

IDP:

Intrinsically disordered proteins

FWF:

FoldIndex window-based features

MAE:

Mean absolute error

TF:

Top features

JRE:

Java runtime environment

GUI:

Graphics user interface

PSSM:

Position-specific scoring matrix

FK-NN:

Fuzzy K nearest neighbor

NR:

Nuclear receptor

ROC:

Receiver operating characteristic

Pgc:

Pro-Gyan classifier

References

  • Arvey A, Agius P, Noble WS, Leslie C (2012) Sequence and chromatin determinants of cell-type—specific transcription factor binding. Genome Res 22(9):1723–1734

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  • Atkinson GC, Kuzmenko A, Kamenski P, Vysokikh MY, Lakunina V, Tankov S, Smirnova E, Soosaar A, Tenson T, Hauryliuk V (2012) Evolutionary and genetic analyses of mitochondrial translation initiation factors identify the missing mitochondrial IF3 in S. cerevisiae. Nucleic Acids Res 40(13):6122–6134

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  • Bánfai B, Jia H, Khatun J, Wood E, Risk B, Gundling WE, Kundaje A, Gunawardena HP, Yu Y, Xie L, Krajewski K, Strahl BD, Chen X, Bickel P, Giddings MC, Brown JB, Lipovich L (2012) Long noncoding RNAs are rarely translated in two human cell lines. Genome Res 22(9):1646–1657

    Article  CAS  Google Scholar 

  • Bender A, van Dooren GG, Ralph SA, McFadden GI, Schneider G (2003) Properties and prediction of mitochondrial transit peptides from Plasmodium falciparum. Mol Biochem Parasitol 132(2):59–66

    Article  PubMed  CAS  Google Scholar 

  • Bum Ju L, Keun Ho R (2008) Feature extraction from protein sequences and classification of enzyme function. In: International conference on biomedical engineering and informatics, 2008. BMEI 2008, 27–30 May 2008, pp 138–142

  • Cai CZ, Han LY, Ji ZL, Chen YZ (2004) Enzyme family classification by support vector machines. Proteins: Struct, Funct, Bioinf 55(1):66–76

    Article  CAS  Google Scholar 

  • Cao DS, Xu QS, Liang YZ (2013) Propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 29(7):960–962

    Article  PubMed  CAS  Google Scholar 

  • Chen YW, Lin CJ (2006) Combining SVMs with various feature selection strategies. In: Guyon I, Nikravesh M, Gunn S, Zadeh L (eds) Feature extraction, vol 207., Studies in fuzziness and soft computingSpringer, Berlin, pp 315–324

    Chapter  Google Scholar 

  • Chen YL, Li QZ, Zhang LQ (2012) Using increment of diversity to predict mitochondrial proteins of malaria parasite: integrating pseudo-amino acid composition and structural alphabet. Amino Acids 42(4):1309–1316

    Article  PubMed  CAS  Google Scholar 

  • Chih-Chung C, Chih-Jen L (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27. doi:10.1145/1961189.1961199

    Article  Google Scholar 

  • Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19

    Article  PubMed  CAS  Google Scholar 

  • Chou KC, Cai YD (2005) Prediction of membrane protein types by incorporating amphipathic effects. J Chem Inf Model 45(2):407–413. doi:10.1021/ci049686v10.1021/ci049686v

    Article  PubMed  CAS  Google Scholar 

  • Dunker AK, Silman I, Uversky VN, Sussman JL (2008) Function and structure of inherently disordered proteins. Curr Opin Struct Biol 18(6):756–764

    Article  PubMed  CAS  Google Scholar 

  • Emanuelsson O, Nielsen H, S Brunak, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016

    Article  PubMed  CAS  Google Scholar 

  • Emanuelsson O, von Heijne G, Schneider G (2001) Analysis and prediction of mitochondrial targeting peptides. Methods Cell Biol 65:175–187

    Article  PubMed  CAS  Google Scholar 

  • Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A (2005) Protein identification and analysis tools on the ExPASy server. In: Walker JM (ed) The proteomics protocols handbook. Humana press Inc., New York, pp 571–607

  • Guda C, Fahy E, Subramaniam S (2004) MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics 20(11):1785–1794

    Article  PubMed  CAS  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Hammen PK, Weiner H (1998) Mitochondrial leader sequences: structural similarities and sequence differences. J Exp Zool 282(1–2):280–283

    Article  PubMed  CAS  Google Scholar 

  • Han LY, Cai CZ, Lo SL, Chung MCM, Chen YZ (2004) Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA 10(3):355–368

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  • Horton P, Park K-J, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K (2007) WoLF PSORT: protein localization predictor. Nucleic Acids Res 35(suppl 2):W585–W587

    Article  PubMed Central  PubMed  Google Scholar 

  • Jia C, Liu T, Chang AK, Zhai Y (2011) Prediction of mitochondrial proteins of malaria parasite using bi-profile Bayes feature extraction. Biochimie 93(4):778–782

    Article  PubMed  CAS  Google Scholar 

  • Kumar M, Verma R, Raghava GPS (2006) Prediction of mitochondrial proteins using support vector machine and hidden Markov model. J Biol Chem 281(9):5357–5363

    Article  PubMed  CAS  Google Scholar 

  • Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659

    Article  PubMed  CAS  Google Scholar 

  • Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 34(suppl 2):W32–W37

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  • Li ZC, Zhou XB, Lin YR, Zou XY (2008) Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 35(3):581–590

    Article  PubMed  CAS  Google Scholar 

  • Muggleton SH (2006) 2020 Computing: exceeding human limits. Nature 440(7083):409–410

    Article  PubMed  CAS  Google Scholar 

  • Murray CJL, Rosenfeld LC, Lim SS, Andrews KG, Foreman KJ, Haring D, Fullman N, Naghavi M, Lozano R, Lopez AD (2012) Global malaria mortality between 1980 and 2010: a systematic analysis. Lancet 379(9814):413–431

    Article  PubMed  Google Scholar 

  • Oehring SC, Woodcroft BJ, Moes S, Wetzel J, Dietz O, Pulfer A, Dekiwadia C, Maeser P, Flueck C, Witmer K (2012) Organellar proteomics reveals hundreds of novel nuclear proteins in the malaria parasite Plasmodium falciparum. Genome Biol 13(11):R108

    Article  PubMed  Google Scholar 

  • Prilusky J, Felder CE, Zeev-Ben-Mordehai T, Rydberg EH, Man O, Beckmann JS, Silman I, Sussman JL (2005) FoldIndex©: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21(16):3435–3438

    Article  PubMed  CAS  Google Scholar 

  • Quinlan JR (1993) C4 5: programs for machine learning. Morgan Kaufmann, Burlington, Massachusetts, United States

    Google Scholar 

  • Ramana J, Gupta D (2010) Faapred: a SVM-based prediction method for fungal adhesins and adhesin-like proteins. PLoS ONE 5(3):e9695

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  • Saeys Y, Inza I, Larrańaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517

    Article  PubMed  CAS  Google Scholar 

  • Shamim MTA, Anwaruddin M, Nagarajaram HA (2007) Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 23(24):3320–3327

    Article  PubMed  CAS  Google Scholar 

  • Shen H-B, Chou K-C (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373(2):386–388

    Article  PubMed  CAS  Google Scholar 

  • Singh GP, Dash D (2008) How expression level influences the disorderness of proteins, vol 371. Elsevier, Amsterdam

    Google Scholar 

  • Smialowski P, Frishman D, Kramer S (2010) Pitfalls of supervised feature selection. Bioinformatics 26(3):440–443. doi:10.1093/bioinformatics/btp621

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  • Uversky VN, Gillespie JR, Fink AL (2000) Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins 41(3):415–427

    Article  PubMed  CAS  Google Scholar 

  • Vapnik V (1999) The nature of statistical learning theory, 2nd edn. Springer, Heidelberg

    Google Scholar 

  • Verma R, Varshney G, Raghava GPS (2010) Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile. Amino Acids 39(1):101–110

    Article  PubMed  CAS  Google Scholar 

  • Wang P, Xiao X, Chou K-C (2011) NR-2L: a two-level predictor for identifying nuclear receptor subfamilies based on sequence-derived features. PLoS ONE 6(8):e23505

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  • Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Machine learning-international workshop then conference, 2003, p 856

Download references

Acknowledgments

RDR acknowledges HP and CSIR Genesis project (BSC-0121) for funding. DD acknowledges CSIR Genesis project (BSC-0121) for funding through IGIB. The authors also like to thank Ritwick Pal for providing the image of the main window of Pro-Gyan.

Conflict of interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debasis Dash.

Additional information

Work performed at: CSIR-Institute of Genomics and Integrative Biology.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Cite this article

Das Roy, R., Dash, D. Selection of relevant features from amino acids enables development of robust classifiers. Amino Acids 46, 1343–1351 (2014). https://doi.org/10.1007/s00726-014-1697-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00726-014-1697-z

Keywords

Navigation