Skip to main content

iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples

Abstract

Meiotic recombination is vital for maintaining the sequence diversity in human genome. Meiosis and recombination are considered the essential phases of cell division. In meiosis, the genome is divided into equal parts for sexual reproduction whereas in recombination, the diverse genomes are combined to form new combination of genetic variations. Recombination process does not occur randomly across the genomes, it targets specific areas called recombination “hotspots” and “coldspots”. Owing to huge exploration of polygenetic sequences in data banks, it is impossible to recognize the sequences through conventional methods. Looking at the significance of recombination spots, it is indispensable to develop an accurate, fast, robust, and high-throughput automated computational model. In this model, the numerical descriptors are extracted using two sequence representation schemes namely: dinucleotide composition and trinucleotide composition. The performances of seven classification algorithms were investigated. Finally, the predicted outcomes of individual classifiers are fused to form ensemble classification, which is formed through majority voting and genetic algorithm (GA). The performance of GA-based ensemble model is quite promising compared to individual classifiers and majority voting-based ensemble model. iRSpot-GAEnsC has achieved 84.46 % accuracy. The empirical results revealed that the performance of iRSpot-GAEnsC is not only higher than the examined algorithms but also better than existing methods in the literature developed so far. It is anticipated that the proposed model might be helpful for research community, academia and for drug discovery.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  • Ahmad S, Kabir M, Hayat M (2015) Identification of heat shock protein families and J-protein types by incorporating dipeptide composition into Chou’s general pseAAC. Comput Methods Programs Biomed. doi:10.1016/j.cmpb.2015.07.005

    PubMed  Google Scholar 

  • Akbar S, Ahmad A, Hayat M (2014) Identification of fingerprint using discrete wavelet transform in conjunction with support vector machine. IJCSI 11(Print):1694–0814

  • ALAllaf ONA (2012) Cascade-forward vs. function fitting neural network for improving image quality and learning time in image compression system. In: Proceedings of the world congress on engineering, pp 4–6

  • Beigi MM, Behjati M, Mohabatkar H (2011) Prediction of metalloproteinase family based on the concept of Chou’s pseudo amino acid composition using a machine learning approach. J Struct Funct Genomics 12:191–197

    Article  Google Scholar 

  • Boulesteix A, Bender A, Bermejo JL, Strobl C (2012) Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations. Brief Bioinform 13:292–304

    Article  PubMed  Google Scholar 

  • Breiman L (2001) Random forests. Machine Learning 45:5–32

    Article  Google Scholar 

  • Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. TIST 2:27

    Article  Google Scholar 

  • Chen W, Lin H, Feng PM, Ding C, Zuo YC, Chou KC (2012) iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS One 7:e47843

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  • Chen W, Feng PM, Lin H, Chou KC (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res:gks1450

  • Chen W, Feng PM, Deng EZ, Lin H, Chou KC (2014a) iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal Biochem 462:76–83

    Article  PubMed  CAS  Google Scholar 

  • Chen W, Feng PM, Lin H, Chou KC (2014b) iSS-PseDNC: identifying Splicing Sites Using Pseudo Dinucleotide Composition. BioMed Res Int 2014:12

    Google Scholar 

  • Chen W, Lei TY, Jin DC, Lin H, Chou KC (2014c) PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal Biochem 456:53–60

    Article  PubMed  CAS  Google Scholar 

  • Chen W, Zhang X, Brooker J, Lin H, Zhang L, Chou K (2014d) PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics:btu602

  • Chen W, Lin H, Chou KC (2015) Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol BioSyst

  • Cherian M, Sathiyan SP (2012) Neural Network based ACC for Optimized safety and comfort. Int J Comp Appl 42

  • Chou KC (2001a) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: structure. Function, and Bioinformatics 43:246–255

    Article  CAS  Google Scholar 

  • Chou KC (2001b) Using subsite coupling to predict signal peptides. Protein Eng 14:75–79

    Article  PubMed  CAS  Google Scholar 

  • Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10–19

    Article  PubMed  CAS  Google Scholar 

  • Chou KC (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr Proteomics 6:262–274

    Article  CAS  Google Scholar 

  • Chou K-C (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 273:236–247

    Article  PubMed  CAS  Google Scholar 

  • Chou KC (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol BioSyst 9:1092–1100

    Article  PubMed  CAS  Google Scholar 

  • Chou KC (2015) Impacts of bioinformatics to medicinal chemistry. Med Chem 11:218–234

    Article  PubMed  CAS  Google Scholar 

  • Chou KC, Shen HB (2007a) Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. J Proteome Res 6:1728–1734

    Article  PubMed  CAS  Google Scholar 

  • Chou KC, Shen HB (2007b) Recent progress in protein subcellular location prediction. Anal Biochem 370:1–16

    Article  PubMed  CAS  Google Scholar 

  • Chou KC, Shen HB (2007c) Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Commun 357:633–640

    Article  PubMed  CAS  Google Scholar 

  • Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One 6:e18258

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  • Chou KC, Wu ZC, Xiao X (2012) iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol BioSyst 8:629–641

    Article  PubMed  CAS  Google Scholar 

  • Dehzangi A, Heffernan R, Sharma A, Lyons J, Paliwal K, Sattar A (2015) Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC. J Theor Biol 364:284–294

    Article  PubMed  CAS  Google Scholar 

  • Ding H, Luo L, Lin H (2009) Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition. Protein Pept Lett 16:351–355

    Article  PubMed  CAS  Google Scholar 

  • Ding C, Yuan LF, Guo SH, Lin H, Chen W (2012) Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J Proteomics 77:321–328

    Article  PubMed  CAS  Google Scholar 

  • Ding H, Deng EZ, Yuan LF, Liu L, Lin H, Chen W, Chou KC (2014) iCTX-Type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed research international 2014

  • Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley

  • Ebina T, Toh H, Kuroda Y (2011) DROP: an SVM domain linker predictor trained with optimal features selected by random forest. Bioinformatics 27:487–494

    Article  PubMed  CAS  Google Scholar 

  • Esmaeili M, Mohabatkar H, Mohsenzadeh S (2010) Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses. J Theor Biol 263:203–209

    Article  PubMed  CAS  Google Scholar 

  • Fang Y, Guo Y, Feng Y, Li M (2008) Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids 34:103–109

    Article  PubMed  CAS  Google Scholar 

  • Feng PM, Chen W, Lin H, Chou KC (2013) iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem 442:118–125

    Article  PubMed  CAS  Google Scholar 

  • Georgiou V, Pavlidis N, Parsopoulos K, Alevizos PD, Vrahatis M (2004) Optimizing the performance of probabilistic neural networks in a bioinformatics task. In: Proceedings of the EUNITE 2004 Conference, pp 34–40

  • Georgiou D, Karakasidis TE, Nieto J, Torres A (2009) Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition. J Theor Biol 257:17–26

    Article  PubMed  CAS  Google Scholar 

  • Gu Q, Ding YS, Zhang TL (2010) Prediction of G-protein-coupled receptor classes in low homology using Chou’s pseudo amino acid composition with approximate entropy and hydrophobicity patterns. Protein Pept Lett 17:559–567

    Article  PubMed  CAS  Google Scholar 

  • Guo J, Rao N, Liu G, Yang Y, Wang G (2011) Predicting protein folding rates using the concept of Chou’s pseudo amino acid composition. J Comput Chem 32:1612–1617

    Article  PubMed  CAS  Google Scholar 

  • Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, Chen W, Chou KC (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics:btu083

  • Han J, Kamber M (2006) Data Mining, Southeast, Asia edn. Concepts and Techniques, Morgan kaufmann

    Google Scholar 

  • Hayat M, Khan A (2011) Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. J Theor Biol 271:10–17

    Article  PubMed  CAS  Google Scholar 

  • Hayat M, Khan A (2012a) Discriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou’s PseAAC. Protein Pept Lett 19:411–421

    Article  PubMed  CAS  Google Scholar 

  • Hayat M, Khan A (2012b) Mem-PHybrid: hybrid features-based prediction system for classifying membrane protein types. Anal Biochem 424:35–44

    Article  PubMed  CAS  Google Scholar 

  • Hayat M, Tahir M (2015) PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine. Mol BioSyst

  • Hayat M, Khan A, Yeasin M (2012) Prediction of membrane proteins using split amino acid and ensemble classification. Amino Acids 42:2447–2460

    Article  PubMed  CAS  Google Scholar 

  • He X, Han K, Hu J, Yan H, Yang JY, Shen HB, Yu DJ (2015) TargetFreeze: Identifying Antifreeze Proteins via a Combination of Weights using Sequence Evolutionary Information and Pseudo Amino Acid Composition. J Membrane Biol:1–10

  • Jia J, Liu Z, Xiao X, Liu B, Chou KC (2015) iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J Theor Biol 377:47–56

    Article  PubMed  CAS  Google Scholar 

  • Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res 35:W339–W344

    Article  PubMed  PubMed Central  Google Scholar 

  • Keeney S (2008) Spo11 and the formation of DNA double-strand breaks in meiosis. In: Recombination and meiosis. Springer, pp 81–123

  • Khan A (2012) Identifying GPCRs and their types with Chou’s pseudo amino acid composition: an approach from multi-scale energy representation and position specific scoring matrix. Protein Pept Lett 19:890–903

    Article  PubMed  Google Scholar 

  • Khan A, Majid A, Hayat M (2011) CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 35:218–229

    Article  PubMed  CAS  Google Scholar 

  • Khan ZU, Hayat M, Khan MA (2015) Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model. J Theor Biol 365:197–203

    Article  PubMed  CAS  Google Scholar 

  • Kumar KK, Pugalenthi G, Suganthan P (2009) DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn 26:679–686

    Article  PubMed  CAS  Google Scholar 

  • Li WC, Deng EZ, Ding H, Chen W, Lin H (2015) iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chemometr Intell Lab Syst 141:100–106

    Article  CAS  Google Scholar 

  • Lin H, Ding H, Guo FB, Zhang AY, Huang J (2008) Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition. Protein Pept Lett 15:739–744

    Article  PubMed  CAS  Google Scholar 

  • Lin H, Wang H, Ding H, Chen YL, Li QZ (2009a) Prediction of subcellular localization of apoptosis protein using Chou’s pseudo amino acid composition. Acta Biotheor 57:321–330

    Article  PubMed  Google Scholar 

  • Lin WZ, Xiao X, Chou KC (2009b) GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. Protein Engineering Design and Selection:gzp057

  • Lin WZ, Fang JA, Xiao X, Chou KC (2012) Predicting secretory proteins of malaria parasite by incorporating sequence evolution information into pseudo amino acid composition via grey system model

  • Lin H, Chen W, Yuan LF, Li ZQ, Ding H (2013) Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor 61:259–268

    Article  PubMed  Google Scholar 

  • Liu G, Liu J, Cui X, Cai L (2012) Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. J Theor Biol 293:49–54

    Article  PubMed  CAS  Google Scholar 

  • Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, Chou KC (2014) iDNA-Prot| dis: Identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition

  • Liu B, Fang L, Liu F, Wang X, Chen J, Chou KC (2015a) Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS One 10:e0121501

    Article  PubMed  PubMed Central  Google Scholar 

  • Liu B, Liu F, Fang L, Wang X, Chou KC (2015b) repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31:1307–1309

    Article  PubMed  Google Scholar 

  • Liu B, Fang L, Liu F, Wang X, Chou KC (2015b) iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct and Dynamics:1–13

  • Liu Z, Xiao X, Qiu WR, Chou KC (2015d) iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem 474:69–77

    Article  PubMed  CAS  Google Scholar 

  • Liu B, Liu F, Fang L, Wang X, Chou KC (2015d) repRNA: a web server for generating various feature vectors of RNA sequences. Mole Genet Genomics:1–9

  • Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC (2015e) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research:gkv458

  • Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H (2014) Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian Naïve Bayes. PLoS One 9:e86703

    Article  PubMed  PubMed Central  Google Scholar 

  • Lu J, Huang G, Li HP, Feng KY, Chen L, Zheng MY, Cai YD (2014) Prediction of cancer drugs by chemical–chemical interactions. PLoS One 9

  • Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Comput 53:331–344

    Article  PubMed  Google Scholar 

  • Mohabatkar H (2010) Prediction of cyclin proteins using Chou’s pseudo amino acid composition. Protein Pept Lett 17:1207–1214

    Article  PubMed  CAS  Google Scholar 

  • Mohabatkar H, Mohammad Beigi M, Esmaeili A (2011) Prediction of GABA receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol 281:18–23

    Article  PubMed  CAS  Google Scholar 

  • Mohabatkar H, Mohammad Beigi M, Abdolahi K, Mohsenzadeh S (2013) Prediction of allergenic proteins by means of the concept of Chou’s pseudo amino acid composition and a machine learning approach. Med Chem 9:133–137

    Article  PubMed  CAS  Google Scholar 

  • Nanni L, Lumini A, Gupta D, Garg A (2012) Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou’s pseudo amino acid composition and on evolutionary information. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 9:467–475

    Article  Google Scholar 

  • Qiu JD, Huang JH, Liang RP, Lu XQ (2009) Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transform. Anal Biochem 390:68–73

    Article  PubMed  CAS  Google Scholar 

  • Qiu WR, Xiao X, Chou KC (2014a) iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci 15:1746–1766

    Article  PubMed  PubMed Central  Google Scholar 

  • Qiu WR, Xiao X, Lin WZ, Chou KC (2014b) iMethyl-PseAAC: Identification of protein methylation sites via a pseudo amino acid composition approach. BioMed Res Int 2014

  • Qiu WR, Xiao X, Lin WZ, Chou KC (2014c) iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J Biomol Struct Dynamics:1–12

  • Sahu SS, Panda G (2010) A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. Comput Biol Chem 34:320–327

    Article  PubMed  CAS  Google Scholar 

  • Specht DF (1990) Probabilistic neural networks. Neural networks 3:109–118

    Article  Google Scholar 

  • Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SA (2013) Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Brief Bioinform 14:315

    Article  PubMed  PubMed Central  Google Scholar 

  • Vapnik V (2000) The nature of statistical learning theory. Springer

  • Xiao X, Wang P, Chou KC (2009) GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 30:1414–1423

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Wang P, Chou KC (2011) Quat-2L: a web-server for predicting protein quaternary structural attributes. Mol Diversity 15:149–155

    Article  CAS  Google Scholar 

  • Xiao X, Min JL, Wang P, Chou KC (2013a) iGPCR-Drug: a web server for predicting interaction between GPCRs and drugs in cellular networking. PLoS One 8:e72234

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  • Xiao X, Wang P, Lin WZ, Jia JH, Chou KC (2013b) iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal Biochem 436:168–177

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Hui MJ, Liu Z, Qiu WR (2015) iCataly-PseAAC: Identification of enzymes catalytic sites using sequence evolution information with grey model GM (2, 1). The J Memb Biol:1–9

  • Xu Y, Ding J, Wu LY, Chou KC (2013a) iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE 8:e55844

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  • Xu Y, Shao XJ, Wu LY, Deng NY, Chou KC (2013b) iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 1:e171

    Article  PubMed  PubMed Central  Google Scholar 

  • Xu R, Zhou J, Liu B, He Y, Zou Q, Wang X, Chou KC (2014a) Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach. Journal of Biomolecular Structure and Dynamics:1–11

  • Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC (2014b) iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition

  • Yuan LF, Ding C, Guo SH, Ding H, Chen W, Lin H (2013) Prediction of the types of ion channel-targeted conotoxins based on radial basis function network. Toxicol In Vitro 27:852–856

    Article  PubMed  CAS  Google Scholar 

  • Zhang YN, Yu DJ, Li SS, Fan YX, Huang Y, Shen HB (2012) Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features. BMC Bioinform 13:118

    Article  CAS  Google Scholar 

  • Zhou XB, Chen C, Li ZC, Zou XY (2007) Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J Theor Biol 248:546–551

    Article  PubMed  CAS  Google Scholar 

  • Zou D, He Z, He J, Xia Y (2011) Supersecondary structure prediction using Chou’s pseudo amino acid composition. J Comput Chem 32:271–278

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maqsood Hayat.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors. References

Additional information

Communicated by S. Hohmann.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 1350 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kabir, M., Hayat, M. iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples. Mol Genet Genomics 291, 285–296 (2016). https://doi.org/10.1007/s00438-015-1108-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00438-015-1108-5

Keywords

  • TNC
  • DNC
  • DNA
  • SVM
  • PNN