Skip to main content
Log in

i6mA-VC: A Multi-Classifier Voting Method for the Computational Identification of DNA N6-methyladenine Sites

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

DNA N6-methyladenine (6 mA), as an essential component of epigenetic modification, cannot be neglected in genetic regulation mechanism. The efficient and accurate prediction of 6 mA sites is beneficial to the development of biological genetics. Biochemical experimental methods are considered to be time-consuming and laborious. Most of the established machine learning methods have a single dataset. Although some of them have achieved cross-species prediction, their results are not satisfactory. Therefore, we designed a novel statistical model called i6mA-VC to improve the accuracy for 6 mA sites. On the one hand, kmer and binary encoding are applied to extract features, and then gradient boosting decision tree (GBDT) embedded method is applied as the feature selection strategy. On the other hand, DNA sequences are represented by vectors through the feature extraction method of ring-function-hydrogen-chemical properties (RFHCP) and the feature selection strategy of ExtraTree. After fusing the two optimal features, a voting classifier based on gradient boosting decision tree (GBDT), light gradient boosting machine (LightGBM) and multilayer perceptron classifier (MLPC) is constructed for final classification and prediction. The accuracy of Rice dataset and M.musculus dataset with five-fold cross-validation are 0.888 and 0.967, respectively. The cross-species dataset is selected as independent testing dataset, and the accuracy reaches 0.848. Through rigorous experiments, it is demonstrated that the proposed predictor is convincing and applicable. The development of i6mA-VC predictor will become an effective way for the recognition of N6-methyladenine sites, and it will also be beneficial for biological geneticists to further study gene expression and DNA modification. In addition, an accessible web-server for i6mA-VC is available from http://www.zhanglab.site/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Vanyushin BF, Tkacheva SG, Belozersky AN (1970) Rare bases in animal DNA. Nature 225:948–949. https://doi.org/10.1038/225948a0

    Article  CAS  PubMed  Google Scholar 

  2. Vanyushin BF, Belozersky AN, Kokurina NA, Kadirova DX (1968) 5-Methylcytosine and 6-Methylaminopurine in bacterial DNA. Nature 218:1066–1067. https://doi.org/10.1038/2181066a0

    Article  CAS  PubMed  Google Scholar 

  3. Dunn DB, Smith JD (1955) Occurrence of a new base in the deoxyribonucleic acid of a strain of bacterium coli. Nature 175:336–337. https://doi.org/10.1038/175336a0

    Article  CAS  PubMed  Google Scholar 

  4. Unger G, Venner H (1966) Remarks on minor bases in spermatic desoxyribonucleic acid. Hoppe Seyler Z physiol Chem 344:280–283

    Article  CAS  PubMed  Google Scholar 

  5. Campbell JL, Kleckner N (1990) E. coli oriC and the dnaA gene promoter are sequestered from dam methyltransferase following the passage of the chromosomal replication fork. Cell 62:967–979. https://doi.org/10.1016/0092-8674(90)90271-F

    Article  CAS  PubMed  Google Scholar 

  6. Robbins-Manke JL, Zdraveski ZZ, Marinus M, Essigmann JM (2005) Analysis of global gene expression and double-strand-break formation in DNA adenine methyltransferase- and mismatch repair-deficient Escherichia coli. J Bacteriol 187:7027–7037. https://doi.org/10.1128/JB.187.20.7027-7037.2005

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Pukkila PJ, Peterson J, Herman G, Modrich P, Meselson M (1983) Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli. Genetics 104:571–582. https://doi.org/10.1093/genetics/104.4.571

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Luria SE, Human ML (1952) A nonhereditary, host-induced variation of bacterial viruses. J Bacteriol 64:557–569. https://doi.org/10.1007/BF00410835

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Meselson M, Yuan R (1968) DNA restriction enzyme from E. coli. Nature 217:1110–1114. https://doi.org/10.1038/2171110a0

    Article  CAS  PubMed  Google Scholar 

  10. Arber W, Dussoix D (1962) Host specificity of DNA produced by Escherichia coli. J Mol Biol 5:18–36. https://doi.org/10.1016/S0022-2836(62)80058-8

    Article  CAS  PubMed  Google Scholar 

  11. Bird AP (1978) Use of restriction enzymes to study eukaryotic DNA methylation: II. The symmetry of methylated sites supports semi-conservative copying of the methylation pattern. J. Mol. Biol. 118:49–60. https://doi.org/10.1016/0022-2836(78)90242-5

    Article  CAS  PubMed  Google Scholar 

  12. Pomraning KR, Smith KM, Freitag M (2009) Genome-wide high throughput analysis of DNA methylation in eukaryotes. Methods 47:142–150. https://doi.org/10.1016/j.ymeth.2008.09.022

    Article  CAS  PubMed  Google Scholar 

  13. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW (2010) Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods 7:461–465. https://doi.org/10.1038/nmeth.1459

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Krais AM, Cornelius MG, Schmeiser HH (2010) Genomic N6-methyladenine determination by MEKC with LIF. Electrophoresis 31:3548–3551. https://doi.org/10.1002/elps.201000357

    Article  CAS  PubMed  Google Scholar 

  15. Greer E, Blanco M, Gu L, Sendinc E, Liu J, Aristizabal-Corrales D, Hsu CH, Aravind L, He C, Shi Y (2015) DNA Methylation on N6-Adenine in C. elegans. Cell 161:868–878. https://doi.org/10.1016/j.cell.2015.04.005

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Zhou C, Wang C, Liu H, Zhou Q, Liu Q, Guo Y, Peng T, Song J, Zhang J, Chen L, Zhao Y, Zeng Z, Zhou D-X (2018) Identification and analysis of adenine N6-methylation sites in the rice genome. Nat Plants 4:554–563. https://doi.org/10.1038/s41477-018-0214-x

    Article  CAS  PubMed  Google Scholar 

  17. Chen W, Lv H, Nie F, Lin H (2019) i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics 35:2796–2800. https://doi.org/10.1093/bioinformatics/btz015

    Article  CAS  PubMed  Google Scholar 

  18. Le NQK (2019) iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule. Mol Genet Genomics 294:1173–1182. https://doi.org/10.1007/s00438-019-01570-y

    Article  CAS  PubMed  Google Scholar 

  19. Feng P, Yang H, Ding H, Lin H, Chen W, Chou KC (2018) iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics. https://doi.org/10.1016/j.ygeno.2018.01.005

    Article  PubMed  PubMed Central  Google Scholar 

  20. Pian C, Zhang G, Li F, Fan X (2019) MM-6mAPred: identifying DNA N6-methyladenine sites based on Markov Model. Bioinformatics 36:388–392. https://doi.org/10.1093/bioinformatics/btz556

    Article  CAS  Google Scholar 

  21. Huang Q, Zhang J, Wei L, Guo F, Zou Q (2020) 6mA-RicePred: a method for identifying DNA N6-Methyladenine sites in the rice genome based on feature fusion. Front Plant Sci 11:4. https://doi.org/10.3389/fpls.2020.00004

    Article  PubMed  PubMed Central  Google Scholar 

  22. Kong L, Zhang L (2019) i6mA-DNCP: computational identification of DNA N6-Methyladenine sites in the rice genome using optimized dinucleotide-based features. Genes 10:828. https://doi.org/10.3390/genes10100828

    Article  CAS  PubMed Central  Google Scholar 

  23. Liu Z, Dong W, Jiang W, He Z (2019) csDMA: an improved bioinformatics tool for identifying DNA 6 mA modifications via Chou’s 5-step rule. Sci Rep-Uk 9:13109–13118. https://doi.org/10.1038/s41598-019-49430-4

    Article  CAS  Google Scholar 

  24. Wahab A, Ali SD, Tayara H, Chong KT (2019) iIM-CNN: intelligent identifier of 6mA sites on different species by using convolution neural network. IEEE Access. https://doi.org/10.1109/ACCESS.2019.2958618

    Article  Google Scholar 

  25. Tahir M, Tayara H, Chong KT (2019) iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via chou’s 5-step rule. Chemometr Intell Lab 189:96–101. https://doi.org/10.1016/j.chemolab.2019.04.007

    Article  CAS  Google Scholar 

  26. Park S, Wahab A, Nazari I, Ryu JH, Chong KT (2020) i6mA-DNC: Prediction of DNA N6-Methyladenosine sites in rice genome based on dinucleotide representation using deep learning. Chemometr Intell Lab 204:104102. https://doi.org/10.1016/j.chemolab.2020.104102

    Article  CAS  Google Scholar 

  27. Hao L, Dao FY, Guan ZX, Zhang D, Lin H (2019) iDNA6mA-Rice: a computational tool for detecting n6-methyladenine sites in rice. Front Genet 10:793. https://doi.org/10.3389/fgene.2019.00793

    Article  CAS  Google Scholar 

  28. Basith S, Manavalan B, Shin TH, Lee G (2019) SDM6A: A web-based integrative machine-learning framework for predicting 6mA sites in the rice genome. Mol Ther-Nucl Acids. https://doi.org/10.1016/j.omtn.2019.08.011

    Article  Google Scholar 

  29. Liu W, Li H (2020) SICD6mA: identifying 6ma sites using deep memory network. BioRxiv. https://doi.org/10.1101/2020.02.02.930776

    Article  PubMed  PubMed Central  Google Scholar 

  30. Yu H, Dai Z (2019) SNNRice6mA: a deep learning method for predicting DNA N6-methyladenine sites in rice genome. Front Genet 10:1071–1077. https://doi.org/10.3389/fgene.2019.01071

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. https://doi.org/10.1093/bioinformatics/bts565

    Article  PubMed  PubMed Central  Google Scholar 

  32. Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC (2015) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. https://doi.org/10.1093/nar/gkv458

    Article  PubMed  PubMed Central  Google Scholar 

  33. Liu B, Wu H, Chou KC (2017) Pse-in-One 20: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nat Sci 9:67–91. https://doi.org/10.4236/ns.2017.94007

    Article  CAS  Google Scholar 

  34. Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, Zhu Y, Powell DR, Akutsu T, Webb GI, Chou KC, Smith AI, Daly RJ, Li J, Song J (2019) iLearn: an integrated platform and meta-learner for feature engineering, machine learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. https://doi.org/10.1093/bib/bbz041

    Article  PubMed  PubMed Central  Google Scholar 

  35. Rafsanjani M, Sajid A, Dewan MF, Swakkhar S, Alok S, Abdollah D (2019) PyFeat: a Python-based effective feature generation tool for DNA RNA and protein sequences. Bioinformatics 35:3831–3833. https://doi.org/10.1093/bioinformatics/btz165

    Article  CAS  Google Scholar 

  36. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, Webb GI, Smith AI, Daly RJ, Chou KC, Song J (2018) iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34:2499–2502. https://doi.org/10.1093/bioinformatics/bty140

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y (2018) PseUI: Pseudouridine sites identification based on RNA sequence information. BMC Bioinformatics 19:306. https://doi.org/10.1186/s12859-018-2321-0

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Su ZD, Huang Y, Zhang ZY, Zhao YW, Wang D, Chen W, Chou KC, Lin H (2018) iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty508

    Article  PubMed  PubMed Central  Google Scholar 

  39. Wang H, Ding Y, Tang J, Zou Q, Guo F (2021) Identify RNA-associated subcellular localizations based on multi-label learning using Chou’s 5-steps rule. BMC Genomics 22:1–14. https://doi.org/10.1186/s12864-020-07347-7

    Article  CAS  Google Scholar 

  40. Zhen C, Pan X, Yang Y, Huang Y, Shen HB (2018) The lncLocator: a subcellular localization predictor for long non-coding RNAs based on a stacked ensemble classifier. Bioinformatics 34:2185–2194. https://doi.org/10.1093/bioinformatics/bty085

    Article  CAS  Google Scholar 

  41. Bari ATMG, Reaz MR, Choi HJ, Jeong BS (2013) DNA encoding for splice site prediction in large DNA sequence. Database Syst Adv Appl. https://doi.org/10.1007/978-3-642-40270-8_4

    Article  Google Scholar 

  42. Chen W, Feng P, Tang H, Ding H, Lin H (2016) Identifying 2’-O-methylationation sites by integrating nucleotide chemical properties and nucleotide compositions. Genomics 107:255–258. https://doi.org/10.1016/j.ygeno.2016.05.003

    Article  CAS  PubMed  Google Scholar 

  43. Chen W, Yang H, Feng P, Ding H, Lin H (2017) iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 33:3518–3523. https://doi.org/10.1093/bioinformatics/btx479

    Article  CAS  PubMed  Google Scholar 

  44. Wei L, Chen H, Su R (2018) M6APred-EL: a sequence-based predictor for identifying N6-methyladenosine sites using ensemble learning. Mol Ther Nucleic Acids 12:635–644. https://doi.org/10.1016/j.omtn.2018.07.004

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Wei L, Su R, Luan S, Liao Z, Manavalan B, Zou Q, Shi X (2019) Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics 35:4930–4937. https://doi.org/10.1093/bioinformatics/btz408

    Article  CAS  PubMed  Google Scholar 

  46. Lv Z, Jin S, Ding H, Zou Q (2019) A random forest sub-golgi protein classifier optimized via dipeptide and amino acid composition features. Front Bioeng Biotech 7(2019):215. https://doi.org/10.3389/fbioe.2019.00215.eCollection

    Article  Google Scholar 

  47. Fu X, Cai L, Zeng X, Zou Q (2020) StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics 36:3028–3034. https://doi.org/10.1093/bioinformatics/btaa131

    Article  CAS  PubMed  Google Scholar 

  48. Zhang S, Qiao H (2020) KD-KLNMF: identification of lncRNAs subcellular localization with multiple features and nonnegative matrix factorization. Anal Biochem. https://doi.org/10.1016/j.ab.2020.113995

    Article  PubMed  Google Scholar 

  49. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232. https://doi.org/10.2307/2699986

    Article  Google Scholar 

  50. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) LightGBM: a highly efficient gradient boosting decision tree, In: 31st Conference Neural Information Processing Systems 30, pp 3149–3157. doi: https://doi.org/10.5555/3294996. 3295074.

  51. Chou KC, Zhang CT (2008) Prediction of protein structural classes. Crit Rev Biochem Mol 30:275–349. https://doi.org/10.3109/10409239509083488

    Article  Google Scholar 

  52. Su R, Hu J, Zou Q, Manavalan B, Wei L (2020) Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools. Brief Bioinform 21:408–420. https://doi.org/10.1093/bib/bby124

    Article  CAS  PubMed  Google Scholar 

  53. Manavalan B, Basith S, Shin TH, Wei L, Lee G (2019) mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation. Bioinformatics 35:2757–2765. https://doi.org/10.1093/bioinformatics/bty1047

    Article  CAS  PubMed  Google Scholar 

  54. Jia J, Liu Z, Xiao X, Liu B, Chou KC (2015) iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J Theor Biol 377:47–56. https://doi.org/10.1016/j.jtbi.2015.04.011

    Article  CAS  PubMed  Google Scholar 

  55. Basith S, Manavalan B, Shin TH, Lee G (2018) iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree. Comput Struct Biotec 16:412–420. https://doi.org/10.1016/j.csbj.2018.10.007

    Article  CAS  Google Scholar 

  56. Manavalan B, Govindaraj RG, Shin TH, Kim MO, Lee G (2018) iBCE-EL: a new ensemble learning framework for improved linear B-cell epitope prediction. Front Immunol 9:1695. https://doi.org/10.3389/fimmu.2018.01695

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Wei L, Luan S, Nagai LAE, Su R, Zou Q (2019) Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 35:1326–1333. https://doi.org/10.1093/bioinformatics/bty824

    Article  CAS  PubMed  Google Scholar 

  58. Meng C, Guo F, Zou Q (2020) CWLy-SVM: a support vector Machine-based tool for identifying cell wall lytic enzymes. Comput Biol Chem 87:107304. https://doi.org/10.1016/j.compbiolchem.2020.107304

    Article  CAS  PubMed  Google Scholar 

  59. Zhang S, Zhu F, Yu Q, Zhu X (2021) Identifying DNA-binding proteins based on multi-features and LASSO feature selection. Biopolymers. https://doi.org/10.1002/bip.23419

    Article  PubMed  Google Scholar 

  60. Crooks GE (2004) WebLogo: a sequence logo generator. Genome Res 14:1188–1190. https://doi.org/10.1101/gr.849004

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. He W, Jia C, Zou Q (2018) 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics 35:593–601. https://doi.org/10.1093/bioinformatics/bty668

    Article  CAS  Google Scholar 

  62. Wang J, Zhang S (2021) PA-PseU: an incremental passive-aggressive based method for identifying RNA pseudouridine sites via Chou’s 5-steps rule. Chemometr Intell Lab. https://doi.org/10.1016/j.chemolab.2021.104250

    Article  Google Scholar 

  63. Li J, Pu Y, Tang J, Zou Q, Guo F (2020) DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences. Brief Bioinform. https://doi.org/10.1093/bib/bbaa159

    Article  PubMed  PubMed Central  Google Scholar 

  64. He S, Guo F, Zou Q, Ding H (2020) MRMD2.0: a python tool for machine learning with feature ranking and reduction. Curr. Bioinform. 15:1213–1221. https://doi.org/10.2174/1574893615999200503030350

    Article  CAS  Google Scholar 

  65. Zhang YP, Zou Q (2020) PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning. Bioinformatics 36:3982–3987. https://doi.org/10.1093/bioinformatics/btaa275

    Article  CAS  PubMed  Google Scholar 

  66. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc 67:768–768. https://doi.org/10.1111/j.1467-9868.2005.00527.x

    Article  Google Scholar 

  67. Breiman L (2001) Random forest. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324

    Article  Google Scholar 

  68. Vapnik VN (1998) Statistical learning theory. In: New York: Wiley, p 1–768. doi: https://doi.org/10.1007/978-1-4419-1428-6_5864.

  69. Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. Acm sigkdd international conference on knowledge discovery and data mining, p 785–794 doi: https://doi.org/10.1145/2939672.2939785.

  70. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42. https://doi.org/10.1007/s10994-006-6226-1

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No.11601407), the Natural Science Basic Research Program of Shaanxi (No. 2021JM-115), and the Fundamental Research Funds for the Central Universities (No. JB210715).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengli Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xue, T., Zhang, S. & Qiao, H. i6mA-VC: A Multi-Classifier Voting Method for the Computational Identification of DNA N6-methyladenine Sites. Interdiscip Sci Comput Life Sci 13, 413–425 (2021). https://doi.org/10.1007/s12539-021-00429-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-021-00429-4

Keywords

Navigation