Skip to main content
Log in

Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model

  • Published:
Plant Molecular Biology Aims and scope Submit manuscript

Abstract

Key message

This study used k-mer embeddings as effective feature to identify DNA N6-Methyladenine sites in plant genomes and obtained improved performance without substantial effort in feature extraction, combination and selection.

Abstract

Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. In this study, we regarded DNA sequences as textual information and employed natural language processing techniques to decipher hidden biological meanings from those sequences. In other words, we considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from these language models to be used in machine learning prediction models. Skip-gram neural networks were the base of the language models and ensemble tree-based algorithms were the machine learning algorithms for prediction models. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

The data is published and specified in the paper.

Code availability

Source code is provided at https://github.com/khucnam/Deep_Emb_6mA.

References

  • Basith S, Manavalan B, Shin TH, Lee G (2019) SDM6A: a web-based integrative machine-learning framework for predicting 6mA sites in the Rice genome. Mol Ther Nucleic Acids 18:131–141

    Article  CAS  Google Scholar 

  • Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  • Chen W, Lv H, Nie F, Lin H (2019) i6mA-Pred: identifying DNA N6-methyladenine sites in the Rice genome. Bioinformatics 35:2796–2800

    Article  CAS  Google Scholar 

  • Clough E, Barrett T (2016) The gene expression omnibus database. Statistical genomics. Springer, New York, pp 93–110

    Chapter  Google Scholar 

  • Fang G, Munera D, Friedman DI, Mandlik A, Chao MC, Banerjee O, Feng Z, Losic B, Mahajan MC, Jabado OJ (2012) Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing. Nat Biotechnol 30:1232–1239

    Article  CAS  Google Scholar 

  • Feng P, Yang H, Ding H, Lin H, Chen W, Chou K-C (2019) iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 111:96–102

    Article  CAS  Google Scholar 

  • Greer EL, Blanco MA, Gu L, Sendinc E, Liu J, Aristizábal-Corrales D, Hsu C-H, Aravind L, He C, Shi Y (2015) DNA methylation on N6-adenine in C. elegans. Cell 161:868–878

    Article  CAS  Google Scholar 

  • Hasan MM, Basith S, Khatun MS, Lee G, Manavalan B, Kurata H (2020a) Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Brief Bioinform. https://doi.org/10.1093/bib/bbaa202

    Article  Google Scholar 

  • Hasan MM, Manavalan B, Shoombuatong W, Khatun MS, Kurata H (2020b) i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation. Plant Mol Biol 103:225–234

    Article  CAS  Google Scholar 

  • Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759

  • Joulin A, Grave E, Bojanowski P, Douze M, Jégou H, Mikolov T (2016) Fasttext. zip: compressing text classification models. arXiv preprint arXiv:1612.03651

  • Karanthamalai J, Chodon A, Chauhan S, Pandi G (2020) DNA N6-methyladenine modification in plant genomes—a glimpse into emerging epigenetic code. Plants 9:247

    Article  CAS  Google Scholar 

  • Khanal J, Lim DY, Tayara H, Chong KT (2020) i6mA-stack: a stacking ensemble-based computational prediction of DNA N6-methyladenine (6mA) sites in the Rosaceae genome. Genomics. https://doi.org/10.1016/j.ygeno.2020.09.054

    Article  PubMed  Google Scholar 

  • Khanal J, Lim DY, Tayara H, Chong KT (2021) i6ma-stack: a stacking ensemble-based computational prediction of DNA N6-methyladenine (6ma) sites in the Rosaceae genome. Genomics 113:582–592

    Article  CAS  Google Scholar 

  • Liu Z-Y, Xing J-F, Chen W, Luan M-W, Xie R, Huang J, Xie S-Q, Xiao C-L (2019) MDR: an integrative DNA N6-methyladenine and N4-methylcytosine modification database for Rosaceae. Horticult Res 6:1–7

    Article  Google Scholar 

  • Luo G-Z, Blanco MA, Greer EL, He C, Shi Y (2015) DNA N 6-methyladenine: a new epigenetic mark in eukaryotes? Nat Rev Mol Cell Biol 16:705–710

    Article  CAS  Google Scholar 

  • Luo G-Z, Wang F, Weng X, Chen K, Hao Z, Yu M, Deng X, Liu J, He C (2016) Characterization of eukaryotic DNA N 6-methyladenine by a highly sensitive restriction enzyme-assisted sequencing. Nat Commun 7:1–6

    Google Scholar 

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp. 3111–3119

  • O’Brown ZK, Greer EL (2016) N6-methyladenine: a conserved and dynamic DNA mark. DNA methyltransferases-role and function. Springer, Cham, pp 213–246

    Chapter  Google Scholar 

  • Pian C, Zhang G, Li F, Fan X (2020) MM-6mAPred: identifying DNA N6-methyladenine sites based on Markov model. Bioinformatics 36:388–392

    PubMed  Google Scholar 

  • Pukkila PJ, Peterson J, Herman G, Modrich P, Meselson M (1983) Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli. Genetics 104:571–582

    Article  CAS  Google Scholar 

  • Ratel D, Ravanat JL, Berger F, Wion D (2006) N6-methyladenine: the other methylated base of DNA. BioEssays 28:309–315

    Article  CAS  Google Scholar 

  • Roberts D, Hoopes B, McClure W, Kleckner N (1985) IS10 transposition is regulated by DNA adenine methylation. Cell 43:117–130

    Article  CAS  Google Scholar 

  • Smith ZD, Meissner A (2013) DNA methylation: roles in mammalian development. Nat Rev Genet 14:204–220

    Article  CAS  Google Scholar 

  • Tahir M, Tayara H, Chong KT (2019) iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the Rice genome by intelligent computational model via Chou’s 5-step rule. Chemom Intell Lab Syst 189:96–101

    Article  CAS  Google Scholar 

  • Wang X, Yan R (2018) RFAthM6A: a new tool for predicting m 6 A sites in Arabidopsis thaliana. Plant Mol Biol 96:327–337

    Article  CAS  Google Scholar 

  • Xu H, Hu R, Jia P, Zhao Z (2020) 6mA-Finder: a novel online tool for predicting DNA N6-methyladenine sites in genomes. Bioinformatics 36:3257–3259

    Article  CAS  Google Scholar 

  • Ye P, Luan Y, Chen K, Liu Y, Xiao C, Xie Z (2016) MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing. Nucl Acids Res 45:85–89

    Article  Google Scholar 

  • Yu N, Li Z, Yu Z (2018) Survey on encoding schemes for genomic data representation and feature learning—from signal processing to machine learning. Big Data Min Anal 1:191–210

    Article  CAS  Google Scholar 

  • Zhang G, Huang H, Liu D, Cheng Y, Liu X, Zhang W, Yin R, Zhang D, Zhang P, Liu J (2015) N6-methyladenine DNA modification in Drosophila. Cell 161:893–906

    Article  CAS  Google Scholar 

  • Zhang M, Sun J-W, Liu Z, Ren M-W, Shen H-B, Yu D-J (2016) Improving N6-methyladenosine site prediction with heuristic selection of nucleotide physical–chemical properties. Anal Biochem 508:104–113

    Article  CAS  Google Scholar 

  • Zhang Q, Liang Z, Cui X, Ji C, Li Y, Zhang P, Liu J, Riaz A, Yao P, Liu M (2018) N6-Methyladenine DNA methylation in Japonica and Indica Rice genomes and its association with gene expression, plant development, and stress responses. Mol Plant 11:1492–1508

    Article  CAS  Google Scholar 

  • Zhou C, Wang C, Liu H, Zhou Q, Liu Q, Guo Y, Peng T, Song J, Zhang J, Chen L (2018) Identification and analysis of adenine N 6-methylation sites in the Rice genome. Nat Plants 4:554–563

    Article  CAS  Google Scholar 

Download references

Funding

This work was partially supported by the Ministry of Science and Technology, Taiwan, R.O.C. under Grant No. MOST 109-2811-E-155-505 and No. MOST 109-2221-E-155-045.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, NTTD, TVN, NQKL and YYO; methodology, NTTD and NQKL; formal analysis, TTDN; writing—original draft preparation, TTDN; writing—review and editing, TTDN, TVN, NQKL, and YYO; supervision, YYO; funding acquisition, YYO. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Yu-Yen Ou.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Consent to participate

Not applicable.

Consent for publication

All authors have read and agreed to the published version of the manuscript.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 593 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, T.T.D., Trinh, V.N., Le, N.Q.K. et al. Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model. Plant Mol Biol 107, 533–542 (2021). https://doi.org/10.1007/s11103-021-01204-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11103-021-01204-1

Keywords

Navigation