Abstract
Motivation: Recognizing and studying DNA patterns is crucial for improving knowledge of illnesses, cell function, and gene control. Motifs determine which transcription factor a protein may bind to, leading to a better unraveling of gene expression. Advancements in the fields of deep learning and high-throughput sequencing have made possible the exploration of motif discovery anew, with greater accuracy and performance. Methodology: In this paper, a novel deep learning framework (XDeMo – Transformer-based Deep Motifs) for DNA motif mining using Transformer models is proposed. Furthermore, a hybrid encoding scheme is also introduced, called ‘blended’ encoding specifically designed for use with deep learning transformer models that are trained using DNA sequences. Results: Our proposed transformer-based framework for DNA motif discovery augmented by blended encoding outperforms many state-of-the-art deep learning models on many baseline performance metrics when trained on the standard datasets. Our models demonstrated robust performance in predicting motifs with high discriminative power, precision, recall, and F1 score. Conclusion: The model’s ability to capture intricate sequence patterns and long-range dependencies led to the discovery of biologically meaningful motifs that were verified from known transcription factor binding motif databases. This shows that our novel framework can be effectively used to find DNA motifs and therefore, aid in further downstream analyses for biomedical and biotechnological applications.
Significance
XDeMo’s practical implications span the realms of gene regulation research, genomics tool development, molecular biology, and diagnostic applications. It offers a robust foundation for further advancements in genomic analysis, with the potential to accelerate discoveries in gene regulation and the development of novel therapeutic strategies.
Similar content being viewed by others
Data availability
All the ChIP-Seq datasets that were used in this study were downloaded from the ENCODE (Encyclopedia of DNA Elements) database, which can be accessed and downloaded freely from the ENCODE website link (available at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/). The preprocessing steps that were performed on these datasets are detailed in the Methods section.
References
Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33(8):831–838. https://doi.org/10.1038/nbt.3300
Avsec Ž et al (2021) Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 18(10):1196–1203. https://doi.org/10.1038/s41592-021-01252-x
Chaurasia R, Ghose U (2023) Human DNA/RNA motif mining using deep-learning methods: a scoping review. Netw Model Anal Health Inf Bioinf 12(1). https://doi.org/10.1007/s13721-023-00414-5
Choong AC, Lee NK (2017) Evaluation of convolutional neural networks modeling of DNA sequences using ordinal versus one-hot encoding method. 2017 International Conference on Computer and Drone Applications (IConDA). https://doi.org/10.1109/iconda.2017.8270400
ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57–74. https://doi.org/10.1038/nature11247
Falk T et al (2018) U-Net: deep learning for cell counting, detection, and morphometry. Nat Methods 16(1):67–70. https://doi.org/10.1038/s41592-018-0261-2
Floridi L, Chiriatti M (2020) GPT-3: its nature, scope, limits, and consequences. Mind Mach 30(4):681–694. https://doi.org/10.1007/s11023-020-09548-1
Fukushima K (1975) Cognitron: a self-organizing multilayered neural network. Biol Cybern 20(3–4):121–136. https://doi.org/10.1007/bf00342633
Gunasekaran H et al (2021) Analysis of DNA sequence classification using CNN and Hybrid models. Comput Math Methods Med 2021:1–12. https://doi.org/10.1155/2021/1835056
He M, Miyajima F, Roberts P et al (2012) Emergence and global spread of epidemic healthcare-associated Clostridium difficile. Nat Genet 45:109–113. https://doi.org/10.1038/ng.2478
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2016.90
Hendrycks D, Gimpel K (2016) Gaussian Error Linear units (GELUs). arXiv e-prints. https://doi.org/10.48550/arXiv.1606.08415
Hitz BC et al (2023) Encode Unif Anal Pipelines. https://doi.org/10.1101/2023.04.04.535623
Ji Y, Zhou Z, Liu H, Davuluri RV (2021) DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37(15):2112–2120. https://doi.org/10.1093/bioinformatics/btab083
Jin S, Zeng X, Xia F, Huang W, Liu X (2020) Application of deep learning methods. Biol Networks Briefings Bioinf 22(2):1902–1917. https://doi.org/10.1093/bib/bbaa043
Kamath U, Graham KL, Emara W (2022) Bidirectional encoder representations from Transformers (BERT). In Transformers for Machine Learning, pp. 43–70. https://doi.org/10.1201/9781003170082-3
Kelley DR, Snoek J, Rinn JL (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26(7):990–999. https://doi.org/10.1101/gr.200535.115
Kingma DP, Ba J (2014) Adam: A Method for Stochastic Optimization. CoRR. doi: abs/1412.6980
Lin QXX, Thieffry D, Jha S, Benoukraf T (2019) TFregulomeR reveals transcription factors’ context-specific features and functions. Nucleic Acids Res 48(2). https://doi.org/10.1093/nar/gkz1088
Lu L (2020) Dying ReLU and initialization: theory and numerical examples. Commun Comput Phys 28(5):1671–1706. https://doi.org/10.4208/cicp.oa-2020-0165
Luo Y et al (2019) New Developments on the encyclopedia of DNA elements (ENCODE) Data Portal. Nucleic Acids Res 48(D1). https://doi.org/10.1093/nar/gkz1062
Madrid F et al (2019) Matrix profile XX: Finding and visualizing time series motifs of all lengths using the matrix profile. 2019 IEEE International Conference on Big Knowledge (ICBK). https://doi.org/10.1109/icbk.2019.00031
Mannor S, Peleg D, Rubinstein R (2005) The cross entropy method for classification. Proc 22nd Int Conf Mach Learn - ICML ’05. https://doi.org/10.1145/1102351.1102422
Nutiu R et al (2011) Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat Biotechnol 29(7):659–664. https://doi.org/10.1038/nbt.1882
OpenAI (2023) GPT-4 Technical Report. ArXiv. abs/2303.08774
Otten NV (2023) Self-attention made easy and how to implement it. Spot Intelligence. Accessed May 11, 2023. [URL: https://spotintelligence.com/2023/01/31/self-attention/]
Pardiñas AF et al (2018) Common schizophrenia alleles are enriched in mutation-intolerant genes and maintained by background selection. Nat Genet 50(3):381–389. https://doi.org/10.1038/s41588-018-0059-2
Poliakov A, Foong J, Brudno M, Dubchak I (2014) GenomeVISTA—an integrated software package for whole-genome alignment and visualization. Bioinformatics 30(18):2654–2655. https://doi.org/10.1093/bioinformatics/btu355
Quang D, Xie X (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 44(11). https://doi.org/10.1093/nar/gkw226
Siggers T, Gordân R (2013) Protein–DNA binding: complexities and multi-protein codes. Nucleic Acids Res 42(4):2099–2111. https://doi.org/10.1093/nar/gkt1112
Suter DM (2020) Transcription factors and DNA play hide and seek. Trends Cell Biol 30(6):491–500. https://doi.org/10.1016/j.tcb.2020.03.003
Trabelsi A, Chaabane M, Ben-Hur A (2019) Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics 35(14):i269–i277. https://doi.org/10.1093/bioinformatics/btz339
Vaswani A et al (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, Dec. 2017. https://doi.org/10.48550/arXiv.1706.03762
Wang C et al (2014) The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol 32(9):926–932. https://doi.org/10.1038/nbt.3001
Xu H, Jia P, Zhao Z (2021) DeepVISP: deep learning for virus site integration prediction and motif discovery. Adv Sci 8(9):2004958. https://doi.org/10.1002/advs.202004958
Yang J et al (2019) Nucleic Acids Res 47(15):7809–7824. https://doi.org/10.1093/nar/gkz672. Prediction of regulatory motifs from human chip-sequencing data using a deep learning framework.
Zambelli F, Pesole G, Pavesi G (2012) Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 14(2):225–237. https://doi.org/10.1093/bib/bbs016
Zeng H, Edwards MD, Liu G, Gifford DK (2016) Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 32(12):i121–i127. https://doi.org/10.1093/bioinformatics/btw255
Zhang Y, Qiao S, Ji S, Li Y (2019) DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding. Int J Mach Learn Cybernet 11(4):841–851. https://doi.org/10.1007/s13042-019-00990-x
Zhang S et al (2021) Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data. Brief Bioinform 23(1). https://doi.org/10.1093/bib/bbab374
Zhou J, Troyanskaya OG (2015) Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods 12(10):931–934. https://doi.org/10.1038/nmeth.3547
Author information
Authors and Affiliations
Contributions
Rajashree Chaurasia and Udayan Ghose conceptualized the model architecture and methodology. Rajashree Chaurasia carried out the literature survey, data collection, preprocessing, and analysis, model construction, training, and evaluation, and wrote the manuscript. Udayan Ghose supervised and reviewed the manuscript preparation.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflicts of interest. No funding was received for conducting this study.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chaurasia, R., Ghose, U. XDeMo: a novel deep learning framework for DNA motif mining using transformer models. Netw Model Anal Health Inform Bioinforma 13, 25 (2024). https://doi.org/10.1007/s13721-024-00463-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-024-00463-4