XDeMo: a novel deep learning framework for DNA motif mining using transformer models

Chaurasia, Rajashree; Ghose, Udayan

doi:10.1007/s13721-024-00463-4

XDeMo: a novel deep learning framework for DNA motif mining using transformer models

Original Article
Published: 13 May 2024

Volume 13, article number 25, (2024)
Cite this article

Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

Motivation: Recognizing and studying DNA patterns is crucial for improving knowledge of illnesses, cell function, and gene control. Motifs determine which transcription factor a protein may bind to, leading to a better unraveling of gene expression. Advancements in the fields of deep learning and high-throughput sequencing have made possible the exploration of motif discovery anew, with greater accuracy and performance. Methodology: In this paper, a novel deep learning framework (XDeMo – Transformer-based Deep Motifs) for DNA motif mining using Transformer models is proposed. Furthermore, a hybrid encoding scheme is also introduced, called ‘blended’ encoding specifically designed for use with deep learning transformer models that are trained using DNA sequences. Results: Our proposed transformer-based framework for DNA motif discovery augmented by blended encoding outperforms many state-of-the-art deep learning models on many baseline performance metrics when trained on the standard datasets. Our models demonstrated robust performance in predicting motifs with high discriminative power, precision, recall, and F1 score. Conclusion: The model’s ability to capture intricate sequence patterns and long-range dependencies led to the discovery of biologically meaningful motifs that were verified from known transcription factor binding motif databases. This shows that our novel framework can be effectively used to find DNA motifs and therefore, aid in further downstream analyses for biomedical and biotechnological applications.

Significance

XDeMo’s practical implications span the realms of gene regulation research, genomics tool development, molecular biology, and diagnostic applications. It offers a robust foundation for further advancements in genomic analysis, with the potential to accelerate discoveries in gene regulation and the development of novel therapeutic strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Deep Learning Techniques: An Overview

A survey of best practices for RNA-seq data analysis

Article Open access 26 January 2016

Deep learning applications and challenges in big data analytics

Article Open access 24 February 2015

Data availability

All the ChIP-Seq datasets that were used in this study were downloaded from the ENCODE (Encyclopedia of DNA Elements) database, which can be accessed and downloaded freely from the ENCODE website link (available at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/). The preprocessing steps that were performed on these datasets are detailed in the Methods section.

References

Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33(8):831–838. https://doi.org/10.1038/nbt.3300
Article Google Scholar
Avsec Ž et al (2021) Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 18(10):1196–1203. https://doi.org/10.1038/s41592-021-01252-x
Article Google Scholar
Chaurasia R, Ghose U (2023) Human DNA/RNA motif mining using deep-learning methods: a scoping review. Netw Model Anal Health Inf Bioinf 12(1). https://doi.org/10.1007/s13721-023-00414-5
Choong AC, Lee NK (2017) Evaluation of convolutional neural networks modeling of DNA sequences using ordinal versus one-hot encoding method. 2017 International Conference on Computer and Drone Applications (IConDA). https://doi.org/10.1109/iconda.2017.8270400
ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57–74. https://doi.org/10.1038/nature11247
Article Google Scholar
Falk T et al (2018) U-Net: deep learning for cell counting, detection, and morphometry. Nat Methods 16(1):67–70. https://doi.org/10.1038/s41592-018-0261-2
Article Google Scholar
Floridi L, Chiriatti M (2020) GPT-3: its nature, scope, limits, and consequences. Mind Mach 30(4):681–694. https://doi.org/10.1007/s11023-020-09548-1
Article Google Scholar
Fukushima K (1975) Cognitron: a self-organizing multilayered neural network. Biol Cybern 20(3–4):121–136. https://doi.org/10.1007/bf00342633
Article Google Scholar
Gunasekaran H et al (2021) Analysis of DNA sequence classification using CNN and Hybrid models. Comput Math Methods Med 2021:1–12. https://doi.org/10.1155/2021/1835056
Article Google Scholar
He M, Miyajima F, Roberts P et al (2012) Emergence and global spread of epidemic healthcare-associated Clostridium difficile. Nat Genet 45:109–113. https://doi.org/10.1038/ng.2478
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2016.90
Hendrycks D, Gimpel K (2016) Gaussian Error Linear units (GELUs). arXiv e-prints. https://doi.org/10.48550/arXiv.1606.08415
Hitz BC et al (2023) Encode Unif Anal Pipelines. https://doi.org/10.1101/2023.04.04.535623
Article Google Scholar
Ji Y, Zhou Z, Liu H, Davuluri RV (2021) DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37(15):2112–2120. https://doi.org/10.1093/bioinformatics/btab083
Article Google Scholar
Jin S, Zeng X, Xia F, Huang W, Liu X (2020) Application of deep learning methods. Biol Networks Briefings Bioinf 22(2):1902–1917. https://doi.org/10.1093/bib/bbaa043
Article Google Scholar
Kamath U, Graham KL, Emara W (2022) Bidirectional encoder representations from Transformers (BERT). In Transformers for Machine Learning, pp. 43–70. https://doi.org/10.1201/9781003170082-3
Kelley DR, Snoek J, Rinn JL (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26(7):990–999. https://doi.org/10.1101/gr.200535.115
Article Google Scholar
Kingma DP, Ba J (2014) Adam: A Method for Stochastic Optimization. CoRR. doi: abs/1412.6980
Lin QXX, Thieffry D, Jha S, Benoukraf T (2019) TFregulomeR reveals transcription factors’ context-specific features and functions. Nucleic Acids Res 48(2). https://doi.org/10.1093/nar/gkz1088
Lu L (2020) Dying ReLU and initialization: theory and numerical examples. Commun Comput Phys 28(5):1671–1706. https://doi.org/10.4208/cicp.oa-2020-0165
Article MathSciNet Google Scholar
Luo Y et al (2019) New Developments on the encyclopedia of DNA elements (ENCODE) Data Portal. Nucleic Acids Res 48(D1). https://doi.org/10.1093/nar/gkz1062
Madrid F et al (2019) Matrix profile XX: Finding and visualizing time series motifs of all lengths using the matrix profile. 2019 IEEE International Conference on Big Knowledge (ICBK). https://doi.org/10.1109/icbk.2019.00031
Mannor S, Peleg D, Rubinstein R (2005) The cross entropy method for classification. Proc 22nd Int Conf Mach Learn - ICML ’05. https://doi.org/10.1145/1102351.1102422
Article Google Scholar
Nutiu R et al (2011) Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat Biotechnol 29(7):659–664. https://doi.org/10.1038/nbt.1882
Article Google Scholar
OpenAI (2023) GPT-4 Technical Report. ArXiv. abs/2303.08774
Otten NV (2023) Self-attention made easy and how to implement it. Spot Intelligence. Accessed May 11, 2023. [URL: https://spotintelligence.com/2023/01/31/self-attention/]
Pardiñas AF et al (2018) Common schizophrenia alleles are enriched in mutation-intolerant genes and maintained by background selection. Nat Genet 50(3):381–389. https://doi.org/10.1038/s41588-018-0059-2
Article Google Scholar
Poliakov A, Foong J, Brudno M, Dubchak I (2014) GenomeVISTA—an integrated software package for whole-genome alignment and visualization. Bioinformatics 30(18):2654–2655. https://doi.org/10.1093/bioinformatics/btu355
Article Google Scholar
Quang D, Xie X (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 44(11). https://doi.org/10.1093/nar/gkw226
Siggers T, Gordân R (2013) Protein–DNA binding: complexities and multi-protein codes. Nucleic Acids Res 42(4):2099–2111. https://doi.org/10.1093/nar/gkt1112
Article Google Scholar
Suter DM (2020) Transcription factors and DNA play hide and seek. Trends Cell Biol 30(6):491–500. https://doi.org/10.1016/j.tcb.2020.03.003
Article Google Scholar
Trabelsi A, Chaabane M, Ben-Hur A (2019) Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics 35(14):i269–i277. https://doi.org/10.1093/bioinformatics/btz339
Article Google Scholar
Vaswani A et al (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, Dec. 2017. https://doi.org/10.48550/arXiv.1706.03762
Wang C et al (2014) The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol 32(9):926–932. https://doi.org/10.1038/nbt.3001
Article Google Scholar
Xu H, Jia P, Zhao Z (2021) DeepVISP: deep learning for virus site integration prediction and motif discovery. Adv Sci 8(9):2004958. https://doi.org/10.1002/advs.202004958
Article Google Scholar
Yang J et al (2019) Nucleic Acids Res 47(15):7809–7824. https://doi.org/10.1093/nar/gkz672. Prediction of regulatory motifs from human chip-sequencing data using a deep learning framework.
Zambelli F, Pesole G, Pavesi G (2012) Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 14(2):225–237. https://doi.org/10.1093/bib/bbs016
Article Google Scholar
Zeng H, Edwards MD, Liu G, Gifford DK (2016) Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 32(12):i121–i127. https://doi.org/10.1093/bioinformatics/btw255
Article Google Scholar
Zhang Y, Qiao S, Ji S, Li Y (2019) DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding. Int J Mach Learn Cybernet 11(4):841–851. https://doi.org/10.1007/s13042-019-00990-x
Article Google Scholar
Zhang S et al (2021) Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data. Brief Bioinform 23(1). https://doi.org/10.1093/bib/bbab374
Zhou J, Troyanskaya OG (2015) Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods 12(10):931–934. https://doi.org/10.1038/nmeth.3547
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Directorate of Training and Technical Education (Govt. of NCT of Delhi), Guru Nanak Dev DSEU Rohini Campus, Delhi, India
Rajashree Chaurasia
University School of Information, Communication & Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Rajashree Chaurasia & Udayan Ghose

Authors

Rajashree Chaurasia
View author publications
You can also search for this author in PubMed Google Scholar
Udayan Ghose
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Rajashree Chaurasia and Udayan Ghose conceptualized the model architecture and methodology. Rajashree Chaurasia carried out the literature survey, data collection, preprocessing, and analysis, model construction, training, and evaluation, and wrote the manuscript. Udayan Ghose supervised and reviewed the manuscript preparation.

Corresponding author

Correspondence to Rajashree Chaurasia.

Ethics declarations

Conflicts of interest

The authors declare no conflicts of interest. No funding was received for conducting this study.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Supplementary Material 4

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chaurasia, R., Ghose, U. XDeMo: a novel deep learning framework for DNA motif mining using transformer models. Netw Model Anal Health Inform Bioinforma 13, 25 (2024). https://doi.org/10.1007/s13721-024-00463-4

Download citation

Received: 13 October 2023
Revised: 26 April 2024
Accepted: 26 April 2024
Published: 13 May 2024
DOI: https://doi.org/10.1007/s13721-024-00463-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

XDeMo: a novel deep learning framework for DNA motif mining using transformer models

Abstract

Significance

Access this article

Similar content being viewed by others

Deep Learning Techniques: An Overview

A survey of best practices for RNA-seq data analysis

Deep learning applications and challenges in big data analytics

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher’s Note

Electronic supplementary material

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Supplementary Material 4

Rights and permissions

About this article

Cite this article

Keywords

Navigation

XDeMo: a novel deep learning framework for DNA motif mining using transformer models

Abstract

Significance

Access this article

Similar content being viewed by others

Deep Learning Techniques: An Overview

A survey of best practices for RNA-seq data analysis

Deep learning applications and challenges in big data analytics

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher’s Note

Electronic supplementary material

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Supplementary Material 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation