Skip to main content

Deciphering the regulatory syntax of genomic DNA with deep learning

Abstract

An organism’s genome contains many sequence regions that perform diverse functions. Examples of such regions include genes, promoters, enhancers, and binding sites for regulatory proteins and RNAs. One of biology’s most important open problems is how to take a genome sequence and predict which regions within it perform different functions. In recent years, deep learning has enabled dramatic advances across many fields by modeling complex relationships between entities. Several deep learning models have also proven successful in predicting the biological function of a portion of DNA from its sequence, revealing new insights into the complex rules underlying genome regulation and opening new possibilities in disease modeling and synthetic biology.

This is a preview of subscription content, access via your institution.

Figure 1
Figure 2

References

  • Alipanahi B, Delong A, Weirauch MT and Frey BJ 2015 Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. https://doi.org/10.1038/nbt.3300

  • Avsec Ž, Kreuzhuber R, Israeli J, et al. 2019 The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37 592–600

    CAS  Article  Google Scholar 

  • Avsec Ž, Agarwal V, Visentin D, et al. 2021a Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Method. 18 1196–1203

    CAS  Article  Google Scholar 

  • Avsec Ž, Weilert M, Shrikumar A, et al. 2021b Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53 354–366

    CAS  Article  Google Scholar 

  • Bernstein BE, Stamatoyannopoulos JA, Costello JF, et al. 2010 The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28 1045–1048

    CAS  Article  Google Scholar 

  • Buniello A, MacArthur JAL, Cerezo M, et al. 2018 The NHGRI-EBI GWAS Catalog of Published Genome-Wide Association Studies, Targeted Arrays and Summary Statistics 2019. Nucleic Acids Res. 47 D1005–D1012

    Article  Google Scholar 

  • Cazares T, Rizvi FW, Iyer B, et al. 2022 maxATAC: Genome-scale transcription-factor binding prediction from ATAC-Seq with deep neural networks. bioRxiv https://doi.org/10.1101/2022.01.28.478235

  • de Almeida BP, Reiter F, Pagani M, and Stark A 2021 DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of enhancers. bioRxiv https://doi.org/10.1101/2021.10.05.463203

  • ENCODE Project Consortium 2012 An integrated encyclopedia of DNA elements in the human genome. Nature 489 57–74

    Article  Google Scholar 

  • Eraslan G, Avsec Ž, Gagneur J and Theis FJ 2019 Deep learning: New computational modelling techniques for genomics. Nat. Rev. Genet. 20 389–403

    CAS  Article  Google Scholar 

  • FANTOM Consortium, the RIKEN PMI, and Clst (dgt). 2014 A promoter-level mammalian expression atlas. Nature 507 462

  • Karbalayghareh A, Sahin M and Leslie CS 2021 Chromatin interaction aware gene regulatory modeling with graph attention networks. bioRxiv https://doi.org/10.1101/2021.03.31.437978

  • Keilwagen J, Posch S and Grau J 2019 Accurate prediction of cell type-specific transcription factor binding. Genome Biol. 20 9

    Article  Google Scholar 

  • Kelley HJ 2012 Gradient theory of optimal flight paths. ARS J. 10 5282

    Google Scholar 

  • Kelley DR, Snoek J and Rinn JL 2016 Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26 990–999

    CAS  Article  Google Scholar 

  • Kelley DR, Reshef YA, Bileschi M, et al. 2018 Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28 739–750

    CAS  Article  Google Scholar 

  • Kodzius R, Kojima M, Nishiyori H, et al. 2006 CAGE: Cap analysis of gene expression. Nat. Methods 3 211–222

    CAS  Article  Google Scholar 

  • Li H and Guan Y 2021 Fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. Genome Res. 31 721–731

    Article  Google Scholar 

  • Li H, Quang D and Guan Y 2019 Anchor: Trans-cell type prediction of transcription factor binding sites. Genome Res. 29 281–292

    CAS  Article  Google Scholar 

  • Linder J and Seelig G 2021 Fast activation maximization for molecular sequence design. BMC Bioinform. 22 510

  • Paszke A, Gross S, Massa F, et al. 2019 PyTorch: An imperative style, high-performance deep learning library. Adv. Neural Informat. Process. Syst. 32 8024–8035

  • Shrikumar A, Greenside P, Shcherbina A and Kundaje A 2016 Not just a black box: Learning important features through propagating activation differences. arXiv http://arxiv.org/abs/1605.01713

  • Shrikumar, A, Tian K, Avsec Ž, et al. 2018. Technical note on transcription factor motif discovery from Importance scores (TF-MoDISco) Version 0.5.6.5, October. arXiv https://doi.org/10.48550/arXiv.1811.00416

  • Simonyan K, Vedaldi A and Zisserman A 2013 Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv http://arxiv.org/abs/1312.6034

  • Vaishnav ED, de Boer CG, Molinet J, et al. 2022 The evolution, evolvability and engineering of gene regulatory DNA. Nature 603 455–463

    CAS  Article  Google Scholar 

  • Whalen S, Schreiber J, Noble WS and Pollard KS 2021 Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. https://doi.org/10.1038/s41576-021-00434-9

    Article  PubMed  Google Scholar 

  • Yin Q, Wu M, Liu Q, Lv H and Jiang R 2019 DeepHistone: A deep learning approach to predicting histone modifications. BMC Genom. 20 193

    CAS  Article  Google Scholar 

  • Zhou J 2021 Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale. bioRxiv https://doi.org/10.1101/2021.05.19.444847

  • Zhou J And Troyanskaya OG 2015 Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12 931–934

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Avantika Lal.

Ethics declarations

Disclosures

AL is an employee of Insitro, South San Francisco, USA. Insitro had no involvement in the preparation or submission of this article.

Additional information

Communicated by Kundan Sengupta.

Corresponding editor: Kundan Sengupta

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lal, A. Deciphering the regulatory syntax of genomic DNA with deep learning. J Biosci 47, 47 (2022). https://doi.org/10.1007/s12038-022-00291-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12038-022-00291-6

Keywords

  • Computational biology
  • deep learning
  • genome regulation
  • neural networks