Advertisement

Combining Sequence and Epigenomic Data to Predict Transcription Factor Binding Sites Using Deep Learning

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10847)

Abstract

Knowing the transcription factor binding sites (TFBSs) is essential for modeling the underlying binding mechanisms and follow-up cellular functions. Convolutional neural networks (CNNs) have outperformed methods in predicting TFBSs from the primary DNA sequence. In addition to DNA sequences, histone modifications and chromatin accessibility are also important factors influencing their activity. They have been explored to predict TFBSs recently. However, current methods rarely take into account histone modifications and chromatin accessibility using CNN in an integrative framework. To this end, we developed a general CNN model to integrate these data for predicting TFBSs. We systematically benchmarked a series of architecture variants by changing network structure in terms of width and depth, and explored the effects of sample length at flanking regions. We evaluated the performance of the three types of data and their combinations using 256 ChIP-seq experiments and also compared it with competing machine learning methods. We find that contributions from these three types of data are complementary to each other. Moreover, the integrative CNN framework is superior to traditional machine learning methods with significant improvements.

Keywords

Bioinformatics Machine learning Transcription factors binding sites Convolutional neural networks DNA accessibility Histone modification 

Notes

Acknowledgement

Fang Jing would like to thank the support of the National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, CAS, during his visit. The work was supported by the National Natural Science Foundation of China [No. 61473232 and 91430111 to SWZ; No. 61621003 and 11661141019 to SZ]; the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) [No. XDB13040600], the Key Research Program of the Chinese Academy of Sciences, [No. KFZD-SW-219] and CAS Frontier Science Research Key Project for Top Young Scientist [No. QYZDB-SSW-SYS008].

References

  1. 1.
    Mitchell, P.J., Tjian, R.: Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245, 371–378 (1989)CrossRefGoogle Scholar
  2. 2.
    Junion, G., Spivakov, M., Girardot, C., Braun, M., Gustafson, E.H., Birney, E., Furlong, E.E.: A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell 148, 473–486 (2012)CrossRefGoogle Scholar
  3. 3.
    Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A., Luscombe, N.M.: A census of human transcription factors: function, expression and evolution. Nature Rev. Genet. 10, 252–263 (2009)CrossRefGoogle Scholar
  4. 4.
    Lee, T.I., Young, R.A.: Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251 (2013)CrossRefGoogle Scholar
  5. 5.
    Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B., Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K.: An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90 (2012)CrossRefGoogle Scholar
  6. 6.
    Gilfillan, G.D., Hughes, T., Sheng, Y., Hjorthaug, H.S., Straub, T., Gervin, K., Harris, J.R., Undlien, D.E., Lyle, R.: Limitations and possibilities of low cell number ChIP-seq. BMC Genom. 13, 645 (2012)CrossRefGoogle Scholar
  7. 7.
    Park, P.J.: ChIP–seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009)CrossRefGoogle Scholar
  8. 8.
    Warner, J.B., Philippakis, A.A., Jaeger, S.A., He, F.S., Lin, J., Bulyk, M.L.: Systematic identification of mammalian regulatory motifs’ target genes and functions. Nat. Methods 5, 347–353 (2008)CrossRefGoogle Scholar
  9. 9.
    Ghandi, M., Lee, D., Mohammad-Noori, M., Beer, M.A.: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10, e1003711 (2014)CrossRefGoogle Scholar
  10. 10.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)CrossRefGoogle Scholar
  11. 11.
    Angermueller, C., Lee, H.J., Reik, W., Stegle, O.: DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 18, 67 (2017)CrossRefGoogle Scholar
  12. 12.
    Qin, Q., Feng, J.: Imputation for transcription factor binding predictions based on deep learning. PLoS Comput. Biol. 13, e1005403 (2017)CrossRefGoogle Scholar
  13. 13.
    Yang, B., Liu, F., Ren, C., Ouyang, Z., Xie, Z., Bo, X., Shu, W.: BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics 33, 1930–1936 (2017)CrossRefGoogle Scholar
  14. 14.
    Kelley, D.R., Snoek, J., Rinn, J.L.: Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26(7), 990–999 (2016)CrossRefGoogle Scholar
  15. 15.
    Zeng, H., Edwards, M.D., Liu, G., Gifford, D.K.: Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 32, i121–i127 (2016)CrossRefGoogle Scholar
  16. 16.
    Jurtz, V.I., Johansen, A.R., Nielsen, M., Almagro Armenteros, J.J., Nielsen, H., Sønderby, C.K., Winther, O., Sønderby, S.K.: An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics 33, 3685–3690 (2017)CrossRefGoogle Scholar
  17. 17.
    Liu, Q., Xia, F., Yin, Q., Jiang, R.: Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics 34(5), 732–738 (2017).  https://doi.org/10.1093/bioinformatics/btx679CrossRefGoogle Scholar
  18. 18.
    Min, X., Zeng, W., Chen, N., Chen, T., Jiang, R.: Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics 33, i92–i101 (2017)CrossRefGoogle Scholar
  19. 19.
    Bu, H., Gan, Y., Wang, Y., Zhou, S., Guan, J.: A new method for enhancer prediction based on deep belief network. BMC Bioinform. 18, 418 (2017)CrossRefGoogle Scholar
  20. 20.
    Zhang, J., Peng, W., Wang, L.: LeNup: learning nucleosome positioning from DNA sequences with improved convolutional neural networks. Bioinformatics 34(10), 1705–1712 (2018).  https://doi.org/10.1093/bioinformatics/bty003CrossRefGoogle Scholar
  21. 21.
    Piqueregi, R., Degner, J.F., Pai, A.A., Gaffney, D.J., Gilad, Y., Pritchard, J.K.: Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455 (2011)CrossRefGoogle Scholar
  22. 22.
    Xin, B., Rohs, R.: Relationship between histone modifications and transcription factor binding is protein family specific. Genome Res. (2018).  https://doi.org/10.1101/gr.220079.116
  23. 23.
    Min, X., Zeng, W., Chen, S., Chen, N., Chen, T., Jiang, R.: Predicting enhancers with deep convolutional neural networks. BMC Bioinform. 18, 478 (2017)CrossRefGoogle Scholar
  24. 24.
    Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015)CrossRefGoogle Scholar
  25. 25.
    Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M.: TensorFlow: a system for large-scale machine learning. In: OSDI 2016, pp. 265–283 (2016)Google Scholar
  26. 26.
    Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Kheradpour, P., Zhang, Z., Heravi-Moussavi, A., Liu, Y., Amin, V.: Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015)CrossRefGoogle Scholar
  27. 27.
    Ziller, M.J., Edri, R., Yaffe, Y., Donaghey, J., Pop, R., Mallard, W., Issner, R., Gifford, C.A., Goren, A., Xing, J.: Dissecting neural differentiation regulatory networks through epigenetic footprinting. Nature 518, 355–359 (2015)CrossRefGoogle Scholar
  28. 28.
    Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)MathSciNetMATHGoogle Scholar
  29. 29.
    Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
  30. 30.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATHGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Key Laboratory of Information Fusion Technology of Ministry of Education, College of AutomationNorthwestern Polytechnical UniversityXi’anChina
  2. 2.NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of SciencesBeijingChina
  3. 3.School of Mathematical SciencesUniversity of Chinese Academy of SciencesBeijingChina

Personalised recommendations