A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9649)

Abstract

Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each dataset (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared towards multi-sample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq datasets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, [1] developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq datasets. Although this versatile framework both estimates the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization based estimation structure hinders its applicability with large numbers of loci and samples. We address this limitation by developing a MAP-based Asymptotic Derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm which converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparisons with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq datasets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.

Keywords

Small-variance asymptotics MAD-Bayes Unified state-space inference and clustering ChIP-seq 

References

  1. 1.
    Zuo, C., Hewitt, K.J., Bresnick, E.H., Keleş, S.: A hierarchical framework for state-space matrix inference and clustering. Ann. Appl. Stat. (Revised)Google Scholar
  2. 2.
    The ENCODE project consortium: an integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012)Google Scholar
  3. 3.
    Roadmap epigenomics consortium: integrative analysis of 111 reference human epigenomes. Nature 518(7539), 317–330 (2015)Google Scholar
  4. 4.
    Bardet, A.F., He, Q., Zeitlinger, J., Stark, A.: A computational pipeline for comparative ChIP-seq analyses. Nat. Protoc. 7(1), 45–61 (2012)CrossRefGoogle Scholar
  5. 5.
    Bao, Y., Vinciotti, V., Wit, E., AC’t Hoen, P.: Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data. BMC Bioinform. 14(1), 169 (2013)CrossRefGoogle Scholar
  6. 6.
    Zeng, X., Sanalkumar, R., Bresnick, E.H., Li, H., Chang, Q., Keleş, S.: jMOSAiCS: joint analysis of multiple ChIP-seq datasets. Genome Biol. 14, R38 (2013). Highly accessed. An R package for joint analysis of multiple ChIP-seq datasets. Available in Bioconductor http://bioconductor.org/packages/2.12/bioc/html/jmosaics.html CrossRefGoogle Scholar
  7. 7.
    Kuan, P.F., Chung, D., Pan, G., Thomson, J., Stewart, R., Keleş, S.: A statistical framework for the analysis of ChIP-Seq data. J. Am. Stat. Assoc. 106, 891–903 (2011). Software available on Galaxy http://toolshed.g2.bx.psu.edu/ and also on Bioconductor http://bioconductor.org/packages/2.8/bioc/html/mosaics.html CrossRefMATHMathSciNetGoogle Scholar
  8. 8.
    Bao, Y., Vinciotti, V., Wit, E., ’t Hoen, P.: Joint modeling of ChIP-seq data via a Markov random field model. Biostatistics 15(2), 296–310 (2014)CrossRefGoogle Scholar
  9. 9.
    Chen, K.B., Hardison, R., Zhang, Y.: dCaP: detecting differential binding events in multiple conditions and proteins. BMC Genomics 15(9), 1–14 (2014)CrossRefGoogle Scholar
  10. 10.
    Ernst, J., Kellis, M.: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 28(8), 817–825 (2010)CrossRefGoogle Scholar
  11. 11.
    Hoffman, M.M., Buske, O.J., Wang, J., Weng, Z., Bilmes, J.A., Noble, W.S.: Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012)CrossRefGoogle Scholar
  12. 12.
    Song, J., Chen, K.C.: Spectacle: fast chromatin state annotation using spectral learning. Genome Biol. 16(1), 33 (2015)CrossRefGoogle Scholar
  13. 13.
    Sohn, K.A., Ho, J.W.K., Djordjevic, D., Jeong, H.H., Park, P.J., Kim, J.H.: hiHMM: Bayesian non-parametric joint inference of chromatin state maps. Bioinformatics, btv117 (2015)Google Scholar
  14. 14.
    Liang, K., Keleş, S.: Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics 28(1), 121–122 (2012). Available in Bioconductor (http://www.bioconductor.org/packages/2.12/bioc/html/DBChIP.html)CrossRefGoogle Scholar
  15. 15.
    Mahony, S., Edwards, M.D., Mazzoni, E.O., Sherwood, R.I., Kakumanu, A., Morrison, C.A., Wichterle, H., Gifford, D.K.: An integrated model of multiple-condition ChIP-Seq data reveals predeterminants of Cdx2 binding. PLoS Comput. Biol. 10(3), e1003501 (2014)CrossRefGoogle Scholar
  16. 16.
    Song, Q., Smith, A.D.: Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics 27, 870–1 (2011)CrossRefGoogle Scholar
  17. 17.
    Ferguson, J.P., Cho, J.H., Zhao, H.: A new approach for the joint analysis of multiple ChIP-seq libraries with application to histone modification. Stat. Appl. Genet. Mol. Biol. 11(3), Article 1 (2012)Google Scholar
  18. 18.
    Taslim, C., Huang, T., Lin, S.: DIME: R-package for identifying differential ChIP-seq based on an ensemble of mixture models. Bioinformatics 27(11), 1569–70 (2011)CrossRefGoogle Scholar
  19. 19.
    Ji, H., Li, X., Wang, Q.F., Ning, Y.: Differential principal component analysis of ChIP-seq. Proc. Nat. Acad. Sci. U.S.A. 110(17), 6789–6794 (2013)CrossRefGoogle Scholar
  20. 20.
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B Met. 39, 1–38 (1977)MATHMathSciNetGoogle Scholar
  21. 21.
    Zuo, C., Keleş, S.: A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics 30(6), 853–860 (2014)CrossRefGoogle Scholar
  22. 22.
    Broderick, T., Kulis, B., Jordan, M.: MAD-Bayes: MAP-based asymptotic derivations from Bayes. In: Proceedings of the 30th International Conference on Machine Learning (2013)Google Scholar
  23. 23.
    Blackwell, D., MacQueen, J.B.: Ferguson distributions via Polya urn schemes. Ann. Stat. 1(2), 353–355 (1973)CrossRefMATHMathSciNetGoogle Scholar
  24. 24.
    Aldous, D.J.: Exchangeability and related topics. In: Hennequin, P.L. (ed.) École d’Été de Probabilités de Saint-Flour XIII, vol. 1117, pp. 1–198. Springer, Heidelberg (1983)CrossRefGoogle Scholar
  25. 25.
    Hewitt, K.J., Kim, D.H., Devadas, P., Prathibha, R., Zuo, C., Sanalkumar, R., Johnson, K.D., Kang, Y.A., Kim, J.S., Dewey, C.N., Keleş, S., Bresnick, E.: Hematopoietic signaling mechanism revealed from a stem/progenitor cell cistrome. Mol. Cell 59(1), 62–74 (2015)CrossRefGoogle Scholar
  26. 26.
    Johnson, K.D., Hsu, A., Ryu, M.J., Boyer, M.E., Keleş, S., Zhang, J., Lee, Y., Holland, S.M., Bresnick, E.H.: Cis-element mutation in a GATA-2-dependent immunodeficiency syndrome governs hematopoiesis and vascular integrity. J. Clin. Inv. 10(122), 3692–3704 (2012)CrossRefGoogle Scholar
  27. 27.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefMATHGoogle Scholar
  28. 28.
    Wei, Y., Li, X., Wang, Q.F., Ji, H.: iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets. BMC Genomics 13, 681 (2012)CrossRefGoogle Scholar
  29. 29.
    Gerstein, M.B., Kundaje, A., Hariharan, M., Landt, S.G., Yan, K.K., Cheng, C., Mu, X.J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A.P., Cayting, P., Charos, A., Chen, D.Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O’Geen, H., Ouyang, Z., Partridge, E.C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T.E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K.Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P.J., Myers, R.M., Weissman, S.M., Snyder, M.: Architecture of the human regulatory network derived from ENCODE data. Nature 489(7414), 91–100 (2012)CrossRefGoogle Scholar
  30. 30.
    Wei, Y., Tenzen, T., Ji, H.: Joint analysis of differential gene expression in multiple studies using correlation motifs. Biostatistics 16(1), 31–46 (2015)CrossRefMathSciNetGoogle Scholar
  31. 31.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRefGoogle Scholar
  32. 32.
    Tan, P.N., Steinbach, M., Kumar, V.: Cluster analysis: basic concepts and algorithms. In: Introduction to Data Mining, chap. 8 (2005)Google Scholar
  33. 33.
    Landt, S.G., Marinov, G.K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B.E., Bickel, P., Brown, J.B., Cayting, P., et al.: ChIP-seq guidelines and practices of the encode and modencode consortia. Genome Res. 22(9), 1813–1831 (2012)CrossRefGoogle Scholar
  34. 34.
    Banerjee, A.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)MATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Statistics, Department of Biostatistics and Medical InformaticsUniversity of Wisconsin-MadisonMadisonUSA

Personalised recommendations