Skip to main content

A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets

  • 2049 Accesses

Part of the Lecture Notes in Computer Science book series (LNBI,volume 9649)

Abstract

Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each dataset (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared towards multi-sample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq datasets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, [1] developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq datasets. Although this versatile framework both estimates the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization based estimation structure hinders its applicability with large numbers of loci and samples. We address this limitation by developing a MAP-based Asymptotic Derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm which converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparisons with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq datasets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.

Keywords

  • Small-variance asymptotics
  • MAD-Bayes
  • Unified state-space inference and clustering
  • ChIP-seq

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-31957-5_2
  • Chapter length: 18 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-31957-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.

References

  1. Zuo, C., Hewitt, K.J., Bresnick, E.H., Keleş, S.: A hierarchical framework for state-space matrix inference and clustering. Ann. Appl. Stat. (Revised)

    Google Scholar 

  2. The ENCODE project consortium: an integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012)

    Google Scholar 

  3. Roadmap epigenomics consortium: integrative analysis of 111 reference human epigenomes. Nature 518(7539), 317–330 (2015)

    Google Scholar 

  4. Bardet, A.F., He, Q., Zeitlinger, J., Stark, A.: A computational pipeline for comparative ChIP-seq analyses. Nat. Protoc. 7(1), 45–61 (2012)

    CrossRef  Google Scholar 

  5. Bao, Y., Vinciotti, V., Wit, E., AC’t Hoen, P.: Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data. BMC Bioinform. 14(1), 169 (2013)

    CrossRef  Google Scholar 

  6. Zeng, X., Sanalkumar, R., Bresnick, E.H., Li, H., Chang, Q., Keleş, S.: jMOSAiCS: joint analysis of multiple ChIP-seq datasets. Genome Biol. 14, R38 (2013). Highly accessed. An R package for joint analysis of multiple ChIP-seq datasets. Available in Bioconductor http://bioconductor.org/packages/2.12/bioc/html/jmosaics.html

    CrossRef  Google Scholar 

  7. Kuan, P.F., Chung, D., Pan, G., Thomson, J., Stewart, R., Keleş, S.: A statistical framework for the analysis of ChIP-Seq data. J. Am. Stat. Assoc. 106, 891–903 (2011). Software available on Galaxy http://toolshed.g2.bx.psu.edu/ and also on Bioconductor http://bioconductor.org/packages/2.8/bioc/html/mosaics.html

    CrossRef  MATH  MathSciNet  Google Scholar 

  8. Bao, Y., Vinciotti, V., Wit, E., ’t Hoen, P.: Joint modeling of ChIP-seq data via a Markov random field model. Biostatistics 15(2), 296–310 (2014)

    CrossRef  Google Scholar 

  9. Chen, K.B., Hardison, R., Zhang, Y.: dCaP: detecting differential binding events in multiple conditions and proteins. BMC Genomics 15(9), 1–14 (2014)

    CrossRef  Google Scholar 

  10. Ernst, J., Kellis, M.: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 28(8), 817–825 (2010)

    CrossRef  Google Scholar 

  11. Hoffman, M.M., Buske, O.J., Wang, J., Weng, Z., Bilmes, J.A., Noble, W.S.: Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012)

    CrossRef  Google Scholar 

  12. Song, J., Chen, K.C.: Spectacle: fast chromatin state annotation using spectral learning. Genome Biol. 16(1), 33 (2015)

    CrossRef  Google Scholar 

  13. Sohn, K.A., Ho, J.W.K., Djordjevic, D., Jeong, H.H., Park, P.J., Kim, J.H.: hiHMM: Bayesian non-parametric joint inference of chromatin state maps. Bioinformatics, btv117 (2015)

    Google Scholar 

  14. Liang, K., Keleş, S.: Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics 28(1), 121–122 (2012). Available in Bioconductor (http://www.bioconductor.org/packages/2.12/bioc/html/DBChIP.html)

    CrossRef  Google Scholar 

  15. Mahony, S., Edwards, M.D., Mazzoni, E.O., Sherwood, R.I., Kakumanu, A., Morrison, C.A., Wichterle, H., Gifford, D.K.: An integrated model of multiple-condition ChIP-Seq data reveals predeterminants of Cdx2 binding. PLoS Comput. Biol. 10(3), e1003501 (2014)

    CrossRef  Google Scholar 

  16. Song, Q., Smith, A.D.: Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics 27, 870–1 (2011)

    CrossRef  Google Scholar 

  17. Ferguson, J.P., Cho, J.H., Zhao, H.: A new approach for the joint analysis of multiple ChIP-seq libraries with application to histone modification. Stat. Appl. Genet. Mol. Biol. 11(3), Article 1 (2012)

    Google Scholar 

  18. Taslim, C., Huang, T., Lin, S.: DIME: R-package for identifying differential ChIP-seq based on an ensemble of mixture models. Bioinformatics 27(11), 1569–70 (2011)

    CrossRef  Google Scholar 

  19. Ji, H., Li, X., Wang, Q.F., Ning, Y.: Differential principal component analysis of ChIP-seq. Proc. Nat. Acad. Sci. U.S.A. 110(17), 6789–6794 (2013)

    CrossRef  Google Scholar 

  20. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B Met. 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  21. Zuo, C., Keleş, S.: A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics 30(6), 853–860 (2014)

    CrossRef  Google Scholar 

  22. Broderick, T., Kulis, B., Jordan, M.: MAD-Bayes: MAP-based asymptotic derivations from Bayes. In: Proceedings of the 30th International Conference on Machine Learning (2013)

    Google Scholar 

  23. Blackwell, D., MacQueen, J.B.: Ferguson distributions via Polya urn schemes. Ann. Stat. 1(2), 353–355 (1973)

    CrossRef  MATH  MathSciNet  Google Scholar 

  24. Aldous, D.J.: Exchangeability and related topics. In: Hennequin, P.L. (ed.) École d’Été de Probabilités de Saint-Flour XIII, vol. 1117, pp. 1–198. Springer, Heidelberg (1983)

    CrossRef  Google Scholar 

  25. Hewitt, K.J., Kim, D.H., Devadas, P., Prathibha, R., Zuo, C., Sanalkumar, R., Johnson, K.D., Kang, Y.A., Kim, J.S., Dewey, C.N., Keleş, S., Bresnick, E.: Hematopoietic signaling mechanism revealed from a stem/progenitor cell cistrome. Mol. Cell 59(1), 62–74 (2015)

    CrossRef  Google Scholar 

  26. Johnson, K.D., Hsu, A., Ryu, M.J., Boyer, M.E., Keleş, S., Zhang, J., Lee, Y., Holland, S.M., Bresnick, E.H.: Cis-element mutation in a GATA-2-dependent immunodeficiency syndrome governs hematopoiesis and vascular integrity. J. Clin. Inv. 10(122), 3692–3704 (2012)

    CrossRef  Google Scholar 

  27. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    CrossRef  MATH  Google Scholar 

  28. Wei, Y., Li, X., Wang, Q.F., Ji, H.: iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets. BMC Genomics 13, 681 (2012)

    CrossRef  Google Scholar 

  29. Gerstein, M.B., Kundaje, A., Hariharan, M., Landt, S.G., Yan, K.K., Cheng, C., Mu, X.J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A.P., Cayting, P., Charos, A., Chen, D.Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O’Geen, H., Ouyang, Z., Partridge, E.C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T.E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K.Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P.J., Myers, R.M., Weissman, S.M., Snyder, M.: Architecture of the human regulatory network derived from ENCODE data. Nature 489(7414), 91–100 (2012)

    CrossRef  Google Scholar 

  30. Wei, Y., Tenzen, T., Ji, H.: Joint analysis of differential gene expression in multiple studies using correlation motifs. Biostatistics 16(1), 31–46 (2015)

    CrossRef  MathSciNet  Google Scholar 

  31. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    CrossRef  Google Scholar 

  32. Tan, P.N., Steinbach, M., Kumar, V.: Cluster analysis: basic concepts and algorithms. In: Introduction to Data Mining, chap. 8 (2005)

    Google Scholar 

  33. Landt, S.G., Marinov, G.K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B.E., Bickel, P., Brown, J.B., Cayting, P., et al.: ChIP-seq guidelines and practices of the encode and modencode consortia. Genome Res. 22(9), 1813–1831 (2012)

    CrossRef  Google Scholar 

  34. Banerjee, A.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sündüz Keleş .

Editor information

Editors and Affiliations

Appendix

Appendix

Fig. 4.
figure 4

A graphical interpretation of the conjugacy between \(\lambda _r\) and J. We use the K-means initialization to compute surrogate values for L(J) for a large collection of \(J \ge 1\). The \(\lambda _r\) value that can yield J clusters in the global solution must satisfy: \(\sup _{J'>J}\frac{L(J)-L(J')}{J-J'}\le \lambda _r\le \inf _{J'>J}\frac{L(J')-L(J)}{J'-J}\). When \(\lambda _r\) satisfies this condition, a line with slope \(-\lambda _r\) passing through (JL(J)) on the graph should be tangent to the trace of all L(J) values. Although using the surrogate L(J) values can lead to the curve connecting the L(J) values to be con-convex, making the solution for \(\lambda _r\) not hold for some J, we can use a convex approximation to the trace of L(J) so that so that a \(\lambda _r\) exists for each J. A simpler approach is to order the L(J) from largest to smallest and require the following condition for \(\lambda _r\). \(L(J) - L(J+1) \le \lambda _r \le L(J-1)-L(J)\). Algorithm 2 essentially applies this idea to select the \(\lambda _r\) values. Each J corresponds to a \(\lambda _r\) of value \([L(J-1)-L(J+1)]/2\) that satisfies the conjugacy inequality. The algorithm essentially tries to identify the range of \(\lambda _r\) that leads up to \(\sqrt{I}\) number of clusters.

Fig. 5.
figure 5

Comparison of the clustering accuracy with the adjusted Rand index by excluding the singleton loci.

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zuo, C., Chen, K., Keleş, S. (2016). A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets. In: Singh, M. (eds) Research in Computational Molecular Biology. RECOMB 2016. Lecture Notes in Computer Science(), vol 9649. Springer, Cham. https://doi.org/10.1007/978-3-319-31957-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31957-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31956-8

  • Online ISBN: 978-3-319-31957-5

  • eBook Packages: Computer ScienceComputer Science (R0)