A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets
- Cite this paper as:
- Zuo C., Chen K., Keleş S. (2016) A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets. In: Singh M. (eds) Research in Computational Molecular Biology. RECOMB 2016. Lecture Notes in Computer Science, vol 9649. Springer, Cham
Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each dataset (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared towards multi-sample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq datasets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently,  developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq datasets. Although this versatile framework both estimates the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization based estimation structure hinders its applicability with large numbers of loci and samples. We address this limitation by developing a MAP-based Asymptotic Derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm which converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparisons with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq datasets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.