Abstract
A position weight matrix (PWM) is widely accepted as a probabilistic representation for modeling protein-DNA binding specificity. Previous studies showed that for factors which bind to divergent binding sites, mixtures of multiple PWMs improve performance. We propose a consensus scaffolded mixutre PWM (CSM) model to improve cis-regulatory elements modeling by allowing overlapping components represented by a set of PWMs, each of which corresponds to a binding pattern and is scaffolded by a degenerate consensus. In addition, we propose a learning algorithm that involves an initial structure learning stage based on the frequent pattern mining and a refining stage based on the expectation maximization (EM) algorithm. We assess the merits of CSM using three independent criteria. In a case-study of transcription factor Leu3, the derived CSM models agree with conventional mixtures but show better fitness according to Fermi-Dirac distribution. Analysis of the human-mouse conservation of predicted binding sites of 83 JASPAR transcription factors (TFs) shows that the CSM is as good as or better than the simple mixture, the context-specific independent (CSI) mixture, and the single PWM model, for 83%, 84%, and 75% of the cases, respectively. Five-fold cross validation on 46 TRANSFAC datasets shows that CSM model has better generality than other mixture models.
Similar content being viewed by others
References
Stormo G D. DNA binding sites: representation and discovery. Bioinformatics, 2000, 16: 16–23
Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res, 1984, 12: 505–519
Bulyk M L, Johnson P L, Church G M. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res, 2002, 30: 1255–1261
Zhang M, Marr T. A weight array method for splicing signal analysis. Comput Appl Biosci, 1993, 9: 499–509
Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997, 268: 78–94
Barash Y, Elidan G, Friedman N, et al. Modeling dependencies in protein-DNA binding sites. In: Vingron M, Istrail S, Pevzner P, et al., eds. Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology. New York: ACM press, 2003. 28–37
Ellrott K, Yang C, Sladek F M, et al. Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics, 2002, 18: 100–109
Zhao X, Huang H, Speed T P. Finding short DNA motifs using permuted Markov models. J Comput Biol, 2005, 12: 894–906
Zhou Q, Liu J S. Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 2004, 20: 909–916
Hannenhalli S, Wang L S. Enhanced position weight matrices using mixture models. Bioinformatics, 2005, 21: 204–212
Sandelin A, Alkema W, Engstrom P, et al. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res, 2004, 32: 91–94
Georgi B, Schliep A. Context-specific independence mixture modeling for positional weight matrices. Bioinformatics, 2006, 22: 166–173
Hannenhalli S. Eukaryotic transcription factor binding sites — modeling and integrative search methods. Bioinformatics, 2008, 24: 1325–1331
Wingender E. The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinform, 2008, 9: 326–332
Liu X, Clarke N D. Rationalization of gene regulation by a eukaryotic transcription factor: calculation of regulatory region occupancy from predicted binding affinities. J Mol Biol, 2002, 323: 1–8
Djordjevic M, Sengupta A M, Shraiman B I. A biophysical approach to transcription factor binding site discovery. Genome Res, 2003, 13: 2381–2390
Thomas J W, Touchman J W, Blakesley R W, et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 2003, 424: 788–793
Kuhn R M, Karolchik D, Zweig A S, et al. The UCSC genome browser database: update 2007. Nucleic Acids Res, 2007, 35: D668–D673
Wakaguri H, Yamashita R, Suzuki Y, et al. DBTSS: database of transcription start sites, progress report 2008. Nucleic Acids Res, 2008, 36: 97–101
Kindermann R, Snell J L, Society A M. Markov Random Fields and their Applications (Contemporary Mathematics Volume 1). Providence: American Mathematical Society, 1980.
Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: Chen W, Naughton J, Bernstein P, eds. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2000. 1–12
Hays W L, Winkler R L. Statistics: Probability, Inference, and Decision. New York: Holt, Rinehart and Winston Inc, 1971.
Mehta C R, Patel N R, Tsiatis A A. Exact significance testing to establish treatment equivalence with ordered categorical data. Biometrics, 1984, 40: 819–825
Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B, 1977, 39: 1–38
Bailey T L, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Altman R B, Brutlag D L, Karp P, et al., eds. Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology. Menlo Park: AAAI Press, 1994. 28–36
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jiang, H., Zhao, Y., Chen, W. et al. Improving cis-regulatory elements modeling by consensus scaffolded mixture models. Sci. China Inf. Sci. 56, 1–11 (2013). https://doi.org/10.1007/s11432-011-4374-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-011-4374-9