Abstract
Many different computer programs for the prediction of transcription factor binding sites have been developed over the last decades. These programs differ from each other by pursuing different objectives and by taking into account different sources of information. For methods based on statistical approaches, these programs differ at an elementary level from each other by the statistical models used for individual binding sites and flanking sequences and by the learning principles employed for estimating the model parameters. According to our experience, both the models and the learning principles should be chosen with great care, depending on the specific task at hand, but many existing programs do not allow the user to choose them freely. Hence, we developed Jstacs, an object-oriented Java framework for sequence analysis, which allows the user to combine different statistical models and different learning principles in a modular manner with little effort. In this chapter we explain how Jstacs can be used for the recognition of transcription factor binding sites.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Note that the parameters \(\boldsymbol\theta\) contain the parameters for each class, e.g., \(\boldsymbol\theta_{\textrm{fg}}\), \(\boldsymbol\theta_{\textrm{bg}}\), and the class probabilities.
References
Lawrence, C.E., Altschul, S.F., Boguski, M.S. et al. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214.
Bailey, T.L., and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology.
Pavesi, G., Mauri, G., and Pesole, G. (2001) An algorithm for finding signals of unknown length in dna sequences. Bioinformatics 17, S207–S214.
Barash, Y., Elidan, G., Friedman, N. et al. (2003) Modeling dependencies in protein-DNA binding sites. In Proceedings of the Annual International Conference on Research in Computational Molecular Biology (RECOMB). pp.28–37.
Smith, A. D., Sumazin, P., and Zhang, M. Q. (2005) Identifying tissue-selective transcription factor binding sites in vertebrate promoters. Proc Natl Acad Sci U S A 102, 1560–1565.
Elemento, O., Slonim, N., and Tavazoie, S. (2007) A universal framework for regulatory element discovery across all genomes and data types;. Mol Cell 28, 337–350.
Stormo, G.D., Schneider, T.D., Gold, L.M. et al. (1982) Use of the ‘perceptron’ algorithm to distinguish translational initiation sites. Nucleic Acids Res 10, 2997–3010.
Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 12, 505–519.
Zhao, X., Huang, H., and Speed, T. P. (2004) Finding short dna motifs using permuted markov models. In Proceedings of the 8th Annual International Conference on Computational Molecular Biology pp., 68–75. ACM, San Diego, CA.
Kel, A.E., Güssling, E., Reuter, I. et al. (2003) Match: a tool for searching transcription factor binding sites in dna sequences. Nucleic Acids Res 31, 3576–3579.
Sinha, S., van Nimwegen, E., and Siggia, E.D. (2003) A probabilistic method to detect regulatory modules. Bioinformatics 19, 292–301.
Ben-Gal, I., Shani, A., Gohr, A. et al. (2005) Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21, 2657–2666.
Grau, J., Keilwagen, J., Kel, A. et al. (2007) Supervised posteriors for DNA-motif classification. In German Conference on Bioinformtics. pp. 123–134.
Blanchette, M., and Tompa, M. (2002) Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res 12, 739–748.
Zhang, Z., and Gerstein, M. (2003) Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements. J Bio 2, 11.
Boffelli, D., McAuliffe, J., Ovcharenko, D. et al. (2003) Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394.
Halperin, Y., Linhart, C., Ulitsky, I. et al. (2009) Allegro: analyzing expression and sequence in concert to discover regulatory programs. Nucleic Acids Res 37, 1566–1579.
Ji, H., Jiang, H., Ma, W. et al. (2008) An integrated software system for analyzing chip-chip and chip-seq data. Nat Biotech 26, 1293–1300.
Roos, T., Wettig, H., Grünwald, P. et al. (2005) On discriminative Bayesian network classifiers and logistic regression. Mach Learn 59, 267–296.
Cerquides, J., and De Mántaras, R. (2005) Robust Bayesian linear classifier ensembles. In Proceedings of the 16th European Conference Machine Learning, Lecture Notes in Computer Science. Citeseer, pp. 70–81.
Schneider, T.D., and Stephens, R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–6100.
Zhang, M., and Marr, T. (1993) A weight array method for splicing signal analysis. Comput Appl Biosci 9, 499–509.
Salzberg, S.L. (1997) A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput Appl Biosci 13, 365–376.
Ng, A., and Jordan, M. (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In Dietterich, T. S. Becker, and Z. Ghahramani (Eds.) Advance in neural information processing systems volume 14, pp.605–610. MIT Press, Cambridge, MA.
Yakhnenko, O., Silvescu, A., and Honavar, V. (2005) Discriminatively trained Markov model for sequence classification. In ICDM ‘05: Proceedings of the 5th IEEE International Conference on Data Mining. IEEE Computer Society, Washington, DC, pp. 498–505.
R Development Core Team. (2009) R: a language and environment for statistical Computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0.
Rissanen, J. (1983) A universal data compression system. IEEE Trans Inform Theory 29, 656–664.
Bejerano, G., and Yona, G. (2001) Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17, 23–43.
Orlov, Y.L., Filippov, V.P., Potapov, V.N. et al. (2002) Construction of stochastic context trees for genetic texts. In Silico Bio 2, 233–247.
Pearl, J. (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Francisco, CA.
Castelo, R., and Guigo, R. (2004) Splice site identification by idlbns. Bioinformatics 20, i69–i76.
Grau, J., Ben-Gal, I., Posch, S. et al. (2006) VOMBAT: prediction of transcription factor binding sites using variable order Bayesian trees. Nucleic Acids Res 34, W529–W533.
Posch, S., Grau, J., Gohr, A. et al. (2007) Recognition of cis-regulatory elements with VOMBAT. J Bioinfor Comput Bio 5, 561–577.
Buntine, W.L. (1991) Theory refinement of Bayesian networks. In Uncertainty in artificial intelligence. Morgan Kaufmann, San Francisco, CA, pp. 52–62.
Heckerman, D., Geiger, D., and Chickering, D.M. (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20, 197–243.
Cortes, C., and Vapnik, V. (1995) Support-vector networks. Mach Learn 20, 273–297.
Schweikert, G., Sonnenburg, S., Philips, P. et al. (2007) Accurate splice site prediction using support vector machines. BMC Bioinformatics 8, S7.
Sonnenburg, S., Zien, A., Philips, P. et al. (2008) POIMs: positional oligomer importance matrices – understanding support vector machine - based signal detectors. Bioinformatics 24, 6–14.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Posch, S., Grau, J., Gohr, A., Keilwagen, J., Grosse, I. (2010). Probabilistic Approaches to Transcription Factor Binding Site Prediction. In: Ladunga, I. (eds) Computational Biology of Transcription Factor Binding. Methods in Molecular Biology, vol 674. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-854-6_7
Download citation
DOI: https://doi.org/10.1007/978-1-60761-854-6_7
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60761-853-9
Online ISBN: 978-1-60761-854-6
eBook Packages: Springer Protocols