Abstract
Gibbs sampler is for de novo motif discovery. Suppose we have a set of sequences each containing a regulatory motif located in different locations of the sequences, but we do not know what the motif looks like or where it is located within each sequence. Gibbs sampler will find such a motif if it is well represented in these sequences. If we have a set of yeast intron sequences each containing a branchpoint site (BPS) somewhere, but we do not know what BPS looks like or where it is located along the intron sequence, Gibbs sampler will find these BPSs. Another scenario involves the discovery of protein binding sites (e.g., transcription factor binding site) given a set of sequences from ChIP-Seq. Each of these sequences has a short sequence segment with affinity to a protein, but we do not know what the short sequence segment looks like or where it is located within the sequence. Gibbs sampler shines in discovering such protein-binding sites. This chapter breaks the black box of Gibbs sampler and numerically illustrates each of its computational steps, including the site sampler (which assumes that each input sequence harbors a signal motif) and motif sampler (which is used when some sequences may contain multiple signal motifs and some none).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aerts S, Van Loo P, Thijs G, Mayer H, de Martin R, Moreau Y, De Moor B (2005) TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis. Nucleic Acids Res 33(Web Server):W393–W396
Aird WC, Parvin JD, Sharp PA, Rosenberg RD (1994) The interaction of GATA-binding proteins and basal transcription factors with GATA box-containing core promoters. A model of tissue-specific gene expression. J Biol Chem 269(2):883–889
Anderson KP, Crable SC, Lingrel JB (1998) Multiple proteins binding to a GATA-E box-GATA motif regulate the erythroid Kruppel-like factor (EKLF) gene. J Biol Chem 273(23):14347–14354
Bailey TL, Williams N, Misleh C, Li WW (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 34(Web Server issue):W369–W373
Bucklew JA (1990) Large deviation techniques in decision, simulation, and estimation. Wiley, New York
Coessens B, Thijs G, Aerts S, Marchal K, De Smet F, Engelen K, Glenisson P, Moreau Y, Mathys J, De Moor B (2003) INCLUSive: a web portal and service registry for microarray and regulatory sequence analysis. Nucleic Acids Res 31(13):3468–3470
Evans T, Felsenfeld G, Reitman M (1990) Control of globin gene transcription. Annu Rev Cell Biol 6:95–124
Fong TC, Emerson BM (1992) The erythroid-specific protein cGATA-1 mediates distal enhancer activity through a specialized beta-globin TATA box. Genes Dev 6(4):521–532
Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15(7–8):563–577
Holmes I, Bruno WJ (2001) Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17(9):803–820
Jensen JL, Hein J (2005) Gibbs sampler for statistical multiple alignment. Stat Sin 15:889–907
Kullback S (1959) Information theory and statistics. Wiley, New York
Kullback S (1987) The Kullback-Leibler distance. Am Stat 41:340–341
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214
Lowry JA, Atchley WR (2000) Molecular evolution of the GATA family of transcription factors: conservation within the DNA-binding domain. J Mol Evol 50(2):103–115
Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458(7234):97–101
Mannella CA, Neuwald AF, Lawrence CE (1996) Detection of likely transmembrane beta strand regions in sequences of mitochondrial pore proteins using the Gibbs sampler. J Bioenerg Biomembr 28(2):163–169
Metropolis N (1987) The beginnning of the Monte Carlo method. Los Alamos Sci 15(Special issue):125–130
Moi P, Loudianos G, Lavinha J, Murru S, Cossu P, Casu R, Oggiano L, Longinotti M, Cao A, Pirastu M (1992) Delta-thalassemia due to a mutation in an erythroid-specific binding protein sequence 3′ to the delta-globin gene. Blood 79(2):512–516
Neuwald AF, Liu JS, Lawrence CE (1995) Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci 4(8):1618–1632
Nishimura S, Takahashi S, Kuroha T, Suwabe N, Nagasawa T, Trainor C, Yamamoto M (2000) A GATA box in the GATA-1 gene hematopoietic enhancer is a critical element in the network of GATA factors and sites that regulate this gene. Mol Cell Biol 20(2):713–723
Orkin SH (1990) Globin gene regulation and switching: circa 1990. Cell 63(4):665–672
Orkin SH (1992) GATA-binding transcription factors in hematopoietic cells. Blood 80(3):575–581
Prensner JR, Iyer MK, Balbin OA, Dhanasekaran SM, Cao Q, Brenner JC, Laxman B, Asangani IA, Grasso CS, Kominsky HD et al (2011) Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression. Nat Biotechnol 29(8):742–749
Qin ZS, McCue LA, Thompson W, Mayerhofer L, Lawrence CE, Liu JS (2003) Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites. Nat Biotechnol 21(4):435–439
Qu K, McCue LA, Lawrence CE (1998) Bayesian protein family classifier. Proc Int Conf Intell Syst Mol Biol 6:131–139
Rouchka EC (1997) A brief overview of Gibbs Sampling. IBC Statistics Study Group, Washington University, Institute for Biomedical Computing
Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE (2002) Using the transcriptome to annotate the genome. Nat Biotechnol 20(5):508–512
Samso M, Palumbo MJ, Radermacher M, Liu JS, Lawrence CE (2002) A Bayesian method for classification of images from electron micrographs. J Struct Biol 138(3):157–170
Schena M (1996) Genome analysis with gene expression microarrays. BioEssays 18(5):427–431
Schena M (2003) Microarray analysis. Wiley-Liss, New York
Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y (2001) A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17(12):1113–1122
Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P, Moreau Y (2002a) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol 9(2):447–464
Thijs G, Moreau Y, De Smet F, Mathys J, Lescot M, Rombauts S, Rouze P, De Moor B, Marchal K (2002b) INCLUSive: integrated clustering, upstream sequence retrieval and motif sampling. Bioinformatics 18(2):331–332
Thompson W, Rouchka EC, Lawrence CE (2003) Gibbs recursive sampler: finding transcription factor binding sites. Nucleic Acids Res 31(13):3580–3585
Thompson W, Palumbo MJ, Wasserman WW, Liu JS, Lawrence CE (2004) Decoding human regulatory circuits. Genome Res 14(10A):1967–1974
Van Esch H, Devriendt K (2001) Transcription factor GATA3 and the human HDR syndrome. Cell Mol Life Sci 58(9):1296–1300
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270(5235):484–487
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63
Xia X (2007b) Bioinformatics and the cell: modern computational approaches in genomics, proteomics and transcriptomics. Springer US, New York
Xia X (2013) DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution. Mol Biol Evol 30:1720–1728
Xia X (2017d) Self-organizing map for characterizing heterogeneous nucleotide and amino acid sequence motifs. Computation 5(4):43
Xia X, MacKay V, Yao X, Wu J, Miura F, Ito T, Morris DR (2011) Translation initiation: a regulatory role for poly(A) tracts in front of the AUG codon in saccharomyces cerevisiae. Genetics 189(2):469–478
Zhu J, Liu JS, Lawrence CE (1998) Bayesian adaptive sequence alignment algorithms. Bioinformatics 14(1):25–39
Zon LI, Gurish MF, Stevens RL, Mather C, Reynolds DS, Austen KF, Orkin SH (1991) GATA-binding transcription factors in mast cells regulate the promoter of the mast cell carboxypeptidase A gene. J Biol Chem 266(34):22948–22953
Author information
Authors and Affiliations
Postscript
Postscript
Gibbs sampler for motif discovery illustrates the magic of a random process guided by a selection process. We start with a random set of motifs and apply a selection process that favors certain motifs against others, based on the criterion that the chosen motifs should contribute to a site-specific frequency distribution with a larger F defined in Eq. (4.1). The starting set of random motifs, shaped by the selection process, eventually converges to a final set of motifs with a strong nonrandom pattern. Sometimes we may have sequences with two or more different types of signal motifs. If we run Gibbs sampler with different starting sets of random motifs, we may converge to different sets of highly informative motifs. Thus, the same random process coupled with the same selection process may generate quite different outcomes.
My Christian friends often assert that Darwinian evolutionary theory is all wrong because random collision of molecules cannot generate highly structured patterns. Random collision of molecules indeed is limited by their potential to generate highly structured patterns. However, Darwinian evolution is not a random process. In fact, Darwin’s most significant contribution to biology is the substantiation of a particular force that he named natural selection. The combination of a random process guided by this particular force can do miracles in generating biodiversity of all colors and shades. It is when Darwin visualized the miracles generated by this ubiquitous force that he proclaimed that “There is grandeur in this view of life.”
This force has been with us, from time immemorial, and continues to shape all forms of life, including the life of those who deny its existence.
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media LLC
About this chapter
Cite this chapter
Xia, X. (2018). Gibbs sampler. In: Bioinformatics and the Cell. Springer, Cham. https://doi.org/10.1007/978-3-319-90684-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-90684-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-90682-9
Online ISBN: 978-3-319-90684-3
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)