Skip to main content

Regulatory Motif Identification in Biological Sequences: An Overview of Computational Methodologies

  • Chapter
  • First Online:
Advances in Enzyme Biotechnology

Abstract

The transcription factor binding sites (TFBS), also called as motifs, are short, recurring patterns in DNA sequences that are presumed to have a biological function. Identification of the motifs from the promoter region of the genes is an important and challenging problem, specifically in the eukaryotic genomes. In this chapter, an overview of motif identification methods has been presented. The computational methods for motif identification are classified as enumerative methods, probabilistic methods, phylogeny-based methods, and machine learning methods. The chapter also presents the standard evaluation scheme for accuracy of prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Bailey TL, Elkan C (1995a) The value of prior knowledge in discovering motifs with MEME. Proc ISMB 1995:21–29

    Google Scholar 

  • Bailey TL, Elkan C (1995b) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 21:51–80

    Google Scholar 

  • Bailey TL, Gribskov M (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14:48–54

    Article  PubMed  CAS  Google Scholar 

  • Bailey TL, Williams N, Misleh C, Li WW (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 34:W369–W373

    Article  PubMed  CAS  Google Scholar 

  • Bailey TL et al (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37(Web Server issue):W202–W208. doi:10.1093/nar/gkp335

    Article  PubMed  CAS  Google Scholar 

  • Blanchette M, Tompa M (2002) Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res 12:739–748

    Article  PubMed  CAS  Google Scholar 

  • Blanco E, Farr´e D, Alb`a MM, Messeguer X, Guig´o R (2006) ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Res 34(1):D63–D67

    Article  PubMed  CAS  Google Scholar 

  • Boucher C, Church P, Brown D (2007) A graph clustering approach to weak motif recognition. Proc WABI 2007:149–160

    Google Scholar 

  • Boyer LA, Lee TI et al (2005) Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122(6):947–956

    Article  PubMed  CAS  Google Scholar 

  • Bucher P (1990) Weight matrix description for four eukaryotic RNA polymerase II promoter element derived from 502 unrelated promoter sequences. J Mol Biol 212:563–578

    Article  PubMed  CAS  Google Scholar 

  • Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9:225–242

    Article  PubMed  CAS  Google Scholar 

  • Bulyk ML, Johnson PL, Church GM (2002) Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res 30:1255–1261

    Article  PubMed  CAS  Google Scholar 

  • Burset M, Guigo R (1996) Evaluation of gene structure prediction programs. Genomics 34:353–367

    Article  PubMed  CAS  Google Scholar 

  • Bussemaker HJ, Li H, Siggia ED (2001) Regulatory element detection using correlation with expression. Nat Genet 27:167–171

    Article  PubMed  CAS  Google Scholar 

  • Carmack CS, McCue LA, Newberg LA, Lawrence CE (2007) PhyloScan: identification of transcription factor binding sites using cross-species evidence. Algorithms Mol Biol 2:1

    Article  PubMed  Google Scholar 

  • Carvalho AM, Oliveira AL (2011) GRISOTTO: a greedy approach to improve combinatorial algorithms for motif discovery with prior knowledge. Algorithms Mol Biol 6:13

    Article  PubMed  CAS  Google Scholar 

  • Carvalho AM, Freitas AT, Oliveira AL, Sagot MF (2006) An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans Comput Biol Bioinform 3(2):126–140

    Article  PubMed  CAS  Google Scholar 

  • Chakravarty A, Carlson JM, Khetani RS, Gross RH (2007) A novel ensemble learning method for de novo computational identification of DNA binding sites. BMC Bioinformatics 8:249. doi:10.1186/1471-2105-8-249

    Article  PubMed  Google Scholar 

  • Chan T, Li G, Leung K, Lee K (2009) Discovering multiple realistic TFBS motifs based on a generalized model BMC. Bioinformatics 10:321. doi:10.1186/1471-2105-10-321

    PubMed  Google Scholar 

  • Chengwei L, Jianhua R (2010) A particle swarm optimization-based algorithm for finding gapped motifs. BioData Min 3:9

    Article  Google Scholar 

  • Chin FYL, Leung CM (2005) Voting algorithms for discovering long motifs. Proc APBC 2005:261–271

    Google Scholar 

  • Chin FYL, Leung CM (2006) An efficient algorithm for string motif discovery. Proc APBC 2006:79–88

    Google Scholar 

  • Chin F, Leung HCM (2008) DNA motif representation with nucleotide dependency. IEEE/ACM Trans Comput Biol Bioinform 5:110–119

    Article  PubMed  CAS  Google Scholar 

  • Cliften PF, Hillier LW, Fulton L, Graves T, Miner T, Gish WR, Waterston RH, Johnston M (2001) Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res 11:1175–1186

    Article  PubMed  CAS  Google Scholar 

  • Das MK, Dai H-K (2007) A survey of DNA motif finding algorithms. BMC Bioinformatics 8(7):S21

    Article  PubMed  Google Scholar 

  • Davila J, Balla S, Rajasekaran S (2007) Fast and practical algorithms for planted (l, d) motif search. IEEE/ACM Trans Comput Biol Bioinform 4(4):544–552

    Article  PubMed  CAS  Google Scholar 

  • Elnitski L, Jin VX, Farnham PJ, Jones SJ (2006) Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Res 16(12):1455–1464

    Article  PubMed  CAS  Google Scholar 

  • Eskin E, Pevzner P (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18(1):S354–S363

    Article  PubMed  Google Scholar 

  • Ettwiller L, Paten B, Ramialison M, Birney E, Wittbrodt J (2007) Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nat Methods 4:563–565

    Article  PubMed  CAS  Google Scholar 

  • Evans PA, Smith A, Wareham HT (2003) On the complexity of finding common approximate substrings. Theor Comput Sci 306(1–3):407–430

    Article  Google Scholar 

  • Frith MC, Saunders NFW, Kobe B, Bailey TL (2008) Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput Biol 4:e1000071

    Article  PubMed  Google Scholar 

  • Hertz GZ, Hartzell GW, Stormo GD (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Bioinformatics 6:81–92

    Article  CAS  Google Scholar 

  • Hongwei H, Zhenhua Z, Vojislav S, Lifang L (2010) Optimizing genetic algorithm for motif discovery. Math Comput Model 52(11–12):2011–2020

    Google Scholar 

  • Horak CE, Snyder M (2002) ChIP-chip: a genomic approach for identifying transcription factor binding sites. Methods Enzymol 350(469–483):2002

    Google Scholar 

  • Hu J, Li B, Kihara D (2005) Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 33:4899–4913

    Article  PubMed  CAS  Google Scholar 

  • Hu J, Yang YD, Kihara D (2006) EMD an ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics 7:342

    Article  PubMed  Google Scholar 

  • Sandve GK, Drablos F (2006) A survey of motif discovery methods in an integrated framework. Biol Direct 1:11. doi:10.1186/1745-6150-1-11

    Article  PubMed  Google Scholar 

  • Kohonen T (2001) Self-organizing maps, vol 30, 3rd edn, Springer series in information sciences. Springer, Berlin, Heidelberg, New York

    Google Scholar 

  • Kolchanov NA, Podkolodnaya OA et al (2000) Transcription Regulatory Regions Database (TRRD): its status in 2000. Nucleic Acids Res 28(1):298–301

    Article  PubMed  CAS  Google Scholar 

  • Lawrence CE, Reilly AA (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7:41–51

    Article  PubMed  CAS  Google Scholar 

  • Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF et al (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208–214

    Article  PubMed  CAS  Google Scholar 

  • Lee NK, Wang D (2011) SOMEA: self-organizing map based extraction algorithm for DNA motif identification with heterogeneous model. BMC Bioinformatics 12(1):S16

    Article  PubMed  Google Scholar 

  • Li L (2009) GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J Comput Biol 16(2):317–329. doi:10.1089/cmb.2008.16TT

    Article  PubMed  CAS  Google Scholar 

  • Liang S (2003) cWINNOWER algorithm for finding fuzzy DNA motifs. IEEE Compu Soc Bioinform Conf 2003:260–265

    Google Scholar 

  • Liu XS, Brutlag DL, Liu JS (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 20:835–839

    PubMed  CAS  Google Scholar 

  • Liu F et al (2004) FMGA: finding motifs by genetic algorithm. In proceedings of the fourth IEEE symposium on bioinformatics and bioengineering, Taichung, Taiwan, pp 459–466

    Google Scholar 

  • Liu D, Xiong X, DasGupta B, Zhang H (2006) Motif discoveries in unaligned molecular sequences using self-organizing neural network. IEEE Trans Neural Netw 17:919–928

    Article  PubMed  CAS  Google Scholar 

  • Lockhart D, Winzeler E (2000) Genomics, gene expression and DNA arrays. Nature 405:827–836

    Article  PubMed  CAS  Google Scholar 

  • Marinescu VD, Kohane IS, Riva A (2005) MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes. BMC Bioinformatics 6:79

    Article  PubMed  Google Scholar 

  • Marsan L, Sagot M (2000) Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J Comput Biol 7(3–4):345–362

    Article  PubMed  CAS  Google Scholar 

  • Marschall T, Rahmann S (2009) Efficient exact motif discovery. Bioinformatics 25:i356–i364. doi:10.1093/bioinformatics/btp188

    Article  PubMed  CAS  Google Scholar 

  • Matys V, Fricke E et al (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31(1):374–378

    Article  PubMed  CAS  Google Scholar 

  • McCue L, Thompson W, Carmack C, Ryan M, Liu J, Derbyshire V, Lawrence C (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 29:774–782

    Article  PubMed  CAS  Google Scholar 

  • Odom DT, Dowell RD et al (2006) Core transcriptional regulatory circuitry in human hepatocytes. Mol Syst Biol 2:0017

    Article  PubMed  Google Scholar 

  • Pavesi G, Mauri G, Pesole G (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(1):S207–S214

    Article  PubMed  Google Scholar 

  • Pavesi G, Mereghetti P, Mauri G, Pesole G (2004) Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res 32:W199–W203

    Article  PubMed  CAS  Google Scholar 

  • Pevzner PA, Sze SH (2000) Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int Conf Intell Syst Mol Biol 8:269–278

    PubMed  CAS  Google Scholar 

  • Pisanti N, Carvalho AM, Marsan L, Sagot MF (2006) RISOTTO: fast extraction of motifs with mismatches. In Proceedings of LATIN’06, Vol 3887 of LNCS. Springer, pp 757–768

    Google Scholar 

  • Rajasekaran S, Balla S, Huang CH (2005) Exact algorithms for planted motif problems. J Comput Biol 12(8):1117–1128

    Article  PubMed  CAS  Google Scholar 

  • Ren B, Robert F et al (2000) Genome-wide location and function of DNA binding proteins. Science 290(5500):2306–2309

    Article  PubMed  CAS  Google Scholar 

  • Rogic S, Mackworth AK, Ouellette FB (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res 11:817–832

    Article  PubMed  CAS  Google Scholar 

  • Roth FP, Hughes JD, Estep PW, Church GM (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16:939–945

    Article  PubMed  CAS  Google Scholar 

  • Sagot M (1998) Spelling approximate repeated or common motifs using a suffix tree. Lect Notes Comput Sci 1380:111–127

    Google Scholar 

  • Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 32:D91–D94

    Article  PubMed  CAS  Google Scholar 

  • Sandve G, Abul O, Drablos F (2008) Compo: composite motif discovery using discrete models. BMC Bioinformatics 9:527. doi:10.1186/1471-2105-9-527

    Article  PubMed  Google Scholar 

  • Schneider TD, Stephens RM (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18:6097–6100

    Article  PubMed  CAS  Google Scholar 

  • Sharon E, Lubliner S, Segal E (2008) A feature-based approach to modeling protein-DNA interactions. PLoS Comput Biol 4:e1000154

    Article  PubMed  Google Scholar 

  • Siddharthan R, Siggia ED, Nimwegen E (2005) PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 1:534–556

    Article  CAS  Google Scholar 

  • Sinha S, Tompa M (2003) YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res 31:3586–3588

    Article  PubMed  CAS  Google Scholar 

  • Sinha S, Blanchette M, Tompa M (2004) PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5:170

    Article  PubMed  Google Scholar 

  • Smith AD, Sumazin P, Das D, Zhang MQ (2005) Mining ChIP-chip data for transcription factor and cofactor binding sites. Bioinformatics 21(1):i403–i412

    Article  PubMed  CAS  Google Scholar 

  • Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16:16–23

    Article  PubMed  CAS  Google Scholar 

  • Sze S, Lu S, Chen J (2004) Integrating sample-driven and pattern driven approaches in motif finding. Proc WABI 2004:438–449

    Google Scholar 

  • Thijs G, Marchal K, Lescot M, Rombauts S, DeMoor B, Rouze P, Moreau Y (2002) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol 9:447–464

    Article  PubMed  CAS  Google Scholar 

  • Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTALW improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673

    Article  PubMed  CAS  Google Scholar 

  • Tompa M (2001) Identifying functional elements by comparative DNA sequence analysis. Genome Res 11:1143–1144

    Article  PubMed  CAS  Google Scholar 

  • Tompa M, Li N, Bailey TL, Church GM, De Moor B et al (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23:137–144

    Article  PubMed  CAS  Google Scholar 

  • Vavouri T, Elgar G (2005) Prediction of cisregulatory elements using binding site matrices—the successes, the failures and the reasons for both. Curr Opin Genet Dev 15:395–402

    Article  PubMed  CAS  Google Scholar 

  • Vilo J, Brazma A, Jonassen I, Robinson A, Ukonnen E (2000) Mining for putative regulatory elements in the yeast genome using gene expression data. In: In proceedings of the eighth international conference on intelligent systems for molecular biology. AAAI Press, San Diego, pp 384–394

    Google Scholar 

  • Vlieghe D, Sandelin A, Bleser P, Vleminckx K, Wasserman W, Roy F, Lenhard B (2006) A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res 34(Database issue):D95–D97

    Article  PubMed  CAS  Google Scholar 

  • Wang T, Stormo GD (2003) Combining phylogenetic data with coregulated genes to identify regulatory motifs. Bioinformatics 19:2369–2380

    Article  PubMed  CAS  Google Scholar 

  • Wang C, Xie J, Craig BA (2006) Context dependent models for discovery of transcription factor binding sites. Stat Methodol 3:55–68

    Article  Google Scholar 

  • Wasserman WW, Sandelin A (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276–287

    Article  PubMed  CAS  Google Scholar 

  • Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA (2003) The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol 20(9):1377–1419

    Article  PubMed  CAS  Google Scholar 

  • Zaslavsky E, Singh M (2006) A combinatorial optimization approach for diverse motif finding applications. Algorithms Mol Biol 1:13. doi:10.1186/1748-7188-1-13

    Article  PubMed  Google Scholar 

  • Zhang Y, Zaki MJ (2006a) SMOTIF: efficient structured pattern and profile motif search. Algorithms Mol Biol 1:22. doi:10.1186/1748-7188-1-22

    Article  PubMed  Google Scholar 

  • Zhang Y, Zaki MJ (2006b) EXMOTIF: efficient structured motif extraction. Algorithms Mol Biol 1:21. doi:10.1186/1748-7188-1-21

    Article  PubMed  Google Scholar 

  • Zhang S, Xu M, Li S, Su Z (2009) Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes. Nucleic Acids Res 37(10):e72

    Article  PubMed  Google Scholar 

  • Zhang S, Li S et al (2010) Simultaneous prediction of transcription factor binding sites in a group of prokaryotic genomes. BMC Bioinformatics 11:397

    Article  PubMed  Google Scholar 

  • Zhao F, Xuan Z, Liu L, Zhang MQ (2005) TRED: a Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies. Nucleic Acids Res 33(Database issue):D103–D107

    Article  PubMed  CAS  Google Scholar 

  • Zhu J, Zhang MQ (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15:607–611

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shripal Vijayvargiya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer India

About this chapter

Cite this chapter

Vijayvargiya, S., Shukla, P. (2013). Regulatory Motif Identification in Biological Sequences: An Overview of Computational Methodologies. In: Shukla, P., Pletschke, B. (eds) Advances in Enzyme Biotechnology. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1094-8_8

Download citation

Publish with us

Policies and ethics