The Identification of Cis-Regulatory Sequence Motifs in Gene Promoters Based on SNP Information

  • Paula Korkuć
  • Dirk WaltherEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1482)


Conservation of particular molecular sequence motifs throughout evolution is a strong indicator of their functional relevance as selective pressure likely prevented the accumulation of mutations. Known as “phylogenetic footprinting”, this rationale has been exploited for the identification of novel functional motifs using sequence information from sequence alignments of diverse species, in particular transcription factor binding site motifs in aligned gene promoter sequences of orthologous genes. With the rapid advances of sequencing technologies, whole genome sequence information is accumulating not only across different species, but increasingly for variants of the same species exhibiting relatively little sequence variability, primarily present as single nucleotide polymorphisms (SNPs). Here, we lay out the basic strategy for the identification of functional cis-regulatory motifs in gene promoter regions based on SNP information.

Key words

Phylogenetic footprinting Transcription factor binding sites Single nucleotide polymorphism Gene promoter Conservation Gene expression 



Transcription factor binding site


Transcription start site


Single nucleotide polymorphism


  1. 1.
    Tagle DA, Koop BF, Goodman M, Slightom JL, Hess DL, Jones RT (1988) Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol 203(2):439–455CrossRefPubMedGoogle Scholar
  2. 2.
    Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE (2000) Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26(2):225–228. doi: 10.1038/79965 CrossRefPubMedGoogle Scholar
  3. 3.
    Blanchette M, Tompa M (2002) Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res 12(5):739–748. doi: 10.1101/gr.6902 CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Blanchette M, Schwikowski B, Tompa M (2002) Algorithms for phylogenetic footprinting. J Comput Biol 9(2):211–223. doi: 10.1089/10665270252935421 CrossRefPubMedGoogle Scholar
  5. 5.
    Blanchette M, Tompa M (2003) FootPrinter: a program designed for phylogenetic footprinting. Nucleic Acids Res 31(13):3840–3842CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    McGuire AM, Hughes JD, Church GM (2000) Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res 10(6):744–757CrossRefPubMedGoogle Scholar
  7. 7.
    Gelfand MS, Koonin EV, Mironov AA (2000) Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. Nucleic Acids Res 28(3):695–705CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM (2003) Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299(5611):1391–1394. doi: 10.1126/science.1081331 CrossRefPubMedGoogle Scholar
  9. 9.
    Hong RL, Hamaguchi L, Busch MA, Weigel D (2003) Regulatory elements of the floral homeotic gene AGAMOUS identified by phylogenetic footprinting and shadowing. Plant Cell 15(6):1296–1309CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Boffelli D (2008) Phylogenetic shadowing: sequence comparisons of multiple primate species. Methods Mol Biol 453:217–231. doi: 10.1007/978-1-60327-429-6_10 CrossRefPubMedGoogle Scholar
  11. 11.
    Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423(6937):241–254. doi: 10.1038/nature01644 CrossRefPubMedGoogle Scholar
  12. 12.
    Korkuc P, Schippers JH, Walther D (2014) Characterization and identification of cis-regulatory elements in Arabidopsis based on single-nucleotide polymorphism information. Plant Physiol 164(1):181–200. doi: 10.1104/pp.113.229716 CrossRefPubMedGoogle Scholar
  13. 13.
    Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W, Mueller LA, Bhattacharyya D, Bhaya D, Sobral BW, Beavis W, Meinke DW, Town CD, Somerville C, Rhee SY (2001) The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res 29(1):102–105CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, Ding L (2009) VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25(17):2283–2285. doi: 10.1093/bioinformatics/btp373 CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Kumar S, You FM, Cloutier S (2012) Genome wide SNP discovery in flax through next generation sequencing of reduced representation libraries. BMC Genomics 13:684. doi: 10.1186/1471-2164-13-684 CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Issel-Tarver L, Christie KR, Dolinski K, Andrada R, Balakrishnan R, Ball CA, Binkley G, Dong S, Dwight SS, Fisk DG, Harris M, Schroeder M, Sethuraman A, Tse K, Weng S, Botstein D, Cherry JM (2002) Saccharomyces Genome Database. Methods Enzymol 350:329–346CrossRefPubMedGoogle Scholar
  17. 17.
    Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30(1):207–210CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A, Holloway E, Kolesnykov N, Lilja P, Lukk M, Mani R, Rayner T, Sharma A, William E, Sarkans U, Brazma A (2007) ArrayExpress—a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database issue):D747–D750. doi: 10.1093/nar/gkl995 CrossRefPubMedGoogle Scholar
  19. 19.
    Craigon DJ, James N, Okyere J, Higgins J, Jotham J, May S (2004) NASCArrays: a repository for microarray data generated by NASC’s transcriptomics service. Nucleic Acids Res 32(Database issue):D575–D577. doi: 10.1093/nar/gkh133 CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Wingender E, Dietze P, Karas H, Knuppel R (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 24(1):238–241CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 32(Database issue):D91–D94. doi: 10.1093/nar/gkh012 CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    O’Connor TR, Dyreson C, Wyrick JJ (2005) Athena: a resource for rapid visualization and systematic analysis of Arabidopsis promoter sequences. Bioinformatics 21(24):4411–4413. doi: 10.1093/bioinformatics/bti714 CrossRefPubMedGoogle Scholar
  23. 23.
    Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, Zheng H, Goity A, van Bakel H, Lozano JC, Galli M, Lewsey MG, Huang E, Mukherjee T, Chen X, Reece-Hoyes JS, Govindarajan S, Shaulsky G, Walhout AJ, Bouget FY, Ratsch G, Larrondo LF, Ecker JR, Hughes TR (2014) Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158(6):1431–1443. doi: 10.1016/j.cell.2014.08.009 CrossRefPubMedPubMedCentralGoogle Scholar
  24. 24.
    Kielbasa SM, Korbel JO, Beule D, Schuchhardt J, Herzel H (2001) Combining frequency and positional information to predict transcription factor binding sites. Bioinformatics 17(11):1019–1026CrossRefPubMedGoogle Scholar
  25. 25.
    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate—a practical and powerful approach to multiple testing. J R Stat Soc Series B 57(1):289–300Google Scholar
  26. 26.
    Dunn J (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57. doi: 10.1080/01969727308546046 CrossRefGoogle Scholar
  27. 27.
    Tan P-N, Steinbach M, Kummar V (2006) Cluster analysis: basic concepts and algorithms. In: Tan P-N, Steinbach M, Kummar V (eds) Introduction to data mining. Pearson Education, Essex, UKGoogle Scholar
  28. 28.
    Thompson J, Gibson T, Higgins D (2002) Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics 2–3Google Scholar
  29. 29.
    Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS (2009) MEME SUITE: tools for motif discovery and searching. Web Server Issue 37(2):W202–W208. doi: 10.1093/nar/gkp335 Google Scholar
  30. 30.
    Jothi R, Cuddapah S, Barski A, Cui K, Zhao K (2008) Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res 36(16):5221–5231. doi: 10.1093/nar/gkn488 CrossRefPubMedPubMedCentralGoogle Scholar
  31. 31.
    Alexandrov NN, Troukhan ME, Brover VV, Tatarinova T, Flavell RB, Feldmann KA (2006) Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol Biol 60(1):69–85. doi: 10.1007/s11103-005-2564-9 CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Max Planck Institute for Molecular Plant PhysiologyPotsdam-GolmGermany

Personalised recommendations