Comparative Genomics-Based Orthologous Promoter Analysis Using the DoOP Database and the DoOPSearch Web Tool

Barta, Endre

doi:10.1007/978-1-59745-514-5_20

Comparative Genomics-Based Orthologous Promoter Analysis Using the DoOP Database and the DoOPSearch Web Tool

Endre Barta²

Protocol

1834 Accesses
4 Citations

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 395))

Summary

Bioinformatic and experimental analyses of promoter regions are available for a long time. Finding of the transcription factor binding sites (TFBSs), however, by either method still faces a number of problems. For example, because of the ambiguity of binding of transcription factors, the number of false-positives and -negatives can be unexpectedly high in these sequence analyses. We can assume that evolutionary conserved motifs or regions in the promoters of the homologous genes function as TFBSs. Thus, a comparative genomic approach can provide a partial resolution for the problem previously outlined.

This chapter describes application of the DoOP database and the DoOPSearch web tools for such a comparative genomic analysis. Orthologous promoter sequences and conserved motifs can be extracted from the DoOP database for further analysis. The web-based tools of the DoOPSearch webpage can be used for searching and comparing conserved motifs. Using these tools, it is possible to compare short sequences with conserved motifs, to map conserved motifs into a longer promoter region, or find sequence patterns in different sets of promoter sequences.

Download protocol PDF

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

1 Introduction

The first and most well-known promoter database, the Eukaryotic Promoter Database (EPD) (1) consists of promoter sequences extracted from the up- and downstream regions of either experimentally or in silico-determined transcription start sites (TSSs). Although EPD describes promoters very precisely, the number of entry records is still very limited. In the genomic era, when full genomes are sequenced rapidly, growing number of sequence data are available for in silico analysis. Comparative genomics using bioinformatic methods is a mean to extract and compare promoter regions of homologous genes to see which regions or smaller motifs are evolutionary conserved. To study (determining, comparing, clustering, and so on) these conserved sequences is one of the most challenging task of comparative genomics in these days (2,3).

1.1 Promoter Databases

There are several promoter databases available for searching and retrieving promoter sequences. Link collections in the internet such as the http://apollo11.isto.unibo.it/Databases.htm, http://databases.biomedcentral.com/, or the NAR database collection, http://www.oxfordjournals.org/nar/database/c list those databases. Many of these databases, however, are not designed for comparative genomic analysis or contain only promoter sequences from a limited number of species. In this respect, the most comprehensive orthologous promoter collection is the DoOP database (4).

1.1.1 The DoOP Promoter Database

The two DoOP databases are based on the annotation of two well-known species, Homo sapiens and Arabidopsis thaliana. To build these databases the annotated first or in some cases the first two exons were used to find the first exons of homologous genes. The 5’ upstream regions of the homologous genes are then used as orthologous promoter sequences. In most cases, this method gives reliable results, but it still has its drawbacks:

1.
The method is heavily depending on the annotation of the model organism. If the annotation is wrong (i.e., for example there is an additional exon in vivo before the first annotated exon), then the extracted promoter sequence might not contain the real promoter. It is very likely though that annotation of the genes in model organisms will be more and more precise.
2.
In most cases, the promoter regions used in DoOP database does not mean the 5’ upstream region relative to the TSS, but it also contains the 5’ untranslated region. It must be mentioned, however, that the positions of known TSSs are annotated if available.
3.
The effectiveness of the method is relatively low. Only about 50 % of the human genes gives an orthologous promoter from a nonprimates species. It is possible however to use homologous gene annotations from other methods like the ENSEMBL (5) to determine the position of orthologous promoters and, thus, to increase the number of useable promoter clusters in the database.

1.2 Transcription Factor Binding Sites and Conserved Motifs Collections

There are several transcription factor binding sites database available on the internet for searching like TRANSFAC (6) or for downloading like JASPAR (7). Their data come mostly from manual curation of experimental data. Most recently, the first collections of evolutionary conserved motifs in the promoter regions become available in these databases. Xie et al. (8) used a statistical method to find conserved motifs in the promoter regions of human, dog, mouse, and rat homologous genes. The cisrRED (9) and the CORG (10) databases employ the ENSEMBL annotation to find and analyze promoter regions. The consensus motif sequences of DoOP (4) database are generated by extracting the conserved parts of the dialing promoter alignments. There are a number of websites, like the TRED (11), available where uploaded sequences can be searched and analyzed for different motifs. At the time of writing, however, the DoOPSearch is the only website, where it is possible to search a conserved motifs database with a user supplied sequence.

1.2.1 The DoOPSearch Website

The DoOPSearch web tools were designed to find similarities in either short or long sequences to conserved or not conserved short DNA sequences or motifs from the promoter region of genes. The search for similar motifs is performed in two steps. In the first step both the query and all motif consensus sequences are split into overlapping pieces (“wordsize”) of a given length. These segments are then compared one by one with the program MOFEXT (MOtiF sEarch and eXTension) using a scoring matrix. If the calculated score of a pair of segments is above a given limit (cutoff) then the MOFEXT program tries to extend the alignment using the original query and the motif sequence. The result is the best and longest alignment between the user-supplied query sequence and the consensus motif in the database. The DoOPSearch website offers also a simple pattern search method in all promoter sequences using the FUZZNUC program from the EMBOSS (12) package.

1.3 Methods Described in This Chapter (see Note 1)

Using DoOP and DoOPSearch it is possible to:

1.
Search and retrieve orthologous promoter sequences from the DoOP database.
2.
Find and retrieve conserved motifs from the DoOP database.
3.
Search the conserved motifs consensus list of DoOP database for similar motifs.
4.
Search the promoter sequences available in the DoOP database for similar patterns.

The retrieved data can be (1) a set of promoter sequences, which can be analyzed further with different bioinformatic tools, (2) a list of genes that contain similar conserved motifs to the query sequence, and (3) a list of genes that contain similar patterns in their promoter region to the query sequence. Besides these data, the websites also provide a starting point for further analysis, because it contains links and cross-references to other databases like ENSEMBL, GOA, or EPD.

2 Materials

1.
Hardware: any type of computer with graphical display and internet connection.
2.
Software: a web browser with javascript capability.

3 Methods

3.1 Using the DoOP Database

From the DoOP database one can retrieve the promoter region (see Note 2) of a given gene or genes and their orthologs. If downloaded, these sequences can be also used in any other type of bioinformatics analysis such as primer design or sequence analysis.

3.1.1 Selecting Promoter Sequences From the Database

1.
Open the DoOP homepage (http://doop.abc.hu) in the web browser.
2.
Select the desired taxonomic category (see Note 3) and click on the “use this database” button.
3.
In the search page fill out one of the fields to select one or more genes:
1. a.
  Enter the Cluster ID to the first field (if known from a previous search).
2. b.
  Type a gene ID in the second field. This is a unique short name of genes. In case of chordates, this is the HGNC name (http://www.gene.ucl.ac.uk/nomenclature/index.html).
3. c.
  Choose from a list of human ENSEMBL (ENSG…) or Arabidopsis (At…) IDs.
4. d.
  Type a keyword into the fourth field to search in the short description of genes.
5. e.
  Choose a species from a list. This option is useful if to get a promoter sequence from a gene of a rare species, or to see each promoter sequences of a given species from the DoOP database.
6. f.
  Use a Gene Ontology (GO) term or category to select one or more gene. Here, the user may either type directly an exact GO term or GO ID (GO:…), or after typing a keyword, go to a separate page and choose between the available GO terms that contain that keyword.
7. g.
  Use the final option to find a promoter using sequence similarity (BLAT) search. Enter or upload preferably a human or Arabidopsis promoter or cDNA sequence and as all in the above cases push the appropriate search button.
4.
Click the Search button beside the chosen field to get the result in the Table View page.

3.1.2 Download Promoter Sequences

After the search for the desired gene(s), in the TableView page;

1.
Select one or more or all gene(s).
2.
Choose between to download only the promoter sequences of the given model organism (H. sapiens for chordates or A. thaliana for plants) or to download promoter sequences from all the available species.
3.
Choose one or more promoter length (500, 1000, and 3000 bp at the time of writing) to download (see Note 4).
4.
Click the Download button to get navigated to the download page.
5.
Download the files by clicking on them one by one (see Note 5).

Or to download only the sequence(es) of one cluster:

1.
Click the cluster ID (8…) of the desired gene.
2.
In the ClusterView page find the Files box and either click the Sequences link and then copy and paste the sequences, or use the mouse right button to save the target (see Note 6).

3.1.3 Getting Conserved Motifs

1.
Navigate into the ClusterView page of the desired gene using the method previously outlined.
2.
In the bottom of the page there is the graphical representation of the promoter sequences of the cluster. Click the chosen motif box to get into the MotifView page.
3.
Here, the user can:
1. a.
  Copy and paste the motif sequences or the consensus.
2. b.
  See and then save by copying and pasting the position-specific weight matrix (PSWM) of the motif if available by clicking the PSWM button (see Note 7).
3. c.
  See and then save by copying and pasting the sequence logo of the motif if available (see Note 7).

3.2 Searching for Similar Conserved Motifs Using the DoOPSearch Web-Based Tool

One can search the consensus sequences of conserved motifs coming from the DoOP database. Either a shorter (for example, a transcription factor binding sites) or a longer (for example, an experimentally proven promoter region) sequence, or any consensus sequence that is already in the DoOP database can be used. This is a sequence similarity search where choosing the appropriate parameters is very important (see Note 8).

3.2.1 MOFEXT Search With an Annotated Motif From the DoOP Database

1.
Select a conserved motif of a given gene from the DoOP database following the previously outlined method starting from the DoOP database homepage (http://doop.abc.hu).
2.
On the Motifview page either:
1. a.
  Click the “Run default search with consensus” button to perform an automatic search with default parameters (see Note 8 and continue with Subheading 3.2.3.).
2. b.
  Click the “Go to search page with this consensus” to paste the consensus sequence of the chosen motif into the appropriate search field of the DoOPSearch website (see Note 9 and continue with Subheading 3.2.2.4.).

3.2.2 MOFEXT Search With a Sequence Pattern

1.
Open the DoOPSearch homepage (http://doops.abc.hu).
2.
Select the desired taxonomic category (see Note 3).
3.
In the search page type or copy and paste the sequence pattern into the search field (see Note 9).
4.
Change the parameters to refine the search (see Note 8).
5.
The user should enter an e-mail address to get the link pointing to the results by e-mail (see Note 10).
6.
Click on the submit button to send the job. Continue with Subheading 3.2.3.

3.2.3 Analyzing the MOFEXT Search Result

After completing the previously described search, the user will be navigated to the TableView page. The resulted hits on this page are sorted by default according to their extended score (i.e., the second score that have been calculated by the MOFEXT program). In this page, it is possible to see the gene clusters (orthologous promoters that belong to one gene) from which the hits (conserved motifs from the motiflists) are originated, the alignments between the query and the hits, or to perform several filtering function to refine the result lists.

1.
Click on the Cluster ID (on the first column) to see in the ClusterView page the highlighted motif (the hit) in the picture, and the information from the DoOP database about the given gene and its promoter region.
2.
Click on the alignment (the last column) link to see the alignment between the query sequence and the given motif (the hit).
3.
To perform a filtering or sorting function on the result (see Note 11):
1. a.
  Select between the available filtering options (like score, extended score, starting position on the query or length of the hit) in the pulldown menu.
2. b.
  Or type a GO ID or type a GO term keyword, and in the next page click on the appropriate GO category to get back with the previous page with the selected GO ID pasted into the GO ID field.
And click on the submit button.
4.
If there is a picture at the top of the page (i.e., the query is longer then 20 bp), click on a position in the graph to list only hits that are presented at that position.

3.3 FUZZNUC Searching of Whole Promoter Sequences With the User’s Pattern

1.
Open the DoOPSearch homepage (http://doops.abc.hu).
2.
Select the desired taxonomic category (see Note 3).
3.
Type the query pattern in the pattern field of the FUZZNUC box (see Note 12).
4.
Either use the default parameters or change the desired promoter set, number of mismatches, or the searching the complement sequence option.
5.
Click the submit button to see the result in the next page (TableView page).
6.
Click the Cluster ID (on the first column) to see in the ClusterView page the position of the given hit in the promoter sequences picture, and the information from the DoOP database about the given gene and its promoter region.
7.
Click the Seq ID (the second column) to get the fasta format sequence of the given promoter.
8.
To perform a filtering or sorting function on the result (see Note 11):
1. a.
  Select between the available filtering options (like Cluster ID, starting position on the hit) in the pulldown menu.
2. b.
  Or type a GO ID or type a GO term keyword, and in the next page click on the appropriate GO category to get back with the previous page with the selected GO ID pasted into the GO ID field.
3. c.
  Or type a mismatch value to show only hits with less mismatch.
4. d.
  Or select the strand to show only hits on that strand.

And click on the submit button.

4 Notes

1.
Both the DoOP database and the DoOPSearch web tool are under constant development. It is possible therefore that the look and the content of one or more webpage will change or new features will be implemented. However, the main methods that are mentioned in this chapter will be most likely available in the same way as now.
2.
The term “promoter region” in this aspect means the upstream genomic sequence relative to the known and annotated translational start point (AUG codon) of a gene, or the beginning of the first totally untranslated exon if exists. In the first case, the whole 5’ untranslated region will be represented in the given promoter sequence, whereas in the second case only the strict promoter region. The reason for this difference is technical (for details see ref. 4).
3.
At the time of writing only two categories available for searching, the chordates (based on the human annotation), and the plants (based on the A. thaliana annotation). Other databases, such as the yeast (based on the annotation of Saccharomyces cerevisiae) and the insect (based on the annotation of Drosophila melanogaster) will be also available in the near future.
4.
It is common that the longer promoter clusters contain less orthologs. The reason for this is that if the available promoter sequence of a gene from a given species is shorter then for example 700 bp (most of the single read genomic sequences fall into this category), it will not get into the 1000 bp promoter cluster.
5.
The sequences are available for downloading in multiple fasta format. The fasta headers contain the unique accession number of the sequence, the type of the gene (1–4, 5n, and 6n, see ref. 4), and the name of the species. If there are more then one cluster available to download then the sequences are available in tarred and gzipped formats too.
6.
In the “Files” box there are options to see and download the DIALIGN aligned sequences either in multiple fasta or dialign format.
7.
PSWM matrices and sequence logos are available from the short high quality motifs.
8.
In general, using the default parameters a quick search can be performed, which is sufficient to get an idea about the result to be expected. If this preliminary result is promising, then it is worth to try to obtain data using the other available parameters.

There are our main parameters, which can effect the result significantly:
1. a.
  The word size.
2. b.
  The cutoff score.
3. c.
  The scoring matrix used in the search.
4. d.
  The motif list used as database for the search.
The word size can be as low as 6 bp. In this case a 90 % cutoff value will result only hits with a perfect match, but with a longer word size the same cutoff will allow one or more mismatches or ambiguities.

The cutoff score affecting the number of hits obtained in the first step. Sometimes it is a good strategy, especially in the case of longer query sequences, to use a lower cutoff value, and then filtering the result using the extended score.

There are two matrices available at the time of writing (it is expected that this number will increase). It is important to note that the EDNAFULL matrix coming from the EMBOSS program package does not make difference between the small and capital letters although they have different meaning in the DoOP consensus sequences (for explanation see http://doop.abc.hu/details.html).

It is worth considering which motiflists will be used in a search. For example, using motifs from low complexity clusters (i.e., where the consensus is coming from close relative species like primates) will result in search of almost the entire promoter region, not only the conserved motifs. It is also meaningless to use motifs of the 1000 or 3000 bp promoter sequences, if the query is a core promoter element.
9.
Here, one can enter not only letters for nucleic acid bases (ACGT or acgt) but also capital letters for consensus bases (for example R for purins).
10.
Under certain circumstances (such as long query sequence; big difference between the length of the query and the word-size; low cutoff value, larger number of motif-lists are in use), the completion time of the search is tend to be long (inextreme cases can be as long as 1 h). In these cases, it is safer and more convenient to get a link pointing to the result instead of keeping the browser window open to wait for the job to finish.
11.
It is a good strategy both in the case of MOFEXT and the FUZZNUC search, to try a first run with a rather loose options (i.e., allow lower cutoff or higher mismatch value), and then refine the result using one of the filtering options.
12.
It is possible to enter either a consensus sequence like in the case of MOFEXT search (see Note 9) or the FUZZNUC style pattern. For details see the FUZZNUC documentation (http://emboss.sourceforge.net/apps/fuzznuc.html).

References

Schmid, C. D., Perier, R., Praz, V., and Bucher, P. (2006) EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res. 34, D82–D85.
Article CAS PubMed Google Scholar
Papatsenko, D. and Levine, M. (2005) Computational identification of regulatory DNAs underlying animal development. Nat. Methods 2, 529–534.
Article CAS PubMed Google Scholar
Prakash, A. and Tompa, M. (2005) Discovery of regulatory elements in vertebrates through comparative genomics. Nat. Biotechnol. 23, 1249–1256.
Article CAS PubMed Google Scholar
Barta, E., Sebestyón, E., Pálfy, T. B., Téth, G., Ortutay, C. P., and Patthy, L. (2005) DoOP: Databases of Orthologous Promoters, collections of clusters of orthologous upstream sequences from chordates and plants. Nucleic Acids Res. 33, D86–D90.
Article CAS PubMed Google Scholar
Birney, E., Andrews, D., Caccamo, M., et al. (2006) Ensembl 2006. Nucleic Acids Res. 34, D556–D561.
Article CAS PubMed Google Scholar
Matys, V., Kel-Margoulis, O. V., Fricke, E., et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110.
Article CAS PubMed Google Scholar
Vlieghe, D., Sandelin, A., De Bleser, P. J., et al. (2006) A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res. 34, D95–D97.
Article CAS PubMed Google Scholar
Xie, X., Lu, J., Kulbokas, E. J., Golub, T. R., et al. (2005) Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature 434, 338–345.
Article CAS PubMed Google Scholar
Robertson, G., Bilenky, M., Lin, K., et al. (2006) cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Res. 34, D68–73.
Article CAS PubMed Google Scholar
Dieterich, C., Grossmann, S., Tanzer, A., et al. (2005) Comparative promoter region analysis powered by CORG. BMC Genomics 6, 24.
Article PubMed Google Scholar
Zhao, F., Xuan, Z., Liu, L., and Zhang, M. Q. (2005) TRED: a Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies. Nucleic Acids Res. 33, D103–D107.
Article CAS PubMed Google Scholar
Rice, P., Longden, I., and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277.
Article CAS PubMed Google Scholar

Download references

Acknowledgments

I am grateful to Dr. Ferenc Marincs for critical reading of the manuscript and for his helpful suggestions.

Author information

Authors and Affiliations

Agricultural Biotechnology Center, Bioinformatics Group, Hungary
Endre Barta

Authors

Endre Barta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Bioinformatics Program and Department of Microbiology and Immunology, University of Michigan Medical School, Ann Arbor, MI
Nicholas H. Bergman

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Barta, E. (2007). Comparative Genomics-Based Orthologous Promoter Analysis Using the DoOP Database and the DoOPSearch Web Tool. In: Bergman, N.H. (eds) Comparative Genomics. Methods in Molecular Biology™, vol 395. Humana Press. https://doi.org/10.1007/978-1-59745-514-5_20

Download citation

DOI: https://doi.org/10.1007/978-1-59745-514-5_20
Publisher Name: Humana Press
Print ISBN: 978-1-58829-693-1
Online ISBN: 978-1-59745-514-5
eBook Packages: Springer Protocols

Publish with us

Policies and ethics