## Abstract

Genomics began with large-scale sequencing of the human and many model organism genomes around 1990; rapid accumulation of vast genomic data brings a great challenge on how to decipher such massive molecular information. As bioinformatics in general, genome informatics is also data driven; many computational tools developed can soon be obsolete when new technologies and data types become available. Keeping this in mind if a student wants to work in this fascinating new field, one must be able to adapt quickly and to “shoot the moving targets” with the “just-in-time ammunition.”

## 3.1 Overview: Genome Informatics

Genomics began with large-scale sequencing of the human and many model organism genomes around 1990; rapid accumulation of vast genomic data brings a great challenge on how to decipher such massive molecular information. As bioinformatics in general, genome informatics is also data driven; many computational tools developed can soon be obsolete when new technologies and data types become available. Keeping this in mind if a student wants to work in this fascinating new field, one must be able to adapt quickly and to “shoot the moving targets” with the “just-in-time ammunition.”

The new paradigm now emerging, is that all the genes will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical (Gilbert 1991).

*P*|

*G*,

*E*), where

*P*is the phenotype (or traits),

*G*is the genotype (or alleles), and

*E*is the environment. Before this could be systematically studied, a “parts list” would have to be completed. This includes various functional (both structural and regulatory) elements in the genome, which would have to be identified and characterized. Here is a partial list of comments questions in genome informatics:

- 1.
Where are the genes?

- 2.
What are the regulatory regions?

- 3.
What is the exon/intron organization of each gene?

- 4.
How many different functional RNA transcripts can each gene produce? When and where are they expressed?

- 5.
What are the gene products? Structure and function?

- 6.
How do different genes and gene products interact?

- 7.
How are they controlled and regulated?

- 8.
How is genome evolved? What is the relationship among different species?

- 9.
What are the mutations, polymorphism, and selection?

In the following, we will describe a few typical problems in detail.

## 3.2 Finding Protein-Coding Genes

^{′}-end, the introns are spliced and the exons are ligated, poly(A)-tail is synthesized at the 3

^{′}-end, and the matured mRNA is transported from the nucleus out into the cytoplasm and ends with protein translation. A typical vertebrate protein-coding gene structure is depicted in Fig. 3.1, together with its mRNA transcript. It contains six exons, including three coding exons (in black). Given a genomic DNA sequence, finding a protein-coding gene consists of (a) identification of the gene boundaries and (b) delineation of the exon-intron organization. Computational prediction of gene boundaries and noncoding exons is extremely difficult (see the next section); most predictions have been focusing on coding regions (CDS in Fig. 3.1). Experimental methods include cDNA/EST/CAGE-tag sequencing, exon trapping, and tiling microarrays. Since a gene may only express in certain cell types and under specific conditions, not every transcript can be identified experimentally. Two common strategies have been used in ab initio gene prediction algorithms: (a) detecting individual exon candidates and connecting them by, e.g., Dynamic Programming (DP) and (b) segmenting DNA sequence into exon/intron/splice-site states by (standard or generalized) hidden Markov models (HMMs). Basic ab initio gene prediction algorithms have not changed much in the last 10 years (see review by [93]); high accuracy can be achieved by integrating evolutionary conservation and cDNA information and by combining multiple prediction algorithms [40].

### 3.2.1 How to Identify a Coding Exon?

It would be very simple to treat this problem as a discrimination problem. Given labeled training samples (*X*_{i}, *Y*_{i}), *i* = 1, *…*, *N*: when *Y* = 1, the corresponding sample *X* is a true exon sample (i.e., *X* = 3^{′}ss — CDS — 5^{′}ss, 3^{′}ss may contain 50nt 3^{′}splice-site sequence ending with AG and 5^{′}ss may contain 10nt 5^{′}splice-site sequence, and the coding sequence CDS does not have a STOP codon at least in one of the three reading frames), and when *Y* = 0, *X* is a matching pseudo-exon. One can train any standard classifier (LDA, QDA, SVM, etc., see [34]) as the predictor by feature selection and cost minimization (e.g., minimizing classification errors with cross-validation). Many discriminant methods have thus developed; the key is to choose good discriminant feature variables and appropriate pseudo-exon samples. Good discriminant feature variables often include 5^{′}ss score, 3^{′}ss score (including branch-site score), in-frame hexamer coding potential scores, and exon-size. HEXON [80] and [92] are exon finder methods based on LDA and QDA, respectively.

For the first coding exon, one would use Translational Initiation Site (TIS or Kozak) score to replace the 3^{′}ss core; and for the last coding exon, the exon must end with a STOP codon instead of a 5^{′}ss.

### 3.2.2 How to Identify a Gene with Multiple Exons?

^{′}ss, intron, 3

^{′}ss, STOP), and one needs to train an HMM model to get the parameters:

*P*(

*s*|

*q*) is probability of emitting a base

*s*in a state

*q*and the transition probabilities

*T*(

*q*

_{1}|

*q*

_{2}) from a state

*q*

_{1}to a state

*q*

_{2}. For a given partition (assignment of labels, or parse)

*Z*, the joint probability is simply given by the product

*N*. A recursive algorithm (similar to DP) called the Viterbi algorithm may be used to find the most probable parse [71] corresponding to the optimal transcript (exon/intron) prediction. The advantage of HMMs is that one can easily add more states (such as intergenic region, promoter, UTRs, poly(A), and frame- or strand-dependent exons/introns) as well as flexible transitions between the states (to allow partial transcript, intronless genes, or even multiple genes).

HMMgene [49] is based on HMM which can be optimized easily for maximum prediction accuracy. Genie introduced generalized HMM (GHMM, [50]) and used the neural networks as individual sensors for splice signals as well as for coding content. Genscan [20], fgenesh [75], and TigrScan [56] are also based on GHMM, and it allows exon-specific length distribution whereas the intrinsic length distribution for a HMM is geometric (i.e., decaying exponentially). Augustus [83] added a new intron submodel. But these generative models often ignore complex statistical dependences. CRAIG [13], a recently discriminative method for ab initio gene prediction, appears to be promising, which is based on a conditional random field (CRF) model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs.

The future challenges are to predict alternative spliced exons and short intronless genes.

## 3.3 Identifying Promoters

^{′}RACE, CAGE-tag sequencing, PolII (or PIC, or H3K4me3) ChIP-chip or ChIP-Seq, etc.) (e.g., [76]) has produced genome-wide mapping of mammalian core-promoter/TSS data for a few cell lines. These genomic methods are not as accurate as the traditional methods, such as nuclease S1 protection or primer extension assays (Fig. 3.3), but the latter methods cannot be scaled up for genome-wide studies.

At the molecular level, promoter activation and transcription initiation (beginning 5^{′}-end pre-mRNA synthesis by PolII) is a complex process (in addition to Cell vol. 10, 2002, interested readers are strongly recommended to read the book “Transcriptional Regulation in Eukaryotes: Concepts, Strategies, and Techniques” by Carey and Smale 2008 and Gross and Oelgeschlager 2006). After the chromatin remodeling, the key step is the pre-initiation complex (PIC) binding of the core-promoter (100 bp around TSS) and initiation is mainly regulated by transcription factors bound in the proximal promoter (1 kb upstream) and in the first intron region. Although several core-promoter elements have been identified (Fig. 3.3), with each element being short and degenerate and not every element occurring in a given core-promoter, the combinatorial regulatory code within core-promoters remains elusive [85]. Their predictive value has also been very limited, despite some weak statistical correlations among certain subsets of the elements which were uncovered recently. Further biochemical characterization of core-promoter binding factors under various functional conditions is necessary before a reliable computational classification of core-promoters becomes possible. An example of the type of question that must be answered is how CK2 phosphorylation of TAF1 may switch TFIID binding specificity from a DCE to DPE function Fig. 3.3.

A number of statistical and machine learning approaches that can discriminate between the known promoter and some pseudo-promoter sequences have been applied to TSS prediction. In a large-scale comparison [6], eight prediction algorithms were compared. Among the most successful algorithms were Eponine [32] (which trains Relevant Vector Machines to recognize a TATA-box motif in a G+C rich domain and uses Monte Carlo sampling), McPromoter [65] (based on neural networks, interpolated Markov models, and physical properties of promoter regions), FirstEF [28] (based on quadratic discriminant analysis of promoters, first exons, and the first donor site), and DragonGSF [5] (based on artificial neural networks). However, DragonGSF is not publicly available and uses additional binding site information based on the TRANSFAC database, exploiting specific information that is typically not available for unknown promoters. Two new de novo promoter prediction algorithms have emerged that further improve in accuracy. One is ARTS [81], which is based on Support Vector Machines with multiple sophisticated sequence kernels. It claims to find about 35 % true positives at a false-positive rate of 1/1,000, where the abovementioned methods find only about half as many true positives (18 %). ARTS uses only downstream genic sequences as the negative set (non-promoters), and therefore it may get more false-positives from upstream nongenic regions. Furthermore, ARTS does not distinguish if a promoter is CpG-island related or not and it is not clear how ARTS may perform on non-CpG-island-related promoters. Another novel TSS prediction algorithm is CoreBoost [94] which is based on simple LogitBoosting with stumps. It has a false-positive rate of 1/5,000 at the same sensitivity level. CoreBoost uses both immediate upstream and downstream fragments as negative sets and trains separate classifiers for each before combining the two. The training sample is 300 bp fragments ( − 250, +50); hence it is more localized than ARTS which has training sample of 2 kb fragments ( − 1 kb, +1 kb). The ideal application of TSS prediction algorithms is to combine them with gene prediction algorithms [6, 93] or with the CAGE-tag and ChIP-chip PIC mapping data [23, 46]. The future challenges are how to integrate CAGE-tag [76] and epigenomic [46] data to identify tissue- or developmental specific (alternative) promoters and TSSs.

## 3.4 Genomic Arrays and aCGH/CNP Analysis

Genomic (BAC/oligo or tiling) microarrays were developed to answer new important questions, such as what is the total human transcriptome [24], especially the locations of alternative transcripts (including non-poly(A) transcripts) and noncoding transcripts (including antisense and, pseudo-gene/retro-elements)? Where are all the in vivo binding sites for a given transcription factor ([46])? And how the genome is rearranged (duplicated, deleted, inverted, translocated, etc.) in disease, development, or evolution [78]? Using aCGH to detect CNVs has become a very powerful method for mapping copy number polymorphisms (CNPs) and cancer genes (oncogenes are often found in the amplified regions and tumor-suppressor genes are found in the deleted region).

*x*

_{1},

*…*,

*x*

_{N}; breakpoints \(0 < y_{1} <\ldots < y_{N} < x_{N}\); and levels \(\mu _{1},\ldots,\mu _{n}\) and error variances

*σ*

_{1},

*…*,

*σ*

_{N}, the likelihood is

*N*of the breakpoints

A comprehensive view of CNVs among 270 HapMap samples was reported using high-density SNP genotyping arrays and BAC array CGH [73]. A novel algorithm, which combines GIM for intensity pre-processing, SW-ARRAY for pairwise CNV detection, and a maximum clique algorithm for CNV extraction, is applied to the Affymetrix GeneChip Human Mapping 500K Early Access (500K EA) arrays data to identify 1,203 CNVs ranging in size from 960 to 3.4 Mb [47]. Recently, a new HMM-based algorithm PennCNV has also been developed for CNV diction in whole-genome high-density SNP genotyping data [88].

## 3.5 Introduction on Computational Analysis of Transcriptional Genomics Data

Transcriptional regulatory data includes data about gene expression (most commonly from gene expression microarrays), data about binding sites of TFs (such as ChIP-chip data or ChIP-seq data), and several other kinds of data compiled in databases, often manually curated from individual research projects (such as the TRANSFAC database).

## 3.6 Modeling Regulatory Elements

Transcription factor binding sites are often referred to as *cis*-regulatory elements. The word “element” is used because these sites are the elemental units that are combined inside regulatory regions, such as promoters, to encode the information of the transcriptional regulatory programs. These binding sites are the actual nucleotides in the genome that are recognized by the DNA binding domains of transcription factors and are usually thought of as being contiguous in the sequence. When modeling binding sites, we seek to incorporate as much useful information as possible. However, the most sophisticated models have two main problems: (1) they do not lend themselves to algorithmic manipulation, and (2) they are overly complex and lead to over fitting current data. So currently models that attempt to incorporate too much information are inappropriate for general analysis. In this section we describe commonly used methods of modeling regulatory elements. It is important throughout to distinguish *motifs* from *regulatory elements* or binding sites. In a general setting, the term “motif” describes a recurring property in a data or a statistical summary for a data sample or a recurring component in the data. Here we use this term to describe sets of genomic sites; in our application these can be viewed as samples from genomic sequences. We will describe different representations for motifs, but it is important to remember that binding sites are DNA segments that may match a motif (*motif occurrences*) but are not the motif itself.

### 3.6.1 Word-Based Representations

The simplest way to describe a set of binding sites is with words over the DNA alphabet \(\{A,C,G,T\}\). A *consensus sequence* for a set of binding sites is the word containing, at each position, the base appearing most frequently at that position across the binding sites. This assumes that the binding sites all have the same length, as do most representations for binding sites. Consensus sequences are useful in many contexts, primarily because their simplicity allows them to be easily manipulated, and statistics related to them are often easily derived. Consensus sequences are easily remembered and communicated and are very easily manipulated in a computer.

However, many TFs bind to sites with significant *degeneracy*, meaning that any two binding sites for the same TF may have quite different sequences. The approach of considering a sequence as *similar* to the consensus by counting the number of positions where the sequences mismatch the consensus is still simple but ignores the fact that different positions in a site will be more or less important to the binding affinity of the TF for that site. In general, a consensus sequence representation for TF binding sites is not adequate for use in computational analysis. Representations like regular expressions allow a great deal of flexibility, for example, wildcard characters can be used to indicate that a particular position may be occupied by either of a pair of bases. Regular expressions have been used successfully to model protein domains [4] but have seen relatively limited use in describing regulatory elements. The IUPAC has developed a nomenclature for nucleic acids that contains special symbols to represent subsets of the nucleotides, for example, the symbol “R” is used for purines (see [43] for the full nomenclature and further details). However, relaxations like this still do not result in one of the most useful characteristics in binding-site models: a statistical foundation.

### 3.6.2 The Matrix-Based Representation

The most popular way to represent motifs is the matrix-based representation, which has been validated repeatedly through use in successfully large-scale analysis projects. This representation has gone by many names: profiles, alignment matrices, position-frequency matrices, and weight matrices; the terminology can be confusing. In addition, there are a few different (but related) kinds of matrices that people use to represent motifs, and certain names have been used by different researchers to describe different kinds of matrices. In this section we will describe what is referred to as a *count matrix* and a *position-weight matrix* through the rest of this tutorial.

*count matrix*. Let

*M*be a matrix with four rows (one for each DNA base) and

*w*columns (with

*w*being the

*width*of

*M*). When the binding sites for the TF associated with

*M*are aligned (without gaps), column

*i*of

*M*contains the counts of bases appearing in column

*i*of the alignment. We use

*M*

_{i}(

*j*) to denote the

*j*th row in the

*i*th column of

*M*, and this value is the number of times base

*j*appears in column

*i*of the alignment of binding sites. So each entry in

*M*must be nonnegative, and the sum of the entries in any column of

*M*must equal the number of binding sites in the alignment. We can visualize such a matrix as follows:

A *position-weight matrix*, or PWM, is very similar to a count matrix, except that the columns of a PWM are normalized. To construct a PWM, first take the count matrix obtained from an alignment of sites and divide the entry in each column by the sum of the entries in that column. We remark that the term PWM is frequently used in the literature to describe other kinds of matrices, including the *scoring matrices* described in Sect. 3.7.

Count matrices and PWMs contain almost equivalent information, and many databases and programs can treat them as equivalent (e.g., the TRANSFAC matrix table contains both count matrices and PWMs). Through this tutorial, unless otherwise stated or made clear by context, when we refer to motifs, we assume they are represented as PWMs.

Binding sites for a TF all have the same width.

Contributions of different positions in the binding site to the site’s function (usually directly related to the binding affinity of the TF and site) are independent.

While these assumptions are generally not true, they can describe the binding specificity of most TFs with high accuracy. Over time the matrix-based representation has become regarded as the most useful in general, and while the sites for any particular TF may be better modeled using some other representation, the matrix-based representation can usually provide a sufficiently accurate description.

### 3.6.3 Other Representations

As we just stated, the assumptions of independent positions and fixed-width binding sites have been very useful, but we know they are not strictly valid. A conceptually naive model that eliminates these assumptions would list probabilities of each sequence being a binding site for the given TF. Such a model is impractical due to the number of parameters required to describe each motif. Profile hidden Markov models (HMMs) were used by [57] to represent binding site motifs in a way that allows occurrences to have indels (gaps). This technique is taken directly from work on proteins, where profile HMMs have seen great successes. Durbin et al. [33] give details on using profile HMMs to model motifs (focusing on proteins). A slightly different approach was taken by Gupta and Liu [41], who incorporated parameters for insertions and deletions into a “stochastic dictionary model” for motifs, which is a more general Bayesian model to describe entire sequences based on their pattern content. Models that describe dependencies between positions in binding sites have also been designed. Barash et al. [8] described models based on Bayesian networks, where the probability of a base at a given position may depend on the identity of bases at other positions. The structure of these dependencies can be arbitrary: the probability of the first position might depend on the base appearing at the second position, the third position, both the second and third positions, or neither. Zhou and Liu [95] defined a “generalized weight matrix” that allows correlations between arbitrary pairs of positions but is restricted in complexity by requiring that pairs be independent.

## 3.7 Predicting Transcription Factor Binding Sites

If we have a motif that describes the binding specificity of some TF, the most common first task is to find sites in sequences that appear similar to the pattern described by the motif. Those sites are called *occurrences*, and we will refer to them as such even when we have not stated an exact criteria for which sites are *similar* to a motif. In this section we describe some fundamental ideas about occurrences of matrix-based motifs and their occurrences in sequences.

### 3.7.1 The Multinomial Model for Describing Sequences

*S*over the DNA alphabet \(\{A,C,G,T\}\) as having been generated by sampling

*w*symbols according to a multinomial model over the alphabet. For example, the probabilities could be \(f = (f(A),f(C),f(G),f(T)) = (0.2,0.3,0.3,0.2)\). Thinking of sequences in this way allows us to calculate the probability of observing a particular sequence. If

*S*=

*ACG*, and therefore

*w*= 3, then the likelihood of

*S*being generated by the multinomial

*f*would be

*C*

*G*

*C*

*G*in the human genome, because of general CpG depletion. More complex models (e.g., Markov models) allow dependence between positions in sequences to be described. When estimating the parameters of a multinomial model, we usually just use the frequencies of nucleotides in some set of sequences (e.g., promoters) or the entire genome.

### 3.7.2 Motifs as Probabilistic Models

*M*be a matrix, just like the position-weight matrix described in Sect. 3.6, where each column in

*M*has been normalized to have unit sum. So entry

*M*

_{i}(

*j*) gives the probability that a sequence generated from

*M*will have the

*j*th base at position

*i*th. The probability for a given sequence is the product of the matrix entries that correspond to the bases at each position:

*match score*of the sequence

*S*with respect to the motif

*M*.

### 3.7.3 Pseudocounts

*M*will often have 0 entries, and these are usually corrected by using a

*pseudocount*, which adds a value to each entry of the matrix [33]. The pseudo-count values can be calculated in a few different ways. The most simple method is to add a small constant positive value \(\varepsilon\) to each entry:

*f*. The value of \(\varepsilon\) is usually much smaller than any value already in

*M*. When the entries in the matrix are counts (nonnegative integers instead of rationals), adding a count of 1 to each entry is called “Laplace’s method”:

*w*can be some arbitrary weight. Additional details on mathematical properties of these matrices can be found in Rahmann et al. [72].

### 3.7.4 Scoring Matrices and Searching Sequences

*scoring matrices*. Let

*M*be a position-weight matrix with nonnegative entries such that each column has unit sum. Let

*f*be the parameters of a multinomial distribution describing the expected frequency of bases (e.g., in the genome; in promoters). For entry

*i*in column

*j*,

*M*and

*f*to be nonzero, which is usually ensured by using some pseudocount method (see Sect. 3.7). In practice, base frequencies in

*f*are never 0.

### 3.7.5 Algorithmic Techniques for Identifying High-Scoring Sites

The naive method of identifying motif occurrences is to align the scoring matrix with each possible position in a sequence and calculating the associated match score. This method works well in many applications, producing perfectly accurate scores with a time complexity of *O*(*wn*) for motif width *w* and sequence length *n*. However, there are applications where very large amounts of sequence must be searched for occurrences of hundreds or thousands of motifs. Finding motif occurrences is often a subproblem involved in motif discovery (see Sect. 3.10), and in such cases identifying occurrences of some candidate motif can be a major bottleneck in the computation.

Here we describe three programs that implement three different approaches for finding motif occurrences in sequences. The match program [45] is a tool developed in close association with the TRANSFAC database [59]. The scoring function used in match is based on scoring matrices, as described above. The search used in match is very simple but incorporates a heuristic speedup based on the idea of a matrix “core.” The core is a set of consecutive matrix columns with high information content relative to the rest of the matrix. Cores of size 5 are used by match, which pre-computes the match score between each 5-mer and the core scoring matrix. The search proceeds by identifying the locations in the sequences where a subset of the 5-mers scoring above some cutoff occurs. Those 5-mers are then extended to the full width of the original matrix to obtain the final match score. This method is heuristic because it could miss occurrences that score highly with respect to the full motif but score below the cutoff with respect to the core.

The possumsearch program of Beckstette et al. [9] is based on the enhanced suffix array data structure [1]. The storm program storm [77] implements a search similar to that of possumsearch, with suffix trees used in the place of suffix arrays. The technique of using suffix trees to increase the speed of identifying matrix occurrences was also used by Dorohonceanu and Nevill-Manning [31]. These methods all incorporate two important ideas: heavily preprocessing the sequences in which to search and using “look-ahead” scoring. The preprocessing in the form of enhanced suffix arrays or suffix trees allows the search to only match identical substrings of the sequence once against a prefix of the matrix. This helps the search a great deal because the small DNA alphabet means that relatively short segments may appear many times in a sequence and would otherwise require matching many times. The “look-ahead” scoring is similar to a branch and bound search. While matching the motif against a segment of the sequence, if the score for matching with the initial positions is sufficiently low, it can suggest that regardless of the identity of the remaining positions in the segment, a high score cannot be attained. This knowledge allows the algorithms to make an early decision that the segment cannot possibly be an occurrence. Versions of match, storm, and possumsearch are freely available for academic use.

### 3.7.6 Measuring Statistical Significance of Matches

There are several ways to determine if match scores for a motif are statistically significant. One of the first methods was proposed by Staden [82]. For a motif of width *w*, this method assumes sequences are generated by selecting *w* bases, independently at random according to a multinomial model describing base frequencies. The *p*-value for a match score is the proportion of sequences generated according to that procedure that would have a match score at least as high. Staden [82] also described a simple dynamic programming algorithm for calculating those *p*-values. An alternative, incorporated with match [45], identifies putative sites based on their expected frequency. Given input sequences, scoring thresholds are set based on *p*-value selected from the distribution of all potential site scores. Additional methods are described by Schones et al. [77]. When selecting a method for measuring statistical significance of matches, the most important consideration is the hypothesis underlying each method – each method makes a different set of assumptions, and different assumptions might be appropriate in different contexts. Scoring thresholds can be described using the *functional depth* of the score. The functional depth of a score cutoff is a normalization of the score to the [0, 1] interval. The normalization takes the cutoff score and subtracts the minimum possible score for the motif. Then this value is divided by the difference between the maximum and minimum possible scores for the motif.

Even with a highly accurate motif model, and using the most stringent matching criteria, scanning genomic sequences as described in this section is not by itself an effective means of identifying functional transcription factor binding sites. Using such a simple procedure will result in high rates of both false-positive and false-negative predictions. The main problems are not the assumptions of independent positions in the motif or the assumptions of fixed binding-site width. Most of the difficulty stems from the complexity of transcription factor function, including genome organization, chromatin structure, and protein–protein interactions. Still, when used in conjunction with other information, the methods just described can become highly effective and represent a fundamental technique in regulatory sequence analysis. In later sections we will describe how additional information can be incorporated into the process.

## 3.8 Modeling Motif Enrichment in Sequences

Occur more often in promoters of the co-regulated genes

Have stronger occurrences in those promoters

Occur with a particular strength or frequency in a significantly high proportion of those promoters

Motifs with various combinations of these three properties for a particular set of sequences are said to be *enriched* or *overrepresented* in the sequences. These properties are always evaluated relative to what we expect to observe in the sequences.

### 3.8.1 Motif Enrichment Based on Likelihood Models

Many important measures of motif enrichment in sequences are based on using a mixture model to describe the sequences, where one component of the mixture describes the motif occurrences in the sequences, and the other component describes the remaining parts of the sequences. The importance of this general formulation comes from its use in a variety of influential algorithms, and the flexibility that makes this method able to incorporate different kinds of information. An early use of mixture models to describe sequences containing motif occurrences was by Lawrence and Reilly [52], and the model was later extended by Bailey and Elkan [3], Liu et al. [55] and others. In Sect. 3.10.2 we will explain how these likelihood formulations of enrichment are used in motif discovery.

*S*, a multinomial distribution

*f*over the DNA bases, a motif

*M*, and a location

*z*where

*M*occurs in

*S*. Then by multiplying the likelihoods for the corresponding parts of

*S*, the likelihood that

*S*was generated according to the two models is

*w*= |

*M*| and

*n*= |

*S*| .

*M*occurs in multiple (non-overlapping) locations in

*S*. Let \(Z =\{ z_{1},\ldots,z_{k}\}\) be a set of start positions in

*S*for occurrences of

*M*, where the difference |

*z*

_{i}−

*z*

_{j}| ≥

*w*for any

*z*

_{i},

*z*

_{j}∈

*Z*, and let \({Z}^{c} =\{ i : 1 \leq i \leq n,i < z,z + w \leq i,\forall z \in Z\}\). Then we define

*Z*describes occurrence locations for all \(S \in \mathcal{F}\). As with the case of a single sequence, the occurrence indicators may be constrained to indicate some specific number of locations in each sequence.

At this point we comment that the base composition *f* used in these likelihood calculations will often be determined beforehand, either from the sequence *S* or some larger set of sequences of which *S* is a member, or *f* may be determined using, for example, the frequencies of bases in a genome. Sometimes *f* is calculated using *S* but only at positions in *Z*^{c}as defined above. In those cases *f* depends on *Z*. Unless otherwise indicated, we will assume *f* is fixed in our calculations and will be implicit in our formulas.

*M*, we can use Eq. 3.3 to calculate the likelihood of \(\mathcal{S}\) given

*M*, over all possible occurrence locations

*Z*as

*Z*that are valid according to the constraints we wish to model about for numbers of occurrences per sequence and disallowing overlapping occurrences. Often the

*Z*are assumed to be uniformly distributed, which simplifies the above formula and makes it feasible to evaluate. Similarly, we can calculate the likelihood of \(\mathcal{F}\) given

*M*, when maximized over valid

*Z*, as

*M*

_{1}and

*M*

_{2}, if \(L(\mathcal{F}\vert M_{1}) > L(\mathcal{F}\vert M_{2})\), then we say that

*M*

_{1}is more enriched than

*M*

_{2}in \(\mathcal{F}\). Comparing enrichment of motifs is an important part of many motif discovery algorithms. When we are only concerned with a single motif, we can ask if the likelihood of \(\mathcal{F}\) is greater under the mixture model than if we assume \(\mathcal{F}\) has no motif occurrences, using only the base composition (as illustrated in Eq. 3.1). If the base composition

*f*is held fixed, then the logarithm of the likelihood ratio

*one occurrence per sequence*(OOPS) model, we assume that each sequence was generated from a mixture model and exactly one location per sequences is the start of a motif occurrence. The

*zero or one occurrence per sequence*(ZOOPS) allows slightly more flexibility: when we attempt to identify the motif occurrences whose locations maximize the likelihood, we may assume any particular sequence contains no occurrence of the motif. By relaxing our assumptions further, we arrive at the

*two-component mixture*(TCM) model. In this model each sequence may have any number of occurrences (as long as they do not overlap).

### 3.8.2 Relative Enrichment Between Two Sequence Sets

The methods described above consider enrichment relative to what we expect if the sequences had been generated randomly from some (usually simple) statistical model. Biological sequences are rarely described well using simple statistical models, and currently no known models provide adequate descriptions of transcriptional regulatory sequences like promoters. It is usually more appropriate to measure enrichment of motifs in a given set of sequences relative to some other set of sequences. The set of sequences in which we wish to test motif enrichment is called the *foreground* set, and enrichment of a motif is then measured relative to what we observe in some *background* set of sequences.

At the beginning of Sect. 3.8, we gave three characteristics that we intuitively associate with motif enrichment, and those three can easily be understood when a background sequence set is used. Motifs that are enriched in the foreground relative to the background should occur more frequently in the foreground than the background, the foreground occurrences should be stronger than those in the background, and more sequences in the foreground than in the background should contain an occurrence.

The exact measures of enrichment could be adapted from the likelihood-based measures of Sect. 3.8.1, perhaps by examining the difference or ratio of the enrichment calculated for a motif in the foreground sequences and that calculated in the background sequences. However, a more general and extremely powerful method is available when measuring relative enrichment: using properties of motif occurrences in the sequences to classify the foreground and background sequences. For example, if we fix some criteria for which sites in a sequence are occurrences of a motif, and under that criteria 90 % of our foreground sequences contain at least one occurrence, but only 20 % of our background sequences contain an occurrence, then (1) we could use the property of “containing an occurrence” to predict the foreground sequences with high accuracy, and (2) that motif is clearly enriched in the foreground relative to the background. This classification-based approach also benefits from a great deal of theory and algorithms from the machine learning community. The idea of explicitly using a background sequence set in this way is due originally to [7].

The purpose of using a background set is to provide some way of expressing properties of sequences that are not easily described using some simple statistical distribution. For that reason selection of the background set is often critical. The background set should be selected to control extraneous variables and should often be as similar as possible to the foreground in all properties except the defining property of the foreground. While it is not always possible to find an ideal background set, multiple background sets can be used to control different characteristics.

When the foreground is a set of proximal promoters for co-regulated genes, a natural background to use might be a set of random promoters. In a case like this, simply taking random sequences would not control for the properties common to many promoters, such as CpG islands or TATA-box motifs. Using random promoters can control for these kinds of patterns and help reveal motifs that are specific to the foreground sequence set, and not promoters in general. When the foreground has been derived from microarray expression data, and the promoters of interest correspond to upregulated genes, reasonable background may be promoters of house-keeping genes or promoters of downregulated genes in the same experiment. Using house-keeping genes for control is akin to selecting a set of promoters that are not likely to be regulated specifically in any given condition. Using downregulated genes may illuminate the interesting differences related to the particular experiment. Similarly, if the foreground contains sequences showing high binding intensity in a ChIP-on-chip experiment, we might consider sequences showing low affinity in the same experiment as a background. Alternatively, we might suspect that certain motifs are enriched in the general regions where binding has been observed. Using a background set of sequences also taken from those regions might be able to control such effects (possibly related to chromatin structure in those regions).

Often the size of the foreground sequence set is dictated by the experiment that identified the sequences, but the size of the background set can be selected. There is no correct size for a background sequence set, and each individual program or statistical method applied to the sequences may either limit the size of the background set or require a minimum size. In general, however, it is advisable to have a large background set and to have a background set that is similar in size (count and length) to the foreground set.

A related means of measuring enrichment, similar in spirit to the classification method, has been applied when it is difficult to assign sequences to discrete classes, such as a foreground and background. This technique uses regression instead of classification and attempts to use properties of sequences to fit some experimentally derived function, such as binding intensity in a ChIP-on-chip experiment. This strategy has been used by Das et al. [26, 27], Bussemaker et al. [21], and Conlon et al. [25].

Before concluding this section, we remark that a great deal of work has been done on measures of enrichment for motifs that are represented as words. These measures of enrichment usually involve some statistical evaluations of the number of occurrences of words in sequences, since in general word-based representations do not consider the strength of an occurrence. The central questions concern the probabilities of observing words, either as exact or approximate matches (see Sect. 3.10.1), with particular frequencies in individual sequences or sets of sequences with high frequency in a set of sequences or a single sequence. These and many related questions have been answered very well in the literature [2, 15, 63, 89].

## 3.9 Phylogenetic Conservation of Regulatory Elements

In Sect. 3.7 we explained why scanning genomic sequences with scoring matrices is likely to result in a very high rate of false-positive predictions. We describe how comparative genomics can help resolve this problem. Cross-species comparison is the most popular and generally useful means of adding confidence to computational predictions of binding sites, and in this context is often called *phylogenetic footprinting*, because comparison between species allows us to examine the evolutionary “footprint” resulting from constrained evolution at important genomic sites [16] and as an analog to the biochemical footprinting technique [69]. In this section we describe the most common general methods for using phylogenetic conservation to identify regulatory elements and discuss some related issues.

### 3.9.1 Three Strategies for Identifying Conserved Binding Sites

Here we will assume that we already have multispecies alignments for the genomic regions of interest, and that we have high confidence in the accuracy of those alignments.

#### 3.9.1.1 Searching Inside Genomic Regions Identified as Conserved

The most basic method is to use the multispecies alignments to define regions that are conserved and, then use techniques such as those described in Sect. 3.7 to predict binding sites within those regions. It is often hypothesized, and in some cases demonstrated, that noncoding genomic regions with high conservation across several species play important regulatory roles [10, 67, 91]. Although this method is very simple, it is also extremely crude. Such highly conserved regions are the “low-lying fruit” and are not necessarily relevant to any particular regulatory context. In cases where extremely stringent criteria for conservation is not useful, the proper definition for conserved regions (in terms of size and degree of conservation) may be difficult to determine.

#### 3.9.1.2 Using a Continuous Measure of Conservation

The strategy of defining conserved regions will only identify individual sites that exist within much larger regions that also appear to be under selection. Functional regulatory elements that show strong conservation across a number of species are also frequently identified as very small islands of conservation. For these reasons it is often useful to describe conservation by assigning a conservation value to each base in the genome, recognizing that one base might be under selection while the adjacent bases may be undergoing neutral evolution.

One popular measure of conservation that assigns a score to each base in a genome is the *conservation* score presented in the UCSC Genome Browser. This score is calculated using the phastCons algorithm of Siepel et al. [79], which is based on a *phylogenetic hidden Markov model* (phylo-HMM) trained on multi-species genome alignments (e.g. the 17-species vertebrate alignments available through the UCSC Genome Browser). Phylo-HMMs originated in the work of Felsenstein and Churchill [37] and as used in phastCons essentially describe the genomic alignment columns as either evolving according a constrained evolutionary model or a neutral model. The phastCons scores can be thought of as equivalent to the likelihoods that individual alignment columns are under negative selection. The phastCons scores have become important tools in genomics, but they do have some problems for regulatory sequence analysis, including disproportionate weight given to positions that align with very distant species (which are the regions most difficult to align accurately), and a smoothing parameter that introduced dependence between the scores at adjacent positions. Smoothing is desirable for coding sequences, in which the regions under selection are relatively large, but for regulatory regions, abrupt transitions might be more appropriate. Parameters in the phastCons algorithm can be adjusted to alleviate these problems, and in general the existing pre-computed phastCons scores remain highly useful.

#### 3.9.1.3 Evaluating Sites Using the Full Alignment

Both of the two strategies just described ultimately ignore information in the alignment columns. It is desirable to also consider the pattern of substitutions in the alignment columns within a site when we predict whether or not that site is under selection. Recent work by Moses et al. [60, 61] has resulted in much more sophisticated techniques that show great promise in helping us exploit cross-species conservation in identifying regulatory elements. This work is based on models initially developed to account for site-specific evolution in protein-coding sequences [42]. A given motif is used to construct a set of evolutionary models, one model for each column of the motif. A likelihood is calculated for each alignment column using the evolutionary model constructed from the corresponding motif column, and the product of these likelihoods is taken. The special property of each column-specific evolutionary model is that the substitution matrix is structured to favor substitutions resulting in bases that have greater frequency in the corresponding motif column. The final likelihood for a candidate binding site under such a model can be compared with the likelihood under a neutral model, indicating whether the site appears more likely to have evolved under selective pressure associated with the motif.

In addition to considering the patterns of substitutions in sites, we might also want to consider that evolutionary forces likely work to constraining properties of entire binding sites, rather than individual positions within those sites. Other functional elements in the genome, particularly RNAs that fold into important structures, have been found to have compensatory substitutions. This likely also happens in binding sites, for which the functional requirement is the overall affinity of the site for the binding domain of its cognate TF. Measuring conservation at a site by independently evaluating the aligned sites in each species could account for such events, but currently it is not clear how best to combine such individual species scores.

### 3.9.2 Considerations When Using Phylogenetic Footprinting

#### 3.9.2.1 Which Alignments to Use

The question of which alignments to use is extremely difficult to answer. Often the easiest choice will be to use pre-computed full-genome alignments, such as those produced using tba/multiz [14] or mlagan [18]. The alternative of actually computing the multispecies alignments allows one to select the set of species used and to tune the alignment parameters to something more appropriate for regulatory regions. Most global multiple-sequence alignment algorithms currently in use were originally designed to align coding sequences, including the popular clustal algorithm [86]. The conreal algorithm [11] is designed for pairwise alignment of promoters but must assume while computing the alignment that functional regulatory elements exist in the sequences. The promoterwise algorithm [35] is also designed specifically for promoters and allows inversion and translocation events, the utility of which is debatable. Neither of these algorithms are designed for more than two sequences.

#### 3.9.2.2 What Set of Species to Use

Although it is well known that conservation of putative binding sites suggests function, it is not known what proportion of functional binding sites are conserved. Lack of binding site conservation could be due to the loss of that particular regulatory function: a particular TF can regulate a particular gene in one species, but not in another species even if that other species has strong orthologs for both the TF and target. Another possibility is that the TF-target relationship is conserved between two species, but the binding sites through which the regulatory relationship is implemented are not orthologous.

Much of the diversity seen between closely related species such as mammals is attributed to differences in gene regulation, and therefore it makes sense that many regulatory elements will not be conserved. More closely related species will have a greater number of conserved binding sites, and those sites will have a greater degree of conservation. When selecting the species to use, one should consider how deeply conserved is the particular biological phenomenon that initiated the search for regulatory elements. We also might want to check that the DNA binding domain of the TF in question has a high degree of conservation in the species selected. If there is significant adaptation of the DNA binding domain, the aligned sites might show large differences in order to retain their function. If there is reason to believe that the binding specificity of the orthologous TF has changed in a particular species, that species may best be left out of the analysis.

#### 3.9.2.3 Binding Site Turnover

Another issue that must be considered is the possibility of binding site “turnover” [30]. Turnover refers to the process by which orthologous regulatory regions, with equivalent function in two species, containing equivalent binding sites do not align. It is assumed that those binding sites do not share a common ancestral site but are both evolving under pressure to preserve a similar function. The commonly proposed mechanism is for a single site to exist in the ancestral sequence, but an additional site capable of similar function emerges by random mutation along one lineage. Because only one site is needed, the original site can mutate in the lineage that acquired the new site. Such a scenario would require many conditions to be satisfied, and the spontaneous emergence of new sites that can perform the function of existing sites seems to require that the precise location of the site (relative to other sites or the TSS) does not matter. Additionally, it is reasonable to assume that short binding sites are more likely to be able to emerge by chance. However, there is increasing evidence that such turnover events are important [44, 64], and efforts are being made to model these events to help understand regulatory sequences [12, 61, 62].

## 3.10 Motif Discovery

The problem of motif discovery is to identify motifs that optimize some measure of enrichment without relying on any given set of motifs; this is often referred to as de novo motif identification. Regardless of the motif representation or measure of motif enrichment being optimized, motif discovery is computationally difficult. The algorithmic strategies that have been applied to motif discovery are highly diverse, but several techniques have emerged as useful after 20 years of research and are critical components of the most powerful motif discovery programs. Tompa et al. [87] provide a review and comparison of many available motif discovery programs. In this section we will describe some of these techniques, grouping them into the broad categories of either word-based and enumerative or based on general statistical algorithms.

### 3.10.1 Word-Based and Enumerative Methods

When motifs are represented using words, as described in Sect. 3.6, the motif discovery problem can be solved by simply generating (i.e., enumerating) each possible word of some specified width and then evaluating the enrichment of each word in the sequences. Such an algorithm is an exact algorithm in that the most enriched word will be identified. When the measure of enrichment for each word can be evaluated rapidly, and the width of the words is sufficiently short, this technique works very well. This technique was used very early by Waterman et al. [90], and with advanced data structures (e.g., suffix trees, suffix arrays, and hashing) can be feasible for widths much greater than 10 bases, depending on the size of the sequence data and measure of enrichment used.

We also explained in Sect. 3.6 that word-based representations are not the most appropriate for TF binding sites, because they cannot adequately describe the degeneracy observed in binding sites for most TFs. However, because words are more easily manipulated and much more amenable to algorithmic optimization than matrix-based motifs, methods have been developed to increase their expressiveness.

One way to increase the expressiveness of word-based motifs is to relax our definition of what it means to be an occurrence of a word. A common relaxation is to fix some number of mismatches and define a match to a word as any subsequence of our input sequences that has at most that many mismatches when compared with the word. The occurrences are said to be approximate matches to the word-based motif, and such a representation can be highly effective when the likelihood of degeneracies in binding sites for a TF is roughly uniform across the positions in the sites. Several algorithms have been developed to discover words that have a surprisingly high number of such approximate occurrences in a set of sequences [19, 36, 53, 58, 68, 84].

Another way to increase the expressiveness of word-based representations is to actually expand the representation to include wildcard characters and even restricted regular expressions. pratt [17] and splash [22] are among the first regular expression-based motif discovery algorithms. Given a set of sequences on a fixed alphabet and a substitution matrix, these algorithms discover motifs composed of tokens (care positions), no-care positions, and gaps (pratt can find motifs with flexible gaps). These algorithms proceed by enumerating regular expressions of a specified length and extending significant motifs. Motif sites are identified deterministically through matching, where each substring in the input sequence either does or does not match a regular expression. Motif significance is evaluated analytically, by evaluating the likelihoods of motifs (and associated sites) based on motif structure and the size and composition of the input sequence set.

While the most popular motif discovery programs in use today produce motifs that are represented as matrices, almost without exceptions these methods employ some sort of word-based procedure at some point in the algorithm. This should be seen as analogous to the way in which sequence database search algorithms, such as blast, initially screen for short exact matches but subsequently apply a more rigorous procedure to evaluate those initial matches.

### 3.10.2 General Statistical Algorithms Applied to Motif Discovery

Here we describe how two related and general algorithms, expectation maximization (EM) and Gibbs sampling, are applied to the motif discovery problem. Both of these algorithms can be used to identify maximum likelihood estimates of parameters for mixture models and are easily applied to estimate parameters of the enrichment models described in Sect. 3.8.1.

*Z*, which were used as sets of locations for the beginnings of motif occurrences, will now have a value for every possible location in the sequences that could be the start of a motif occurrence. So given a motif

*M*of width

*w*and a sequence

*S*of length

*n*,

*Z*(

*S*

_{k}) indicates the event that an occurrence of

*M*starts at position

*k*in

*S*, for \(1 \leq k \leq n - w + 1\). In this way the occurrence indicators

*Z*can take values of exactly 0 or 1, corresponding to complete knowledge about the occurrences of

*M*in

*S*. But now we introduce the variables \(\bar{Z}\), which lie

*between*0 and 1, and \(\bar{Z}(S_{k})\) is interpreted as the probability or the expectation that an occurrence of

*M*begins at position

*k*in

*S*: \(\Pr (Z(S_{k}) = 1) = E(Z(S_{k})) =\bar{ Z}(S_{k})\). In Sect. 3.8.1 we placed restrictions on the values taken by

*Z*to require, for example, that each sequence have exactly one motif occurrence (recall the OOPS model). Analogous restrictions can be used when fractional values in

*Z*are interpreted as expectations: the OOPS model would require that the

*sum*of values of \(\bar{Z}\) for a particular sequence be exactly one:

*n*= |

*S*| and

*w*= |

*M*| . Different restrictions on \(\bar{Z}\) can be made to correspond to the requirements of the ZOOPS or TCM models.

### 3.10.3 Expectation Maximization

The expectation maximization algorithm is a general method for finding maximum likelihood estimates of parameters to a mixture model [29] and was originally applied to motif discovery by Lawrence and Reilly [52]. Starting with some initial value for the motif *M* (possibly a guess), the idea is to calculate the expected values of the occurrence indicators *Z* and then update *M* as the maximum likelihood estimate given the expected values \(\bar{Z}\). These two steps, called the *expectation step* and *maximization step*, are iterated until the likelihood corresponding to the maximum likelihood estimate of *M* converges.

*Expectation Step.*In the expectation step, for the given

*M*, we calculate the expected value \(\bar{Z}\) using the formula

*L*(

*S*

_{k}|

*M*) and

*L*(

*S*

_{k}|

*f*) are the likelihoods for the width

*w*subsequence of

*S*, starting at position

*k*, according to the motif

*M*and base composition

*f*, respectively (refer back to Sect. 3.8.1 for how these individual likelihoods are calculated).

*Maximization Step.*In the maximization step, we update the estimate of the motif

*M*using the values of \(\bar{Z}\) from the expectation step. Define the function

*I*(

*X*) to take a value 1 when

*X*is true and 0 otherwise. Then the maximum likelihood estimate of the value for base

*j*in the

*i*th column of

*M*is

*N*is a normalizing factor to ensure that the columns of

*M*sum to exactly 1. The value of

*N*can be obtained directly from \(\bar{Z}\), and is actually the sum of the values in \(\bar{Z}\) over all sequences, and all positions in the sequences. When

*f*is not specified as fixed, but instead estimated from the sequences, it can be updated during the maximization step in a manner similar to that of

*M*

_{i}(

*j*).

The above description assumes that we have uniform prior probabilities on the values of *Z* during the expectation step. The formula for \(\bar{Z}(S_{k})\) can be easily modified to incorporate prior probabilities on *Z* that account for additional information that might be available about which positions in the sequences are more likely to be the start of a motif occurrence. Although there is no rule for how the initial value of *M* is obtained, the most common approach is to use a word-based algorithm, and construct the initial matrix-based motif from word-based motifs. Also, motif discovery programs based on EM typically try many different initial motifs, because the algorithm tends to converge quickly. meme [3] is the most widely known motif discovery program-based EM, but several other programs use EM at some point, often just prior to returning the motifs, because EM does a nice job of optimizing a motif that is nearly optimal. The major early criticism of EM in motif discovery is that it provides no guarantee that the motifs produced are globally optimal, and instead depends critically on the starting point (i.e., the initial motif) selected. It is now recognized that in practice, most matrix-based motif discovery algorithms have this problem, which illustrates the value of word-based methods in identifying good starting points.

### 3.10.4 Gibbs Sampling

The Gibbs sampling algorithm is a general method for sampling from distributions involving multiple random variables [39]. It was first applied to motif discovery in the context of protein motifs by Lawrence et al. [51]. As with the EM algorithm, Gibbs proceeds in two alternating steps: one using an estimate of the motif to update information about its occurrences and the other step using information about those occurrences to refine the estimate of the motif. Applied to motif discovery, the main difference between Gibbs sampling and EM is that instead of using the expected values of *Z* to update the motif *M*, Gibbs randomly samples exact occurrence locations for *M* (i.e., 0 or 1 values for variables in *Z*). We denote the indicators for the sampled occurrence location with \(\hat{Z}\).

*Sampling Step.*In the sampling step, for the current estimate of

*M*, we first calculate the expected values of variables in \(\bar{Z}\) using the Formula 3.4. Then we update the values of \(\hat{Z}\) as

*Z*. For example, the OOPS condition would require that exactly one

*k*satisfy \(\hat{Z}(S_{k}) = 1\) for any \(S \in \mathcal{F}\).

*Predictive Update Step.*In the predictive update step, the parameters of the motif

*M*are estimated from the sampled \(\hat{Z}\) variables as follows:

*N*is the same normalizing factor used in Eq. 3.5.

In the original application of Gibbs sampling to motif discovery, each predictive update step updated the values of \(\hat{Z}\) for only one sequence *S*, and the values corresponding to other sequences retained their values from previous iterations [51]. While the original application was concerned with the OOPS model, the Gibbs sampling strategy has seen many generalizations and extensions. Liu et al. [55] provide a detailed and rigorous treatment. Note that, in contrast with EM, Gibbs sampling does not converge absolutely, and implementations use different criteria to determine how many iterations to execute. Programs implementing the Gibbs sampling strategy for motif discovery include gibbssampler [55], alignace [74], and mdscan [54], which are widely used. alignace is designed specifically for intergenic DNA sequences and masks occurrences of earlier-discovered motifs to facilitate discovery of multiple independent motifs. The mdscan program is designed to work on a ranked sequence set and uses overrepresented words in the highest ranking sequences to obtain starting points for the Gibbs sampling procedure.

### References

- 1.Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86MATHMathSciNetCrossRefGoogle Scholar
- 2.Apostolico A, Bock ME, Lonardi S (2002) Monotony of surprise and large-scale quest for unusual words. In: Proceedings of the sixth annual international conference on computational biology. ACM Press, New York, pp 22–31Google Scholar
- 3.Bailey TL, Elkan C (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 21(1–2):51–80Google Scholar
- 4.Bairoch A (1992) PROSITE: a dictionary of site and patterns in proteins. Nucl Acids Res 20:2013–2018CrossRefGoogle Scholar
- 5.Bajic V, Seah S (2003) Dragon gene start finder identifies approximate locations of the 5
^{′}ends of genes. Nucleic Acids Res 31:3560–3563CrossRefGoogle Scholar - 6.Bajic V, Tan S, Suzuki Y, Sugano S (2004) Promoter prediction analysis on the whole human genome. Nat Biotechnol 22:1467–1473CrossRefGoogle Scholar
- 7.Barash Y, Bejerano G, Friedman N (2001) A simple hyper-geometric approach for discovering putative transcription factor binding sites. Lect Notes Comput Sci 2149:278–293CrossRefGoogle Scholar
- 8.Barash Y, Elidan G, Friedman N, Kaplan T (2003) Modeling dependencies in protein-DNA binding sites. In: Miller W, Vingron M, Istrail S, Pevzner P, Waterman M (eds) Proceedings of the seventh annual international conference on computational molecular biology, ACM Press, New York, pp 28–37. doi http://doi.acm.org/10.1145/640075.640079
- 9.Beckstette M, Stothmann D, Homann R, Giegerich R, Kurtz S (2004) Possumsearch: fast and sensitive matching of position specific scoring matrices using enhanced suffix arrays. In: Proceedings of the German conference in bioinformatics. pp 53–64Google Scholar
- 10.Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D (2004) Ultraconserved elements in the human genome. Science 304(5675):1321–1325CrossRefGoogle Scholar
- 11.Berezikov E, Guryev V, Plasterk RH, Cuppen E (2004) CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. Genome Res 14(1):170–178. doi:10.1101/gr.1642804CrossRefGoogle Scholar
- 12.Berg J, Willmann S, Lassig M (2004) Adaptive evolution of transcription factor binding sites. BMC Evol Biol 4(1):42. doi:10.1186/1471-2148-4-42. URL http://www.biomedcentral.com/1471-2148/4/42
- 13.Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for high-accuracy computational gene prediction. PLoS Comput Biol 3:e54MathSciNetCrossRefGoogle Scholar
- 14.Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14(4):708–715CrossRefGoogle Scholar
- 15.Blanchette M, Sinha S (2001) Separating real motifs from their artifacts. In: Brunak S, Krogh A (eds) Proceedings of the annual international symposium on intelligent systems for molecular biology, pp 30–38Google Scholar
- 16.Blanchette M, Tompa M (2002) Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res 12(5):739–748CrossRefGoogle Scholar
- 17.Brazma A, Jonassen I, Ukkonen E, Vilo J (1996) Discovering patterns and subfamilies in biosequences. In: Proceedings of the annual international symposium on intelligent systems for molecular biology, pp 34–43Google Scholar
- 18.Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13(4):721–731CrossRefGoogle Scholar
- 19.Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242CrossRefGoogle Scholar
- 20.Burge C, Karlin S (1997) Prediction of complete gene structure in human genomic DNA. J Mol Biol 268:78–94CrossRefGoogle Scholar
- 21.Bussemaker HJ, Li H, Siggia ED (2001) Regulatory element detection using correlation with expression. Nat Genet 27(2):167–171CrossRefGoogle Scholar
- 22.Califano A (2000) SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics 16(4):341–357CrossRefGoogle Scholar
- 23.Carninci P, et al (2006) Genomewide analysis of mammalian promoter architecture and evolution. Nat Genet 38:626–635CrossRefGoogle Scholar
- 24.Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, Gingeras TR (2005) Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308(5725):1149–1154CrossRefGoogle Scholar
- 25.Conlon EM, Liu XS, Lieb JD, Liu JS (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci USA 100(6):3339–3344CrossRefGoogle Scholar
- 26.Das D, Banerjee N, Zhang MQ (2004) Interacting models of cooperative gene regulation. Proc Natl Acad Sci USA 101(46):16234–16239CrossRefGoogle Scholar
- 27.Das D, Nahle Z, Zhang M (2006) Adaptively inferring human transcriptional subnetworks. Mol Syst Biol 2:2006.0029CrossRefGoogle Scholar
- 28.Davuluri R, Grosse I, Zhang M (2002) Computational identification of promoters and first exons in the human genome. Nat Genet 229:412–417; Erratum: Nat Genet 32:459Google Scholar
- 29.Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38MATHMathSciNetGoogle Scholar
- 30.Dermitzakis ET, Clark AG (2002) Evolution of transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol Biol Evol 19(7):1114–1121CrossRefGoogle Scholar
- 31.Dorohonceanu B, Nevill-Manning C (2000) Accelerating protein classification using suffix trees. In: Proceedings of the 8th international conference on intelligent systems for molecular biology (ISMB). pp 128–133Google Scholar
- 32.Down T, Hubbard T (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res 12:458–461CrossRefGoogle Scholar
- 33.Durbin R, Eddy SR, Krogh A, Mitchison G (1999) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University PressGoogle Scholar
- 34.Duta R, Hart P, Stock D (2000) Pattern classification, 2 edn. Wiley, New YorkGoogle Scholar
- 35.Ettwiller L, Paten B, Souren M, Loosli F, Wittbrodt J, Birney E (2005) The discovery, positioning and verification of a set of transcription-associated motifs in vertebrates. Genome Biol 6(12):R104CrossRefGoogle Scholar
- 36.Evans PA, Smith AD (2003) Toward optimal motif enumeration. In: Dehne FKHA, Ortiz AL, Sack JR (eds) Workshop on algorithms and data structures. Lecture notes in computer science, vol 2748, pp 47–58Google Scholar
- 37.Felsenstein J, Churchill G (1996) A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol 13(1):93–104CrossRefGoogle Scholar
- 38.Fiegler H, et al (2006) Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Res 16:1566–1574CrossRefGoogle Scholar
- 39.Gelfand AE, Smith AFM (1990) Sampling-based approaches to calculating marginal densities. J Am Stat Assoc 85:398–409MATHMathSciNetCrossRefGoogle Scholar
- 40.Guigó R, et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S2.1–S2.3CrossRefGoogle Scholar
- 41.Gupta M, Liu J (2003) Discovery of conserved sequence patterns using a stochastic dictionary model. J Am Stat Assoc 98(461):55–66MATHMathSciNetCrossRefGoogle Scholar
- 42.Halpern A, Bruno W (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol 15(7):910–917CrossRefGoogle Scholar
- 43.IUPAC-IUB Commission on Biochemical Nomenclature (1970) Abbreviations and symbols for nucleic acids, polynucleotides and their constituents: recommendations 1970. J Biol Chem 245(20):5171–5176. URL http://www.jbc.org
- 44.Javier Costas FC, Vieira J (2003) Turnover of binding sites for transcription factors involved in early drosophila development. Gene 310:215–220CrossRefGoogle Scholar
- 45.Kel A, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis O, Wingender E (2003) MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucl Acids Res 31(13):3576–3579CrossRefGoogle Scholar
- 46.Kim TH, Barrera LO, Zheng M, Qu C, Singer MA, Richmond TA, Wu Y, Green RD, Ren B (2005) A high-resolution map of active promoters in the human genome. Nature 436:876–880CrossRefGoogle Scholar
- 47.Komura D, et al (2006) Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res 16:1575–1584CrossRefGoogle Scholar
- 48.Korbel JO, et al (2007) Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci USA 104:10110–10115CrossRefGoogle Scholar
- 49.Krogh A (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 5:179–186Google Scholar
- 50.Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134–142Google Scholar
- 51.Lawrence C, Altschul S, Boguski M, Liu J, Neuwald A, Wootton J (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208–214CrossRefGoogle Scholar
- 52.Lawrence C, Reilly AA (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct Funct Genet 7:41–51CrossRefGoogle Scholar
- 53.Li M, Ma B, Wang L (2002) On the closest string and substring problems. J ACM 49(2):157–171MathSciNetCrossRefGoogle Scholar
- 54.Liu XS, Brutlag DL, Liu JS (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat Biotechnol 20(8):835–839CrossRefGoogle Scholar
- 55.Liu JS, Lawrence CE, Neuwald A (1995) Bayesian models for multiple local sequence alignment and its Gibbs sampling strategies. J Am Stat Assoc 90:1156–1170MATHCrossRefGoogle Scholar
- 56.Majoros W, Pertea M, Salzberg S (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic genefinders. Bioinformatics 20:2878–2879CrossRefGoogle Scholar
- 57.Marinescu VD, Kohane IS, Riva A (2005) The MAPPER database: a multi-genome catalog of putative transcription factor binding sites. Nucl Acids Res 33(Suppl 1):D91–D97Google Scholar
- 58.Marsan L, Sagot MF (2000) Extracting structured motifs using a suffix tree – algorithms and application to promoter consensus identification. In: Minoru S, Shamir R (eds) Proceedings of the annual international conference on computational molecular biology. ACM Press, New York, pp 210–219Google Scholar
- 59.Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E (2003) TRANSFAC(R): transcriptional regulation, from patterns to profiles. Nucl Acids Res 31(1):374–378CrossRefGoogle Scholar
- 60.Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB (2004) MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol 5(12):R98CrossRefGoogle Scholar
- 61.Moses AM, Pollard DA, Nix DA, Iyer VN, Li XY, Biggin MD, Eisen MB (2006) Large-scale turnover of functional transcription factor binding sites in drosophila. PLoS Comput Biol 2(10):e130CrossRefGoogle Scholar
- 62.Mustonen V, Lassig M (2005) Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc. Natl. Acad. Sci. USA 102(44):15936–15941. doi:10.1073/pnas.0505537102. URL http://www.pnas.org/cgi/content/abstract/102/44/15936
- 63.Nicodeme P, Salvy B, Flajolet P (2002) Motif statistics. Theor Comput Sci 287:593–617MATHMathSciNetCrossRefGoogle Scholar
- 64.Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, MacIsaac KD, Rolfe PA, Conboy CM, Gifford DK, Fraenkel E (2007) Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat Genet 39(6):730–732; Published online: 21 May 2007Google Scholar
- 65.Ohler U, Liao G, Niemann H, Rubin G (2002) Computational analysis of core promoters in the drosophila genome. Genome Biol 3(12):RESEARCH0087Google Scholar
- 66.Pearson H (2006) What is a gene?. Nat Genet 441:398–340Google Scholar
- 67.Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Fick I, Akiyama J, Val SD, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature 444(7118):499–502CrossRefGoogle Scholar
- 68.Pevzner P, Sze S (2000) Combinatorial approaches to finding subtle signals in DNA sequences. In: Bourne P, et al (eds) Proceedings of the annual international symposium on intelligent systems for molecular biology. Menlo Park, AAAI Press, pp 269–278Google Scholar
- 69.Portugal J (1989) Footprinting analysis of sequence-specific DNA-drug interactions. Chem Biol Interact 71(4):311–324CrossRefGoogle Scholar
- 70.Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJL (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucl Acids Res 33(11):3455–3464CrossRefGoogle Scholar
- 71.Rabiner L (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77:257–286CrossRefGoogle Scholar
- 72.Rahmann S, Muller T, Vingron M (2003) On the power of profiles for transcription factor binding site detection. Stat Appl Genet Mol Biol 2(1):7MathSciNetGoogle Scholar
- 73.Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME (2006) Global variation in copy number in the human genome. Nature 444:444–454CrossRefGoogle Scholar
- 74.Roth F, Hughes J, Estep P, Church G (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16(10):939–945CrossRefGoogle Scholar
- 75.Salamov A, Solovyev V (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10:516–522CrossRefGoogle Scholar
- 76.Sandelin A, et al (2007) Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet 8:424–436CrossRefGoogle Scholar
- 77.Schones D, Smith A, Zhang M (2007) Statistical significance of cis-regulatory modules. BMC Bioinform 8:19CrossRefGoogle Scholar
- 78.Sebat J, et al (2004) Large-scale copy number polymorphism in the human genome. Science 305:525–528CrossRefGoogle Scholar
- 79.Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15(8):1034–1050CrossRefGoogle Scholar
- 80.Solovyev VV, et al (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucl Acids Res 22:5156–5163CrossRefGoogle Scholar
- 81.Sonnenburg S, Zien A, Ratsch G (2006) ARTS: accurate recognition of transcription starts in human. Bioinformatics 22:e472–e480CrossRefGoogle Scholar
- 82.Staden R (1989) Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 5(2):89–96Google Scholar
- 83.Stanke M, Waack S (2003) Gene prediction with a hidden markov model and a new intron submodel. Bioinformatics 19(Suppl 2):II215–II225CrossRefGoogle Scholar
- 84.Sumazin P, Chen G, Hata N, Smith AD, Zhang T, Zhang MQ (2005) DWE: discriminating word enumerator. Bioinformatics 21(1):31–38CrossRefGoogle Scholar
- 85.Thomas M, Chiang C (2006) The general transcription machinery and general cofactors. Crit Rev Biochem Mol Biol 41:105–178CrossRefGoogle Scholar
- 86.Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 22(22):4673–4680CrossRefGoogle Scholar
- 87.Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23(1):137–144CrossRefGoogle Scholar
- 88.Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M (2007) PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17(11):1665–1674CrossRefGoogle Scholar
- 89.Waterman MS (1995) Introduction to computational biology: maps, sequences and genomes. Chapman and Hall, LondonMATHCrossRefGoogle Scholar
- 90.Waterman MS, Arratia R, Galas DJ (1984) Pattern recognition in several sequences: consensus and alignment. Bull Math Biol 46:515–527MATHMathSciNetCrossRefGoogle Scholar
- 91.Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJK, Cooke JE, Elgar G (2005) Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol 3(1):e7CrossRefGoogle Scholar
- 92.Zhang M (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci USA 94:565–568CrossRefGoogle Scholar
- 93.Zhang M (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698–709CrossRefGoogle Scholar
- 94.Zhao X, Xuan Z, Zhang MQ (2006) Boosting with stumps for predicting transcription start sites. Genome Biol 8:R17CrossRefGoogle Scholar
- 95.Zhou Q, Liu JS (2004) Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 20(6):909–916CrossRefGoogle Scholar