Computational Reconstruction of Ancestral DNA Sequences
This chapter introduces the problem of ancestral sequence reconstruction: given a set of extant orthologous DNA genomic sequences (or even whole-genomes), together with a phylogenetic tree relating these sequences, predict the DNA sequence of all ancestral species in the tree. Blanchette et al. (1) have shown that for certain sets of species (in particular, for eutherian mammals), very accurate reconstruction can be obtained. We explain the main steps involved in this process, including multiple sequence alignment, insertion and deletion inference, substitution inference, and gene arrangement inference. We also describe a simulation-based procedure to assess the accuracy of the reconstructed sequences. The whole reconstruction process is illustrated using a set of mammalian sequences from the CFTR region.
Key WordsAncestral DNA sequence reconstruction multiple sequences alignment mammalian phylogeny; mammalian evolution substitutions and indels reconstruction ancestral sequence reconstruction accuracy
Following the completion of the human genome sequence, there is now considerable interest in obtaining a more comprehensive understanding of its evolution (2, 3, 4). Patterns of evolutionary conservation are used to screen human DNA mutations to predict those that will be deleterious to protein function and to identify noncoding sequences that are under negative selection, and hence may perform regulatory or structural functions (5, 6, 7). Long periods of conservation followed by sudden change may provide clues to the evolution of new human traits (8,9). All of these efforts depend, directly or indirectly, on reconstructing the evolutionary history of the bases in the human genome, and hence on reconstructing the genomes of our distant ancestors.
Although some information about ancestral species has been irrevocably lost during evolution, there is still the possibility that large regions of the genomes of ancestral species with many modern descendants can be approximately inferred from the genomes of modern species using a model of molecular evolution. Indeed, it has recently been reported that in the specific case of mammalian evolution, ancestral genome reconstruction was possible to a surprising degree of accuracy (1).
The ideal target species for a genomic reconstruction is one that has generated a large number of independent, successful descendant lineages through a rapid series of early speciation events. In this case, the problem can be viewed as attempting to reconstruct an original from many independent noisy copies. In the limit of an instantaneous radiation, the accuracy of the reconstruction approaches 100% exponentially fast as the number of copies increases. From the Cretaceous period, a good choice for reconstruction would be the genome of the eutherian ancestor, as this species is believed to have spawned the relatively rapid radiation of the different lineages of modern placental mammals (10,11). This ancient species also has the added advantage of being a human ancestor, so its reconstruction, however speculative, may shed additional light on our own evolution, perhaps helping to explain features of the human and other modern mammalian genomes.
In this chapter, we describe the set of computational approaches and tools that exist for reconstructing ancestral sequences and for estimating the accuracy of such a reconstruction. This area being relatively new, there is no single tool that performs all the steps involved in the reconstruction. Instead, tools developed by different authors need to be used sequentially. The methods are illustrated on a 1.8-Mb region of mammalian genomes, containing the CFTR gene, sequenced by the ENCODE project (12).
2.1 Sequence Data
To reconstruct the ancestral sequences, orthologous DNA regions from as many descendants as possible need to be compared. The more orthologous sequences are available, the more accurate the reconstruction will be, provided accurate evolutionary models are used. For vertebrate sequences, a good repository of complete genome sequences is the UCSC Genome Browser (http://genome.ucsc.edu ). Besides raw DNA sequences, multiple genome alignments and various types of genome annotation are accessible from the same site.
For the purpose of this chapter, we illustrate the process of ancestral sequence reconstruction using a 1.8-Mb region of the human genome including the CFTR gene, together with orthologous regions from 19 other mammals (available from the UCSC Genome Browser). This deep coverage is not currently available over all the genome, but only for the targeted sequencing of the ENCODE project (12).
2.2 Phylogenetic Information
An important component of ancestral sequence reconstruction is the knowledge of the phylogenetic relationships among the species being compared. Knowing the correct tree topology and estimating the length of its branches are crucial for an accurate reconstruction, as well as for estimating the accuracy of that reconstruction through simulations. Accepted phylogenetic trees are now available for many sets of species (see, e.g., refs.10,14). For others, the exact phylogenetic relationships remain unclear and need to be inferred prior to reconstruction, using programs like Phylip (15), PAUP (16), or MrBayes (17). These tools are also necessary to estimate the branch lengths of the phylogenetic tree using a maximum likelihood approach.
2.3 Sequence Annotation
In some cases, functional annotation of extant sequences can be used to obtain more accurate reconstruction of ancestral sequences. This is particularly the case for coding region annotation and repetitive region annotation. For metazoans, a good source of such annotations is the UCSC genome browser and the Ensembl Genome Browser (http://www.ensembl.org).
This section introduces the techniques that have been developed for predicting ancestral DNA sequences based on their extant descendants, and for estimating the accuracy of the reconstruction. We illustrate this reconstruction process (see alsoNote 1) and the type of information that can be derived from it using a 1.8-Mb region surrounding the CFTR gene in mammals (seeref.1 and Note 2 for more details).
3.1 Predicting Ancestral Sequences
The prediction of ancestral genomes can be divided into four main steps. A crucial first step toward the reconstruction is to build an accurate multiple alignments of the extant orthologous sequences, thus establishing orthology relationships among the nucleotides of each sequence. Second, the process of indel reconstruction determines the most likely scenario of insertions and deletions that may have led to the extant sequences. Third, substitution history is reconstructed using a maximum likelihood approach. The last step involves dealing with genome rearrangements (inversions, transpositions, translocations, duplications, and chromosome fusions, fissions, and duplications).
3.1.1 Multiple Sequence Alignment
Given a set of orthologous sequences, the multiple alignment problem consists of identifying (by aligning them together) the sets of nucleotides derived from a common ancestor through direct inheritance or through substitution. Many approaches have been developed to align multiple large genomic regions. Some of the most popular approaches include programs like MAVID (18), MLAGAN (5,19), and TBA (20). All these approaches fall under the category of progressive alignment methods and require the prior knowledge of the topology of the phylogenetic tree that relates the extant sequences compared (seeSubheading 2.2.). The threaded blocks aligner (TBA) program, based on the well established pair-wise alignment program BLASTZ (21), has been shown to be particularly accurate for aligning mammalian sequences and is thus a tool of choice for ancestral reconstruction for these species. The program is available at http://www.bx.psu.edu/miller_lab/. The multiple sequence alignment problem is discussed in more detail in Chapter 9.
3.1.2 Indel Reconstructing
Given a multiple sequence alignment of the repeat-soft-masked extant sequences and a phylogenetic tree with known topology and branch lengths, the next step consists of predicting, for each ancestral node in the tree, which columns of the alignment correspond to ancestral bases and which correspond to nucleotides inserted after the ancestor. Although the problem of parsimonious indel inference has recently been shown to be NP-Hard (22), good heuristics have been developed by Fredslund et al. (23), Blanchette et al. (1), and Chindelevitch et al. (22). Currently, the only publicly available program for indel reconstruction is the inferAncestors program based on the greedy approach of Blanchette et al. (1). This section describes briefly how the program works.
Given a multiple alignment, all the gaps in the alignment are first marked as unexplained. The algorithm iteratively selects the insertion or deletion, performed along a specific edge of the tree and spanning one or more columns of the alignment, which yields the largest number of alignment gaps explained per unit of cost. The number of gaps explained by a deletion is the number of unexplained gaps in the subtree above which the deletion occurs. The number of gaps explained by an insertion is the number of unexplained gaps in the complement of the subtree above which the insertion occurs. The costs can be defined heuristically. The cost of a deletion is given by 1 + 0.01 log(L) − 0.01b, where L is the length of the deletion and b is the length of the branch along which the event takes place. The cost of an insertion is given by 1 + 0.01 log(L) − 0.01b − r, where r is a term that takes value 0.5 if the repetitive content of the segment inserted is more than 90%. Once the best insertion or deletion has been identified, its gaps are marked as “explained.” This does not preclude them from being part of other indels, but they will not count in their evaluation. Finally, heuristics are used to reduce errors related to incorrect alignment, in particular to reduce the problems caused by two repetitive regions from two distantly related species mistakenly aligned to each other, with other species having gaps in that region.
3.1.3 Substitutions Reconstruction
After having established which positions of the multiple alignment correspond to bases in the ancestor, the inferAncestors program predicts which nucleotide (A, C, G, or T) was present at each position in the ancestor using the standard posterior probability approach (24) based on a dinucleotide substitution model in which substitutions at two adjacent positions are independent except for CpG, whose substitution rate to TpG is 10 times higher than those of other transitions (25). This phase of the reconstruction relies on the availability accurate branchlength estimates for the phylogenetic tree, which can be obtained as described under Subheading 2.2.
3.1.4 The inferAncestors Program
The inferAncestor program, available from http://www.mcb.mcgill.ca/~blanchem/software, integrates the steps of indel and substitution inference. The algorithm takes as input a multiple alignment in fasta format, together with a phylogenetic tree in New Hampshire format. The program outputs a predicted ancestral sequence for each internal node of the phylogenetic tree. Two other files are outputs, describing the confidence of the prediction made for each base of each ancestral sequence. The first describes the confidence in the prediction of presence or absence of a base at each position of each ancestral sequence. The second describes the confidence of the actual nucleotide (A, C, G, or T) predicted. The infer Ancestor program is written in C++ and has been tested on Linux and Mac OS X.
3.1.5 Genome Rearrangements
To complete the inference of ancestral genomes, the ancestral DNA sequences inferred for each block of orthologous sequences need to be ordered into a single, contiguous genome. This problem is made challenging by the presence of genome rearrangements (inversions, transpositions, translocations, and duplications/losses). One of the most popular computer programs for inferring ancestral gene arrangement is MGR (, http://www.cse.ucsd.edu/groups/bioinformatics/MGR), which is described in detail in Chapter 10.
3.2 Assessing Reconstruction Accuracy Through Simulations
This section describes a simulation-based method for assessing the accuracy of the reconstructed ancestor. An alternate approach based on retrotransposons is described in (1).
To assess the reconstructability of ancestral genomic sequences from their extant descendants, the simplest method is to use simulations of sequence evolution. Starting from a known (but synthetic) ancestral sequence, we let the sequence evolve along the branches of the tree until the leaves are reached. The ancestral sequence reconstruction procedure is then applied to the set of simulated leaves, and the prediction made is compared to the known ancestral sequence.
The simulation program Simali (http://www.bx.psu.edu/miller_lab/), based on the Rose program (27), can be used to mimic the evolution of sequences under no selective pressure. Given a phylogenetic tree, the program simulates sequence evolution by performing random substitutions, deletions, and insertions along each branch, in proportion to its length. The program allows for the insertion of retrotransposons, which is an important source of error in sequence alignment, and thus in ancestral sequence reconstruction.
After generating a set of simulated sequences, the sequences are first soft-repeat-masked using RepeatMasker (31) and then aligned using one of the methods under Subheading 3.1.1. The repeat-masked multiple alignment is then fed into the inferAncestors program, which produces a predictiozn of the ancestral sequence at each internal node of the phylogenetic tree. To compare the actual ancestral sequence generated by simulations to the predicted ancestral sequence, we align them and count the number of missing bases (those present in the actual ancestor but not in the reconstruction), added bases (present in the reconstruction but not in the actual ancestor), and mismatch errors (positions in the reconstruction assigned the incorrect nucleotide). The sum of the rates of all three types of errors, calculated separately at each ancestral node in the phylogenetic tree, is used to estimate the reconstructability of a given ancestor.
An alternate approach to assessing the accuracy of a reconstruction is through a pseudo cross-validation procedure. Instead of reconstructing an ancestral sequence based on all the extant sequences available, do so using a (large) subset of these species. Different subsets of species will produce slightly different ancestral reconstructions, and the variability between these reconstructions will give an idea of the expected error rate of the reconstruction that is based on all species.
3.3 Reconstruction of Actual Mammalian Sequences
Blanchette et al. (1) applied the reconstruction method described above to actual high-quality sequence data from a region containing the human CFTR locus, using 18 additional orthologous mammalian genomic regions generated by the NISC Comparative Sequencing Program (, http://www.nisc.nih.gov). Simulations on synthetic data like those described above indicate that for the topology and set of branch lengths for these 19 species, the ancestral sequence that can be the most accurately reconstructed based on the sequences available is the Boreoeutherian ancestor, and that neutrally evolving regions of this ancestral genome can be reconstructed with an accuracy of about 96%. On a site-specific basis, simulations suggest that more than 90% of the bases of the predicted ancestor can be assigned confidence values greater than 99%. The reconstructed ancestor and site-specific confidence estimates are available at http://genome.ucsc.edu/ancestors.
Notice that despite the fact that the alignment of certain species (in particular, mouse, rat, and hedgehog) appears somewhat unreliable, the inference of the presence or absence of a Boreoeutherian ancestral base at a given position is quite straightforward given the alignment, and so is, to a lesser extent, the prediction of the actual ancestral base itself. The MER20 consensus is shown for comparison. Most positions in which the reconstructed Boreoeutherian ancestral base disagrees with the MER20 consensus are likely owing to substitutions in this MER20 relic that predated the Boreoeutherian ancestor, since the support of the reconstructed base is very strong in the extant species. If the MER20 consensus sequence is used as an outgroup in the reconstruction procedure, only two bases (indicated by a longer arrow) are reconstructed differently, indicating that the reconstructed ancestral sequence is very stable and most of it is likely to be correct.
The accuracy of the reconstruction depends crucially on the length of the early branches of the phylogenetic tree. In the context of the ancestral mammalian sequence reconstruction, Blanchette et al. (1) have shown that if the major placental lineages had diverged instantaneously, they would be able to reconstruct the simulated Boreoeutherian ancestral sequence, including repetitive regions, with less than 1% error. In contrast, if the early branch lengths inferred by Eirizik et al. (10) turned out to underestimate the actual lengths by a factor of two, the error rate would jump to 3%, and to 6% if they were underestimated by a factor of 4.
- 2.One of the nonintuitive results presented by Blanchette et al. (1) is the observation that more ancient ancestral genomes can often be reconstructed more accurately than their more recent descendants. Why exactly is this so? For simplicity, consider the case of reconstructing a single binary ancestral character state in the root species (e.g., purine vs pyrimidine at a given site) under a simple model in which the prior probability distribution on the ancestral character is uniform, substitution rates are known, symmetric, homogeneous, and not too high, and the total branch length in the phylogenetic tree from the root ancestor to each of the modern species is the same (i.e., assume a molecular clock). Here each of n modern species has a state that differs from the ancestral one with the same probability p < 1/2. If the tree exhibits a star topology, in which each of the modern species derives directly from the ancestor on an independent branch, then it is clear that the maximum likelihood and Bayesian maximum a posteriori reconstructions of the ancestral character agree, and the reconstructed state is the one that is most often observed in the n modern species. The probability of an error in reconstruction is: 32,33]; Lemma 5, p. 479). This error approaches zero exponentially fast as n increases. The star topology has a kind of “phase transition” where the ancestor becomes highly reconstructible once enough present day sequences are available to compensate for the length of the branches leading back to the ancestor.
In contrast, a nonstar topology such as a binary tree that has the same total root-toleaf branch length and the same number n of modern species at the leaves has two nonzero length branches from the root ancestor R leading to intermediate ancestors A and B, and information is irrevocably lost along these two branches. No matter how large the number n of modern descendant species derived from A and B, one can do no better at reconstructing the state at R than if one knew for certain the state in its immediate descendants A and B. Even with this knowledge, the accuracy of reconstruction of R from A and B will be strictly less than 100% for all reasonable models and nonzero branch lengths. The reconstruction gets poorer the longer the branch lengths are to A and B. This extends to the case where the ancestor R being reconstructed has a bounded number of independent immediate descendants and to the case where descendants of an earlier ancestor of R (outgroups) are also available. The long branches connecting them to the rest of the tree are why some more recent ancestral sequences in the tree of Fig. 1 are less reconstructible than the Boreoeutherian ancestor, which acts almost like the root of a star topology (seeref.34 for a discussion of optimal tree topologies for ancestral reconstructability).
- 14.Maddison, D. R. and Schulz K.-S. (ed.) (2004) The Tree of Life Web Project. http://tolweb.org
- 15.Felsenstein, J. (1989) PHYLIP-Phylogeny inference package (Version 3.2). Cladistics 5, 164–166.Google Scholar
- 16.Swofford, D. L. (2003) PAUP: Phylogenetic Analysis Using Parsimony. Sinauer, Sunderland, MA.Google Scholar
- 22.Chindelevitch, L., Li, Z., Blais, E., and Blanchette, M. (2006) On the inference of parsimonious indel evolutionary scenarios. J. Bioinformatics Comput. Biol. in press.Google Scholar
- 23.Fredslund, J., Hein, J., and Scharling, T. (2003) A large version of the small parsimony problem. Lecture Notes in Bioinformatics, Proceedings of WABI’03. 2812, 417–432.Google Scholar
- 25.Siepel, A. and Haussler, D. (2003) Combining phylogenetic and hidden Markov models in biosequence analysis. Proceedings of the 7th Annual International. Conference on Research in Computational Molecular Biology. pp. 277–286.Google Scholar
- 31.Smit, A. and Green, P. (1999) RepeatMasker, http://ftp.genome.washington.edu/RM/RepeatMasker.html
- 33.Le Cam, L. (1986) Asymptotic Methods in Statistical Decision Theory, Springer, New York.Google Scholar