Abstract
Evolution operates on whole genomes through direct rearrangements of genes, such as inversions, transpositions, and inverted transpositions, as well as through operations, such as duplications, losses, and transfers, that also affect the gene content of the genomes. Because these events are rare relative to nucleotide substitutions, gene order data offer the possibility of resolving ancient branches in the tree of life; the combination of gene order data with sequence data also has the potential to provide more robust phylogenetic reconstructions, since each can elucidate evolution at different time scales. Distance corrections greatly improve the accuracy of phylogeny reconstructions from DNA sequences, enabling distance-based methods to approach the accuracy of the more elaborate methods based on parsimony or likelihood at a fraction of the computational cost. This paper focuses on developing distance correction methods for phylogeny reconstruction from whole genomes. The main question we investigate is how to estimate evolutionary histories from whole genomes with equal gene content, and we present a technique, the empirically derived estimator (EDE), that we have developed for this purpose. We study the use of EDE on whole genomes with identical gene content, and we explore the accuracy of phylogenies inferred using EDE with the neighbor joining and minimum evolution methods under a wide range of model conditions. Our study shows that tree reconstruction under these two methods is much more accurate when based on EDE distances than when based on other distances previously suggested for whole genomes.
Similar content being viewed by others
References
Bader D, Moret B, Yan M (2001) A linear time algorithm for computing inversion distances between signed permutations with an experimental study. J Comput Biol 8(5):483–491
Belda E, Moya A, Silva F (2005) Genome rearrangement distances and gene order phylogeny in γ-proteobacteria. Mol Biol Evol 22:1456–1467
Blanchette M, Bourque G, Sankoff D (1997) Breakpoint phylogenies. In: Miyano S, Takagi T (eds) Genome informatics. University Academy Press, Tokyo, pp 25–34
Blanchette M, Kunisawa M, Sankoff D (1999) Gene order breakpoint evidence in animal mitochondrial phylogeny. J Mol Evol 49:193–203
Boore J (1999) Animal mitochondrial genomes. Nucleic Acids Res 27:1767–1780
Boore J, Brown W (1998) Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. Curr Opin Genet Dev 8(6):668–674
Boore JL, Collins TM, Stanton D, Daehler LL, Brown WM (1995) Deducing arthropod phylogeny from mitochondrial DNA rearrangements. Nature 376:163–165
Bourque G, Pevzner P (2002) Genome scale evolution: reconstructing gene orders in the ancestral species. Genome Res 12(1):26–36
Cosner M, Jansen R, Moret B, Raubeson L, Wang LS, Warnow T, Wyman S (2000) A new fast heuristic for computing the breakpoint phylogeny and a phylogenetic analysis of a group of highly rearranged chloroplast genomes. In: Proceedings, 8th International Conference on Intelligent Systems for Molecular Biology (ISMB’00). AAAI Press, Menlo Park, California, pp 104–115
Desper R, Gascuel O (2002) Fast and accurate phylogeny reconstruction algorithms based on the minimum evolution principle. J Comput Biol 19(5):687–705
Downie S, Palmer J (1992) Use of chloroplast DNA rearrangements in reconstructing plant phylogeny. In: Soltis P, Soltis D, Doyle J (eds) Molecular systematics of plants, Vol 49. Chapman & Hall, London, pp 14–35
El-Mabrouk N (2001) Sorting signed permutations by reversals and insertions/deletions of contiguous segments. J Discrete Algorithms 1(1):105–122
El-Mabrouk N (2002) Reconstructing an ancestral genome using minimum segments duplications and reversals. J Comput Syst Sci 65:442–464
El-Mabrouk N, Sankoff D (2000) Duplication, rearrangement and reconciliation. In: Comparative genomics: empirical and analytical approaches to gene order dynamics, map alignment and the evolution of gene families, Vol 1. Kluwer Academic, New York, pp 537–550
Hannenhalli S, Pevzner P (1995) Transforming cabbage into turnip (polynomial algorithm for genomic distance problems). In: Proceedings, 27th Annual ACM Symposium on the Theory of Computing (STOC’95). ACM Press, New York, pp 178–189
Heard SB (1992) Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution 46:1818–1826
Kim J (1998) Large scale phylogenies and measuring the performance of phylogenetic estimators. Syst Biol 47(1):43–60
Larget B, Simon D, Kadane J (2002) Bayesian phylogenetic inference from animal mitochondrial genome arrangements (with discussion). J Roy Stat Soc Ser B 64:681–693
Larget B, Simon D, Kadane J, Sweet D (2004) A Bayesian analysis of metazoan mitochondrial genome arrangements. Mol Biol Evol 22(3):486–495
Marron M, Swenson K, Moret B (2004) Genomic distances under deletions and insertions. Theor Comput Sci 325(3):347–360 (Special issue: papers from COCOON’03)
Moret B, Warnow T (2005) Advances in phylogeny reconstruction from gene order and content data. In: Zimmer E, Roalson E (eds) Molecular evolution, producing the biochemical data, Part B, 395. Elsevier, Amsterdam, pp 673–700
Moret B, Wang LS, Warnow T, Wyman S (2001) New approaches for reconstructing phylogenies based on gene order. Bioinformatics 17(Suppl):165–173
Moret B, Tang J, Wang LS, Warnow T (2002) Steps toward accurate reconstructions of phylogenies from gene order data. J Comput Syst Sci 65:508–525
Moret B, Tang J, Warnow T (2005) Reconstructing phylogenies from gene content and geneorder data. In: Gascuel O (ed) Mathematics of evolution and phylogeny. Oxford University Press, New York, pp 321–352
Nakhleh L, Moret B, Roshan U, John KS, Sun J, Warnow T (2002a) The accuracy of fast phylogenetic methods for large datasets. In: Proceedings, 7th Pacific Symposium on Biocomputing (PSB’02), pp 211–222
Nakhleh L, Roshan U, Vawter L, Warnow T (2002b) Estimating the deviation from a molecular clock. In: Lecture Notes in Computer Science: Proceedings of the 2nd Workshop for Algorithms and Bioinformatics (WABI’02), Vol 2452. Springer Verlag, New York, pp 287–299
Pinter R, Skiena S (2002) Genomic sorting with length weighted reversals. Genome Inform 13:103–111
Raubeson L, Jansen R (1992) Chloroplast DNA evidence on the ancient evolutionary split in vascular land plants. Science 255:1697–1699
Raubeson L, Jansen R (2005) Chloroplast genomes of plants. In: Henry R (ed) Diversity and evolution of plants genotypic and phenotypic variation in higher plants. CABI, London, pp 45–68
Rokas A, Holland PWH (2000) Rare genomic changes as a tool for phylogenetics. Trends Ecol Evol 15:454–459
Rzhetsky A, Nei M (1992) A simple method for estimating and testing minimumevolution trees. Mol Biol Evol 35:367–375
Rzhetsky A, Sitnikova T (1996) When is it safe to use an oversimplified substitution model in treemaking? Mol Biol Evol 13(9):1255–1265
Saitou N, Imanishi T (1989) Relative efficiencies of the Fitch-Margoliash, maximumparsimony, maximum likekihood, minimum evolution, and neighbor joining methods of phylogenetic tree construction in obtaining the correct tree. Mol Biol Evol 6(5):514–525
Saitou N, Nei M (1987) The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425
Sanderson MJ (2003) Analysis of rates (r8s) of evolution, v1.6; available at: http://ginger.ucdavis.edu/r8s/
Sankoff D, Blanchette M (1998) Multiple genome rearrangement and breakpoint phylogeny. J Comput Biol 5:555–570
Sourdis J, Krimbas C (1987) Accuracy of phylogenetic trees estimated from DNA sequence data. Mol Biol Evol 4:159–166
Swenson K, Marron M, Earnest DeYoung J, Moret B (2005) Approximating the true evolutionary distance between two genomes. In: Proceedings, 7th Workshop on Algorithm Engineering and Experiments (ALENEX’05). SIAM Press, Philadelphia, Pennsylvania, pp 37–46
Swofford D (2001) PAUP* 4.0. Sinauer Associates, Sunderland, MA
Swofford D, Olson G, Waddell P, Hillis D (1996) Phylogenetic inference. In: Hillis D, Moritz C, Mable B (eds) Molecular systematics, 2nd ed. Sinauer Associates, Sunderland, MA, chap 11
Tang J, Moret B (2003) Phylogenetic reconstruction from gene rearrangement data with unequal gene contents. In: Lecture Notes in Computer Science: Proceedings, 8th Workshop on Algorithms and Data Structures (WADS’03), Vol 3069, pp 37–46
Tannier E, Sagot M (2004) Sorting by reversals in subquadratic time. In: Lecture Notes in Computer Science: Proceedings, 15th Symposium on Combinatorial Pattern Matching (CPM’04), Vol 3109. Springer Verlag, New York, pp 1–13
Tesler G, Pevzner P (2003) Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc Natl Acad Sci USA 100(13):7672–7677
Tessler G (2002) Efficient algorithms for multichromosomal genome rearrangements. J Comput Syst Scie 65:587–609
Wang LS, Jansen R, Moret B, Raubeson L, Warnow T (2002) Fast phylogenetic methods for the analysis of genome rearrangement data: an empirical study. In: Proceedings of the Fifth Pacific Symposium of Biocomputing (PSB’02), pp 524–535
Zwickl DJ, Hillis DM (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51(4):588–598
Acknowledgments
We thank the two anonymous reviewers for their comments and for the suggestion of the Rzhetsky-Nei interior branch length test from one of the reviewers. This research was supported by National Science Foundation Grants EIA0121680, EF0331453, DEB0120709, DEB0075700, IIS0113654, EF0331654, IIS0121377, IIS0113095, and ANI020203584. Bernard Moret would like to acknowledge support from the IBM Corporation under a DARPA grant for the HPCS initiative and from the NIH under Grant 2R01GM056120-05A1 through a subcontract to the University of Arizona. Li-San Wang was supported in part by a NIH Training Grant in Bioinformatics. Tandy Warnow would like to acknowledge the support of the David and Lucile Packard Foundation, the Radcliffe Institute for Advanced Study, the Program in Evolutionary Dynamics at Harvard, and the Institute for Cellular and Molecular Biology at the University of Texas at Austin.
Author information
Authors and Affiliations
Corresponding author
Additional information
[Reviewing Editor: Dr. Martin Kreitman]
Electronic Supplementary Material
Appendix
Appendix
EDE: The Empirically Derived Estimator
EDE produces the best results under all model conditions, even when the evolutionary model is exclusively transpositions. For details about the mathematical derivation of this technique, see Moret et al. (2001).
EDE is based on inverting a function for the expected minimum inversion distance produced by a sequence of random inversions. Theoretical approaches (i.e., actually trying to analytically solve the expected inversion distance produced by k random inversions) proved to be quite difficult, and so we studied this under simulation. Our initial studies showed little difference in the behavior under 120 genes (typical for chloroplasts) and 37 genes (typical of mitochondria) and, in particular, suggested that it should be possible to express the normalized expected inversion distance as a function of the normalized number of random inversions. Therefore, we attempted to define a simple function f(k/n) that approximates E[dINV (G0, G k )/n] well, for k the number of random inversions, n the number of genes, G0 the initial genome, and G k the result of applying k random inversions to G0.
The function f should have the following properties.
-
1
0 ≤ f (x) ≤ x, since the inversion distance is always less than or equal to the actual number of inversions.
-
2
limx→∞ f (x) ≈ 1, as our simulations show the normalized expected inversion distance is close to 1, when a large number of random inversions is applied.
-
3
f ‘(0) = 1, since a single random inversion always produces a genome that is inversion distance 1 away.
-
4
f−1(y) is defined for all y [0, 1].
We used nf (x) to estimate E[dINV(G nx , G0)], the expected inversion distance after nx inversions are applied. The nonlinear formula
satisfies constraints (2) and (4).
We tried several different values for the constant a, and observed in our experiments that setting a = 1 produced the best results in subsequent phylogeny reconstructions using neighbor joining, for all values of n (the number of genes). The estimation of the constants b and c then amounts to a least-squares nonlinear regression; using simulated data we obtained b =0.5956 and c =0.4577. However, with this setting for a, b, and c, the formula does not satisfy the first constraint. Hence, we modified the formula to ensure that constraint (1) holds, and obtained
The inverse of f is given by the formula
Using the function f given above, we can thus define EDE, a method of moments estimator, as follows.
-
Step 1: Given genomes G and G′, compute the inversion distance d.
-
Step 2: Return n f −1(d/n), where n is the number of genes, as the estimate of the actual number of rearrangement events.
Since the function f is directly invertible, this allows us to estimate distances efficiently.
Theorem 1 (Moret et al. 2001). Let m be the number of genomes and let n be the number of genes. We can compute the pairwise EDE distance between every pair of genomes in O(nm2) time. If the inversion distance matrix is already computed, then we can compute the EDE distance matrix in O(m2) time.
Rights and permissions
About this article
Cite this article
Wang, LS., Warnow, T., Moret, B.M.E. et al. Distance-Based Genome Rearrangement Phylogeny. J Mol Evol 63, 473–483 (2006). https://doi.org/10.1007/s00239-005-0216-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-005-0216-y