Distance-Based Genome Rearrangement Phylogeny

Wang, Li-San; Warnow, Tandy; Moret, Bernard M. E.; Jansen, Robert K.; Raubeson, Linda A.

doi:10.1007/s00239-005-0216-y

Distance-Based Genome Rearrangement Phylogeny

Published: 04 October 2006

Volume 63, pages 473–483, (2006)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

Li-San Wang^1,6,
Tandy Warnow²,
Bernard M. E. Moret³,
Robert K. Jansen⁴ &
…
Linda A. Raubeson⁵

356 Accesses
28 Citations
Explore all metrics

Abstract

Evolution operates on whole genomes through direct rearrangements of genes, such as inversions, transpositions, and inverted transpositions, as well as through operations, such as duplications, losses, and transfers, that also affect the gene content of the genomes. Because these events are rare relative to nucleotide substitutions, gene order data offer the possibility of resolving ancient branches in the tree of life; the combination of gene order data with sequence data also has the potential to provide more robust phylogenetic reconstructions, since each can elucidate evolution at different time scales. Distance corrections greatly improve the accuracy of phylogeny reconstructions from DNA sequences, enabling distance-based methods to approach the accuracy of the more elaborate methods based on parsimony or likelihood at a fraction of the computational cost. This paper focuses on developing distance correction methods for phylogeny reconstruction from whole genomes. The main question we investigate is how to estimate evolutionary histories from whole genomes with equal gene content, and we present a technique, the empirically derived estimator (EDE), that we have developed for this purpose. We study the use of EDE on whole genomes with identical gene content, and we explore the accuracy of phylogenies inferred using EDE with the neighbor joining and minimum evolution methods under a wide range of model conditions. Our study shows that tree reconstruction under these two methods is much more accurate when based on EDE distances than when based on other distances previously suggested for whole genomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of the True Evolutionary Distance Under the INFER Model

Rearrangements in Phylogenetic Inference: Compare, Model, or Encode?

TruEst: a better estimator of evolutionary distance under the INFER model

Article 10 July 2023

References

Bader D, Moret B, Yan M (2001) A linear time algorithm for computing inversion distances between signed permutations with an experimental study. J Comput Biol 8(5):483–491
Article PubMed CAS Google Scholar
Belda E, Moya A, Silva F (2005) Genome rearrangement distances and gene order phylogeny in γ-proteobacteria. Mol Biol Evol 22:1456–1467
Article PubMed CAS Google Scholar
Blanchette M, Bourque G, Sankoff D (1997) Breakpoint phylogenies. In: Miyano S, Takagi T (eds) Genome informatics. University Academy Press, Tokyo, pp 25–34
Google Scholar
Blanchette M, Kunisawa M, Sankoff D (1999) Gene order breakpoint evidence in animal mitochondrial phylogeny. J Mol Evol 49:193–203
Article PubMed CAS Google Scholar
Boore J (1999) Animal mitochondrial genomes. Nucleic Acids Res 27:1767–1780
Article PubMed CAS Google Scholar
Boore J, Brown W (1998) Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. Curr Opin Genet Dev 8(6):668–674
Article PubMed CAS Google Scholar
Boore JL, Collins TM, Stanton D, Daehler LL, Brown WM (1995) Deducing arthropod phylogeny from mitochondrial DNA rearrangements. Nature 376:163–165
Article PubMed CAS Google Scholar
Bourque G, Pevzner P (2002) Genome scale evolution: reconstructing gene orders in the ancestral species. Genome Res 12(1):26–36
PubMed CAS Google Scholar
Cosner M, Jansen R, Moret B, Raubeson L, Wang LS, Warnow T, Wyman S (2000) A new fast heuristic for computing the breakpoint phylogeny and a phylogenetic analysis of a group of highly rearranged chloroplast genomes. In: Proceedings, 8th International Conference on Intelligent Systems for Molecular Biology (ISMB’00). AAAI Press, Menlo Park, California, pp 104–115
Desper R, Gascuel O (2002) Fast and accurate phylogeny reconstruction algorithms based on the minimum evolution principle. J Comput Biol 19(5):687–705
Article Google Scholar
Downie S, Palmer J (1992) Use of chloroplast DNA rearrangements in reconstructing plant phylogeny. In: Soltis P, Soltis D, Doyle J (eds) Molecular systematics of plants, Vol 49. Chapman & Hall, London, pp 14–35
Google Scholar
El-Mabrouk N (2001) Sorting signed permutations by reversals and insertions/deletions of contiguous segments. J Discrete Algorithms 1(1):105–122
Google Scholar
El-Mabrouk N (2002) Reconstructing an ancestral genome using minimum segments duplications and reversals. J Comput Syst Sci 65:442–464
Article Google Scholar
El-Mabrouk N, Sankoff D (2000) Duplication, rearrangement and reconciliation. In: Comparative genomics: empirical and analytical approaches to gene order dynamics, map alignment and the evolution of gene families, Vol 1. Kluwer Academic, New York, pp 537–550
Hannenhalli S, Pevzner P (1995) Transforming cabbage into turnip (polynomial algorithm for genomic distance problems). In: Proceedings, 27th Annual ACM Symposium on the Theory of Computing (STOC’95). ACM Press, New York, pp 178–189
Heard SB (1992) Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution 46:1818–1826
Article Google Scholar
Kim J (1998) Large scale phylogenies and measuring the performance of phylogenetic estimators. Syst Biol 47(1):43–60
Article PubMed CAS Google Scholar
Larget B, Simon D, Kadane J (2002) Bayesian phylogenetic inference from animal mitochondrial genome arrangements (with discussion). J Roy Stat Soc Ser B 64:681–693
Article Google Scholar
Larget B, Simon D, Kadane J, Sweet D (2004) A Bayesian analysis of metazoan mitochondrial genome arrangements. Mol Biol Evol 22(3):486–495
Article PubMed CAS Google Scholar
Marron M, Swenson K, Moret B (2004) Genomic distances under deletions and insertions. Theor Comput Sci 325(3):347–360 (Special issue: papers from COCOON’03)
Article Google Scholar
Moret B, Warnow T (2005) Advances in phylogeny reconstruction from gene order and content data. In: Zimmer E, Roalson E (eds) Molecular evolution, producing the biochemical data, Part B, 395. Elsevier, Amsterdam, pp 673–700
Google Scholar
Moret B, Wang LS, Warnow T, Wyman S (2001) New approaches for reconstructing phylogenies based on gene order. Bioinformatics 17(Suppl):165–173
Google Scholar
Moret B, Tang J, Wang LS, Warnow T (2002) Steps toward accurate reconstructions of phylogenies from gene order data. J Comput Syst Sci 65:508–525
Article Google Scholar
Moret B, Tang J, Warnow T (2005) Reconstructing phylogenies from gene content and geneorder data. In: Gascuel O (ed) Mathematics of evolution and phylogeny. Oxford University Press, New York, pp 321–352
Google Scholar
Nakhleh L, Moret B, Roshan U, John KS, Sun J, Warnow T (2002a) The accuracy of fast phylogenetic methods for large datasets. In: Proceedings, 7th Pacific Symposium on Biocomputing (PSB’02), pp 211–222
Nakhleh L, Roshan U, Vawter L, Warnow T (2002b) Estimating the deviation from a molecular clock. In: Lecture Notes in Computer Science: Proceedings of the 2nd Workshop for Algorithms and Bioinformatics (WABI’02), Vol 2452. Springer Verlag, New York, pp 287–299
Pinter R, Skiena S (2002) Genomic sorting with length weighted reversals. Genome Inform 13:103–111
PubMed CAS Google Scholar
Raubeson L, Jansen R (1992) Chloroplast DNA evidence on the ancient evolutionary split in vascular land plants. Science 255:1697–1699
Article CAS PubMed Google Scholar
Raubeson L, Jansen R (2005) Chloroplast genomes of plants. In: Henry R (ed) Diversity and evolution of plants genotypic and phenotypic variation in higher plants. CABI, London, pp 45–68
Google Scholar
Rokas A, Holland PWH (2000) Rare genomic changes as a tool for phylogenetics. Trends Ecol Evol 15:454–459
Article PubMed Google Scholar
Rzhetsky A, Nei M (1992) A simple method for estimating and testing minimumevolution trees. Mol Biol Evol 35:367–375
Article CAS Google Scholar
Rzhetsky A, Sitnikova T (1996) When is it safe to use an oversimplified substitution model in treemaking? Mol Biol Evol 13(9):1255–1265
PubMed CAS Google Scholar
Saitou N, Imanishi T (1989) Relative efficiencies of the Fitch-Margoliash, maximumparsimony, maximum likekihood, minimum evolution, and neighbor joining methods of phylogenetic tree construction in obtaining the correct tree. Mol Biol Evol 6(5):514–525
CAS Google Scholar
Saitou N, Nei M (1987) The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425
PubMed CAS Google Scholar
Sanderson MJ (2003) Analysis of rates (r8s) of evolution, v1.6; available at: http://ginger.ucdavis.edu/r8s/
Sankoff D, Blanchette M (1998) Multiple genome rearrangement and breakpoint phylogeny. J Comput Biol 5:555–570
Article PubMed CAS Google Scholar
Sourdis J, Krimbas C (1987) Accuracy of phylogenetic trees estimated from DNA sequence data. Mol Biol Evol 4:159–166
PubMed CAS Google Scholar
Swenson K, Marron M, Earnest DeYoung J, Moret B (2005) Approximating the true evolutionary distance between two genomes. In: Proceedings, 7th Workshop on Algorithm Engineering and Experiments (ALENEX’05). SIAM Press, Philadelphia, Pennsylvania, pp 37–46
Swofford D (2001) PAUP* 4.0. Sinauer Associates, Sunderland, MA
Swofford D, Olson G, Waddell P, Hillis D (1996) Phylogenetic inference. In: Hillis D, Moritz C, Mable B (eds) Molecular systematics, 2nd ed. Sinauer Associates, Sunderland, MA, chap 11
Tang J, Moret B (2003) Phylogenetic reconstruction from gene rearrangement data with unequal gene contents. In: Lecture Notes in Computer Science: Proceedings, 8th Workshop on Algorithms and Data Structures (WADS’03), Vol 3069, pp 37–46
Tannier E, Sagot M (2004) Sorting by reversals in subquadratic time. In: Lecture Notes in Computer Science: Proceedings, 15th Symposium on Combinatorial Pattern Matching (CPM’04), Vol 3109. Springer Verlag, New York, pp 1–13
Tesler G, Pevzner P (2003) Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc Natl Acad Sci USA 100(13):7672–7677
Article PubMed CAS Google Scholar
Tessler G (2002) Efficient algorithms for multichromosomal genome rearrangements. J Comput Syst Scie 65:587–609
Article Google Scholar
Wang LS, Jansen R, Moret B, Raubeson L, Warnow T (2002) Fast phylogenetic methods for the analysis of genome rearrangement data: an empirical study. In: Proceedings of the Fifth Pacific Symposium of Biocomputing (PSB’02), pp 524–535
Zwickl DJ, Hillis DM (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51(4):588–598
Article PubMed Google Scholar

Download references

Acknowledgments

We thank the two anonymous reviewers for their comments and for the suggestion of the Rzhetsky-Nei interior branch length test from one of the reviewers. This research was supported by National Science Foundation Grants EIA0121680, EF0331453, DEB0120709, DEB0075700, IIS0113654, EF0331654, IIS0121377, IIS0113095, and ANI020203584. Bernard Moret would like to acknowledge support from the IBM Corporation under a DARPA grant for the HPCS initiative and from the NIH under Grant 2R01GM056120-05A1 through a subcontract to the University of Arizona. Li-San Wang was supported in part by a NIH Training Grant in Bioinformatics. Tandy Warnow would like to acknowledge the support of the David and Lucile Packard Foundation, the Radcliffe Institute for Advanced Study, the Program in Evolutionary Dynamics at Harvard, and the Institute for Cellular and Molecular Biology at the University of Texas at Austin.

Author information

Authors and Affiliations

Department of Biology, University of Pennsylvania, Philadelphia, PA, 19104, USA
Li-San Wang
Department of Computer Sciences, University of Texas at Austin, Austin, TX, 78712, USA
Tandy Warnow
Department of Computer Science, University of New Mexico, Albuquerque, NM, 87131, USA
Bernard M. E. Moret
Section of Integrative Biology and Institute of Cellular and Molecular Biology, University of Texas at Austin, Austin, TX, 78712, USA
Robert K. Jansen
Department of Biological Sciences, Central Washington University, Ellensburg, WA, 98926, USA
Linda A. Raubeson
203 Goddard Laboratories, 415 South University Avenue, Philadelphia, PA, 19104, USA
Li-San Wang

Authors

Li-San Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tandy Warnow
View author publications
You can also search for this author in PubMed Google Scholar
Bernard M. E. Moret
View author publications
You can also search for this author in PubMed Google Scholar
Robert K. Jansen
View author publications
You can also search for this author in PubMed Google Scholar
Linda A. Raubeson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li-San Wang.

Additional information

[Reviewing Editor: Dr. Martin Kreitman]

Electronic Supplementary Material

Supplementary material

Appendix

EDE: The Empirically Derived Estimator

EDE produces the best results under all model conditions, even when the evolutionary model is exclusively transpositions. For details about the mathematical derivation of this technique, see Moret et al. (2001).

EDE is based on inverting a function for the expected minimum inversion distance produced by a sequence of random inversions. Theoretical approaches (i.e., actually trying to analytically solve the expected inversion distance produced by k random inversions) proved to be quite difficult, and so we studied this under simulation. Our initial studies showed little difference in the behavior under 120 genes (typical for chloroplasts) and 37 genes (typical of mitochondria) and, in particular, suggested that it should be possible to express the normalized expected inversion distance as a function of the normalized number of random inversions. Therefore, we attempted to define a simple function f(k/n) that approximates E[dINV (G₀, G_k)/n] well, for k the number of random inversions, n the number of genes, G₀ the initial genome, and G_k the result of applying k random inversions to G₀.

The function f should have the following properties.

1
0 ≤ f (x) ≤ x, since the inversion distance is always less than or equal to the actual number of inversions.
2
lim_x→∞ f (x) ≈ 1, as our simulations show the normalized expected inversion distance is close to 1, when a large number of random inversions is applied.
3
f ‘(0) = 1, since a single random inversion always produces a genome that is inversion distance 1 away.
4
f⁻¹(y) is defined for all y [0, 1].

We used nf (x) to estimate E[d_INV(G_nx, G₀)], the expected inversion distance after nx inversions are applied. The nonlinear formula

$$ f{(x)}={({\rm{a}}x^{2} +{\rm{b}}x) /(x^{2} +{\rm{c}}x +{\rm{b}})} $$

(1)

satisfies constraints (2) and (4).

We tried several different values for the constant a, and observed in our experiments that setting a = 1 produced the best results in subsequent phylogeny reconstructions using neighbor joining, for all values of n (the number of genes). The estimation of the constants b and c then amounts to a least-squares nonlinear regression; using simulated data we obtained b =0.5956 and c =0.4577. However, with this setting for a, b, and c, the formula does not satisfy the first constraint. Hence, we modified the formula to ensure that constraint (1) holds, and obtained

$$ f(x)={\rm{min}}\{x,({\rm{a}}x^{2} +{\rm{b}}x) /(x^{2} +{\rm{c}}x +{\rm{b}})\} $$

(2)

The inverse of f is given by the formula

$$ \eqalign{f^{-1}(\rm d) = &\rm{max} \{\rm{d},(-(\rm{b-cd})+((\rm{b -cd})^2 \cr & +4\rm{bd}{({1-\rm{d}}))^{1/2}})/(2(1 -\rm{d}))\}} $$

(3)

Using the function f given above, we can thus define EDE, a method of moments estimator, as follows.

Step 1: Given genomes G and G′, compute the inversion distance d.
Step 2: Return n f ⁻¹(d/n), where n is the number of genes, as the estimate of the actual number of rearrangement events.

Since the function f is directly invertible, this allows us to estimate distances efficiently.

Theorem 1 (Moret et al. 2001). Let m be the number of genomes and let n be the number of genes. We can compute the pairwise EDE distance between every pair of genomes in O(nm²) time. If the inversion distance matrix is already computed, then we can compute the EDE distance matrix in O(m²) time.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, LS., Warnow, T., Moret, B.M.E. et al. Distance-Based Genome Rearrangement Phylogeny. J Mol Evol 63, 473–483 (2006). https://doi.org/10.1007/s00239-005-0216-y

Download citation

Received: 13 September 2005
Accepted: 26 June 2006
Published: 04 October 2006
Issue Date: October 2006
DOI: https://doi.org/10.1007/s00239-005-0216-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distance-Based Genome Rearrangement Phylogeny

Abstract

Access this article

Similar content being viewed by others

Estimation of the True Evolutionary Distance Under the INFER Model

Rearrangements in Phylogenetic Inference: Compare, Model, or Encode?

TruEst: a better estimator of evolutionary distance under the INFER model

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic Supplementary Material

Supplementary material

Appendix

EDE: The Empirically Derived Estimator

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distance-Based Genome Rearrangement Phylogeny

Abstract

Access this article

Similar content being viewed by others

Estimation of the True Evolutionary Distance Under the INFER Model

Rearrangements in Phylogenetic Inference: Compare, Model, or Encode?

TruEst: a better estimator of evolutionary distance under the INFER model

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic Supplementary Material

Supplementary material

Appendix

Appendix

EDE: The Empirically Derived Estimator

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation