Skip to main content

Advertisement

Log in

Distance-Based Genome Rearrangement Phylogeny

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Evolution operates on whole genomes through direct rearrangements of genes, such as inversions, transpositions, and inverted transpositions, as well as through operations, such as duplications, losses, and transfers, that also affect the gene content of the genomes. Because these events are rare relative to nucleotide substitutions, gene order data offer the possibility of resolving ancient branches in the tree of life; the combination of gene order data with sequence data also has the potential to provide more robust phylogenetic reconstructions, since each can elucidate evolution at different time scales. Distance corrections greatly improve the accuracy of phylogeny reconstructions from DNA sequences, enabling distance-based methods to approach the accuracy of the more elaborate methods based on parsimony or likelihood at a fraction of the computational cost. This paper focuses on developing distance correction methods for phylogeny reconstruction from whole genomes. The main question we investigate is how to estimate evolutionary histories from whole genomes with equal gene content, and we present a technique, the empirically derived estimator (EDE), that we have developed for this purpose. We study the use of EDE on whole genomes with identical gene content, and we explore the accuracy of phylogenies inferred using EDE with the neighbor joining and minimum evolution methods under a wide range of model conditions. Our study shows that tree reconstruction under these two methods is much more accurate when based on EDE distances than when based on other distances previously suggested for whole genomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

Similar content being viewed by others

References

  • Bader D, Moret B, Yan M (2001) A linear time algorithm for computing inversion distances between signed permutations with an experimental study. J Comput Biol 8(5):483–491

    Article  PubMed  CAS  Google Scholar 

  • Belda E, Moya A, Silva F (2005) Genome rearrangement distances and gene order phylogeny in γ-proteobacteria. Mol Biol Evol 22:1456–1467

    Article  PubMed  CAS  Google Scholar 

  • Blanchette M, Bourque G, Sankoff D (1997) Breakpoint phylogenies. In: Miyano S, Takagi T (eds) Genome informatics. University Academy Press, Tokyo, pp 25–34

    Google Scholar 

  • Blanchette M, Kunisawa M, Sankoff D (1999) Gene order breakpoint evidence in animal mitochondrial phylogeny. J Mol Evol 49:193–203

    Article  PubMed  CAS  Google Scholar 

  • Boore J (1999) Animal mitochondrial genomes. Nucleic Acids Res 27:1767–1780

    Article  PubMed  CAS  Google Scholar 

  • Boore J, Brown W (1998) Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. Curr Opin Genet Dev 8(6):668–674

    Article  PubMed  CAS  Google Scholar 

  • Boore JL, Collins TM, Stanton D, Daehler LL, Brown WM (1995) Deducing arthropod phylogeny from mitochondrial DNA rearrangements. Nature 376:163–165

    Article  PubMed  CAS  Google Scholar 

  • Bourque G, Pevzner P (2002) Genome scale evolution: reconstructing gene orders in the ancestral species. Genome Res 12(1):26–36

    PubMed  CAS  Google Scholar 

  • Cosner M, Jansen R, Moret B, Raubeson L, Wang LS, Warnow T, Wyman S (2000) A new fast heuristic for computing the breakpoint phylogeny and a phylogenetic analysis of a group of highly rearranged chloroplast genomes. In: Proceedings, 8th International Conference on Intelligent Systems for Molecular Biology (ISMB’00). AAAI Press, Menlo Park, California, pp 104–115

  • Desper R, Gascuel O (2002) Fast and accurate phylogeny reconstruction algorithms based on the minimum evolution principle. J Comput Biol 19(5):687–705

    Article  Google Scholar 

  • Downie S, Palmer J (1992) Use of chloroplast DNA rearrangements in reconstructing plant phylogeny. In: Soltis P, Soltis D, Doyle J (eds) Molecular systematics of plants, Vol 49. Chapman & Hall, London, pp 14–35

    Google Scholar 

  • El-Mabrouk N (2001) Sorting signed permutations by reversals and insertions/deletions of contiguous segments. J Discrete Algorithms 1(1):105–122

    Google Scholar 

  • El-Mabrouk N (2002) Reconstructing an ancestral genome using minimum segments duplications and reversals. J Comput Syst Sci 65:442–464

    Article  Google Scholar 

  • El-Mabrouk N, Sankoff D (2000) Duplication, rearrangement and reconciliation. In: Comparative genomics: empirical and analytical approaches to gene order dynamics, map alignment and the evolution of gene families, Vol 1. Kluwer Academic, New York, pp 537–550

  • Hannenhalli S, Pevzner P (1995) Transforming cabbage into turnip (polynomial algorithm for genomic distance problems). In: Proceedings, 27th Annual ACM Symposium on the Theory of Computing (STOC’95). ACM Press, New York, pp 178–189

  • Heard SB (1992) Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution 46:1818–1826

    Article  Google Scholar 

  • Kim J (1998) Large scale phylogenies and measuring the performance of phylogenetic estimators. Syst Biol 47(1):43–60

    Article  PubMed  CAS  Google Scholar 

  • Larget B, Simon D, Kadane J (2002) Bayesian phylogenetic inference from animal mitochondrial genome arrangements (with discussion). J Roy Stat Soc Ser B 64:681–693

    Article  Google Scholar 

  • Larget B, Simon D, Kadane J, Sweet D (2004) A Bayesian analysis of metazoan mitochondrial genome arrangements. Mol Biol Evol 22(3):486–495

    Article  PubMed  CAS  Google Scholar 

  • Marron M, Swenson K, Moret B (2004) Genomic distances under deletions and insertions. Theor Comput Sci 325(3):347–360 (Special issue: papers from COCOON’03)

    Article  Google Scholar 

  • Moret B, Warnow T (2005) Advances in phylogeny reconstruction from gene order and content data. In: Zimmer E, Roalson E (eds) Molecular evolution, producing the biochemical data, Part B, 395. Elsevier, Amsterdam, pp 673–700

    Google Scholar 

  • Moret B, Wang LS, Warnow T, Wyman S (2001) New approaches for reconstructing phylogenies based on gene order. Bioinformatics 17(Suppl):165–173

    Google Scholar 

  • Moret B, Tang J, Wang LS, Warnow T (2002) Steps toward accurate reconstructions of phylogenies from gene order data. J Comput Syst Sci 65:508–525

    Article  Google Scholar 

  • Moret B, Tang J, Warnow T (2005) Reconstructing phylogenies from gene content and geneorder data. In: Gascuel O (ed) Mathematics of evolution and phylogeny. Oxford University Press, New York, pp 321–352

    Google Scholar 

  • Nakhleh L, Moret B, Roshan U, John KS, Sun J, Warnow T (2002a) The accuracy of fast phylogenetic methods for large datasets. In: Proceedings, 7th Pacific Symposium on Biocomputing (PSB’02), pp 211–222

  • Nakhleh L, Roshan U, Vawter L, Warnow T (2002b) Estimating the deviation from a molecular clock. In: Lecture Notes in Computer Science: Proceedings of the 2nd Workshop for Algorithms and Bioinformatics (WABI’02), Vol 2452. Springer Verlag, New York, pp 287–299

  • Pinter R, Skiena S (2002) Genomic sorting with length weighted reversals. Genome Inform 13:103–111

    PubMed  CAS  Google Scholar 

  • Raubeson L, Jansen R (1992) Chloroplast DNA evidence on the ancient evolutionary split in vascular land plants. Science 255:1697–1699

    Article  CAS  PubMed  Google Scholar 

  • Raubeson L, Jansen R (2005) Chloroplast genomes of plants. In: Henry R (ed) Diversity and evolution of plants genotypic and phenotypic variation in higher plants. CABI, London, pp 45–68

    Google Scholar 

  • Rokas A, Holland PWH (2000) Rare genomic changes as a tool for phylogenetics. Trends Ecol Evol 15:454–459

    Article  PubMed  Google Scholar 

  • Rzhetsky A, Nei M (1992) A simple method for estimating and testing minimumevolution trees. Mol Biol Evol 35:367–375

    Article  CAS  Google Scholar 

  • Rzhetsky A, Sitnikova T (1996) When is it safe to use an oversimplified substitution model in treemaking? Mol Biol Evol 13(9):1255–1265

    PubMed  CAS  Google Scholar 

  • Saitou N, Imanishi T (1989) Relative efficiencies of the Fitch-Margoliash, maximumparsimony, maximum likekihood, minimum evolution, and neighbor joining methods of phylogenetic tree construction in obtaining the correct tree. Mol Biol Evol 6(5):514–525

    CAS  Google Scholar 

  • Saitou N, Nei M (1987) The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425

    PubMed  CAS  Google Scholar 

  • Sanderson MJ (2003) Analysis of rates (r8s) of evolution, v1.6; available at: http://ginger.ucdavis.edu/r8s/

  • Sankoff D, Blanchette M (1998) Multiple genome rearrangement and breakpoint phylogeny. J Comput Biol 5:555–570

    Article  PubMed  CAS  Google Scholar 

  • Sourdis J, Krimbas C (1987) Accuracy of phylogenetic trees estimated from DNA sequence data. Mol Biol Evol 4:159–166

    PubMed  CAS  Google Scholar 

  • Swenson K, Marron M, Earnest DeYoung J, Moret B (2005) Approximating the true evolutionary distance between two genomes. In: Proceedings, 7th Workshop on Algorithm Engineering and Experiments (ALENEX’05). SIAM Press, Philadelphia, Pennsylvania, pp 37–46

  • Swofford D (2001) PAUP* 4.0. Sinauer Associates, Sunderland, MA

  • Swofford D, Olson G, Waddell P, Hillis D (1996) Phylogenetic inference. In: Hillis D, Moritz C, Mable B (eds) Molecular systematics, 2nd ed. Sinauer Associates, Sunderland, MA, chap 11

  • Tang J, Moret B (2003) Phylogenetic reconstruction from gene rearrangement data with unequal gene contents. In: Lecture Notes in Computer Science: Proceedings, 8th Workshop on Algorithms and Data Structures (WADS’03), Vol 3069, pp 37–46

  • Tannier E, Sagot M (2004) Sorting by reversals in subquadratic time. In: Lecture Notes in Computer Science: Proceedings, 15th Symposium on Combinatorial Pattern Matching (CPM’04), Vol 3109. Springer Verlag, New York, pp 1–13

  • Tesler G, Pevzner P (2003) Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc Natl Acad Sci USA 100(13):7672–7677

    Article  PubMed  CAS  Google Scholar 

  • Tessler G (2002) Efficient algorithms for multichromosomal genome rearrangements. J Comput Syst Scie 65:587–609

    Article  Google Scholar 

  • Wang LS, Jansen R, Moret B, Raubeson L, Warnow T (2002) Fast phylogenetic methods for the analysis of genome rearrangement data: an empirical study. In: Proceedings of the Fifth Pacific Symposium of Biocomputing (PSB’02), pp 524–535

  • Zwickl DJ, Hillis DM (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51(4):588–598

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

We thank the two anonymous reviewers for their comments and for the suggestion of the Rzhetsky-Nei interior branch length test from one of the reviewers. This research was supported by National Science Foundation Grants EIA0121680, EF0331453, DEB0120709, DEB0075700, IIS0113654, EF0331654, IIS0121377, IIS0113095, and ANI020203584. Bernard Moret would like to acknowledge support from the IBM Corporation under a DARPA grant for the HPCS initiative and from the NIH under Grant 2R01GM056120-05A1 through a subcontract to the University of Arizona. Li-San Wang was supported in part by a NIH Training Grant in Bioinformatics. Tandy Warnow would like to acknowledge the support of the David and Lucile Packard Foundation, the Radcliffe Institute for Advanced Study, the Program in Evolutionary Dynamics at Harvard, and the Institute for Cellular and Molecular Biology at the University of Texas at Austin.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li-San Wang.

Additional information

[Reviewing Editor: Dr. Martin Kreitman]

Electronic Supplementary Material

Appendix

Appendix

EDE: The Empirically Derived Estimator

EDE produces the best results under all model conditions, even when the evolutionary model is exclusively transpositions. For details about the mathematical derivation of this technique, see Moret et al. (2001).

EDE is based on inverting a function for the expected minimum inversion distance produced by a sequence of random inversions. Theoretical approaches (i.e., actually trying to analytically solve the expected inversion distance produced by k random inversions) proved to be quite difficult, and so we studied this under simulation. Our initial studies showed little difference in the behavior under 120 genes (typical for chloroplasts) and 37 genes (typical of mitochondria) and, in particular, suggested that it should be possible to express the normalized expected inversion distance as a function of the normalized number of random inversions. Therefore, we attempted to define a simple function f(k/n) that approximates E[dINV (G0, G k )/n] well, for k the number of random inversions, n the number of genes, G0 the initial genome, and G k the result of applying k random inversions to G0.

The function f should have the following properties.

  1. 1

    0 ≤ f (x) ≤ x, since the inversion distance is always less than or equal to the actual number of inversions.

  2. 2

    limx→∞ f (x) ≈ 1, as our simulations show the normalized expected inversion distance is close to 1, when a large number of random inversions is applied.

  3. 3

    f ‘(0) = 1, since a single random inversion always produces a genome that is inversion distance 1 away.

  4. 4

    f−1(y) is defined for all y [0, 1].

We used nf (x) to estimate E[dINV(G nx , G0)], the expected inversion distance after nx inversions are applied. The nonlinear formula

$$ f{(x)}={({\rm{a}}x^{2} +{\rm{b}}x) /(x^{2} +{\rm{c}}x +{\rm{b}})} $$
(1)

satisfies constraints (2) and (4).

We tried several different values for the constant a, and observed in our experiments that setting a = 1 produced the best results in subsequent phylogeny reconstructions using neighbor joining, for all values of n (the number of genes). The estimation of the constants b and c then amounts to a least-squares nonlinear regression; using simulated data we obtained b =0.5956 and c =0.4577. However, with this setting for a, b, and c, the formula does not satisfy the first constraint. Hence, we modified the formula to ensure that constraint (1) holds, and obtained

$$ f(x)={\rm{min}}\{x,({\rm{a}}x^{2} +{\rm{b}}x) /(x^{2} +{\rm{c}}x +{\rm{b}})\} $$
(2)

The inverse of f is given by the formula

$$ \eqalign{f^{-1}(\rm d) = &\rm{max} \{\rm{d},(-(\rm{b-cd})+((\rm{b -cd})^2 \cr & +4\rm{bd}{({1-\rm{d}}))^{1/2}})/(2(1 -\rm{d}))\}} $$
(3)

Using the function f given above, we can thus define EDE, a method of moments estimator, as follows.

  • Step 1: Given genomes G and G′, compute the inversion distance d.

  • Step 2: Return n f −1(d/n), where n is the number of genes, as the estimate of the actual number of rearrangement events.

Since the function f is directly invertible, this allows us to estimate distances efficiently.

Theorem 1 (Moret et al. 2001). Let m be the number of genomes and let n be the number of genes. We can compute the pairwise EDE distance between every pair of genomes in O(nm2) time. If the inversion distance matrix is already computed, then we can compute the EDE distance matrix in O(m2) time.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, LS., Warnow, T., Moret, B.M.E. et al. Distance-Based Genome Rearrangement Phylogeny. J Mol Evol 63, 473–483 (2006). https://doi.org/10.1007/s00239-005-0216-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-005-0216-y

Keywords

Navigation