Models and Algorithms for Genome Evolution

Volume 19 of the series Computational Biology pp 147-171

Rearrangements in Phylogenetic Inference: Compare, Model, or Encode?

  • Bernard M. E. MoretAffiliated withLaboratory for Computational Biology and Bioinformatics, EPFL Email author 
  • , Yu LinAffiliated withLaboratory for Computational Biology and Bioinformatics, EPFL
  • , Jijun TangAffiliated withDepartment of Computer Science and Engineering, University of South Carolina

* Final gross prices may vary according to local VAT.

Get Access


We survey phylogenetic inference from rearrangement data, as viewed through the lens of the work of our group in this area, in tribute to David Sankoff, pioneer and mentor.

Genomic rearrangements were first used for phylogenetic analysis in the late 1920s, but it was not until the 1990s that this approach was revived, with the advent of genome sequencing. G. Watterson et al. proposed to measure the inversion distance between two genomes, J. Palmer et al. studied the evolution of mitochondrial and chloroplast genomes, and D. Sankoff and W. Day published the first algorithmic paper on phylogenetic inference from rearrangement data, giving rise to a fertile field of mathematical, algorithmic, and biological research.

Distance measures for sequence data are simple to define, but those based on rearrangements proved to be complex mathematical objects. The first approaches for phylogenetic inference from rearrangement data, due to D. Sankoff, used model-free distances, such as synteny (colocation on a chromosome) or breakpoints (disrupted adjacencies). The development of algorithms for distance and median computations led to modeling approaches based on biological mechanisms. However, the multiplicity of such mechanisms pose serious challenges. A unifying framework, proposed by S. Yancopoulos et al. and popularized by D. Sankoff, has supported major advances, such as precise distance corrections and efficient algorithms for median estimation, thereby enabling phylogenetic inference using both distance and maximum-parsimony methods.

Likelihood-based methods outperform distance and maximum-parsimony methods, but using such methods with rearrangements has proved problematic. Thus we have returned to an approach we first proposed 12 years ago: encoding the genome structure into sequences and using likelihood methods on these sequences. With a suitable a bias in the ground probabilities, we attain levels of performance comparable to the best sequence-based methods. Unsurprisingly, the idea of injecting such a bias was first proposed by D. Sankoff.