Advances in Data Analysis and Classification

, Volume 3, Issue 2, pp 95–108

Comparison of alignment free string distances for complete genome phylogeny

  • Frédéric Guyon
  • Céline Brochier-Armanet
  • Alain Guénoche
Regular Article

DOI: 10.1007/s11634-009-0041-z

Cite this article as:
Guyon, F., Brochier-Armanet, C. & Guénoche, A. Adv Data Anal Classif (2009) 3: 95. doi:10.1007/s11634-009-0041-z

Abstract

In this paper, we compare the accuracy of four string distances on complete genomes to reconstruct phylogenies using simulated and real biological data. These distances are based on common words shared by raw genomic sequences and do not require preliminary processing steps such as gene identification or sequence alignment. Moreover, they are computable in linear time. The first distance is based on Maximum Significant Matches (MSM). The second is computed from the frequencies of all the words of length k (KW). The third distance is based on the Average length of maximum Common Substrings at any position (ACS). The last one is based on the Ziv–Lempel compression algorithm (ZL). We describe a simulation process of evolution to generate a set of sequences having evolved according to a random tree topology T. This process allows both base substitution and fragment insertion/deletion, including horizontal transfers. The distances between the generated sequences are computed using the four formulas and the corresponding trees T′ are reconstructed using Neighbor-Joining. T and T′ are compared according to topological criteria. These comparisons show that the MSM distance outperforms the others whatever the parameters used to generate sequences. Finally, we test the MSM and KW distances on real biological data (i.e. prokaryotic complete genomes) and we compare the NJ trees to a Maximum Likelihood 16S + 23S RNA tree. We show that the MSM distance provides accurate results to study intra-phylum relationships, much better than those given by KW.

Keywords

PhylogenyString distancesComplete bacterial genomes

Mathematics Subject Classification (2000)

05C0568R1590C2792B10

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Frédéric Guyon
    • 1
  • Céline Brochier-Armanet
    • 2
  • Alain Guénoche
    • 3
  1. 1.MTI, INSERM-Université Denis DiderotParisFrance
  2. 2.IBSM-LCB, CNRS-Université de ProvenceMarseilleFrance
  3. 3.IML, CNRS-Université de la MéditerranéeMarseilleFrance