A Distance Measure for Genome Phylogenetic Analysis

  • Minh Duc Cao
  • Lloyd Allison
  • Trevor Dix
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5866)


Phylogenetic analyses of species based on single genes or parts of the genomes are often inconsistent because of factors such as variable rates of evolution and horizontal gene transfer. The availability of more and more sequenced genomes allows phylogeny construction from complete genomes that is less sensitive to such inconsistency. For such long sequences, construction methods like maximum parsimony and maximum likelihood are often not possible due to their intensive computational requirement. Another class of tree construction methods, namely distance-based methods, require a measure of distances between any two genomes. Some measures such as evolutionary edit distance of gene order and gene content are computational expensive or do not perform well when the gene content of the organisms are similar. This study presents an information theoretic measure of genetic distances between genomes based on the biological compression algorithm expert model. We demonstrate that our distance measure can be applied to reconstruct the consensus phylogenetic tree of a number of Plasmodium parasites from their genomes, the statistical bias of which would mislead conventional analysis methods. Our approach is also used to successfully construct a plausible evolutionary tree for the γ-Proteobacteria group whose genomes are known to contain many horizontally transferred genes.


Horizontal Gene Transfer Compression Algorithm Plasmodium Species Expert Model Yersinia Pestis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Camin, J., Sokal, R.: A method for deducing branching sequences in phylogeny. Evolution, 311–326 (1965)Google Scholar
  2. 2.
    Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Bio., 368–376 (1981)Google Scholar
  3. 3.
    Saitou, N., Nei, M.: The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 406–425 (1987)Google Scholar
  4. 4.
    Gogarten, P., Townsend, F.: Horizontal gene transfer, genome innovation and evolution. Nature Reviews Microbiology, 679–687 (2005)Google Scholar
  5. 5.
    Sankoff, D., Leduc, G., Antoine, N., Paquin, B., Lang, B.F., Cedergren, R.: Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. PNAS, 6575–6579 (1992)Google Scholar
  6. 6.
    Snel, B., Bork, P., Huynen, M.A.: Genome phylogeny based on gene content. Nat. Genet., 66–67 (1999)Google Scholar
  7. 7.
    Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal, 379–423 (1948)Google Scholar
  8. 8.
    Wallace, C.S., Boulton, D.M.: An information measure for classification. Computer Journal, 185–194 (1968)Google Scholar
  9. 9.
    Sokal, R., Michener, C.: A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 1409–1438 (1958)Google Scholar
  10. 10.
    Lerat, E., Daubin, V., Moran, N.A.: From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-proteobacteria. PLoS Biology, e19 (2003)Google Scholar
  11. 11.
    Vinga, S., Almeida, J.: Alignment-free sequence comparison - a review. Bioinformatics, 513–523 (2003)Google Scholar
  12. 12.
    Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. PNAS, 5155–5159 (1986)Google Scholar
  13. 13.
    Gentleman, J., Mullin, R.: The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics, 35–52 (1989)Google Scholar
  14. 14.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 337–342 (1977)Google Scholar
  15. 15.
    Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 396–402 (1984)Google Scholar
  16. 16.
    Grumbach, S., Tahi, F.: A new challenge for compression algorithms: Genetic sequences. Journal of Information Processing and Management, 875–866 (1994)Google Scholar
  17. 17.
    Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. RECOMB, 107 (2000)Google Scholar
  18. 18.
    Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 149–154 (2001)Google Scholar
  19. 19.
    Otu, H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 2122–2130 (2003)Google Scholar
  20. 20.
    Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory, 75–81 (1976)Google Scholar
  21. 21.
    Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. DCC, 43–52 (2007)Google Scholar
  22. 22.
    Felsenstein, J.: PHYLIP phylogeny inference package. Technical report (1993)Google Scholar
  23. 23.
    Waters, A., Higgins, D., McCutchan, T.: Evolutionary relatedness of some primate models of plasmodium. Mol. Biol. Evol., 914–923 (1993)Google Scholar
  24. 24.
    Escalante, A., Goldman, I.F., Rijk, P.D., Wachter, R.D., Collins, W.E., Qari, S.H., Lal, A.A.: Phylogenetic study of the genus plasmodium based on the secondary structure-based alignment of the small subunit ribosomal RNA. Molecular and Biochemical Parasitology, 317–321 (1997)Google Scholar
  25. 25.
    Corredor, V., Enea, V.: Plasmodial ribosomal RNA as phylogenetic probe: a cautionary note. Mol. Biol. Evol., 924–926 (1993)Google Scholar
  26. 26.
    Leclerc, M.C., Hugot, J.P., Durand, P., Renaud, F.: Evolutionary relationships between 15 plasmodium species from new and old world primates (including humans): an 18s rDNA cladistic analysis. Parasitology, 677–684 (2004)Google Scholar
  27. 27.
    Cao, M.D., Dix, T.I., Allison, L.: Computing substitution matrices for genomic comparative analysis. In: PAKDD, pp. 647–655 (2009)Google Scholar
  28. 28.
    Siddall, M.E., Barta, J.R.: Phylogeny of plasmodium species: Estimation and inference. The Journal of Parasitology, 567–568 (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Minh Duc Cao
    • 1
  • Lloyd Allison
    • 2
  • Trevor Dix
    • 1
  1. 1.Clayton School of Information TechnologyMonash University 
  2. 2.National ICT Australia Victorian Research LaboratoryUniversity of Melbourne 

Personalised recommendations