An Alignment-Free Distance Measure for Closely Related Genomes

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5267)


Phylogeny reconstruction on a genome scale remains computationally challenging even for closely related organisms. Here we propose an alignment-free pairwise distance measure, K r, for genomes separated by less than approximately 0.5 mismatches/nucleotide. We have implemented the computation of K r based on enhanced suffix arrays in the program kr, which is freely available from The software is applied to genomes obtained from three sets of taxa: 27 primate mitochondria, eight Staphylococcus agalactiae strains, and 12 Drosophila species. Subsequent clustering of the K r values always recovers phylogenies that are similar or identical to the accepted branching order.


Drosophila Species Guide Tree Drosophila Genome Phylogeny Reconstruction Streptococcus Agalactiae 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aanensen, D.M., Spratt, B.G.: The multilocus sequence typing network: Nucleic Acids Res. 33(Web Server issue) , W728–W733 (2005)CrossRefGoogle Scholar
  2. 2.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: The enhanced suffix array and its applications to genome analysis. In: Proceedings of the second workshop on algorithms in bioinformatics. Springer, Heidelberg (2002)Google Scholar
  3. 3.
    Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences, USA 83, 5155–5159 (1986)zbMATHCrossRefGoogle Scholar
  4. 4.
    Bray, N., Pachter, L.: MAVID: Constrained ancestral alignment of multiple sequences. Genome Research 14, 693–699 (2004)CrossRefGoogle Scholar
  5. 5.
    Chapus, C., Dufraigne, C., Edwards, S., Giron, A., Fertil, B., Deschavanne, P.: Exploration of phylogenetic data using a global sequence analysis method. BMC Evolutionary Biology 5, 63 (2005)CrossRefGoogle Scholar
  6. 6.
    Dewey, C.N., Pachter, L.: Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum. Mol. Genet. 15(Spec. No. 1), R51–R56 (2006)CrossRefGoogle Scholar
  7. 7.
    Efron, B.: Bootstrap methods: another look at the Jackknife. The Annals of Statistics 7, 1–26 (1979)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Eisen, J.A.: Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research 8, 163–167 (1998)Google Scholar
  9. 9.
    Felsenstein, J.: Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 (1985)CrossRefGoogle Scholar
  10. 10.
    Felsenstein, J.: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle (2005) Google Scholar
  11. 11.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)zbMATHGoogle Scholar
  12. 12.
    Haubold, B., Pierstorff, N., Möller, F., Wiehe, T.: Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 6, 123 (2005)CrossRefGoogle Scholar
  13. 13.
    Haubold, B., Wiehe, T.: How repetitive are genomes? BMC Bioinformatics 7, 541 (2006)CrossRefGoogle Scholar
  14. 14.
    Hervé, P., Delsuc, F., Lartillot, N.: Phylogenomics. Annual Review of Ecology, Evolution, and Systematics 36, 541–562 (2005)CrossRefGoogle Scholar
  15. 15.
    Hudson, R.R.: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18, 337–338 (2002)CrossRefGoogle Scholar
  16. 16.
    Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Munro, H.N. (ed.) Mammalian Protein Metabolism, vol. 3, pp. 21–132. Academic Press, New York (1969)Google Scholar
  17. 17.
    Kantorovitz, M.R., Robinson, G.E., Sinha, S.: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23, i249–i255 (2007)CrossRefGoogle Scholar
  18. 18.
    Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: Clustal w and clustal x version 2.0. Bioinformatics 23(21), 2947–2948 (2007)CrossRefGoogle Scholar
  19. 19.
    Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 698–710. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  20. 20.
    Moriyama, E.N., Gojobori, T.: Rates of synonymous substitution and base composition of nuclear genes in Drosophila. Genetics 130(4), 855–864 (1992)Google Scholar
  21. 21.
    Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 4 (2007)CrossRefGoogle Scholar
  22. 22.
    R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2007) ISBN 3-900051-07-0Google Scholar
  23. 23.
    Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylgenetic trees. Molecular Biology and Evolution 4, 406–425 (1987)Google Scholar
  24. 24.
    Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L., Angiuoli, S.V., Crabtree, J., Jones, A.L., Durkin, A.S., Deboy, R.T., Davidsen, T.M., Mora, M., Scarselli, M., Margarit y Ros, I., Peterson, J.D., Hauser, C.R., Sundaram, J.P., Nelson, W.C., Madupu, R., Brinkac, L.M., Dodson, R.J., Rosovitz, M.J., Sullivan, S.A., Daugherty, S.C., Haft, D.H., Selengut, J., Gwinn, M.L., Zhou, L., Zafar, N., Khouri, H., Radune, D., Dimitrov, G., Watkins, K., O’Connor, K.J., Smith, S., Utterback, T.R., White, O., Rubens, C.E., Grandi, G., Madoff, L.C., Kasper, D.L., Telford, J.L., Wessels, M.R., Rappuoli, R., Fraser, C.M.: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ”pan-genome”. Proc. Natl. Acad. Sci. USA 102(39), 13950–13955 (2005)CrossRefGoogle Scholar
  25. 25.
    Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007)Google Scholar
  26. 26.
    Vinga, S., Almeida, J.: Alignment-free sequence comparison—a review. Bioinformatics 19, 513–523 (2003)CrossRefGoogle Scholar
  27. 27.
    Wilbur, W.J., Lipman, D.J.: Rapid similarity searches of nucleic acid and protein data banks. Proceedings of the National Academy of Sciences, USA 80, 726–730 (1983)CrossRefGoogle Scholar
  28. 28.
    Yang, K., Zhang, L.: Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Res. 36(5), e33 (2008)CrossRefGoogle Scholar
  29. 29.
    Yang, Z.: Computational Molecular Evolution. Oxford University Press, Oxford (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  1. 1.Department of Evolutionary GeneticsMax-Planck-Institute for Evolutionary BiologyPlönGermany
  2. 2.Faculty of Electrical Engineering and ComputingUniversity of ZagrebZagrebCroatia
  3. 3.Institute of GeneticsUniversität zu KölnCologneGermany

Personalised recommendations