Large-Scale Neighbor-Joining with NINJA

  • Travis J. Wheeler
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5724)

Abstract

Neighbor-joining is a well-established hierarchical clustering algorithm for inferring phylogenies. It begins with observed distances between pairs of sequences, and clustering order depends on a metric related to those distances. The canonical algorithm requires O(n3) time and O(n2) space for n sequences, which precludes application to very large sequence families, e.g. those containing 100,000 sequences. Datasets of this size are available today, and such phylogenies will play an increasingly important role in comparative biology studies. Recent algorithmic advances have greatly sped up neighbor-joining for inputs of thousands of sequences, but are limited to fewer than 13,000 sequences on a system with 4GB RAM. In this paper, I describe an algorithm that speeds up neighbor-joining by dramatically reducing the number of distance values that are viewed in each iteration of the clustering procedure, while still computing a correct neighbor-joining tree. This algorithm can scale to inputs larger than 100,000 sequences because of external-memory-efficient data structures. A free implementation may by obtained from http://nimbletwist.com/software/ninja

Keywords

Phylogeny inference Neighbor joining external memory 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)PubMedGoogle Scholar
  2. 2.
    Nakhleh, L., Moret, B.M.E., Roshan, U., John, K.S., Sun, J., Warnow, T.: The accuracy of fast phylogenetic methods for large datasets. In: Proc. 7th Pacific Symp. on Biocomputing, PSB 2002, pp. 211–222 (2002)Google Scholar
  3. 3.
    Atteson, K.: The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction. Algorithmica 25, 251–278 (1999)CrossRefGoogle Scholar
  4. 4.
    Felsenstein, J.: Inferring phylogenies (January 2004)Google Scholar
  5. 5.
    Bryant, D.: On the Uniqueness of the Selection Criterion in Neighbor-Joining. Journal of Classification 22, 3–15 (2005)CrossRefGoogle Scholar
  6. 6.
    Studier, J.A., Keppler, K.J.: A note on the neighbor-joining algorithm of Saitou and Nei. Mol. Biol. Evol. 5(6), 729–731 (1988)PubMedGoogle Scholar
  7. 7.
    Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.R.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L.L., Bateman, A.: The Pfam protein families database. Nucleic Acids Res. 36(Database issue), D281–D288 (2008)Google Scholar
  8. 8.
    Griffiths Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., Bateman, A.: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33(Database issue), D121–D124 (2005)CrossRefGoogle Scholar
  9. 9.
    Goldman, N., Yang, Z.: Introduction. Statistical and computational challenges in molecular phylogenetics and evolution. Philos. Trans. R Soc. Lond B Biol. Sci. 363(1512), 3889–3892 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Smith, S.A., Beaulieu, J.M., Donoghue, M.J.: Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evol. Biol. 9, 37 (2009)CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Howe, K., Bateman, A., Durbin, R.: QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics 18(11), 1546–1547 (2002)CrossRefPubMedGoogle Scholar
  12. 12.
    Mailund, T., Pedersen, C.N.S.: QuickJoin–fast neighbour-joining tree reconstruction. Bioinformatics 20(17), 3261–3262 (2004)CrossRefPubMedGoogle Scholar
  13. 13.
    Mailund, T., Brodal, G.S., Fagerberg, R., Pedersen, C.N.S., Phillips, D.: Recrafting the neighbor-joining method. BMC Bioinformatics 7, 29 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Simonsen, M., Mailund, T., Pedersen, C.N.S.: Rapid Neighbour-Joining. In: Crandall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS (LNBI), vol. 5251, pp. 113–122. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Zaslavsky, L., Tatusova, T.: Accelerating the neighbor-joining algorithm using the adaptive bucket data structure. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS (LNBI), vol. 4983, pp. 122–133. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Evans, J., Sheneman, L., Foster, J.: Relaxed neighbor joining: a fast distance-based phylogenetic tree construction method. J. Mol. Evol. 62(6), 785–792 (2006)CrossRefPubMedGoogle Scholar
  17. 17.
    Elias, I., Lagergren, J.: Fast Neighbor Joining. Theor. Comput. Sci. 410, 1993–2000 (2009)CrossRefGoogle Scholar
  18. 18.
    Desper, R., Gascuel, O.: Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. Journal of Computational Biology 9(5), 687–705 (2002)CrossRefPubMedGoogle Scholar
  19. 19.
    Sheneman, L., Evans, J., Foster, J.A.: Clearcut: a fast implementation of relaxed neighbor joining. Bioinformatics 22(22), 2823–2824 (2006)CrossRefPubMedGoogle Scholar
  20. 20.
    Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree: Computing Large Minimum-Evolution Trees with Profiles instead of a Distance Matrix. Molecular Biology and Evolution 26, 1641–1650 (2009)CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Patterson, D.A.: Latency lags bandwidth. Communications of the ACM 47(10), 71–75 (2004)CrossRefGoogle Scholar
  22. 22.
    Bayer, R., McCreight, E.: Organization and Maintenance of Large Ordered Indexes. Acta Informatica 1, 173–189 (1972)CrossRefGoogle Scholar
  23. 23.
    Corman, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms, 2nd edn. MIT Press, Cambridge (2001)Google Scholar
  24. 24.
    Brengel, K., Crauser, A., Ferragina, P., Meyer, U.: An Experimental Study of Priority Queues in External Memory. In: Vitter, J.S., Zaroliagis, C.D. (eds.) WAE 1999. LNCS, vol. 1668, pp. 345–359. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  25. 25.
    Gascuel, O.: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14(7), 685–695 (1997)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Travis J. Wheeler
    • 1
  1. 1.Department of Computer ScienceThe University of ArizonaTucsonUSA

Personalised recommendations