Fast Discovery of Similar Sequences in Large Genomic Collections

  • Yaniv Bernstein
  • Michael Cameron
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)


Detection of highly similar sequences within genomic collections has a number of applications, including the assembly of expressed sequence tag data, genome comparison, and clustering sequence collections for improved search speed and accuracy. While several approaches exist for this task, they are becoming infeasible — either in space or in time — as genomic collections continue to grow at a rapid pace. In this paper we present an approach based on document fingerprinting for identifying highly similar sequences. Our approach uses a modest amount of memory and executes in a time roughly proportional to the size of the collection. We demonstrate substantial speed improvements compared to the CD-HIT algorithm, the most successful existing approach for clustering large protein sequence collections.


Average Precision Sequence Pair Fast Discovery Genomic Collection Chunk Length 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI–BLAST: A new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
  3. Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: Apostolico, A., Melucci, M. (eds.) Proc. String Processing and Information Retrieval Symposium (SPIRE), Padova, Italy. Springer, Heidelberg (2004)Google Scholar
  4. Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Chowdhury, A., Fuhr, N., Ronthaler, M., Schek, H., Teiken, W. (eds.) Proc. CIKM conference, Bremen, Germany, pp. 736–743. ACM Press, New York (2005)Google Scholar
  5. Brin, S., Davis, J., Garc´ıa-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, pp. 398–409 (1995)Google Scholar
  6. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)CrossRefGoogle Scholar
  7. Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proc. ACM SIGIR conference, pp. 33–40. ACM Press, New York (2000)Google Scholar
  8. Burke, J., Davison, D., Hide, W.: d2 cluster: A validated method for clustering EST and full-length DNA sequences. Genome Research 9(11), 1135–1142 (1999)CrossRefGoogle Scholar
  9. Cameron, M., Williams, H.E., Cannane, A.: Improved gapped alignment in BLAST. IEEE Transactions on Computational Biology and Bioinformatics 1(3), 116–129 (2004)CrossRefGoogle Scholar
  10. Cameron, M., Williams, H.E., Cannane, A.: A deterministic finite automaton for faster protein hit detection in BLAST. Journal of Computational Biology (2005) (to appear)Google Scholar
  11. Chandonia, J., Hon, G., Walker, N., Conte, L.L., Koehl, P., Levitt, M., Brenner, S.: The ASTRAL compendium in 2004. Nucleic Acids Research 32, D189–D192 (2004)CrossRefGoogle Scholar
  12. Chao, K., Pearson, W., Miller, W.: Aligning two sequences within a specified diagonal band. Computer Applications in the Biosciences 8(5), 481–487 (1992)Google Scholar
  13. Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Baeza-Yates, R. (ed.) Proc. 1st Latin American Web Congress, pp. 37–45. IEEE, Santiago (2003)Google Scholar
  14. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: STOC 2000: Proceedings of the thirty-second annual ACM symposium on Theory of computing, pp. 397–406. ACM Press, New York (2000)CrossRefGoogle Scholar
  15. Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)CrossRefzbMATHGoogle Scholar
  16. Heintze, N.: Scalable document fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (1996)Google Scholar
  17. Holm, L., Sander, C.: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14(5), 423–429 (1998)CrossRefGoogle Scholar
  18. Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.: Versatile and open software for comparing large genomes. Genome Biology 5(2) (2004)Google Scholar
  19. Li, W., Jaroszewski, L., Godzik, A.: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17(3), 282–283 (2001a)CrossRefGoogle Scholar
  20. Li, W., Jaroszewski, L., Godzik, A.: Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18, 77–82 (2001b)CrossRefGoogle Scholar
  21. Li, W., Jaroszewski, L., Godzik, A.: Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Engineering 15(8), 643–649 (2002)CrossRefGoogle Scholar
  22. Malde, K., Coward, E., Jonassen, I.: Fast sequence clustering using a suffix array algorithm. Bioinformatics 19(10), 1221–1226 (2003)CrossRefGoogle Scholar
  23. Manber, U.: Finding similar files in a large file system, in Proceedings of the USENIX Winter, Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)Google Scholar
  24. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  25. Park, J., Holm, L., Heger, A., Chothia, C.: RSDB: representative sequence databases have high information content. Bioinformatics 16(5), 458–464 (2000)CrossRefGoogle Scholar
  26. Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences USA 85(8), 2444–2448 (1988)CrossRefGoogle Scholar
  27. Shivakumar, N., García-Molina, H.: Finding near-replicas of documents on the web. In: WEBDB: International Workshop on the World Wide Web and Databases, WebDB. Springer, Heidelberg (1999)Google Scholar
  28. Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)CrossRefGoogle Scholar
  29. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kauffman, San Francisco (1999)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yaniv Bernstein
    • 1
  • Michael Cameron
    • 1
  1. 1.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia

Personalised recommendations