Advertisement

Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory

  • Simon J. Puglisi
  • W. F. Smyth
  • Andrew Turpin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4209)

Abstract

Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.

Keywords

Pattern Match Search Time Index Size English Text Inverted Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal exact string matching based on suffix arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  2. 2.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)Google Scholar
  3. 3.
    Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Information Retrieval 8, 151–166 (2005)CrossRefGoogle Scholar
  4. 4.
    Benson, D., Lipman, D.J., Ostell, J.: GenBank. Nucleic Acids Research 21(13), 2963–2965 (1993)CrossRefGoogle Scholar
  5. 5.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank. Nucleic Acids Research 33, D34–D38 (2005)CrossRefGoogle Scholar
  6. 6.
    Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California (1994)Google Scholar
  7. 7.
    Cameron, M., Williams, H.E., Cannane, A.: Improved gapped alignment in blast. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(3), 116–129 (2004)CrossRefGoogle Scholar
  8. 8.
    Choi, Y., Park, K.: Time and space efficient search with suffix arrays. In: Hong, S. (ed.) Proceedings of AWOCA 2004, Ballina, Australia, pp. 230–238 (2004)Google Scholar
  9. 9.
    De Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)CrossRefGoogle Scholar
  10. 10.
    Ensembl. Ensembl Genome Browser (2006), http://www.ensembl.org
  11. 11.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st IEEE Symposium on Found. of Comp. Sci., Redondo Beach, CA, pp. 390–398. IEEE Computer Society, Los Alamitos (2000)Google Scholar
  12. 12.
    Ferragina, P., Navarro, G.: Pizza & Chili Corpus – Compressed Indexes and their Testbeds (2005), http://pizzachili.dcc.uchile.cl
  13. 13.
    Grossi, R., Vitter, J.S., Gupta, A.: When indexing equals compression: Experiments with compressing suffix arrays and applications. In: Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms, pp. 636–645 (2004)Google Scholar
  14. 14.
    Harman, D.K.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)CrossRefGoogle Scholar
  15. 15.
    Kärkkäinen, J.: Ziv-Lempel index for q-grams. Algorithmica 21(1), 137–154 (1998)MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Kurtz, S.: Reducing the space requirement of suffix trees. Software, Practice and Experience 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
  17. 17.
    Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(2), 40–66 (2005)MathSciNetGoogle Scholar
  18. 18.
    Mäkinen, V., Navarro, G.: Compressed full text indexes. Technical Report TR/DCC-2005-7, Department of Computer Science, University of Chile (June 2006)Google Scholar
  19. 19.
    Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  20. 20.
    Manber, U., Myers, G.W.: Suffix arrays: a new model for on-line string searches. SIAM Journal of Computing 22(5), 935–948 (1993)MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Manber, U., Wu, S.: Glimpse: A tool to search through entire file systems. In: Proceedings of the USENIX Technical Conference, Berkeley, CA, pp. 23–32. USENIX Association (1994)Google Scholar
  22. 22.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)CrossRefMathSciNetGoogle Scholar
  23. 23.
    McCreight, E.M.: A space-economical suffix tree construction algroithm. Journal of the ACM 23(2), 262–272 (1976)MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Navarro, G., De Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3, 49–77 (2000)CrossRefGoogle Scholar
  25. 25.
    NCBI. NCBI Blast (2006), http://www.ncbi.nlm.nih.gov/BLAST/
  26. 26.
    Simon, J., Puglisi, W., Smyth, F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. In: Proceedings of the Prague Stringology Conference, Prague, pp. 1–30. Czech Technical University (August 2005)Google Scholar
  27. 27.
    Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  28. 28.
    Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, pp. 225–232 (2002)Google Scholar
  29. 29.
    Sim, J.S., Kim, D.K., Park, H., Park, K.: Linear-time search in suffix arrays. In: Miller, M., Park, K. (eds.) Proceedings of AWOCA 2003, Seoul, Korea, pp. 139–146 (2003)Google Scholar
  30. 30.
    Smyth, W.F.: Computing Patterns in Strings. Addison-Wesley, Essex, England (2003)Google Scholar
  31. 31.
    Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th annual Symposium on Foundations of Computer Science, pp. 1–11 (1973)Google Scholar
  32. 32.
    Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering 14(1), 63–78 (2002)CrossRefGoogle Scholar
  33. 33.
    Williams, H., Zobel, J.: Compression of nucleotide databases for fast searching. CABIOS Computer Applications in the Biological Sciences 13(5), 549–554 (1997)Google Scholar
  34. 34.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)Google Scholar
  35. 35.
    Zobel, J., Moffat, A., Sacks-Davis, R.: Searching large lexicons for partially specified terms using compressed inverted files. In: Agrawal, R., Baker, S., Bell, D. (eds.) Proceedings of the International Conference on Very Large Data Bases, Dublin, Ireland, August 1993, pp. 290–301 (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Simon J. Puglisi
    • 1
  • W. F. Smyth
    • 1
    • 2
  • Andrew Turpin
    • 3
  1. 1.Curtin University of TechnologyPerthAustralia
  2. 2.McMaster UniversityHamiltonCanada
  3. 3.RMIT UniversityMelbourneAustralia

Personalised recommendations