Reducing Space Requirements for Disk Resident Suffix Arrays

  • Alistair Moffat
  • Simon J. Puglisi
  • Ranjan Sinha
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5463)

Abstract

Suffix trees and suffix arrays are important data structures for string processing, providing efficient solutions for many applications involving pattern matching. Recent work by Sinha et al. (SIGMOD 2008) addressed the problem of arranging a suffix array on disk so that querying is fast, and showed that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered many times faster than alternative disk-based suffix trees. A drawback of their LOF-SA structure, and common to all current disk resident suffix tree/array approaches, is that the space requirement of the data structure, though on disk, is large relative to the text – for the LOF-SA, 13n bytes including the underlying n byte text. In this paper we explore techniques for reducing the space required by the LOF-SA. Experiments show these methods cut the data structure to nearly half its original size, without, for large strings that necessitate on-disk structures, any impact on search times.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Information Retrieval 8(1), 151–166 (2005)CrossRefGoogle Scholar
  3. 3.
    Apostolico, A.: The myriad virtues of subword trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO ASI Series F12, pp. 85–96. Springer, Berlin (1985)CrossRefGoogle Scholar
  4. 4.
    Baeza-Yates, R.A., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Information Systems 21(6), 497–514 (1996)CrossRefGoogle Scholar
  5. 5.
    Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F. (S,C)-dense coding: An optimized compression code for natural language text databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 122–136. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Cheung, C.-F., Xu Yu, J., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 90–105 (2005)CrossRefGoogle Scholar
  7. 7.
    Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1–35 (2002)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Culpepper, J.S., Moffat, A.: Enhanced byte codes with restricted prefix properties. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 1–12. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM Journal of Experimental Algorithmics 12(3.4), 1–24 (2008)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of the ACM 47(6), 987–1011 (2000)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    González, R., Navarro, G.: Compressed text indexes with fast locate. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 216–227. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  12. 12.
    Gusfield, D.: Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
  13. 13.
    Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  14. 14.
    Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)CrossRefGoogle Scholar
  15. 15.
    Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Manzini, G.: Two space saving tricks for linear time LCP array computation. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 372–383. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  17. 17.
    McCreight, E.M.: A space-economical suffix tree construction algroithm. Journal of the ACM 23(2), 262–272 (1976)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Navarro, G., Mäkinen, V.: Compressed full text indexes. ACM Computing Surveys 39(1) (2007)Google Scholar
  19. 19.
    Phoophakdee, B., Zaki, M.J.: Genome-scale disk-based suffix tree indexing. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 833–844. ACM, New York (2007)CrossRefGoogle Scholar
  20. 20.
    Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Lakshmanan, L.V.S., Ng, R.T., Shasha, D. (eds.) Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 661–672. ACM, New York (2008)CrossRefGoogle Scholar
  21. 21.
    Smyth, B.: Computing Patterns in Strings. Pearson Addison-Wesley, Essex (2003)Google Scholar
  22. 22.
    Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. The VLDB Journal 14(3), 281–299 (2005)CrossRefGoogle Scholar
  23. 23.
    Ukkonen, E.: Online construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Computer Society, Washington (1973)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Alistair Moffat
    • 1
  • Simon J. Puglisi
    • 2
  • Ranjan Sinha
    • 1
  1. 1.Department of Computer Science and Software EngineeringThe University of MelbourneAustralia
  2. 2.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia

Personalised recommendations