Adding Compression and Blended Search to a Compact Two-Level Suffix Array

  • Simon Gog
  • Alistair Moffat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8214)

Abstract

The suffix array is an efficient in-memory data structure for pattern search; and two-level variants also exist that are suited to external searching and can handle strings larger than the available memory. Assuming the latter situation, we introduce a factor-based mechanism for compressing the text string that integrates seamlessly with the in-memory index search structure, rather than requiring a separate dictionary. We also introduce a mixture of indexed and sequential pattern search in a trade-off that allows further space savings. Experiments on a 4 GB computer with 62.5 GB of English text show that a two-level arrangement is possible in which around 2.5% of the text size is required as an index and for which the disk-resident components, including the text itself, occupy less than twice the space of the original text; and with which count queries can be carried out using two disk accesses and less than two milliseconds of CPU time.

Keywords

string search pattern matching suffix array Burrows-Wheeler transform succinct data structure disk-based algorithm experimental evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R.A., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Inf. Systems 21(6), 497–514 (1996)CrossRefGoogle Scholar
  2. 2.
    Colussi, L., De Col, A.: A time and space efficient data structure for string searching on large texts. Inf. Processing Letters 58(5), 217–222 (1996)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Ferguson, M.P.: FEMTO: Fast search of large sequence collections. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 208–219. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  4. 4.
    Ferragina, P., Grossi, R.: The string B-tree: A new data structure for search in external memory and its applications. J. ACM 46(2), 236–280 (1999)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Gog, S., Moffat, A., Culpepper, J.S., Turpin, A., Wirth, A.: Large-scale pattern search using reduced-space on-disk suffix arrays. IEEE Trans. Knowledge and Data Engineering (to appear)Google Scholar
  7. 7.
    Gog, S., Petri, M.: Optimized succinct data structures for massive data. Software Practice & Experience (to appear, 2013), http://dx.doi.org/10.1002/spe.2198
  8. 8.
    González, R., Navarro, G.: A compressed text index on secondary memory. J. Combinatorial Mathematics and Combinatorial Comp. 71, 127–154 (2009)MathSciNetMATHGoogle Scholar
  9. 9.
    Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Meyer, U., Sanders, P., Sibeyn, J.F. (eds.) Algorithms for Memory Hierarchies. LNCS, vol. 2625, pp. 149–170. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    Mäkinen, V., Navarro, G.: Compressed compact suffix arrays. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 420–433. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. 11.
    Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Moffat, A., Puglisi, S.J., Sinha, R.: Reducing space requirements for disk resident suffix arrays. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 730–744. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  13. 13.
    Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Wang, J.T.-L. (ed.) Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 661–672 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Simon Gog
    • 1
  • Alistair Moffat
    • 1
  1. 1.Department of Computing and Information SystemsThe University of MelbourneAustralia

Personalised recommendations