Adding Compression and Blended Search to a Compact Two-Level Suffix Array
The suffix array is an efficient in-memory data structure for pattern search; and two-level variants also exist that are suited to external searching and can handle strings larger than the available memory. Assuming the latter situation, we introduce a factor-based mechanism for compressing the text string that integrates seamlessly with the in-memory index search structure, rather than requiring a separate dictionary. We also introduce a mixture of indexed and sequential pattern search in a trade-off that allows further space savings. Experiments on a 4 GB computer with 62.5 GB of English text show that a two-level arrangement is possible in which around 2.5% of the text size is required as an index and for which the disk-resident components, including the text itself, occupy less than twice the space of the original text; and with which count queries can be carried out using two disk accesses and less than two milliseconds of CPU time.
Keywordsstring search pattern matching suffix array Burrows-Wheeler transform succinct data structure disk-based algorithm experimental evaluation
Unable to display preview. Download preview PDF.
- 6.Gog, S., Moffat, A., Culpepper, J.S., Turpin, A., Wirth, A.: Large-scale pattern search using reduced-space on-disk suffix arrays. IEEE Trans. Knowledge and Data Engineering (to appear)Google Scholar
- 7.Gog, S., Petri, M.: Optimized succinct data structures for massive data. Software Practice & Experience (to appear, 2013), http://dx.doi.org/10.1002/spe.2198
- 13.Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Wang, J.T.-L. (ed.) Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 661–672 (2008)Google Scholar