Abstract
Given a sequence S of n symbols over some alphabet Σ, we develop a new compression method that is (i) very simple to implement; (ii) provides O(1) time random access to any symbol of the original sequence; (iii) allows efficient pattern matching over the compressed sequence. Our simplest solution uses at most 2h + o(h) bits of space, where h = n (H 0(S) + 1), and H 0(S) is the zeroth-order empirical entropy of S. We discuss a number of improvements and trade-offs over the basic method. The new method is applied to text compression. We also propose average case optimal string matching algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amir, A., Benson, G.: Two-dimensional periodicity and its applications. In: Proceedings of SODA’92, pp. 440–452 (1992)
Baeza-Yates, R.A., Gonnet, G.H.: A new approach to text searching. Commun. ACM 35(10), 74–82 (1992)
Brisaboa, N., Iglesias, E., Navarro, G., Paramá, J.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)
Brown, J.L.: Zeckendorf’s theorem and some applications. Fib. Quart. 2:163–168
Brown, J.L.: A new characterization of the Fibonacci numbers. Fib. Quart. 3, 1–8 (1965)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Clark, D.R.: Compact Pat Trees. PhD thesis, University of Waterloo, Ontario, Canada (1998)
Elias, P.: Universal codeword sets and representation of the integers. IEEE Transactions on Information Theory 21(2), 194–203 (1975)
Fredriksson, K.: Shift–or string matching with super-alphabets. Information Processing Letters 87(1), 201–204 (2003)
González, R., Navarro, G.: Statistical encoding of succinct data structures. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 295–306. Springer, Heidelberg (2006)
Grabowski, S., Navarro, G., Przywarski, R., Salinger, A., Mäkinen, V.: A simple alphabet-independent FM-index. International Journal of Foundations of Computer Science (IJFCS) 17(6), 1365–1384 (2006)
Heaps, H.S.: Information retrieval: theoretical and computational aspects. Academic Press, New York (1978)
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)
Huffman, D.A.: A method for the construction of minimum redundancy codes. Proceedings of I.R.E 40, 1098–1101 (1951)
Jacobson, G.: Succinct static data structures. PhD thesis, Carnegie Mellon University (1989)
Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theoretical Computer Science, Special issue on The Burrows-Wheeler Transform and its Applications (To appear 2006)
Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. ACM Trans. Inform. Syst. 15(2), 124–136 (1997)
Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems (TOIS) 18(2), 113–139 (2000)
Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) Foundations of Software Technology and Theoretical Computer Science. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)
Navarro, G., Raffinot, M.: Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA), vol. 5(4) (2000)
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of ALENEX’07, ACM Press, New York (2007)
Pagh, R.: Low redundancy in static dictionaries with o(1) worst case lookup time. In: Wiedermann, J., van Emde Boas, P., Nielsen, M. (eds.) ICALP 1999. LNCS, vol. 1644, pp. 595–604. Springer, Heidelberg (1999)
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of SODA’02, pp. 233–242. ACM Press, New York (2002)
Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of SODA’06, pp. 1230–1239. ACM Press, New York (2006)
Sunday, D.M.: A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990)
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communications of the ACM 30(6), 520 (1987)
Yao, A.C.: The complexity of pattern matching for a random string. SIAM J. Comput. 8(3), 368–387 (1979)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Fredriksson, K., Nikitin, F. (2007). Simple Compression Code Supporting Random Access and Fast String Matching. In: Demetrescu, C. (eds) Experimental Algorithms. WEA 2007. Lecture Notes in Computer Science, vol 4525. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72845-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-72845-0_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72844-3
Online ISBN: 978-3-540-72845-0
eBook Packages: Computer ScienceComputer Science (R0)