Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

Sadakane, Kunihiko

doi:10.1007/3-540-40996-3_35

Kunihiko Sadakane⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1969))

Included in the following conference series:

International Symposium on Algorithms and Computation

819 Accesses
49 Citations

Abstract

A compressed text database based on the compressed sufffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies \( O(n\log |\Sigma |) \) bits for the alphabet ∑. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in \( O(|P|\log n + occ\log ^\varepsilon n) \) time and decompress a part of the text of length l in \( O(l + \log ^e n) \) time for any given 1 ≥ ∈ > 0. Our data structure occupies only \( n(\frac{2} {\varepsilon }(\frac{3} {2} + H_0 + 2logH_0 ) + 2 + \frac{{4log^\varepsilon n}} {{log^\varepsilon n - 1}}) + o(n) + O(|\Sigma |log|\Sigma |) \) bits where \( {\rm H}0 \leqslant {\text{log}}\left| \sum \right| \) is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. L. Bentley, D. D. Sleator, R. E. Tarjan, and V. K. Wei. A Locally Adaptive Data Compression Scheme. Communications of the ACM, 29(4):320–330, April 1986.
Article MATH MathSciNet Google Scholar
M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994.
Google Scholar
P. Elias. Universal codeword sets and representation of the integers. IEEE Trans. Inform. Theory, IT-21(2):194–203, March 1975.
Article MATH MathSciNet Google Scholar
M. Farach and T. Thorup. String-matching in Lempel-Ziv Compressed Strings. In 27th ACM Symposium on Theory of Computing, pages 703–713, 1995.
Google Scholar
P. Ferragina and G. Manzini. Opportunistic Data Structures with Applications. Technical Report TR00-03, Dipartimento di Informatica, Università di Pisa, March 2000.
Google Scholar
R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000. http://www.cs.duke.edu/~jsv/Papers/catalog/node68.html.
D. A. Grossman and O. Frieder. Information Retrieval: Algorithms and Heuristics. Kluwer Academic Publishers, 1998.
Google Scholar
G. Jacobson. Space-efficient Static Trees and Graphs. In 30th IEEE Symp. on Foundations of Computer Science, pages 549–554, 1989.
Google Scholar
P. Jokinen and E. Ukkonen. Two Algorithms for Approximate String Matching in Static Texts. In A. Tarlecki, editor, Proceedings of Mathematical Foundations of Computer Science, LNCS 520, pages 240–248, 1991.
Google Scholar
J. Kärkkäinen and E. Sutinen. Lempel-Ziv Index for q-Grams. Algorithmica, 21(1):137–154, 1998.
Article MATH MathSciNet Google Scholar
T. Kasai, H. Arimura, R. Fujino, and S. Arikawa. Text data mining based on optimal pattern discovery — towards a scalable data mining system for large text databases—. In Summer DB Workshop, SIGDBS-116-20, pages 151–156. IPSJ, July 1998. (in Japanese).
Google Scholar
T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A Unifying Framework for Compressed Pattern Matching. In Proc. IEEE String Processing and Information Retrieval Symposium (SPIRE’99), pages 89–96, September 1999.
Google Scholar
S. Kurtz. Reducing the Space Requirement of Suffix Trees. Technical Report 98–03, Technische Fakultät der Universität Bielefeld, Abteilung Informationstechnik, 1998.
Google Scholar
U. Manber and G. Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935–948, October 1993.
Article MATH MathSciNet Google Scholar
E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(12):262–272, 1976.
Article MATH MathSciNet Google Scholar
E. Moura, G. Navarro, and N. Ziviani. Indexing compressed text. In Proc. of WSP’97, pages 95–111. Carleton University Press, 1997.
Google Scholar
J. I. Munro. Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Computer Science (FSTTCS’ 96), LNCS 1180, pages 37–42, 1996.
Google Scholar
J. I. Munro. Personal communication, July 2000.
Google Scholar
K. Sadakane. A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression. In Proceedings of IEEE Data Compression Conference (DCC’99), page 548, 1999. poster session.
Google Scholar
K. Sadakane and H. Imai. A Cooperative Distributed Text Database Management Method Unifying Search and Compression Based on the Burrows-Wheeler Transformation. In Advances in Database Technologies, number 1552 in LNCS, pages 434–445, 1999.
Google Scholar
K. Sadakane and H. Imai. Text Retrieval by using k-word Proximity Search. In Proceedings of International Symposium on Database Applications in Non-Traditional Environments (DANTE’99), pages 23–28. Research Project on Advanced Databases, 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of System Information Sciences Graduate School of Information Sciences, Tohoku University, Japan
Kunihiko Sadakane

Authors

Kunihiko Sadakane
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Karlsruhe University, Germany
Gerhard Goos
Cornell University, NY, USA
Juris Hartmanis
Utrecht University, The Netherlands
Jan van Leeuwen
Academia Sinica, Institute of Information Science, 128 Academia Road, Section 2, Nankang, 115, Taipei, Taiwan, R.O.C.
D. T. Lee
Department of Computer Science and Akamai Technologies, University of Illinois at Urbana Champaign, 500 Technology Square, 02139, Cambridge, MA, USA
Shang-Hua Teng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sadakane, K. (2000). Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array. In: Goos, G., Hartmanis, J., van Leeuwen, J., Lee, D.T., Teng, SH. (eds) Algorithms and Computation. ISAAC 2000. Lecture Notes in Computer Science, vol 1969. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40996-3_35

Download citation

DOI: https://doi.org/10.1007/3-540-40996-3_35
Published: 29 January 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41255-7
Online ISBN: 978-3-540-40996-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics