Skip to main content

Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

  • Conference paper
  • First Online:
Algorithms and Computation (ISAAC 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1969))

Included in the following conference series:

Abstract

A compressed text database based on the compressed sufffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies \( O(n\log |\Sigma |) \) bits for the alphabet . On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in \( O(|P|\log n + occ\log ^\varepsilon n) \) time and decompress a part of the text of length l in \( O(l + \log ^e n) \) time for any given 1 ≥ ∈ > 0. Our data structure occupies only \( n(\frac{2} {\varepsilon }(\frac{3} {2} + H_0 + 2logH_0 ) + 2 + \frac{{4log^\varepsilon n}} {{log^\varepsilon n - 1}}) + o(n) + O(|\Sigma |log|\Sigma |) \) bits where \( {\rm H}0 \leqslant {\text{log}}\left| \sum \right| \) is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. L. Bentley, D. D. Sleator, R. E. Tarjan, and V. K. Wei. A Locally Adaptive Data Compression Scheme. Communications of the ACM, 29(4):320–330, April 1986.

    Article  MATH  MathSciNet  Google Scholar 

  2. M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994.

    Google Scholar 

  3. P. Elias. Universal codeword sets and representation of the integers. IEEE Trans. Inform. Theory, IT-21(2):194–203, March 1975.

    Article  MATH  MathSciNet  Google Scholar 

  4. M. Farach and T. Thorup. String-matching in Lempel-Ziv Compressed Strings. In 27th ACM Symposium on Theory of Computing, pages 703–713, 1995.

    Google Scholar 

  5. P. Ferragina and G. Manzini. Opportunistic Data Structures with Applications. Technical Report TR00-03, Dipartimento di Informatica, Università di Pisa, March 2000.

    Google Scholar 

  6. R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000. http://www.cs.duke.edu/~jsv/Papers/catalog/node68.html.

  7. D. A. Grossman and O. Frieder. Information Retrieval: Algorithms and Heuristics. Kluwer Academic Publishers, 1998.

    Google Scholar 

  8. G. Jacobson. Space-efficient Static Trees and Graphs. In 30th IEEE Symp. on Foundations of Computer Science, pages 549–554, 1989.

    Google Scholar 

  9. P. Jokinen and E. Ukkonen. Two Algorithms for Approximate String Matching in Static Texts. In A. Tarlecki, editor, Proceedings of Mathematical Foundations of Computer Science, LNCS 520, pages 240–248, 1991.

    Google Scholar 

  10. J. Kärkkäinen and E. Sutinen. Lempel-Ziv Index for q-Grams. Algorithmica, 21(1):137–154, 1998.

    Article  MATH  MathSciNet  Google Scholar 

  11. T. Kasai, H. Arimura, R. Fujino, and S. Arikawa. Text data mining based on optimal pattern discovery — towards a scalable data mining system for large text databases—. In Summer DB Workshop, SIGDBS-116-20, pages 151–156. IPSJ, July 1998. (in Japanese).

    Google Scholar 

  12. T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A Unifying Framework for Compressed Pattern Matching. In Proc. IEEE String Processing and Information Retrieval Symposium (SPIRE’99), pages 89–96, September 1999.

    Google Scholar 

  13. S. Kurtz. Reducing the Space Requirement of Suffix Trees. Technical Report 98–03, Technische Fakultät der Universität Bielefeld, Abteilung Informationstechnik, 1998.

    Google Scholar 

  14. U. Manber and G. Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935–948, October 1993.

    Article  MATH  MathSciNet  Google Scholar 

  15. E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(12):262–272, 1976.

    Article  MATH  MathSciNet  Google Scholar 

  16. E. Moura, G. Navarro, and N. Ziviani. Indexing compressed text. In Proc. of WSP’97, pages 95–111. Carleton University Press, 1997.

    Google Scholar 

  17. J. I. Munro. Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Computer Science (FSTTCS’ 96), LNCS 1180, pages 37–42, 1996.

    Google Scholar 

  18. J. I. Munro. Personal communication, July 2000.

    Google Scholar 

  19. K. Sadakane. A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression. In Proceedings of IEEE Data Compression Conference (DCC’99), page 548, 1999. poster session.

    Google Scholar 

  20. K. Sadakane and H. Imai. A Cooperative Distributed Text Database Management Method Unifying Search and Compression Based on the Burrows-Wheeler Transformation. In Advances in Database Technologies, number 1552 in LNCS, pages 434–445, 1999.

    Google Scholar 

  21. K. Sadakane and H. Imai. Text Retrieval by using k-word Proximity Search. In Proceedings of International Symposium on Database Applications in Non-Traditional Environments (DANTE’99), pages 23–28. Research Project on Advanced Databases, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sadakane, K. (2000). Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array. In: Goos, G., Hartmanis, J., van Leeuwen, J., Lee, D.T., Teng, SH. (eds) Algorithms and Computation. ISAAC 2000. Lecture Notes in Computer Science, vol 1969. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40996-3_35

Download citation

  • DOI: https://doi.org/10.1007/3-540-40996-3_35

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41255-7

  • Online ISBN: 978-3-540-40996-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics