Skip to main content

Self-indexing Natural Language

  • Conference paper
String Processing and Information Retrieval (SPIRE 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5280))

Included in the following conference series:

Abstract

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.

Funded in part (for the Spanish group) by MEC grant (TIN2006-15071-C03-03), and (for the third author) by Yahoo! Research grant “Compact Data Structures”. We also thank AECI grant A/8065/07.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)

    Chapter  Google Scholar 

  2. Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 319–330. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  3. Baeza-Yates, R., Navarro, G.: Block-addressing indices for approximate text retrieval. J. of the American Society for Information Science 51(1), 69–82 (2000)

    Article  Google Scholar 

  4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  5. Baeza-Yates, R., Salinger, A.: Experimental analysis of a fast intersection algorithm for sorted sequences. In: Proc. 12th SPIRE, pp. 13–24 (2005)

    Google Scholar 

  6. Barbay, J., López-Ortiz, A., Lu, T.: Faster adaptive set intersections for text searching. In: Proc. 5th WEA, pp. 146–157 (2006)

    Google Scholar 

  7. Bell, T., Cleary, J., Witten, I.: Text compression. Prentice Hall, Englewood Cliffs (1990)

    Google Scholar 

  8. Brisaboa, N., Fariña, A., Ladra, S., Navarro, G.: Reorganizing compressed text. In: Proc. 31st ACM SIGIR. ACM Press, New York (to appear, 2008)

    Google Scholar 

  9. Brisaboa, N., Fariña, A., Navarro, G., Paramá, J.: Lightweight natural language text compression. Information Retrieval 10, 1–33 (2007)

    Article  Google Scholar 

  10. Culpepper, J., Moffat, A.: Compact set representation for information retrieval. In: Proc. 14th SPIRE, pp. 137–148 (2007)

    Google Scholar 

  11. Fariña, A., Navarro, G., Paramá, J.: Word-based statistical compressors as natural language compression boosters. In: Proc. 18th DCC, pp. 162–171 (2008)

    Google Scholar 

  12. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007)

    Google Scholar 

  13. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th ACM-SIAM SODA, pp. 841–850 (2003)

    Google Scholar 

  14. Heaps, H.: Information Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)

    MATH  Google Scholar 

  15. Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. of the I.R.E. 40(9), 1090–1101 (1952)

    Article  MATH  Google Scholar 

  16. Jacobson, G.: Space-efficient static trees and graphs. In: Proc. 30th FOCS, pp. 549–554 (1989)

    Google Scholar 

  17. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  18. Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  19. Moffat, A.: Word-based text compression. Software Practice and Experience 19(2), 185–198 (1989)

    Article  Google Scholar 

  20. Moffat, A., Katajainen, J.: In-place calculation of minimum-redundancy codes. In: Sack, J.-R., Akl, S.G., Dehne, F., Santoro, N. (eds.) WADS 1995, vol. 955, pp. 393–402. Springer, Heidelberg (1995)

    Chapter  Google Scholar 

  21. Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems (TOIS) 18(2), 113–139 (2000)

    Article  Google Scholar 

  22. Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  23. Navarro, G.: Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms 2(1), 87–114 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  24. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)

    Google Scholar 

  25. Navarro, G., Moura, E., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3(1), 49–77 (2000)

    Article  Google Scholar 

  26. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  27. Sadakane, K.: Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms (JDA) 5(1), 12–22 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  28. Sanders, P., Transier, F.: Intersection in integer inverted indices. In: Proc. 9th ALENEX (2007)

    Google Scholar 

  29. Sanders, P., Transier, F.: Compressed inverted indexes for in-memory search engines. In: Proc. 10th ALENEX (2008)

    Google Scholar 

  30. Strohman, T., Croft, B.: Efficient document retrieval in main memory. In: Proc. 30th ACM SIGIR, pp. 175–182. ACM Press, New York (2007)

    Google Scholar 

  31. Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

  32. Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1999)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brisaboa, N.R., Fariña, A., Navarro, G., Places, A.S., Rodríguez, E. (2008). Self-indexing Natural Language. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89097-3_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89096-6

  • Online ISBN: 978-3-540-89097-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics