Self-indexing Natural Language

  • Nieves R. Brisaboa
  • Antonio Fariña
  • Gonzalo Navarro
  • Angeles S. Places
  • Eduardo Rodríguez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5280)

Abstract

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)CrossRefGoogle Scholar
  2. 2.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 319–330. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Baeza-Yates, R., Navarro, G.: Block-addressing indices for approximate text retrieval. J. of the American Society for Information Science 51(1), 69–82 (2000)CrossRefGoogle Scholar
  4. 4.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  5. 5.
    Baeza-Yates, R., Salinger, A.: Experimental analysis of a fast intersection algorithm for sorted sequences. In: Proc. 12th SPIRE, pp. 13–24 (2005)Google Scholar
  6. 6.
    Barbay, J., López-Ortiz, A., Lu, T.: Faster adaptive set intersections for text searching. In: Proc. 5th WEA, pp. 146–157 (2006)Google Scholar
  7. 7.
    Bell, T., Cleary, J., Witten, I.: Text compression. Prentice Hall, Englewood Cliffs (1990)Google Scholar
  8. 8.
    Brisaboa, N., Fariña, A., Ladra, S., Navarro, G.: Reorganizing compressed text. In: Proc. 31st ACM SIGIR. ACM Press, New York (to appear, 2008)Google Scholar
  9. 9.
    Brisaboa, N., Fariña, A., Navarro, G., Paramá, J.: Lightweight natural language text compression. Information Retrieval 10, 1–33 (2007)CrossRefGoogle Scholar
  10. 10.
    Culpepper, J., Moffat, A.: Compact set representation for information retrieval. In: Proc. 14th SPIRE, pp. 137–148 (2007)Google Scholar
  11. 11.
    Fariña, A., Navarro, G., Paramá, J.: Word-based statistical compressors as natural language compression boosters. In: Proc. 18th DCC, pp. 162–171 (2008)Google Scholar
  12. 12.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007)Google Scholar
  13. 13.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th ACM-SIAM SODA, pp. 841–850 (2003)Google Scholar
  14. 14.
    Heaps, H.: Information Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)MATHGoogle Scholar
  15. 15.
    Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. of the I.R.E. 40(9), 1090–1101 (1952)CrossRefMATHGoogle Scholar
  16. 16.
    Jacobson, G.: Space-efficient static trees and graphs. In: Proc. 30th FOCS, pp. 549–554 (1989)Google Scholar
  17. 17.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Moffat, A.: Word-based text compression. Software Practice and Experience 19(2), 185–198 (1989)CrossRefGoogle Scholar
  20. 20.
    Moffat, A., Katajainen, J.: In-place calculation of minimum-redundancy codes. In: Sack, J.-R., Akl, S.G., Dehne, F., Santoro, N. (eds.) WADS 1995, vol. 955, pp. 393–402. Springer, Heidelberg (1995)CrossRefGoogle Scholar
  21. 21.
    Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems (TOIS) 18(2), 113–139 (2000)CrossRefGoogle Scholar
  22. 22.
    Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  23. 23.
    Navarro, G.: Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms 2(1), 87–114 (2004)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)Google Scholar
  25. 25.
    Navarro, G., Moura, E., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3(1), 49–77 (2000)CrossRefGoogle Scholar
  26. 26.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)MathSciNetCrossRefMATHGoogle Scholar
  27. 27.
    Sadakane, K.: Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms (JDA) 5(1), 12–22 (2007)MathSciNetCrossRefMATHGoogle Scholar
  28. 28.
    Sanders, P., Transier, F.: Intersection in integer inverted indices. In: Proc. 9th ALENEX (2007)Google Scholar
  29. 29.
    Sanders, P., Transier, F.: Compressed inverted indexes for in-memory search engines. In: Proc. 10th ALENEX (2008)Google Scholar
  30. 30.
    Strohman, T., Croft, B.: Efficient document retrieval in main memory. In: Proc. 30th ACM SIGIR, pp. 175–182. ACM Press, New York (2007)Google Scholar
  31. 31.
    Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  32. 32.
    Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1999)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Nieves R. Brisaboa
    • 1
  • Antonio Fariña
    • 1
  • Gonzalo Navarro
    • 2
  • Angeles S. Places
    • 1
  • Eduardo Rodríguez
    • 1
  1. 1.Database Lab.Univ. da CoruñaSpain
  2. 2.Dept. of Computer ScienceUniv. of ChileChile

Personalised recommendations