Self-indexing Natural Language

Brisaboa, Nieves R.; Fariña, Antonio; Navarro, Gonzalo; Places, Angeles S.; Rodríguez, Eduardo

doi:10.1007/978-3-540-89097-3_13

Nieves R. Brisaboa⁴,
Antonio Fariña⁴,
Gonzalo Navarro⁵,
Angeles S. Places⁴ &
…
Eduardo Rodríguez⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5280))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

778 Accesses
10 Citations

Abstract

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.

Funded in part (for the Spanish group) by MEC grant (TIN2006-15071-C03-03), and (for the third author) by Yahoo! Research grant “Compact Data Structures”. We also thank AECI grant A/8065/07.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)
Chapter Google Scholar
Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 319–330. Springer, Heidelberg (2006)
Chapter Google Scholar
Baeza-Yates, R., Navarro, G.: Block-addressing indices for approximate text retrieval. J. of the American Society for Information Science 51(1), 69–82 (2000)
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Baeza-Yates, R., Salinger, A.: Experimental analysis of a fast intersection algorithm for sorted sequences. In: Proc. 12th SPIRE, pp. 13–24 (2005)
Google Scholar
Barbay, J., López-Ortiz, A., Lu, T.: Faster adaptive set intersections for text searching. In: Proc. 5th WEA, pp. 146–157 (2006)
Google Scholar
Bell, T., Cleary, J., Witten, I.: Text compression. Prentice Hall, Englewood Cliffs (1990)
Google Scholar
Brisaboa, N., Fariña, A., Ladra, S., Navarro, G.: Reorganizing compressed text. In: Proc. 31st ACM SIGIR. ACM Press, New York (to appear, 2008)
Google Scholar
Brisaboa, N., Fariña, A., Navarro, G., Paramá, J.: Lightweight natural language text compression. Information Retrieval 10, 1–33 (2007)
Article Google Scholar
Culpepper, J., Moffat, A.: Compact set representation for information retrieval. In: Proc. 14th SPIRE, pp. 137–148 (2007)
Google Scholar
Fariña, A., Navarro, G., Paramá, J.: Word-based statistical compressors as natural language compression boosters. In: Proc. 18th DCC, pp. 162–171 (2008)
Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th ACM-SIAM SODA, pp. 841–850 (2003)
Google Scholar
Heaps, H.: Information Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)
MATH Google Scholar
Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. of the I.R.E. 40(9), 1090–1101 (1952)
Article MATH Google Scholar
Jacobson, G.: Space-efficient static trees and graphs. In: Proc. 30th FOCS, pp. 549–554 (1989)
Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)
Article MathSciNet MATH Google Scholar
Moffat, A.: Word-based text compression. Software Practice and Experience 19(2), 185–198 (1989)
Article Google Scholar
Moffat, A., Katajainen, J.: In-place calculation of minimum-redundancy codes. In: Sack, J.-R., Akl, S.G., Dehne, F., Santoro, N. (eds.) WADS 1995, vol. 955, pp. 393–402. Springer, Heidelberg (1995)
Chapter Google Scholar
Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems (TOIS) 18(2), 113–139 (2000)
Article Google Scholar
Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)
Chapter Google Scholar
Navarro, G.: Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms 2(1), 87–114 (2004)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)
Google Scholar
Navarro, G., Moura, E., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3(1), 49–77 (2000)
Article Google Scholar
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)
Article MathSciNet MATH Google Scholar
Sadakane, K.: Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms (JDA) 5(1), 12–22 (2007)
Article MathSciNet MATH Google Scholar
Sanders, P., Transier, F.: Intersection in integer inverted indices. In: Proc. 9th ALENEX (2007)
Google Scholar
Sanders, P., Transier, F.: Compressed inverted indexes for in-memory search engines. In: Proc. 10th ALENEX (2008)
Google Scholar
Strohman, T., Croft, B.: Efficient document retrieval in main memory. In: Proc. 30th ACM SIGIR, pp. 175–182. ACM Press, New York (2007)
Google Scholar
Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1999)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Database Lab., Univ. da Coruña, Spain
Nieves R. Brisaboa, Antonio Fariña, Angeles S. Places & Eduardo Rodríguez
Dept. of Computer Science, Univ. of Chile, Chile
Gonzalo Navarro

Authors

Nieves R. Brisaboa
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Fariña
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Angeles S. Places
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Bar-Ilan University, 52900, Ramat-Gan, Israel
Amihood Amir
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Andrew Turpin
NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria, Australia
Alistair Moffat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brisaboa, N.R., Fariña, A., Navarro, G., Places, A.S., Rodríguez, E. (2008). Self-indexing Natural Language. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-89097-3_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89096-6
Online ISBN: 978-3-540-89097-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics