Abstract
Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.
Funded in part (for the Spanish group) by MEC grant (TIN2006-15071-C03-03), and (for the third author) by Yahoo! Research grant “Compact Data Structures”. We also thank AECI grant A/8065/07.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)
Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 319–330. Springer, Heidelberg (2006)
Baeza-Yates, R., Navarro, G.: Block-addressing indices for approximate text retrieval. J. of the American Society for Information Science 51(1), 69–82 (2000)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Baeza-Yates, R., Salinger, A.: Experimental analysis of a fast intersection algorithm for sorted sequences. In: Proc. 12th SPIRE, pp. 13–24 (2005)
Barbay, J., López-Ortiz, A., Lu, T.: Faster adaptive set intersections for text searching. In: Proc. 5th WEA, pp. 146–157 (2006)
Bell, T., Cleary, J., Witten, I.: Text compression. Prentice Hall, Englewood Cliffs (1990)
Brisaboa, N., Fariña, A., Ladra, S., Navarro, G.: Reorganizing compressed text. In: Proc. 31st ACM SIGIR. ACM Press, New York (to appear, 2008)
Brisaboa, N., Fariña, A., Navarro, G., Paramá, J.: Lightweight natural language text compression. Information Retrieval 10, 1–33 (2007)
Culpepper, J., Moffat, A.: Compact set representation for information retrieval. In: Proc. 14th SPIRE, pp. 137–148 (2007)
Fariña, A., Navarro, G., Paramá, J.: Word-based statistical compressors as natural language compression boosters. In: Proc. 18th DCC, pp. 162–171 (2008)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007)
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th ACM-SIAM SODA, pp. 841–850 (2003)
Heaps, H.: Information Retrieval - Computational and Theoretical Aspects. Academic Press, London (1978)
Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. of the I.R.E. 40(9), 1090–1101 (1952)
Jacobson, G.: Space-efficient static trees and graphs. In: Proc. 30th FOCS, pp. 549–554 (1989)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)
Moffat, A.: Word-based text compression. Software Practice and Experience 19(2), 185–198 (1989)
Moffat, A., Katajainen, J.: In-place calculation of minimum-redundancy codes. In: Sack, J.-R., Akl, S.G., Dehne, F., Santoro, N. (eds.) WADS 1995, vol. 955, pp. 393–402. Springer, Heidelberg (1995)
Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems (TOIS) 18(2), 113–139 (2000)
Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)
Navarro, G.: Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms 2(1), 87–114 (2004)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)
Navarro, G., Moura, E., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3(1), 49–77 (2000)
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)
Sadakane, K.: Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms (JDA) 5(1), 12–22 (2007)
Sanders, P., Transier, F.: Intersection in integer inverted indices. In: Proc. 9th ALENEX (2007)
Sanders, P., Transier, F.: Compressed inverted indexes for in-memory search engines. In: Proc. 10th ALENEX (2008)
Strohman, T., Croft, B.: Efficient document retrieval in main memory. In: Proc. 30th ACM SIGIR, pp. 175–182. ACM Press, New York (2007)
Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brisaboa, N.R., Fariña, A., Navarro, G., Places, A.S., Rodríguez, E. (2008). Self-indexing Natural Language. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-89097-3_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89096-6
Online ISBN: 978-3-540-89097-3
eBook Packages: Computer ScienceComputer Science (R0)