Smaller Self-indexes for Natural Language

Brisaboa, Nieves R.; Navarro, Gonzalo; Ordóñez, Alberto

doi:10.1007/978-3-642-34109-0_39

Nieves R. Brisaboa²⁰,
Gonzalo Navarro²¹ &
Alberto Ordóñez²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7608))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1208 Accesses
1 Citations

Abstract

Self-indexes for natural-language texts, where these are regarded as token (word or separator) sequences, achieve very attractive space and search time. However, they suffer from a space penalty due to their large vocabulary. In this paper we show that by replacing the Huffman encoding they implicitly use by the slightly weaker Hu-Tucker encoding, which respects the lexical order of the vocabulary, both their space and time are improved.

Funded by the Spanish MICINN (PGE and FEDER) refs. TIN2009-14560-C03-0, MICINN ref. AP2010-6038 (FPU Program) for Alberto Ordóñez, and Fondecyt Grant 1-110066, Chile for Gonzalo Navarro.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet Partitioning for Compressed Rank/Select and Applications. In: Cheong, O., Chwa, K.-Y., Park, K. (eds.) ISAAC 2010, Part II. LNCS, vol. 6507, pp. 315–326. Springer, Heidelberg (2010)
Chapter Google Scholar
Barbay, J., Navarro, G.: Compressed representations of permutations, and applications. In: STACS 2009, pp. 111–122 (2009)
Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)
Google Scholar
Claude, F., Navarro, G.: Practical Rank/Select Queries over Arbitrary Sequences. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 176–187. Springer, Heidelberg (2008)
Chapter Google Scholar
Fariña, A., Brisaboa, N., Navarro, G., Claude, F., Places, A., Rodríguez, E.: Word-based self-indexes for natural language text. ACM Trans. Inf. Sys. 30(1), 1–34 (2012)
Article Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398 (2000)
Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), art. 20 (2007)
Google Scholar
González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: WEA (posters). pp. 27–38 (2005)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: SODA, pp. 841–850 (2003)
Google Scholar
Grossi, R., Vitter, J., Xu, B.: Wavelet trees: From theory to practice. In: CCP. pp. 210–221 (2011)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC, pp. 397–406 (2000)
Google Scholar
Hu, T.C., Tucker, A.C.: Optimal computer search trees and variable-length alphabetical codes. SIAM J. Appl. Math. 21(4), 514–532 (1971)
Article MathSciNet MATH Google Scholar
Huffman, D.A.: A method for the construction of minimum-redundancy codes. In: Proc. I.R.E., vol. 40, pp. 1098–1101 (1952)
Google Scholar
Jacobson, G.: Space-efficient static trees and graphs. In: FOCS, pp. 549–554 (1989)
Google Scholar
Knuth, D.E.: The Art of Computer Programming. Vol. 3: Sorting and Searching, 2nd edn. Addison-Wesley (1998)
Google Scholar
Mäkinen, V., Navarro, G.: Implicit Compression Boosting with Applications to Self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)
Chapter Google Scholar
Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), art. 2 (2007)
Google Scholar
Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: SODA, pp. 233–242 (2002)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Database Lab., Univ. of A Coruña, Spain
Nieves R. Brisaboa & Alberto Ordóñez
Dept. of Computer Science, Univ. of Chile, Chile
Gonzalo Navarro

Authors

Nieves R. Brisaboa
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Ordóñez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Technologies Research Group, Universidad Autónoma de Bucaramanga, Bucaramanga, Colombia
Liliana Calderón-Benavides
Information Technologies and Research Group, Universidad Autónoma de Bucaramanga, Bucaramanga, Colombia
Cristina González-Caro
School of Physics and Mathematics, Universidad Michoacana, Edificio ”B”, Ciudad Universitaria,, 58000, Morelia, Mexico
Edgar Chávez
Department of Computer Science, Universidade Federal de Minas Gerais, Av. Antonio Carlos 6627, Pampulha, 31270-010, Belo Horizonte, Brazil
Nivio Ziviani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brisaboa, N.R., Navarro, G., Ordóñez, A. (2012). Smaller Self-indexes for Natural Language. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds) String Processing and Information Retrieval. SPIRE 2012. Lecture Notes in Computer Science, vol 7608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34109-0_39

Download citation

DOI: https://doi.org/10.1007/978-3-642-34109-0_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34108-3
Online ISBN: 978-3-642-34109-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics