Text Index Compression

Konow, Roberto; Navarro, Gonzalo

doi:10.1007/978-1-4614-8265-9_945

Text Index Compression

Roberto Konow³ &
Gonzalo Navarro³

Reference work entry
First Online: 01 January 2018

12 Accesses

Synonyms

Inverted index/list/file compression

Definition

Text index compression is the problem of designing a reduced-space data structure that provides fast search on a text collection, seen as a set of documents. In information retrieval (IR) the search queries are usually one or a set of words or phrases. Full-text searching aims to retrieve the documents where all or some of the query words/phrases appear. Relevance ranking aims at retrieving a ranked list of the documents that are most relevant to the query, according to some criterion. As inverted indexes (sometimes also called inverted lists or inverted files) are by far the most popular type of text index in IR, this entry focuses on different techniques to compress inverted indexes, depending on whether they are oriented to full-text searching or to relevance ranking.

Historical Background

Text indexing techniques have been known at least since the 1960s (see, e.g., the book by Salton [16], one of the pioneers in the area)....

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

Anh V, Moffat A. Simplified similarity scoring using term ranks. In: Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval; 2005. p. 226–33.
Google Scholar
Anh V, Moffat A. Improved word-aligned binary compression for text indexing. IEEE Trans Knowl Data Eng. 2006;18(6):857–61.
Article Google Scholar
Arroyuelo D, Gil Costa V, González S, Marín M, Oyarzún M. Distributed search based on self-indexed compressed text. Inf Process Manag. 2012;48(5):819–27.
Article Google Scholar
Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York/Toronto: Addison-Wesley; 2011.
Google Scholar
Brisaboa N, Fariña A, Ladra S, Navarro G. Implicit indexing of natural language text by reorganizing bytecodes. Inf. Retr. 2012;15(6):527–57.
Article Google Scholar
Das A, Jain A. Indexing the world wide web: the journey so far. In: Next Generation Search Engines: Advanced Models for Information Retrieval. IGI Global; 2012. p. 1–28.
Google Scholar
Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th ACM International Conference on Research and Development in Information Retrieval; 2011. p. 993–1002.
Google Scholar
Fariña A, Brisaboa N, Navarro G, Claude F, Places A, Rodríguez E. Word-based self-indexes for natural language text. ACM TOIS. 2012;30(1):article 1.
Article Google Scholar
Kane A, Tompa FW. Skewed partial bitvectors for list intersection. In: Proceedings of the 37th ACM International Conference on Research and Development in Information Retrieval; 2014. p. 263–72.
Google Scholar
Konow R, Navarro G, Clarke C, López-Ortíz A. Faster and smaller inverted indices with treaps. In: Proceedings of the 36th ACM International Conference on Research and Development in Information Retrieval; 2013. p. 193–202.
Google Scholar
Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software: practice and experience; 2013, to appear. https://doi.org/10.1002/spe.2203.
Google Scholar
Moffat A, Culpepper JS. Hybrid bitvector index compression. In: Proceedings of the 12th Australasian Document Computing Symposium; 2007. p. 25–31.
Google Scholar
Navarro G. Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput Surv. 2014;46(4):article 52.
Google Scholar
Navarro G, Mäkinen V. Compressed full-text indexes. ACM Comput Surv. 2007;39(1):article 2.
Article MATH Google Scholar
Persin M, Zobel J, Sacks-Davis R. Filtered document retrieval with frequency-sorted indexes. J Am Soc Inf Sci. 1996;47(10):749–64.
Article Google Scholar
Salton G. Automatic information organization and retrieval. New York: McGraw-Hill; 1968.
Google Scholar
Solomon D. Variable-length codes for data compression. London: Springer; 2007.
Book Google Scholar
Witten I, Moffat A, Bell T. Managing gigabytes. 2nd ed. New York: Van Nostrand Reinhold; 1999.
MATH Google Scholar
Zobel J, Moffat A. Inverted files for text search engines. ACM Comput Surv. 2006;38(2):6–6.
Article Google Scholar
Zukowski M, Héman S, Nes N, Boncz PA. Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd IEEE International Conference on Data Engineering; 2006. p. 59–71.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Chile, Santiago, Chile
Roberto Konow & Gonzalo Navarro

Authors

Roberto Konow
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roberto Konow .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Konow, R., Navarro, G. (2018). Text Index Compression. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_945

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_945
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics