Synonyms
Inverted index/list/file compression
Definition
Text index compression is the problem of designing a reduced-space data structure that provides fast search on a text collection, seen as a set of documents. In information retrieval (IR) the search queries are usually one or a set of words or phrases. Full-text searching aims to retrieve the documents where all or some of the query words/phrases appear. Relevance ranking aims at retrieving a ranked list of the documents that are most relevant to the query, according to some criterion. As inverted indexes (sometimes also called inverted lists or inverted files) are by far the most popular type of text index in IR, this entry focuses on different techniques to compress inverted indexes, depending on whether they are oriented to full-text searching or to relevance ranking.
Historical Background
Text indexing techniques have been known at least since the 1960s (see, e.g., the book by Salton [16], one of the pioneers in the area)....
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsRecommended Reading
Anh V, Moffat A. Simplified similarity scoring using term ranks. In: Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval; 2005. p. 226–33.
Anh V, Moffat A. Improved word-aligned binary compression for text indexing. IEEE Trans Knowl Data Eng. 2006;18(6):857–61.
Arroyuelo D, Gil Costa V, González S, Marín M, Oyarzún M. Distributed search based on self-indexed compressed text. Inf Process Manag. 2012;48(5):819–27.
Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York/Toronto: Addison-Wesley; 2011.
Brisaboa N, Fariña A, Ladra S, Navarro G. Implicit indexing of natural language text by reorganizing bytecodes. Inf. Retr. 2012;15(6):527–57.
Das A, Jain A. Indexing the world wide web: the journey so far. In: Next Generation Search Engines: Advanced Models for Information Retrieval. IGI Global; 2012. p. 1–28.
Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th ACM International Conference on Research and Development in Information Retrieval; 2011. p. 993–1002.
Fariña A, Brisaboa N, Navarro G, Claude F, Places A, Rodríguez E. Word-based self-indexes for natural language text. ACM TOIS. 2012;30(1):article 1.
Kane A, Tompa FW. Skewed partial bitvectors for list intersection. In: Proceedings of the 37th ACM International Conference on Research and Development in Information Retrieval; 2014. p. 263–72.
Konow R, Navarro G, Clarke C, López-Ortíz A. Faster and smaller inverted indices with treaps. In: Proceedings of the 36th ACM International Conference on Research and Development in Information Retrieval; 2013. p. 193–202.
Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software: practice and experience; 2013, to appear. https://doi.org/10.1002/spe.2203.
Moffat A, Culpepper JS. Hybrid bitvector index compression. In: Proceedings of the 12th Australasian Document Computing Symposium; 2007. p. 25–31.
Navarro G. Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput Surv. 2014;46(4):article 52.
Navarro G, Mäkinen V. Compressed full-text indexes. ACM Comput Surv. 2007;39(1):article 2.
Persin M, Zobel J, Sacks-Davis R. Filtered document retrieval with frequency-sorted indexes. J Am Soc Inf Sci. 1996;47(10):749–64.
Salton G. Automatic information organization and retrieval. New York: McGraw-Hill; 1968.
Solomon D. Variable-length codes for data compression. London: Springer; 2007.
Witten I, Moffat A, Bell T. Managing gigabytes. 2nd ed. New York: Van Nostrand Reinhold; 1999.
Zobel J, Moffat A. Inverted files for text search engines. ACM Comput Surv. 2006;38(2):6–6.
Zukowski M, Héman S, Nes N, Boncz PA. Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd IEEE International Conference on Data Engineering; 2006. p. 59–71.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Konow, R., Navarro, G. (2018). Text Index Compression. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_945
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_945
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering