Skip to main content

Text Index Compression

  • Reference work entry
  • First Online:
  • 12 Accesses

Synonyms

Inverted index/list/file compression

Definition

Text index compression is the problem of designing a reduced-space data structure that provides fast search on a text collection, seen as a set of documents. In information retrieval (IR) the search queries are usually one or a set of words or phrases. Full-text searching aims to retrieve the documents where all or some of the query words/phrases appear. Relevance ranking aims at retrieving a ranked list of the documents that are most relevant to the query, according to some criterion. As inverted indexes (sometimes also called inverted lists or inverted files) are by far the most popular type of text index in IR, this entry focuses on different techniques to compress inverted indexes, depending on whether they are oriented to full-text searching or to relevance ranking.

Historical Background

Text indexing techniques have been known at least since the 1960s (see, e.g., the book by Salton [16], one of the pioneers in the area)....

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Anh V, Moffat A. Simplified similarity scoring using term ranks. In: Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval; 2005. p. 226–33.

    Google Scholar 

  2. Anh V, Moffat A. Improved word-aligned binary compression for text indexing. IEEE Trans Knowl Data Eng. 2006;18(6):857–61.

    Article  Google Scholar 

  3. Arroyuelo D, Gil Costa V, González S, Marín M, Oyarzún M. Distributed search based on self-indexed compressed text. Inf Process Manag. 2012;48(5):819–27.

    Article  Google Scholar 

  4. Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York/Toronto: Addison-Wesley; 2011.

    Google Scholar 

  5. Brisaboa N, Fariña A, Ladra S, Navarro G. Implicit indexing of natural language text by reorganizing bytecodes. Inf. Retr. 2012;15(6):527–57.

    Article  Google Scholar 

  6. Das A, Jain A. Indexing the world wide web: the journey so far. In: Next Generation Search Engines: Advanced Models for Information Retrieval. IGI Global; 2012. p. 1–28.

    Google Scholar 

  7. Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th ACM International Conference on Research and Development in Information Retrieval; 2011. p. 993–1002.

    Google Scholar 

  8. Fariña A, Brisaboa N, Navarro G, Claude F, Places A, Rodríguez E. Word-based self-indexes for natural language text. ACM TOIS. 2012;30(1):article 1.

    Article  Google Scholar 

  9. Kane A, Tompa FW. Skewed partial bitvectors for list intersection. In: Proceedings of the 37th ACM International Conference on Research and Development in Information Retrieval; 2014. p. 263–72.

    Google Scholar 

  10. Konow R, Navarro G, Clarke C, López-Ortíz A. Faster and smaller inverted indices with treaps. In: Proceedings of the 36th ACM International Conference on Research and Development in Information Retrieval; 2013. p. 193–202.

    Google Scholar 

  11. Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software: practice and experience; 2013, to appear. https://doi.org/10.1002/spe.2203.

    Google Scholar 

  12. Moffat A, Culpepper JS. Hybrid bitvector index compression. In: Proceedings of the 12th Australasian Document Computing Symposium; 2007. p. 25–31.

    Google Scholar 

  13. Navarro G. Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput Surv. 2014;46(4):article 52.

    Google Scholar 

  14. Navarro G, Mäkinen V. Compressed full-text indexes. ACM Comput Surv. 2007;39(1):article 2.

    Article  MATH  Google Scholar 

  15. Persin M, Zobel J, Sacks-Davis R. Filtered document retrieval with frequency-sorted indexes. J Am Soc Inf Sci. 1996;47(10):749–64.

    Article  Google Scholar 

  16. Salton G. Automatic information organization and retrieval. New York: McGraw-Hill; 1968.

    Google Scholar 

  17. Solomon D. Variable-length codes for data compression. London: Springer; 2007.

    Book  Google Scholar 

  18. Witten I, Moffat A, Bell T. Managing gigabytes. 2nd ed. New York: Van Nostrand Reinhold; 1999.

    MATH  Google Scholar 

  19. Zobel J, Moffat A. Inverted files for text search engines. ACM Comput Surv. 2006;38(2):6–6.

    Article  Google Scholar 

  20. Zukowski M, Héman S, Nes N, Boncz PA. Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd IEEE International Conference on Data Engineering; 2006. p. 59–71.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roberto Konow .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Konow, R., Navarro, G. (2018). Text Index Compression. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_945

Download citation

Publish with us

Policies and ethics