Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Text Index Compression

  • Roberto Konow
  • Gonzalo Navarro
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_945

Synonyms

Inverted index/list/file compression

Definition

Text index compression is the problem of designing a reduced-space data structure that provides fast search on a text collection, seen as a set of documents. In information retrieval (IR) the search queries are usually one or a set of words or phrases. Full-text searching aims to retrieve the documents where all or some of the query words/phrases appear. Relevance ranking aims at retrieving a ranked list of the documents that are most relevant to the query, according to some criterion. As inverted indexes (sometimes also called inverted lists or inverted files) are by far the most popular type of text index in IR, this entry focuses on different techniques to compress inverted indexes, depending on whether they are oriented to full-text searching or to relevance ranking.

Historical Background

Text indexing techniques have been known at least since the 1960s (see, e.g., the book by Salton [16], one of the pioneers in the area)....

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Anh V, Moffat A. Simplified similarity scoring using term ranks. In: Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval; 2005. p. 226–33.Google Scholar
  2. 2.
    Anh V, Moffat A. Improved word-aligned binary compression for text indexing. IEEE Trans Knowl Data Eng. 2006;18(6):857–61.CrossRefGoogle Scholar
  3. 3.
    Arroyuelo D, Gil Costa V, González S, Marín M, Oyarzún M. Distributed search based on self-indexed compressed text. Inf Process Manag. 2012;48(5):819–27.CrossRefGoogle Scholar
  4. 4.
    Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York/Toronto: Addison-Wesley; 2011.Google Scholar
  5. 5.
    Brisaboa N, Fariña A, Ladra S, Navarro G. Implicit indexing of natural language text by reorganizing bytecodes. Inf. Retr. 2012;15(6):527–57.CrossRefGoogle Scholar
  6. 6.
    Das A, Jain A. Indexing the world wide web: the journey so far. In: Next Generation Search Engines: Advanced Models for Information Retrieval. IGI Global; 2012. p. 1–28.Google Scholar
  7. 7.
    Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th ACM International Conference on Research and Development in Information Retrieval; 2011. p. 993–1002.Google Scholar
  8. 8.
    Fariña A, Brisaboa N, Navarro G, Claude F, Places A, Rodríguez E. Word-based self-indexes for natural language text. ACM TOIS. 2012;30(1):article 1.CrossRefGoogle Scholar
  9. 9.
    Kane A, Tompa FW. Skewed partial bitvectors for list intersection. In: Proceedings of the 37th ACM International Conference on Research and Development in Information Retrieval; 2014. p. 263–72.Google Scholar
  10. 10.
    Konow R, Navarro G, Clarke C, López-Ortíz A. Faster and smaller inverted indices with treaps. In: Proceedings of the 36th ACM International Conference on Research and Development in Information Retrieval; 2013. p. 193–202.Google Scholar
  11. 11.
    Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software: practice and experience; 2013, to appear.  https://doi.org/10.1002/spe.2203.Google Scholar
  12. 12.
    Moffat A, Culpepper JS. Hybrid bitvector index compression. In: Proceedings of the 12th Australasian Document Computing Symposium; 2007. p. 25–31.Google Scholar
  13. 13.
    Navarro G. Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput Surv. 2014;46(4):article 52.Google Scholar
  14. 14.
    Navarro G, Mäkinen V. Compressed full-text indexes. ACM Comput Surv. 2007;39(1):article 2.zbMATHCrossRefGoogle Scholar
  15. 15.
    Persin M, Zobel J, Sacks-Davis R. Filtered document retrieval with frequency-sorted indexes. J Am Soc Inf Sci. 1996;47(10):749–64.CrossRefGoogle Scholar
  16. 16.
    Salton G. Automatic information organization and retrieval. New York: McGraw-Hill; 1968.Google Scholar
  17. 17.
    Solomon D. Variable-length codes for data compression. London: Springer; 2007.CrossRefGoogle Scholar
  18. 18.
    Witten I, Moffat A, Bell T. Managing gigabytes. 2nd ed. New York: Van Nostrand Reinhold; 1999.zbMATHGoogle Scholar
  19. 19.
    Zobel J, Moffat A. Inverted files for text search engines. ACM Comput Surv. 2006;38(2):6–6.CrossRefGoogle Scholar
  20. 20.
    Zukowski M, Héman S, Nes N, Boncz PA. Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd IEEE International Conference on Data Engineering; 2006. p. 59–71.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of ChileSantiagoChile