Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Inverted Index Compression

  • Giulio Ermanno PibiriEmail author
  • Rossano Venturini
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_52

Definitions

The data structure at the core of nowadays large-scale search engines, social networks, and storage architectures is the inverted index. Given a collection of documents, consider for each distinct term t appearing in the collection the integer sequence t, listing in sorted order all the identifiers of the documents (docIDs in the following) in which the term appears. The sequence t is called the inverted list or posting list of the term t. The inverted index is the collection of all such lists.

The scope of the entry is the one of surveying the most important encoding algorithms developed for efficient inverted index compression and fast retrieval.

Overview

The inverted index owes its popularity to the efficient resolution of queries, expressed as a set of terms {t1, …, tk} combined with a query operator. The simplest operators are Boolean AND and OR. For example, given an AND query, the index has to report all the docIDs of the documents containing the terms {t1, …, tk}....

This is a preview of subscription content, log in to check access.

References

  1. Anh VN, Moffat A (2005) Inverted index compression using word-aligned binary codes. Inf Retr J 8(1):151–166CrossRefGoogle Scholar
  2. Anh VN, Moffat A (2010) Index compression using 64-bit words. Softw Pract Exp 40(2):131–147Google Scholar
  3. Claude F, Fariña A, Martínez-Prieto MA, Navarro G (2016) Universal indexes for highly repetitive document collections. Inf Syst 61:1–23CrossRefGoogle Scholar
  4. Dean J (2009) Challenges in building large-scale information retrieval systems: invited talk. In: Proceedings of the 2nd international conference on web search and data mining (WSDM)Google Scholar
  5. Delbru R, Campinas S, Tummarello G (2012) Searching web data: an entity retrieval and high-performance indexing model. J Web Semant 10:33–58CrossRefGoogle Scholar
  6. Dhulipala L, Kabiljo I, Karrer B, Ottaviano G, Pupyrev S, Shalita A (2016) Compressing graphs and indexes with recursive graph bisection. In: Proceedings of the 22nd international conference on knowledge discovery and data mining (SIGKDD), pp 1535–1544Google Scholar
  7. Elias P (1974) Efficient storage and retrieval by content and address of static files. J ACM 21(2):246–260MathSciNetzbMATHCrossRefGoogle Scholar
  8. Elias P (1975) Universal codeword sets and representations of the integers. IEEE Trans Inf Theory 21(2):194–203MathSciNetzbMATHCrossRefGoogle Scholar
  9. Fano RM (1971) On the number of bits required to implement an associative memory. Memorandum 61. Computer Structures Group, MIT, CambridgeGoogle Scholar
  10. Goldstein J, Ramakrishnan R, Shaft U (1998) Compressing relations and indexes. In: Proceedings of the 14th international conference on data engineering (ICDE), pp 370–379Google Scholar
  11. Golomb S (1966) Run-length encodings. IEEE Trans Inf Theory 12(3):399–401zbMATHCrossRefGoogle Scholar
  12. Larsson NJ, Moffat A (1999) Offline dictionary-based compression. In: Data compression conference (DCC), pp 296–305Google Scholar
  13. Lemire D, Boytsov L (2013) Decoding billions of integers per second through vectorization. Softw Pract Exp 45(1):1–29CrossRefGoogle Scholar
  14. Lemire D, Kurz N, Rupp C (2018) Stream-VByte: faster byte-oriented integer compression. Inf Process Lett 130:1–6MathSciNetzbMATHCrossRefGoogle Scholar
  15. Moffat A, Stuiver L (2000) Binary interpolative coding for effective index compression. Inf Retr J 3(1): 25–47CrossRefGoogle Scholar
  16. Navarro G, Mäkinen V (2007) Compressed full-text indexes. ACM Comput Surv 39(1):1–79zbMATHCrossRefGoogle Scholar
  17. Ottaviano G, Venturini R (2014) Partitioned elias-fano indexes. In: Proceedings of the 37th international conference on research and development in information retrieval (SIGIR), pp 273–282Google Scholar
  18. Ottaviano G, Tonellotto N, Venturini R (2015) Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th annual international ACM conference on web search and data mining (WSDM), pp 47–56Google Scholar
  19. Pibiri GE, Venturini R (2017) Clustered Elias-Fano indexes. ACM Trans Inf Syst 36(1):1–33. ISSN 1046-8188CrossRefGoogle Scholar
  20. Plaisance J, Kurz N, Lemire D (2015) Vectorized VByte decoding. In: International symposium on web algorithms (iSWAG)Google Scholar
  21. Rice R, Plaunt J (1971) Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Trans Commun 16(9):889–897CrossRefGoogle Scholar
  22. Salomon D (2007) Variable-length codes for data compression. Springer, LondonzbMATHCrossRefGoogle Scholar
  23. Silvestri F (2007) Sorting out the document identifier assignment problem. In: Proceedings of the 29th European conference on IR research (ECIR), pp 101–112Google Scholar
  24. Silvestri F, Venturini R (2010) Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th international conference on information and knowledge management (CIKM), pp 1219–1228Google Scholar
  25. Stepanov A, Gangolli A, Rose D, Ernst R, Oberoi P (2011) Simd-based decoding of posting lists. In: Proceedings of the 20th international conference on information and knowledge management (CIKM), pp 317–326Google Scholar
  26. Vigna S (2013) Quasi-succinct indices. In: Proceedings of the 6th ACM international conference on web search and data mining (WSDM), pp 83–92Google Scholar
  27. Witten I, Moffat A, Bell T (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, San FranciscozbMATHGoogle Scholar
  28. Yan H, Ding S, Suel T (2009) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th international conference on world wide web (WWW), pp 401–410Google Scholar
  29. Zhang Z, Tong J, Huang H, Liang J, Li T, Stones RJ, Wang G, Liu X (2016) Leveraging context-free grammar for efficient inverted index compression. In: Proceedings of the 39th international conference on research and development in information retrieval (SIGIR), pp 275–284Google Scholar
  30. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343MathSciNetzbMATHCrossRefGoogle Scholar
  31. Zukowski M, Héman S, Nes N, Boncz P (2006) Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd international conference on data engineering (ICDE), pp 59–70Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of PisaPisaItaly