Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Structures for Large Data Sets

  • Peiquan JinEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_168



Bloom filter (Bloom 1970): Bloom filter is a bit-vector data structure that provides a compact representation of a set of elements. It uses a group of hash functions to map each element in a data set S = {s1, s2, …, sm} into a bit-vector of n bits.

LSM tree (O’Neil et al. 1996): The LSM tree is a data structure designed to provide low-cost indexing for files experiencing a high rate of inserts and deletes. It cascades data over time from smaller, higher performing (but more expensive) stores to larger less performant (and less expensive) stores.

Skip list (Black 2014): Skip list is a randomized variant of an ordered linked list with additional, parallel lists. Parallel lists at higher levels skip geometrically more items. Searching begins at the highest level, to quickly get to the right part of the list, and then uses progressively lower level lists. A new item is added by randomly selecting a level, then...

This is a preview of subscription content, log in to check access.


  1. Bender M, Kuszmaul B (2013) Data structures and algorithms for big databases. In: 7th extremely large databases conference, Workshop, and Tutorials (XLDB), Stanford University, CaliforniaGoogle Scholar
  2. Black P (2014) Skip list. In: Pieterse V, Black P (eds) Dictionary of algorithms and data structures. https://www.nist.gov/dads/HTML/skiplist.html
  3. Bloom B (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426zbMATHCrossRefGoogle Scholar
  4. Boldi P, Rosa M, Vigna S (2011) HyperANF: approximating the neighbourhood function of very large graphs on a budget. In: Srinivasan S et al (eds) Proceedings of the 20th international conference on World Wide Web, March 2011, Hyderabad/India, p 625–634Google Scholar
  5. Bonomi F, Mitzenmacher M, Panigrahy R, Singh S, Varghese G (2006) An improved construction for counting Bloom filters. In: Azar Y, Erlebach T (eds) Algorithms – ESA 2006, the 14th annual european symposium on algorithms, September 2006, LNCS 4168, Zurich, Switzerland, p 684–695Google Scholar
  6. Broder A, Charikar M, Frieze A, Mitzenmacher M (1998) Min-wise independent permutations. In: Vitter J (eds) Proceedings of the thirtieth annual ACM symposium on the theory of computing, May 1998, Dallas, Texas, p 327–336Google Scholar
  7. Chen K, Jin P, Yue L (2014) A novel page replacement algorithm for the hybrid memory architecture involving PCM and DRAM. In: Hsu C et al (eds) Proceedings of the 11th IFIP WG 10.3 international conference on network and parallel computing, September 2014, Ilan, Taiwan, p 108–119CrossRefGoogle Scholar
  8. Cooper B, Ramakrishnan R, Srivastava U, Silberstein A, Bohannon P, Jacobsen H, Puz N, Weaver D, Yerneni R (2008) PNUTS: Yahoo!’s hosted data serving platform. Proc VLDB Endowment 1(2):1277–1288CrossRefGoogle Scholar
  9. Cormen T, Leiserson C, Rivest R, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, Boston, pp 253–280zbMATHGoogle Scholar
  10. Das A, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: Williamson C et al (eds) Proceedings of the 16th international conference on World Wide Web, May 2007, Banff, Alberta, p 271–280Google Scholar
  11. Graefe G (2004) Write-Optimized B-Trees. In: Nascimento M, Özsu M, Kossmann D, et al. (eds) Proceedings of the thirtieth international conference on very large data bases, Toronto, Canada, p 672–683Google Scholar
  12. Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms, In: Efthimiadis E et al (eds) Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, August 2006, Seattle, Washington, p 284–291Google Scholar
  13. Jin P, Yang P, Yue L (2015) Optimizing B+-tree for hybrid storage systems. Distrib Parallel Databases 33(3):449–475CrossRefGoogle Scholar
  14. Jin P, Yang C, Jensen C, Yang P, Yue L (2016) Read/write-optimized tree indexing for solid-state drives. VLDB J 25(5):695–717CrossRefGoogle Scholar
  15. Karger D, Lehman E, Leighton T, Panigrahy R, Levine M, Lewin D (1997) Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Leighton F et al (eds) Proceedings of the twenty-ninth Annual ACM symposium on the theory of computing, May 1997, El Paso, Texas, p 654–663Google Scholar
  16. Knuth D (1998) The art of computer programming. 3: sorting and searching, 2nd edn. Addison-Wesley, New York, pp 513–558Google Scholar
  17. Li X, Da Z, Meng X (2008) A new dynamic hash index for flash-based storage. In Jia Y et al (eds) Proceedings of the ninth international conference on web-age information management, July 2008, Zhangjiajie, China, p 93–98Google Scholar
  18. Li Y, He B, Yang J, Luo Q, Yi K (2010) Tree indexing on solid state drives. Proc VLDB Endowment 3(1):1195–1206CrossRefGoogle Scholar
  19. Li L, Jin P, Yang C, Wan S, Yue L (2016) XB+-tree: a novel index for PCM/DRAM-based hybrid memory. In: Cheema M et al (eds) Databases theory and applications – proceedings of the 27th Australasian database conference, September 2016, LNCS 9877, Sydney, Australia, p 357–368Google Scholar
  20. Liu L, Özsu M (2009) Encyclopedia of database systems. Springer, New YorkzbMATHCrossRefGoogle Scholar
  21. Maggs B, Sitaraman R (2015) Algorithmic nuggets in content delivery. SIGCOMM Comput Commun Rev 45(3):52–66CrossRefGoogle Scholar
  22. O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (LSM-tree). Acta Informatica 33(4):351–385zbMATHCrossRefGoogle Scholar
  23. Pournaras E, Warnier M, Brazier F (2013) A generic and adaptive aggregation service for large-scale decentralized networks. Complex Adapt Syst Model 1:19CrossRefGoogle Scholar
  24. Pugh W (1990) Skip lists: a probabilistic alternative to balanced trees. Commun ACM 33(6):668CrossRefGoogle Scholar
  25. Roh H, Kim W, Kim S, Park S (2009) A B-tree index extension to enhance response time and the life cycle of flash memory. Inf Sci 179(18):3136–3161MathSciNetCrossRefGoogle Scholar
  26. Wang L, Wang H (2010) A new self-adaptive extendible hash index for flash-based DBMS. In Hao Y et al (eds) Proceedings of the 2010 IEEE international conference on information and automation, June 2010, Haerbin, China, p 2519–2524Google Scholar
  27. Wang J, Liu W, Kumar S, Chang S (2016) Learning to hash for indexing big data – a survey. Proc IEEE 104(1):34–57CrossRefGoogle Scholar
  28. Yang C, Lee K, Kim M, Lee Y (2009) An efficient dynamic hash index structure for NAND flash memory. IEICE Trans Fundam Electron Commun Comput Sci 92(7):1716–1719CrossRefGoogle Scholar
  29. Yang C, Jin P, Yue L, Zhang D (2016) Self-adaptive linear hashing for solid state drives. In Hsu M et al (eds) Proceedings of the 32nd IEEE international conference on data engineering, May 2016, Helsinki, Finland, p 433–444Google Scholar
  30. Yoo M, Kim B, Lee D (2012). Hybrid hash index for NAND flash memory-based storage systems. In: Lee S et al (eds) Proceedings of the 6th international conference on ubiquitous information management and communication, February 2012, Kuala Lumpur, Malaysia, p 55:1–55:5Google Scholar
  31. Zeinalipour-Yazti D, Lin S, Kalogeraki V, Gunopulos D, Najjar W (2005) MicroHash: an efficient index structure for flash-based sensor devices. In: Gibson G (eds) Proceedings of the FAST ‘05 conference on file and storage technologies, December 2005, San Francisco, California, p 1–14Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyUniversity of Science and Technology of ChinaHefeiChina

Section editors and affiliations

  • Bingsheng He
  • Behrooz Parhami
    • 1
  1. 1.Dept. of Electrical and Computer EngineeringUniversity of California, Santa BarbaraSanta BarbaraUSA