Fast Set Intersection through Run-Time Bitmap Construction over PForDelta-Compressed Indexes

  • Xiaocheng Zou
  • Sriram Lakshminarasimhan
  • David A. BoyukaII
  • Stephen Ranshous
  • Houjun Tang
  • Scott Klasky
  • Nagiza F. Samatova
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8632)

Abstract

Set intersection is a fundamental operation for evaluating conjunctive queries in the context of scientific data analysis. The state-of-the-art approach in performing set intersection, compressed bitmap indexing, achieves high computational efficiency because of cheap bitwise operations; however, overall efficiency is often nullified by the HPC I/O bottleneck, because compressed bitmap indexes typically exhibit a heavy storage footprint. Conversely, the recently-presented PForDelta-compressed index has been demonstrated to be storage-lightweight, but has limited performance for set intersection. Thus, a more effective set intersection approach should be efficient in both computation and I/O.

Therefore, we propose a fast set intersection approach that couples the storage light-weight PForDelta indexing format with computationally-efficient bitmaps through a specialized on-the-fly conversion. The resultant challenge is to ensure this conversion process is fast enough to maintain the performance gains from both PForDelta and the bitmaps. To this end, we contribute two key enhancements to PForDelta, BitRun and BitExp, which improve bitmap conversion through bulk bit-setting and a more streamlined PForDelta decoding process, respectively. Our experimental results show that our integrated PForDelta-bitmap method speeds up conjunctive queries by up to 7.7x versus the state-of-the-art approach, while using indexes that require 15%-60% less storage in most cases.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Demaine, E., López-Ortiz, A., Munro, J.: Adaptive set intersections, unions, and differences. In: Proc. Symposium on Discrete Algorithms, SODA (2000)Google Scholar
  2. 2.
    Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems (1996)Google Scholar
  3. 3.
    Byna, S., Wehner, M., Wu, K., et al.: Detecting atmospheric rivers in large climate datasets. In: Proc. Workshop on Petascal Data Analytics: Challenges and Opportunities (2011)Google Scholar
  4. 4.
    Wu, K., Otoo, E., Shoshani, A.: Compressing bitmap indexes for faster search operations. In: Proc. Scientific and Statistical Database Management, SSDM (2002)Google Scholar
  5. 5.
    Wu, K., Otoo, E., Shoshani, A.: On the performance of bitmap indices for high cardinality attributes. In: Proc. Very Large Data Bases (VLDB), vol. 30 (2004)Google Scholar
  6. 6.
    Wu, K.: FastBit: An efficient indexing technology for accelerating data-intensive science. Journal of Physics: Conference Series (2005)Google Scholar
  7. 7.
    Jenkins, J., et al.: Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying. In: Liddle, S.W., Schewe, K.-D., Tjoa, A.M., Zhou, X. (eds.) DEXA 2012, Part II. LNCS, vol. 7447, pp. 16–30. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Jenkins, J., et al.: Alacrity: Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying. In: Hameurlain, A., Küng, J., Wagner, R., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) TLDKS X. LNCS, vol. 8220, pp. 95–114. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  9. 9.
    Lakshminarasimhan, S., Boyuka II, D., et al.: Scalable in situ scientific data encoding for analytical query processing. In: Proc. High-performance Parallel and Distributed Computing HPDC 2013 (2013)Google Scholar
  10. 10.
    Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In: Proc. International Conference on Data Engineering, ICDE (2006)Google Scholar
  11. 11.
    Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: Proc. World Wide Web, WWW (2008)Google Scholar
  12. 12.
    Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proc. World Wide Web, WWW (2009)Google Scholar
  13. 13.
    Barbay, J., López-Ortiz, A., Lu, T.: Faster adaptive set intersections for text searching. In: Àlvarez, C., Serna, M. (eds.) WEA 2006. LNCS, vol. 4007, pp. 146–157. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  15. 15.
    Chatchaval, J., Boonjing, V., Chanvarasuth, P.: A skipping SvS intersection algorithm. In: Proc. International Conference on Computing, Engineering and Information, ICC (2009)Google Scholar
  16. 16.
    Jonassen, S., Bratsberg, S.E.: Efficient compressed inverted index skipping for disjunctive text-queries. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 530–542. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  17. 17.
    Chen, J., Choudhary, A., Supinski, B., et al.: Terascale direct numerical simulations of turbulent combustion using S3D. Computational Science & Discovery (2009)Google Scholar
  18. 18.
    Fryxell, B., Olson, K., Ricker, P., et al.: FLASH: An adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. The Astrophysical Journal Supplement Series (2000)Google Scholar
  19. 19.
    Sinha, R.R., Winslett, M.: Multi-resolution bitmap indexes for scientific data. ACM Transactions on Database Systems, TODS (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Xiaocheng Zou
    • 1
    • 2
  • Sriram Lakshminarasimhan
    • 3
  • David A. BoyukaII
    • 1
    • 2
  • Stephen Ranshous
    • 1
    • 2
  • Houjun Tang
    • 1
    • 2
  • Scott Klasky
    • 2
  • Nagiza F. Samatova
    • 1
    • 2
  1. 1.North Carolina State UniversityRaleighUSA
  2. 2.Oak Ridge National LaboratoryOak RidgeUSA
  3. 3.IBM India Research LabBangaloreIndia

Personalised recommendations