Skip to main content

Effective and efficient data sampling using bitmap indices

Abstract

With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: (1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and (2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Notes

  1. http://en.wikipedia.org/wiki/Histogram

  2. http://en.wikipedia.org/wiki/Q-Q_plot

  3. http://www.sqlite.org

  4. http://en.wikipedia.org/wiki/Radix_sort

References

  1. Abramson, D., Kommineni, J.: A flexible IO scheme for grid workflows. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), April 2004.

  2. Ahrens, J., Geveci, B., Law, C.: Paraview: an end user tool for large data visualization. In: Hansen, C.D., Johnson, C.R. (eds.) The Visualization Handbook. Elsevier, Burlington (2005)

    Google Scholar 

  3. Allcock, W.E., Foster, I., Madduri, R.: Reliable data transport: a critical service for the grid. In: Proceedings of the Workshop on Building Service Based Grids, 2004.

  4. Antoshenkov, G.: Byte-aligned bitmap compression. In: DCC’95: Proceedings of the Conference on Data Compression, p. 476. IEEE (1995)

  5. Baranovski, A., Beattie, K., Bharathi, S., Boverhof, J., Bresnahan, J., Chervenak, A., Foster, I., Freeman, T., Gunter, D., Keahey, K., Kesselman, C., Kettimuthu, R., Leroy, N., Link, M., Livny, M., Madduri, R., Oleynik, G., Pearlman, L., Schuler, R., Tierney, B.: Enabling petascale science: data management, troubleshooting, and scalable science services. J. Phys.: Conf. Ser. 125, (2008)

  6. Bernholdt, D., Bharathi, S., Brown, D., Chanchio, K., Chen, M., Chervenak, A., Cinquini, L., Drach, B., Foster, I., Fox, P., et al.: The earth system grid: supporting the next generation of climate modeling research. Proc. IEEE 93(3), 485–495 (2005)

    Article  Google Scholar 

  7. Cai, M., Chervenak, A., Frank, M.: A peer-to-peer replica location service based on a distributed hash table. In: Proceedings of SC 2004, Nov 2004

  8. Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10, 199–223 (2001)

    MATH  Google Scholar 

  9. Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aggregation queries. Proc. ICDE 1999, 534–542 (1999)

    Google Scholar 

  10. Chervenak, A.L., Palavalli, N., Bharathi, S., Kesselman, C., Schwartzkopf, R.: Performance and scalability of a replica location service. In: Proceedings of the Conference on High Performance Distributed Computing (HPDC), June 2004

  11. Chou, J., Wu, K., Rübel, O., Prabhat, M.H.J.Q., Austin, B., Bethel, E.W., Ryne, R.D., Shoshani, A.: Parallel index and query forlarge scale data analysis, In: SC (2011)

  12. Cochran, W.G.: Sampling Techniques. Wiley-India, New Delhi (2007)

    Google Scholar 

  13. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Lazzarini, A., Arbree, A., Cavanaugh, R., Koranda, S.: Mapping abstract complex workflows onto grid environments. J. Grid Comput., 9–23 (2003)

  14. Deelman, E., Singh, G., Atkinson, M.P., Chervenak, A., Chue Hong, N.P., Kesselman, C., Patil, S., Pearlman, L., Su, M.: Grid-based metadata services. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM04) (2004)

  15. Ellsworth, D., Green, B., Moran, P.: Interactive terascale particle visualization. In: Proceedings of the conference on Visualization’04, pp. 353–360. IEEE Computer Society (2004)

  16. Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Chimera: a virtual data system for representing, querying and automating data derivation. In: Proceedings of the Conference on Scientific and Statistical Data Management, July 2002

  17. Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: IEEE 28th International Conference on Data Engineering (ICDE), 2012, pp. 486–497. IEEE (2012)

  18. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proceedings of SIGMOD 1997 (1997)

  19. Ioannidis, Y., Poosala, V.: Histogram-based approximation of set-valued query-answers. In: Proceedings of the International Conference on Very Large Data, Bases, pp. 174–185. (1999)

  20. Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the dbo engine. Proc. SIGMOD 2007, 725–736 (2007)

    Google Scholar 

  21. Jiang, W., Ravi, V.T., Agrawal, G.: A map-reduce system with an alternate API for multi-core environments. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 84–93. IEEE Computer Society (2010)

  22. Johnson, C.R., Sanderson, A.R.: A next step: visualizing errors and uncertainty. IEEE Comput. Graph. Appl. 23(5), 6–10 (2003)

    Article  Google Scholar 

  23. Jones, P.W., Worley, P.H., Yoshida, Y., White III, J.B., Levesque, J.: Practical performance portability in the parallel ocean program (POP). Concurr. Comput.: Pract. Exp. 17(10), 1317–1327 (2005)

    Article  Google Scholar 

  24. Kettimuthu, R., Sim, A., Gunter, D., Allcock, B., Bremer, P.-T., Bresnahan, J., Cherry, A., Childers, L., Dart, E., Foster, I., Harms, K., Hick, J., Lee, J., Link, M., Long, J., Miller, K., Natarajan, V., Pascucci, V., Raffenetti, K., Ressman, D., Williams, D., Wilson, L., Winkler, L.: Lessons learned from moving earth system grid data sets over a 20 Gbps wide-area network. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC 2010), June 2010

  25. Khairoutdinov, M.F., Randall, D.A.: A cloud resolving model as a cloud parameterization in the ncar community climate system model: preliminary results. Geophys. Res. Lett. 28(18), 36173620 (2001)

    Google Scholar 

  26. Kissel, E., Martin Swany, D., Brown, A.: Improving GridFTP performance using the Phoebus session layer. In: Proceedings of SC, Nov 2009

  27. Kosar, T., Livny, M.: Stork: making data placement a first class citizen in the grid. In: Proceedings of International Conference on Distributed Computing Systems (ICDCS) (2004)

  28. LaMar, E.C., Hamann, B., Joy, K.I.: Efficient error calculation for multiresolution texture-based volume visualization. In: Hierarchical and Geometrical Methods in Scientific Visualization. pp. 51–62. (2003)

  29. LaMar, E., Hamann, B., Joy, K.I.: Multiresolution techniques for interactive texture-based volume visualization. In: Proceedings of the Conference on Visualization’99: Celebrating Ten Years, pp. 355–361. IEEE Computer Society Press (1999)

  30. Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)

    Article  Google Scholar 

  31. Liu, W., Tieman, B., Kettimuthu, R., Foster, I.: A data transfer framework for large-scale science experiments. In: 3rd International Workshop on Data Intensive Distributed Computing (DIDC 2010) in conjunction with 19th International Symposium on High Performance Distributed Computing (HPDC) 2010 (2010)

  32. Lohr, S.L.: Sampling: design and analysis. Thomson (2009)

  33. Lu, D., Qiao, Y., Dinda, P.A., Bustamante, F.E.: Modeling and taming parallel TCP on wide area networks. In: Proceedings of the 12th International Parallel and Distributed Processing Symposium (IPDPS), April 2005

  34. O’Neil, P., Quass, D.: Improved query performance with variant indexes. In ACM Sigmod Record, vol. 26, pp. 38–49. ACM (1997)

  35. Pascucci, V., Frank, R.J.: Global static indexing for real-time exploration of very large regular grids. In: Supercomputing, ACM/IEEE 2001 Conference, pp. 45–45. IEEE (2001)

  36. Poosala, V., Ganti, V.: Fast approximate query answering using precomputed statistics. In: Proceedings of ICDE 1999, p. 252 (1999)

  37. Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. ACM SIGMOD Record 25(2), 294–305 (1996)

    Google Scholar 

  38. Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, C., Mahohar, M., Pail, S., Pearlman L.: A metadata catalog service for data intensive applications. In: Proceedings of Supercomputing 2003 (SC2003), Nov 2003

  39. Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, E., Manohar, M., Patil, S., Pearlman, L.: A metadata catalog service for data intensive applications. In SC ’03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 33, Washington, DC, USA. IEEE Computer Society (2003)

  40. Su, Y., Agrawal, G.: Supporting user-defined subsetting and aggregation over parallel netcdf datasets. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 212–219. IEEE (2012)

  41. Su, Y., Agrawal, G., Woodring, J.: Indexing and parallel query processing support for visualizing climate datasets. In: 2012 41th IEEE/ACM International Conference on Parallel Processing (ICPP), pp. 249–258. IEEE (2012)

  42. Su, Y., Agrawal, G., Woodring, J., Myers, K., Wendelberger, J., Ahrens, J.: Taming massive distributed datasets: data sampling using bitmap indices. In: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pp. 13–24. ACM (2013)

  43. Tuchinda, R., Thakkar, S., Gil, A., Deelman, E.: Artemis: integrating scientific data on the grid. In: Proceedings of the 16th Conference on Innovative Applications of Artificial Intelligence (IAAI), pp. 25–29 (2004)

  44. Vazhkudai, S., Schopf, J.: Using disk throughput data in predictions of end-to-end grid transfers. In: Proceedings of the Third Workshop on Grid Computing (Grid 2002), Nov 2002

  45. Vitter, J.S.: An efficient algorithm for sequential random sampling. ACM Trans. Math. Softw. (TOMS) 13(1), 58–67 (1987)

    Article  Google Scholar 

  46. Wang, C., Garcia, A., Shen, H.W.: Interactive level-of-detail selection using image-based quality metric for large volume visualization. IEEE Trans. Vis. Comput. Graph. 13(1), 122–134 (2007)

    Article  Google Scholar 

  47. Woodring, J., Ahrens, J., Figg, J., Wendelberger, J., Habib, S., Heitmann, K.: In situ sampling of a large-scale particle simulation for interactive visualization and analysis. In: Computer Graphics Forum, vol. 30, pp. 1151–1160. Wiley Online Library (2011)

  48. Wu, K., Otoo, E.J., Shoshani, A.: Compressing bitmap indexes for faster search operations. In: Proceedings of the 14th International Conference on Scientific and Statistical Database Management, 2002, pp. 99–108. IEEE (2002)

  49. Wu, K., Stockinger, K., Shoshani, A.: Breaking the curse of cardinality on bitmap indexes. In: Scientific and Statistical Database Management, pp. 348–365. Springer (2008)

  50. Wu, K., Koegler, W., Chen, J., Shoshani, A.: Using bitmap index for interactive exploration of large datasets. In: 15th International Conference on Scientific and Statistical Database Management, 2003, pp. 65–74. IEEE, July 2003

  51. Xu, L., Lee, T.Y., Shen, H.W.: An information-theoretic framework for flow visualization. IEEE Trans. Vis. Comput. Graph. 16(6), 1216–1224 (2010)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the Department of Energy (DOE) Office of Science (OSC) Advanced Scientific Computing Research (ASCR) and NSF award IIS-0916196 to the Ohio State University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Su.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Su, Y., Agrawal, G., Woodring, J. et al. Effective and efficient data sampling using bitmap indices. Cluster Comput 17, 1081–1100 (2014). https://doi.org/10.1007/s10586-014-0360-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-014-0360-5

Keywords

  • Big data
  • Bitmap indexing
  • Data sampling
  • Multi-resolution
  • Parallel processing