Cluster Computing

, Volume 17, Issue 4, pp 1081–1100 | Cite as

Effective and efficient data sampling using bitmap indices

  • Yu Su
  • Gagan Agrawal
  • Jonathan Woodring
  • Kary Myers
  • Joanne Wendelberger
  • James Ahrens
Article

Abstract

With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: (1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and (2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.

Keywords

Big data Bitmap indexing Data sampling Multi-resolution Parallel processing 

References

  1. 1.
    Abramson, D., Kommineni, J.: A flexible IO scheme for grid workflows. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), April 2004.Google Scholar
  2. 2.
    Ahrens, J., Geveci, B., Law, C.: Paraview: an end user tool for large data visualization. In: Hansen, C.D., Johnson, C.R. (eds.) The Visualization Handbook. Elsevier, Burlington (2005)Google Scholar
  3. 3.
    Allcock, W.E., Foster, I., Madduri, R.: Reliable data transport: a critical service for the grid. In: Proceedings of the Workshop on Building Service Based Grids, 2004.Google Scholar
  4. 4.
    Antoshenkov, G.: Byte-aligned bitmap compression. In: DCC’95: Proceedings of the Conference on Data Compression, p. 476. IEEE (1995)Google Scholar
  5. 5.
    Baranovski, A., Beattie, K., Bharathi, S., Boverhof, J., Bresnahan, J., Chervenak, A., Foster, I., Freeman, T., Gunter, D., Keahey, K., Kesselman, C., Kettimuthu, R., Leroy, N., Link, M., Livny, M., Madduri, R., Oleynik, G., Pearlman, L., Schuler, R., Tierney, B.: Enabling petascale science: data management, troubleshooting, and scalable science services. J. Phys.: Conf. Ser. 125, (2008)Google Scholar
  6. 6.
    Bernholdt, D., Bharathi, S., Brown, D., Chanchio, K., Chen, M., Chervenak, A., Cinquini, L., Drach, B., Foster, I., Fox, P., et al.: The earth system grid: supporting the next generation of climate modeling research. Proc. IEEE 93(3), 485–495 (2005)CrossRefGoogle Scholar
  7. 7.
    Cai, M., Chervenak, A., Frank, M.: A peer-to-peer replica location service based on a distributed hash table. In: Proceedings of SC 2004, Nov 2004Google Scholar
  8. 8.
    Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10, 199–223 (2001)MATHGoogle Scholar
  9. 9.
    Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aggregation queries. Proc. ICDE 1999, 534–542 (1999)Google Scholar
  10. 10.
    Chervenak, A.L., Palavalli, N., Bharathi, S., Kesselman, C., Schwartzkopf, R.: Performance and scalability of a replica location service. In: Proceedings of the Conference on High Performance Distributed Computing (HPDC), June 2004Google Scholar
  11. 11.
    Chou, J., Wu, K., Rübel, O., Prabhat, M.H.J.Q., Austin, B., Bethel, E.W., Ryne, R.D., Shoshani, A.: Parallel index and query forlarge scale data analysis, In: SC (2011)Google Scholar
  12. 12.
    Cochran, W.G.: Sampling Techniques. Wiley-India, New Delhi (2007)Google Scholar
  13. 13.
    Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Lazzarini, A., Arbree, A., Cavanaugh, R., Koranda, S.: Mapping abstract complex workflows onto grid environments. J. Grid Comput., 9–23 (2003)Google Scholar
  14. 14.
    Deelman, E., Singh, G., Atkinson, M.P., Chervenak, A., Chue Hong, N.P., Kesselman, C., Patil, S., Pearlman, L., Su, M.: Grid-based metadata services. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM04) (2004)Google Scholar
  15. 15.
    Ellsworth, D., Green, B., Moran, P.: Interactive terascale particle visualization. In: Proceedings of the conference on Visualization’04, pp. 353–360. IEEE Computer Society (2004)Google Scholar
  16. 16.
    Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Chimera: a virtual data system for representing, querying and automating data derivation. In: Proceedings of the Conference on Scientific and Statistical Data Management, July 2002Google Scholar
  17. 17.
    Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: IEEE 28th International Conference on Data Engineering (ICDE), 2012, pp. 486–497. IEEE (2012)Google Scholar
  18. 18.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proceedings of SIGMOD 1997 (1997)Google Scholar
  19. 19.
    Ioannidis, Y., Poosala, V.: Histogram-based approximation of set-valued query-answers. In: Proceedings of the International Conference on Very Large Data, Bases, pp. 174–185. (1999)Google Scholar
  20. 20.
    Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the dbo engine. Proc. SIGMOD 2007, 725–736 (2007)Google Scholar
  21. 21.
    Jiang, W., Ravi, V.T., Agrawal, G.: A map-reduce system with an alternate API for multi-core environments. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 84–93. IEEE Computer Society (2010)Google Scholar
  22. 22.
    Johnson, C.R., Sanderson, A.R.: A next step: visualizing errors and uncertainty. IEEE Comput. Graph. Appl. 23(5), 6–10 (2003)CrossRefGoogle Scholar
  23. 23.
    Jones, P.W., Worley, P.H., Yoshida, Y., White III, J.B., Levesque, J.: Practical performance portability in the parallel ocean program (POP). Concurr. Comput.: Pract. Exp. 17(10), 1317–1327 (2005)CrossRefGoogle Scholar
  24. 24.
    Kettimuthu, R., Sim, A., Gunter, D., Allcock, B., Bremer, P.-T., Bresnahan, J., Cherry, A., Childers, L., Dart, E., Foster, I., Harms, K., Hick, J., Lee, J., Link, M., Long, J., Miller, K., Natarajan, V., Pascucci, V., Raffenetti, K., Ressman, D., Williams, D., Wilson, L., Winkler, L.: Lessons learned from moving earth system grid data sets over a 20 Gbps wide-area network. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC 2010), June 2010Google Scholar
  25. 25.
    Khairoutdinov, M.F., Randall, D.A.: A cloud resolving model as a cloud parameterization in the ncar community climate system model: preliminary results. Geophys. Res. Lett. 28(18), 36173620 (2001)Google Scholar
  26. 26.
    Kissel, E., Martin Swany, D., Brown, A.: Improving GridFTP performance using the Phoebus session layer. In: Proceedings of SC, Nov 2009Google Scholar
  27. 27.
    Kosar, T., Livny, M.: Stork: making data placement a first class citizen in the grid. In: Proceedings of International Conference on Distributed Computing Systems (ICDCS) (2004)Google Scholar
  28. 28.
    LaMar, E.C., Hamann, B., Joy, K.I.: Efficient error calculation for multiresolution texture-based volume visualization. In: Hierarchical and Geometrical Methods in Scientific Visualization. pp. 51–62. (2003)Google Scholar
  29. 29.
    LaMar, E., Hamann, B., Joy, K.I.: Multiresolution techniques for interactive texture-based volume visualization. In: Proceedings of the Conference on Visualization’99: Celebrating Ten Years, pp. 355–361. IEEE Computer Society Press (1999)Google Scholar
  30. 30.
    Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)CrossRefGoogle Scholar
  31. 31.
    Liu, W., Tieman, B., Kettimuthu, R., Foster, I.: A data transfer framework for large-scale science experiments. In: 3rd International Workshop on Data Intensive Distributed Computing (DIDC 2010) in conjunction with 19th International Symposium on High Performance Distributed Computing (HPDC) 2010 (2010)Google Scholar
  32. 32.
    Lohr, S.L.: Sampling: design and analysis. Thomson (2009)Google Scholar
  33. 33.
    Lu, D., Qiao, Y., Dinda, P.A., Bustamante, F.E.: Modeling and taming parallel TCP on wide area networks. In: Proceedings of the 12th International Parallel and Distributed Processing Symposium (IPDPS), April 2005Google Scholar
  34. 34.
    O’Neil, P., Quass, D.: Improved query performance with variant indexes. In ACM Sigmod Record, vol. 26, pp. 38–49. ACM (1997)Google Scholar
  35. 35.
    Pascucci, V., Frank, R.J.: Global static indexing for real-time exploration of very large regular grids. In: Supercomputing, ACM/IEEE 2001 Conference, pp. 45–45. IEEE (2001) Google Scholar
  36. 36.
    Poosala, V., Ganti, V.: Fast approximate query answering using precomputed statistics. In: Proceedings of ICDE 1999, p. 252 (1999)Google Scholar
  37. 37.
    Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. ACM SIGMOD Record 25(2), 294–305 (1996)Google Scholar
  38. 38.
    Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, C., Mahohar, M., Pail, S., Pearlman L.: A metadata catalog service for data intensive applications. In: Proceedings of Supercomputing 2003 (SC2003), Nov 2003Google Scholar
  39. 39.
    Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, E., Manohar, M., Patil, S., Pearlman, L.: A metadata catalog service for data intensive applications. In SC ’03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 33, Washington, DC, USA. IEEE Computer Society (2003)Google Scholar
  40. 40.
    Su, Y., Agrawal, G.: Supporting user-defined subsetting and aggregation over parallel netcdf datasets. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 212–219. IEEE (2012)Google Scholar
  41. 41.
    Su, Y., Agrawal, G., Woodring, J.: Indexing and parallel query processing support for visualizing climate datasets. In: 2012 41th IEEE/ACM International Conference on Parallel Processing (ICPP), pp. 249–258. IEEE (2012)Google Scholar
  42. 42.
    Su, Y., Agrawal, G., Woodring, J., Myers, K., Wendelberger, J., Ahrens, J.: Taming massive distributed datasets: data sampling using bitmap indices. In: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pp. 13–24. ACM (2013)Google Scholar
  43. 43.
    Tuchinda, R., Thakkar, S., Gil, A., Deelman, E.: Artemis: integrating scientific data on the grid. In: Proceedings of the 16th Conference on Innovative Applications of Artificial Intelligence (IAAI), pp. 25–29 (2004)Google Scholar
  44. 44.
    Vazhkudai, S., Schopf, J.: Using disk throughput data in predictions of end-to-end grid transfers. In: Proceedings of the Third Workshop on Grid Computing (Grid 2002), Nov 2002Google Scholar
  45. 45.
    Vitter, J.S.: An efficient algorithm for sequential random sampling. ACM Trans. Math. Softw. (TOMS) 13(1), 58–67 (1987)CrossRefGoogle Scholar
  46. 46.
    Wang, C., Garcia, A., Shen, H.W.: Interactive level-of-detail selection using image-based quality metric for large volume visualization. IEEE Trans. Vis. Comput. Graph. 13(1), 122–134 (2007)CrossRefGoogle Scholar
  47. 47.
    Woodring, J., Ahrens, J., Figg, J., Wendelberger, J., Habib, S., Heitmann, K.: In situ sampling of a large-scale particle simulation for interactive visualization and analysis. In: Computer Graphics Forum, vol. 30, pp. 1151–1160. Wiley Online Library (2011)Google Scholar
  48. 48.
    Wu, K., Otoo, E.J., Shoshani, A.: Compressing bitmap indexes for faster search operations. In: Proceedings of the 14th International Conference on Scientific and Statistical Database Management, 2002, pp. 99–108. IEEE (2002)Google Scholar
  49. 49.
    Wu, K., Stockinger, K., Shoshani, A.: Breaking the curse of cardinality on bitmap indexes. In: Scientific and Statistical Database Management, pp. 348–365. Springer (2008)Google Scholar
  50. 50.
    Wu, K., Koegler, W., Chen, J., Shoshani, A.: Using bitmap index for interactive exploration of large datasets. In: 15th International Conference on Scientific and Statistical Database Management, 2003, pp. 65–74. IEEE, July 2003Google Scholar
  51. 51.
    Xu, L., Lee, T.Y., Shen, H.W.: An information-theoretic framework for flow visualization. IEEE Trans. Vis. Comput. Graph. 16(6), 1216–1224 (2010)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Yu Su
    • 1
  • Gagan Agrawal
    • 1
  • Jonathan Woodring
    • 2
  • Kary Myers
    • 2
  • Joanne Wendelberger
    • 2
  • James Ahrens
    • 2
  1. 1.Computer Science and EngineeringThe Ohio State UniversityColumbus USA
  2. 2.Los Alamos National LaboratoryLos Alamos USA

Personalised recommendations