Skip to main content
Log in

HEDC++: An Extended Histogram Estimator for Data in the Cloud

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

With increasing popularity of cloud-based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used in traditional databases to support query optimization, and histograms are of particular interest. Naturally, histograms could be used to support query optimization and efficient utilization of computing resources in the cloud. Histograms could provide helpful reference information for generating optimal query plans, and generate basic statistics useful for guaranteeing the load balance of query processing in the cloud. Since it is too expensive to construct an exact histogram on massive data, building an approximate histogram is a more feasible solution. This problem, however, is challenging to solve in the cloud environment because of the special data organization and processing mode in the cloud. In this paper, we present HEDC++, an extended histogram estimator for data in the cloud, which provides efficient approximation approaches for both equi-width and equi-depth histograms. We design the histogram estimate workflow based on an extended MapReduce framework, and propose novel sampling mechanisms to leverage the sampling efficiency and estimate accuracy. We experimentally validate our techniques on Hadoop and the results demonstrate that HEDC++ can provide promising histogram estimate for massive data in the cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Thusoo A, Sarma J, Jain N et al. Hive: A warehousing solution over a Map-Reduce framework. In Proc. the 35th Conference of Very Large Databases (VLDB2009), August 2009, pp.1626-1629.

  2. Olston C, Reed B, Srivastava U et al. Pig latin: A not-so-foreign language for data processing. In Proc. the ACM Int. Conf. Management of Data (SIGMOD2008), June 2008, pp.1099-1110.

  3. Abadi D J. Data management in the cloud: Limitations and opportunities. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2009, 32(1): 3–12.

    Google Scholar 

  4. Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In Proc. the 6th Symposium on Opearting Systems Design and Implementation (OSDI2004), December 2004, pp.137-150.

  5. Blanas S, Patel J, Ercegovac V et al. A comparison of join algorithms for log processing in MapReduce. In Proc. the ACM Int. Conf. Management of Data (SIGMOD2010), June 2010, pp.975-986.

  6. Okcan A, Riedewald M. Processing theta-joins using MapReduce. In Proc. the ACM International Conference on Management of Data (SIGMOD2011), June 2011, pp.949-960.

  7. Shi Y J, Meng X F, Wang F S et al. HEDC: A histogram estimator for data in the cloud. In Proc. the 4th Int. Workshop on Cloud Data Management (CloudDB2012), Oct. 29-Nov. 2, 2012, pp.51-58.

  8. Poosala V, Ioannidis Y E, Haas P J, Shekita E J. Improved histograms for selectivity estimation of range predicates. In Proc. the ACM International Conference on Management of Data (SIGMOD1996), June 1996, pp.294-305.

  9. Ioannidis Y E. The history of histograms (abridged). In Proc. the 29th Conference of Very Large Databases (VLDB2003), September 2003, pp.19-30.

  10. Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. In Proc. the ACM International Conference on Management of Data (SIGMOD1984), June 1984, pp.256-276.

  11. Gibbons P B, Matias Y, Poosala V. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 2002, 27(3): 261–298.

    Article  Google Scholar 

  12. Chaudhuri S, Motwani R, Narasayya V. Random sampling for histogram construction: How much is enough? In Proc. ACM International Conference on Management of Data (SIGMOD1998), June 1998, pp.436-447.

  13. Chaudhuri S, Motwani R, Narasayya V. Using random sampling for histogram construction. Technical Report, Microsoft, http://citeseerx.ist.psu.edu/showciting?cid=467221, 1997.

  14. Chaudhuri S, Das G, Srivastava U. Effective use of block-level sampling in statistics estimation. In Proc. ACM International Conference on Management of Data (SIGMOD2004), June 2004, pp.287-298.

  15. Jestes J, Yi K, Li F F. Building wavelet histograms on large data in MapReduce. In Proc. the 37th International Conference of Very Large Databases (VLDB2011), August 29-September 3, 2011, pp.109-120.

  16. Mousavi H, Zaniolo C. Fast and accurate computation of equi-depth histograms over data streams. In Proc. the 14th International Conference on Extending Database Technology (EDBT2011), March 2011, pp.69-80.

  17. Cochran W G. Sampling Techniques. John Wiley and Sons, 1977.

  18. Francisco C A, Fuller W A. Quantile estimation with a complex survey design. The Annals of Statistics, 1991, 19(1): 454–469.

    Article  MATH  MathSciNet  Google Scholar 

  19. Woodruff R S. Confidence intervals for medians and other position measures. Journal of the American Statistical Association, 1952, 47(260): 635–646.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

This research was partially supported by the National Natural Science Foundation of China under Grant Nos. 61070055, 91024032, 91124001, the Fundamental Research Funds for the Central Universities of China, the Research Funds of Renmin University of China under Grant No. 11XNL010, and the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010701, 2013AA013204.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(DOC 29 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, YJ., Meng, XF., Wang, F. et al. HEDC++: An Extended Histogram Estimator for Data in the Cloud. J. Comput. Sci. Technol. 28, 973–988 (2013). https://doi.org/10.1007/s11390-013-1392-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-013-1392-7

Keywords

Navigation