Cluster Computing

, Volume 22, Supplement 1, pp 2471–2484 | Cite as

An iterative sampling method for online aggregation

  • Zhiqiang ZhangEmail author
  • Jianghua Hu
  • Xiaoqin Xie
  • Haiwei Pan
  • Xiaoning Feng


Online aggregation (OLA) makes it possible to save cost by taking acceptable approximate early answers. Compared to the precise results, computing the approximate ones are more cost effective, especially for large-scale datasets. The user can terminate the processing at any time, when he/she is satisfied with the quality of the result. And the performance of OLA relies on the sampling approach and estimation model. But in large scale distributed computing environment, how to realize OLA more efficiently is a challenging problem. In this paper, we consider the problem of providing OLA in the distributed computing environment and propose a Hadoop-based iterative sampling method for online aggregation. The desired precision of the user can be met by two iteration samplings. To avoid the effects of data bias, we propose a “layered sampling” method to ensure that the approximate aggregation result is statistically meaningful. The experimental results showed the “layered sampling” method considers not only the time efficiency, but also the usage of computing and storage resources of Hadoop.


Online aggregation Iteration Sampling Query processing Hadoop 



This work is supported by the National Natural Science Foundation of China (No. 61672181, 61202090, 61272184), Natural Science Foundation of Heilongjiang Province (No. LC2017029, F2016005), the Science and Technology Innovation Talents Special Fund of Harbin (Nos. 2016RAXXJ 036, 2015RQQXJ067), the opening found of Key Laboratory of Machine Perception (Ministry of Education), Peking University (K-2016-02).


  1. 1.
    Pansare, N., Borkar, V.R., Jermaine, C., et al.: Online aggregation for large MapReduce jobs. Proc. VLDB Endow 4(11), 1135–1145 (2011)Google Scholar
  2. 2.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)Google Scholar
  3. 3.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD Conference Proceedings, pp. 171–182 (1997)Google Scholar
  4. 4.
    Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: SSDBM 1997 Conference Proceedings, pp. 51–63 (1997)Google Scholar
  5. 5.
    Qin, C., Rusu, F.: Sampling estimators for parallel online aggregation. In: Big Data, pp. 204–217. Springer, Berlin (2013)Google Scholar
  6. 6.
    Qin, C., Rusu, F.: PF-OLA: a high-performance framework for parallel online aggregation. Distrib. Parallel Databases 32, 1–39 (2013)Google Scholar
  7. 7.
    Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable Hash ripple join algorithm. In: SIGMOD, pp. 252–262 (2002)Google Scholar
  8. 8.
    Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD, pp. 287–298 (1999)Google Scholar
  9. 9.
    Wu, S., et al.: Distributed online aggregation. PVLDB 2(1), 443–454 (2009)Google Scholar
  10. 10.
    Wu, S., et al.: Continuous sampling for online aggregation over multiple queries. In: SIGMOD, pp. 651–662 (2010)Google Scholar
  11. 11.
    Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: SIGMOD Conference, pp. 1115–1118 (2010)Google Scholar
  12. 12.
    Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)CrossRefGoogle Scholar
  13. 13.
    Kalavri, V., Brundza, V., Vlassov, V.: Block sampling: efficient accurate online aggregation in MapReduce. In: IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom), vol. 1, pp. 250–257. IEEE, New York (2013)Google Scholar
  14. 14.
    Gan, Y., Meng, X., Shi, Y.: Processing online aggregation on skewed data in MapReduce. In: Proceedings of the Fifth International Workshop on Cloud Data Management, pp. 3–10. ACM, New York (2013)Google Scholar
  15. 15.
    Xixian, H., Jianzhong, L., Hong, G.: PAA: an efficient approximate aggregation algorithm on Massive Data. J. Comput. Res. Dev. 51(1), 41–53 (2014)Google Scholar
  16. 16.
    Ci, X., Meng, X.: An efficient block sampling strategy for online aggregation in the Cloud. In: Proceedings of International Conference on Web-Age Information Management (WAIM), June 8–10, Qingdao, China. LNCS 9098, pp. 362–373 (2015)Google Scholar
  17. 17.
    Zhang, Z., Hu, J., Xie, X., Pan, H., Feng, X.: An online approximate aggregation query processing method based on Hadoop. In: IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Nanchang, China, pp. 117–122 (2016)Google Scholar
  18. 18.
  19. 19.
    Cox, D.R.: Estimation by double sampling. Biometrika 39(3–4), 217–227 (1952)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Govindarajulu, Z.: Elements of Sampling Theory and Methods, pp. 64–72. Prentice Hall, Upper Saddle River (1999)Google Scholar
  21. 21.
  22. 22.
  23. 23.
  24. 24. Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.School of InformationZhejiang University of Finance & EconomicsHangzhouChina
  2. 2.College of Computer Science and TechnologyHarbin Engineering UniversityHarbinChina

Personalised recommendations