Skip to main content
Log in

PF-OLA: a high-performance framework for parallel online aggregation

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets.

In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive 8 TB TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Fig. 2
Fig. 3
Algorithm 3
Algorithm 4
Algorithm 5
Algorithm 6
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Agarwal, S., Panda, A., Mozafari, B., Iyer, A.P., Madden, S., Stoica, I.: Blink and it’s done: interactive queries on very large data. Proc. VLDB Endow. 5(12), 1902–1905 (2012)

    Google Scholar 

  2. Arumugam, S., Dobra, A., Jermaine, C., Pansare, N., Perez, L.: The DataPath system: a data-centric analytic processing engine for large data warehouses. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 519–530 (2010)

    Chapter  Google Scholar 

  3. Avnur, R., Hellerstein, J.M., Lo, B., Olston, C., Raman, B., Raman, V., Roth, T., Wylie, K.: CONTROL: continuous output and navigation technology with refinement on-line. In: Proceedings of 1998 ACM SIGMOD International Conference on Management of Data, pp. 567–569 (1998)

    Chapter  Google Scholar 

  4. Chen, S., Gibbons, P.B., Nath, S.: PR-join: a non-blocking join achieving higher early result rate with statistical guarantees. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 147–158 (2010)

    Chapter  Google Scholar 

  5. Cheng, Y., Qin, C., Rusu, F.: GLADE: big data analytics made easy. In: Proceedings of 2012 ACM SIGMOD International Conference on Management of Data, pp. 697–700 (2012)

    Chapter  Google Scholar 

  6. Cochran, W.G.: Sampling Techniques. Wiley, New York (1977)

    MATH  Google Scholar 

  7. Cohen, S.: User-defined aggregate functions: bridging theory and practice. In: Proceedings of 2006 ACM SIGMOD International Conference on Management of Data, pp. 49–60 (2006)

    Chapter  Google Scholar 

  8. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of 2010 USENIX Conference on Networked Systems Design and Implementation, pp. 21–32 (2010)

    Google Scholar 

  9. Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends® Databases 4(1–3), 1–294 (2012)

    Google Scholar 

  10. Dobra, A., Jermaine, C., Rusu, F., Xu, F.: Turbo-charging estimate convergence in DBO. Proc. VLDB Endow. 2(1), 419–430 (2009)

    Google Scholar 

  11. Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-RDBMS analytics. In: Proceedings of 2012 ACM SIGMOD International Conference on Management of Data, pp. 325–336 (2012)

    Chapter  Google Scholar 

  12. Garofalakis, M.N., Gibbon, P.B.: Approximate query processing: taming the TeraBytes. In: Proceedings of 2001 VLDB International Conference on Very Large Databases (2001)

    Google Scholar 

  13. Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proceedings of 1997 SSDBM International Conference on Scientific and Statistical Database Management, pp. 51–63 (1997)

    Google Scholar 

  14. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: Proceedings of 1999 ACM SIGMOD International Conference on Management of Data, pp. 287–298 (1999)

    Chapter  Google Scholar 

  15. Hadoop: http://hadoop.apache.org/. Accessed July 2011

  16. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, pp. 171–182 (1997)

    Chapter  Google Scholar 

  17. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. SIGMOD Rec. 26(2), 171–182 (1997)

    Article  Google Scholar 

  18. Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. In: Proceedings of 2007 ACM SIGMOD International Conference on Management of Data, pp. 725–736 (2007)

    Chapter  Google Scholar 

  19. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sort-merge-shrink join. ACM TODS 31(4) (2006)

  20. Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL queries. In: Proceedings of 2005 VLDB International Conference on Very Large Databases, pp. 745–756 (2005)

    Google Scholar 

  21. Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)

    Google Scholar 

  22. Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: Proceedings of 2002 ACM SIGMOD International Conference on Management of Data, pp. 252–262 (2002)

    Chapter  Google Scholar 

  23. Olken, F.: Random sampling from databases. Ph.D. thesis, UC Berkeley (1993)

  24. Pansare, N., Borkar, V.R., Jermaine, C., Condie, T.: Online aggregation for large MapReduce jobs. Proc. VLDB Endow. 4(11), 1135–1145 (2011)

    Google Scholar 

  25. Rowe, L.A., Stonebraker, M.: The POSTGRES data model. In: Proceedings of 1987 VLDB International Conference on Very Large Databases, pp. 83–96 (1987)

    Google Scholar 

  26. Rusu, F., Dobra, A.: GLADE: a scalable framework for efficient analytics. Oper. Syst. Rev. 46(1), 12–18 (2012)

    Article  Google Scholar 

  27. Rusu, F., Xu, F., Perez, L.L., Wu, M., Jampani, R., Jermaine, C., Dobra, A.: The DBO database system. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data, pp. 1223–1226 (2008)

    Chapter  Google Scholar 

  28. TPC-H: http://www.tpc.org/tpch/. Accessed February 2012

  29. Wang, H., Zaniolo, C.: Using SQL to build new aggregates and extenders for object-relational systems. In: Proceedings of 2000 VLDB International Conference on Very Large Databases, pp. 166–175 (2000)

    Google Scholar 

  30. Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set. In: Proceedings of 2007 VLDB International Conference on Very Large Databases, pp. 471–482 (2007)

    Google Scholar 

  31. Wu, S., Jiang, S., Ooi, B.C., Tan, K.-L.: Distributed online aggregation. Proc. VLDB Endow. 2(1), 443–454 (2009)

    Google Scholar 

  32. Wu, S., Ooi, B.C., Tan, K.-L.: Continuous sampling for online aggregation over multiple queries. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 651–662 (2010)

    Chapter  Google Scholar 

  33. Xu, F., Jermaine, C., Dobra, A.: Confidence bounds for sampling-based GROUP BY estimates. ACM TODS 33(3) (2008)

Download references

Acknowledgements

This work was supported in part by a gift from LogicBlox.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florin Rusu.

Additional information

Communicated by Feifei Li and Suman Nath.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, C., Rusu, F. PF-OLA: a high-performance framework for parallel online aggregation. Distrib Parallel Databases 32, 337–375 (2014). https://doi.org/10.1007/s10619-013-7132-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-013-7132-8

Keywords

Navigation