PF-OLA: a high-performance framework for parallel online aggregation

Qin, Chengjie; Rusu, Florin

doi:10.1007/s10619-013-7132-8

PF-OLA: a high-performance framework for parallel online aggregation

Published: 09 August 2013

Volume 32, pages 337–375, (2014)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Chengjie Qin¹ &
Florin Rusu¹

392 Accesses
23 Citations
Explore all metrics

Abstract

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets.

In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive 8 TB TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agarwal, S., Panda, A., Mozafari, B., Iyer, A.P., Madden, S., Stoica, I.: Blink and it’s done: interactive queries on very large data. Proc. VLDB Endow. 5(12), 1902–1905 (2012)
Google Scholar
Arumugam, S., Dobra, A., Jermaine, C., Pansare, N., Perez, L.: The DataPath system: a data-centric analytic processing engine for large data warehouses. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 519–530 (2010)
Chapter Google Scholar
Avnur, R., Hellerstein, J.M., Lo, B., Olston, C., Raman, B., Raman, V., Roth, T., Wylie, K.: CONTROL: continuous output and navigation technology with refinement on-line. In: Proceedings of 1998 ACM SIGMOD International Conference on Management of Data, pp. 567–569 (1998)
Chapter Google Scholar
Chen, S., Gibbons, P.B., Nath, S.: PR-join: a non-blocking join achieving higher early result rate with statistical guarantees. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 147–158 (2010)
Chapter Google Scholar
Cheng, Y., Qin, C., Rusu, F.: GLADE: big data analytics made easy. In: Proceedings of 2012 ACM SIGMOD International Conference on Management of Data, pp. 697–700 (2012)
Chapter Google Scholar
Cochran, W.G.: Sampling Techniques. Wiley, New York (1977)
MATH Google Scholar
Cohen, S.: User-defined aggregate functions: bridging theory and practice. In: Proceedings of 2006 ACM SIGMOD International Conference on Management of Data, pp. 49–60 (2006)
Chapter Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of 2010 USENIX Conference on Networked Systems Design and Implementation, pp. 21–32 (2010)
Google Scholar
Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends® Databases 4(1–3), 1–294 (2012)
Google Scholar
Dobra, A., Jermaine, C., Rusu, F., Xu, F.: Turbo-charging estimate convergence in DBO. Proc. VLDB Endow. 2(1), 419–430 (2009)
Google Scholar
Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-RDBMS analytics. In: Proceedings of 2012 ACM SIGMOD International Conference on Management of Data, pp. 325–336 (2012)
Chapter Google Scholar
Garofalakis, M.N., Gibbon, P.B.: Approximate query processing: taming the TeraBytes. In: Proceedings of 2001 VLDB International Conference on Very Large Databases (2001)
Google Scholar
Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proceedings of 1997 SSDBM International Conference on Scientific and Statistical Database Management, pp. 51–63 (1997)
Google Scholar
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: Proceedings of 1999 ACM SIGMOD International Conference on Management of Data, pp. 287–298 (1999)
Chapter Google Scholar
Hadoop: http://hadoop.apache.org/. Accessed July 2011
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, pp. 171–182 (1997)
Chapter Google Scholar
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. SIGMOD Rec. 26(2), 171–182 (1997)
Article Google Scholar
Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. In: Proceedings of 2007 ACM SIGMOD International Conference on Management of Data, pp. 725–736 (2007)
Chapter Google Scholar
Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sort-merge-shrink join. ACM TODS 31(4) (2006)
Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL queries. In: Proceedings of 2005 VLDB International Conference on Very Large Databases, pp. 745–756 (2005)
Google Scholar
Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)
Google Scholar
Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: Proceedings of 2002 ACM SIGMOD International Conference on Management of Data, pp. 252–262 (2002)
Chapter Google Scholar
Olken, F.: Random sampling from databases. Ph.D. thesis, UC Berkeley (1993)
Pansare, N., Borkar, V.R., Jermaine, C., Condie, T.: Online aggregation for large MapReduce jobs. Proc. VLDB Endow. 4(11), 1135–1145 (2011)
Google Scholar
Rowe, L.A., Stonebraker, M.: The POSTGRES data model. In: Proceedings of 1987 VLDB International Conference on Very Large Databases, pp. 83–96 (1987)
Google Scholar
Rusu, F., Dobra, A.: GLADE: a scalable framework for efficient analytics. Oper. Syst. Rev. 46(1), 12–18 (2012)
Article Google Scholar
Rusu, F., Xu, F., Perez, L.L., Wu, M., Jampani, R., Jermaine, C., Dobra, A.: The DBO database system. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data, pp. 1223–1226 (2008)
Chapter Google Scholar
TPC-H: http://www.tpc.org/tpch/. Accessed February 2012
Wang, H., Zaniolo, C.: Using SQL to build new aggregates and extenders for object-relational systems. In: Proceedings of 2000 VLDB International Conference on Very Large Databases, pp. 166–175 (2000)
Google Scholar
Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set. In: Proceedings of 2007 VLDB International Conference on Very Large Databases, pp. 471–482 (2007)
Google Scholar
Wu, S., Jiang, S., Ooi, B.C., Tan, K.-L.: Distributed online aggregation. Proc. VLDB Endow. 2(1), 443–454 (2009)
Google Scholar
Wu, S., Ooi, B.C., Tan, K.-L.: Continuous sampling for online aggregation over multiple queries. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 651–662 (2010)
Chapter Google Scholar
Xu, F., Jermaine, C., Dobra, A.: Confidence bounds for sampling-based GROUP BY estimates. ACM TODS 33(3) (2008)

Download references

Acknowledgements

This work was supported in part by a gift from LogicBlox.

Author information

Authors and Affiliations

University of California, Merced, 5200 N Lake Road, Merced, CA, 95343, USA
Chengjie Qin & Florin Rusu

Authors

Chengjie Qin
View author publications
You can also search for this author in PubMed Google Scholar
Florin Rusu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florin Rusu.

Additional information

Communicated by Feifei Li and Suman Nath.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, C., Rusu, F. PF-OLA: a high-performance framework for parallel online aggregation. Distrib Parallel Databases 32, 337–375 (2014). https://doi.org/10.1007/s10619-013-7132-8

Download citation

Published: 09 August 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10619-013-7132-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PF-OLA: a high-performance framework for parallel online aggregation

Abstract

Access this article

Similar content being viewed by others

Sampling Estimators for Parallel Online Aggregation

Impala

Parallel Graph Processing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PF-OLA: a high-performance framework for parallel online aggregation

Abstract

Access this article

Similar content being viewed by others

Sampling Estimators for Parallel Online Aggregation

Impala

Parallel Graph Processing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation