Skip to main content

Sampling Estimators for Parallel Online Aggregation

  • Conference paper
Big Data (BNCOD 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7968))

Included in the following conference series:

Abstract

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. When coupled with parallel processing, this allows for the interactive data exploration of the largest datasets. In this paper, we identify the main functionality requirements of sampling-based parallel online aggregation—partial aggregation, parallel sampling, and estimation. We argue for overlapped online aggregation as the only scalable solution to combine computation and estimation. We analyze the properties of existent estimators and design a novel sampling-based estimator that is robust to node delay and failure. When executed over a massive 8TB TPC-H instance, the proposed estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and achieves linear scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online Aggregation. In: SIGMOD (1997)

    Google Scholar 

  2. Rusu, F., Dobra, A.: GLADE: A Scalable Framework for Efficient Analytics. Operating Systems Review 46(1) (2012)

    Google Scholar 

  3. Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4(1-3) (2012)

    Google Scholar 

  4. Wu, S., Jiang, S., Ooi, B.C., Tan, K.L.: Distributed Online Aggregation. PVLDB 2(1) (2009)

    Google Scholar 

  5. Laptev, N., Zeng, K., Zaniolo, C.: Early Accurate Results for Advanced Analytics on MapReduce. PVLDB 5(10) (2012)

    Google Scholar 

  6. Rusu, F., Xu, F., Perez, L.L., Wu, M., Jampani, R., Jermaine, C., Dobra, A.: The DBO Database System. In: SIGMOD (2008)

    Google Scholar 

  7. Pansare, N., Borkar, V.R., Jermaine, C., Condie, T.: Online Aggregation for Large MapReduce Jobs. PVLDB 4(11) (2011)

    Google Scholar 

  8. Olken, F.: Random Sampling from Databases. Ph.D. thesis, UC Berkeley (1993)

    Google Scholar 

  9. Cochran, W.G.: Sampling Techniques. Wiley (1977)

    Google Scholar 

  10. Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A Scalable Hash Ripple Join Algorithm. In: SIGMOD (2002)

    Google Scholar 

  11. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The Sort-Merge-Shrink Join. TODS 31(4) (2006)

    Google Scholar 

  12. Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable Approximate Query Processing with the DBO Engine. In: SIGMOD (2007)

    Google Scholar 

  13. Dobra, A., Jermaine, C., Rusu, F., Xu, F.: Turbo-Charging Estimate Convergence in DBO. PVLDB 2(1) (2009)

    Google Scholar 

  14. Cheng, Y., Qin, C., Rusu, F.: GLADE: Big Data Analytics Made Easy. In: SIGMOD (2012)

    Google Scholar 

  15. Qin, C., Rusu, F.: PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation. CoRR abs/1206.0051 (2012)

    Google Scholar 

  16. Avnur, R., Hellerstein, J.M., Lo, B., Olston, C., Raman, B., Raman, V., Roth, T., Wylie, K.: CONTROL: Continuous Output and Navigation Technology with Refinement On-Line. In: SIGMOD (1998)

    Google Scholar 

  17. Haas, P.J., Hellerstein, J.M.: Ripple Joins for Online Aggregation. In: SIGMOD (1999)

    Google Scholar 

  18. Chen, S., Gibbons, P.B., Nath, S.: PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees. In: SIGMOD (2010)

    Google Scholar 

  19. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce Online. In: NSDI (2010)

    Google Scholar 

  20. Agarwal, S., Panda, A., Mozafari, B., Iyer, A.P., Madden, S., Stoica, I.: Blink and It’s Done: Interactive Queries on Very Large Data. PVLDB 5(12) (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Qin, C., Rusu, F. (2013). Sampling Estimators for Parallel Online Aggregation. In: Gottlob, G., Grasso, G., Olteanu, D., Schallhart, C. (eds) Big Data. BNCOD 2013. Lecture Notes in Computer Science, vol 7968. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39467-6_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39467-6_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39466-9

  • Online ISBN: 978-3-642-39467-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics