Improving Online Aggregation Performance for Skewed Data Distribution

  • Yuxiang Wang
  • Junzhou Luo
  • Aibo Song
  • Jiahui Jin
  • Fang Dong
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7238)

Abstract

Online aggregation is a commonly-used technique to response aggregation queries with the refined approximate answers (within an estimated confidence interval) quickly. However, we observe that low selectivity and inappropriate sample proportion significantly affect the online aggregation performance when the data distribution is skewed. To overcome this problem, we propose a Partition-based Online Aggregation System called POAS. In POAS, the side effect of low selectivity can be reduced by efficient pruning of unneeded data due to the partition and shuffle strategies, and the appropriate sample proportion can be achieved as far as possible by drawing samples (tuples) from relevant partitions with dynamic sample size. Moreover, POAS applies some statistical approaches to calculate estimates from relevant partitions. We have implemented POAS and conducted an extensive experiments study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of POAS.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Wu, S., Ooi, B.C., Tan, K.L.: Continuous sampling for online aggregation over multiple queries. In: SIGMOD 2010, pp. 651–662. ACM, New York (2010)Google Scholar
  2. 2.
    Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aggregation queries. In: ICDE 2001, pp. 534–542 (2001)Google Scholar
  3. 3.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. SIGMOD Rec. 26, 171–182 (1997)CrossRefGoogle Scholar
  4. 4.
    Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: SSDBM 1997, pp. 51–63. IEEE Computer Society, Washington, DC, USA (1997)Google Scholar
  5. 5.
    Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. SIGMOD Rec. (1999)Google Scholar
  6. 6.
    Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: SIGMOD 2002 (2002)Google Scholar
  7. 7.
    Wu, S., Jiang, S., Ooi, B.C., Tan, K.L.: Distributed online aggregations. In: Proc. VLDB Endow. (2009)Google Scholar
  8. 8.
    Condie, T., Conway, N., Alvaro, P.: Hellerstein: Online aggregation and continuous query support in mapreduce. In: SIGMOD 2010 (2010)Google Scholar
  9. 9.
    Böse, J.H., Andrzejak, A., Högqvist, M.: Beyond online aggregation: parallel and incremental data mining with online map-reduce. In: MDAC 2010 (2010)Google Scholar
  10. 10.
    Pansare, N., Borkar, V., Jermaine, C., Condie, T.: Online aggregation for large mapreduce jobs. In: VLDB 2011, ACM, Seattle (2011)Google Scholar
  11. 11.
    Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE 2011, pp. 1151–1162 (2011)Google Scholar
  12. 12.
    Jacobs, A.: The pathologies of big data. Commun. ACM 52, 36–44 (2009)CrossRefGoogle Scholar
  13. 13.
    Bowen, T.F., Gopal, G., Herman, G., Hickey, T., Lee, K.C., Mansfield, W.H., Raitz, J., Weinrib, A.: The datacycle architecture. Commun. ACM (1992)Google Scholar
  14. 14.
    Candea, G., Polyzotis, N., Vingralek, R.: A scalable, predictable join operator for highly concurrent data warehouses. In: Proc. VLDB Endow., vol. 2, pp. 277–288 (2009)Google Scholar
  15. 15.
    Chaudhuri, S., Narasayya, V.: Program for tpc-d data generation with skew, ftp://ftp.research.microsoft.com/pub/user/viveknar/tpcdskew

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Yuxiang Wang
    • 1
  • Junzhou Luo
    • 1
  • Aibo Song
    • 1
  • Jiahui Jin
    • 1
  • Fang Dong
    • 1
  1. 1.School of Computer Science and EngineeringSoutheast UniversityNanjingP.R. China

Personalised recommendations