Advertisement

Knowledge and Information Systems

, Volume 57, Issue 2, pp 437–473 | Cite as

Efficiently processing deterministic approximate aggregation query on massive data

  • Xixian HanEmail author
  • Bailing Wang
  • Jianzhong Li
  • Hong Gao
Regular Paper
  • 334 Downloads

Abstract

In actual applications, aggregation is an important operation to return statistical characterizations of subset of the data set. On massive data, approximate aggregation often is preferable for its better timeliness and responsiveness. This paper focuses on deterministic approximate aggregation to return running aggregate within progressive deterministic error interval. The existing methods either return approximate results with fixed accuracy, or return online approximate aggregate with probabilistic confidence interval, or incur a high I/O cost on massive data. This paper proposes LDA algorithm to compute deterministic approximate aggregate on massive data efficiently. LDA utilizes selection attribute lattice of hierarchical structure to distribute tuples and obtain a horizontal partitioning of the table. In each partition, each selection attribute is kept in column file and each ranking attribute is transposed to bit-slices. Given the selection condition, only relevant partitions are involved to compute the running aggregate. The compact storage scheme based on Z-order space filling curve is proposed to reduce the management cost of the partitions. An error reduction method is devised to reduce the error interval when computing running aggregate. The extensive experimental results on synthetic and real data sets show that LDA has a significant performance advantage over the existing algorithms.

Keywords

Deterministic approximate aggregation LDA Massive data Selection attribute lattice Error reduction processing 

Notes

Acknowledgements

We thank anonymous reviewers for their very useful comments and suggestions. This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61402130, 61632010, 61602129, 61502121, National Key Research and Development Program under Grant No. 2016YFB1000703, the Shandong Province Science and Technology Major Project No. 2015ZDXX0210B02.

References

  1. 1.
    Acharya S, Gibbons P, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, pp 487–498Google Scholar
  2. 2.
    Agarwal S, Milner H, Kleiner A et al (2014) Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 481–492Google Scholar
  3. 3.
    Agarwal S, Mozafari B, Panda A et al (2013) Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European conference on computer systems, pp 29–42Google Scholar
  4. 4.
    Ahn H, Mamoulis N, Wong H (2001) A survey on multidimensional access methods. Technical report, University of Science and Technology, Clearwater Bay, Hong KongGoogle Scholar
  5. 5.
    Chakrabarti K, Garofalakis M, Rastogi R, Shim K (2001) Approximate query processing using wavelets. VLDB J 10(2–3):199–223zbMATHGoogle Scholar
  6. 6.
    Cormode G, Garofalakis M, Haas P, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294zbMATHGoogle Scholar
  7. 7.
    Ding B, Huang S, Chaudhuri S et al (2016) Sample + seek: approximating aggregates with distribution precision guarantee. In: Proceedings of the 2016 ACM SIGMOD international conference on management of data, pp 679–694Google Scholar
  8. 8.
    Dong X, Han J, Cheng H, Li X (2006) Answering top-\(k\) queries with multi-dimensional selections: the ranking cube approach. In: Proceedings of the 32nd international conference on very large data bases, pp 463–474Google Scholar
  9. 9.
    Dragut E, Meng W, Yu C (2012) Deep web query interface understanding and integration. Morgan & Claypool, San Rafael (Synthesis lectures on data management)Google Scholar
  10. 10.
    Gaede V, Günther O (1998) Multidimensional access methods. ACM Comput Surv 30(2):170–231CrossRefGoogle Scholar
  11. 11.
    Ganti V, Lee M, Ramakrishnan R (2000) ICICLES: self-tuning samples for approximate query answering. In: Proceedings of 26th international conference on very large data bases, pp 176–187Google Scholar
  12. 12.
    Garofalakis M, Gibbons P (2001) Approximate query processing: taming the terabytes. In: Proceedings of 27th international conference on very large data bases, p 725Google Scholar
  13. 13.
    Gray J, Chaudhuri S, Bosworth A et al (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1(1):29–53CrossRefGoogle Scholar
  14. 14.
    Gupta A, Harinarayan V, Quass D (1995) Aggregate-query processing in data warehousing environments. In: Proceedings of the 21th international conference on very large data bases, pp 358–369Google Scholar
  15. 15.
    Haas P, Hellerstein J (1999) Ripple joins for online aggregation. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, pp 287–298Google Scholar
  16. 16.
    Han X, Li J, Gao H (2014) Efficiently processing (p, \(\epsilon \))-approximate join aggregation on massive data. Inf Sci 278:773–792MathSciNetCrossRefGoogle Scholar
  17. 17.
    Harinarayan V, Rajaraman A, Ullman J (1996) Implementing data cubes efficiently. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, pp 205–216Google Scholar
  18. 18.
    Hellerstein J, Haas P, Wang H (1997) Online aggregation. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data, pp 171–182Google Scholar
  19. 19.
    Jermaine C, Arumugam S, Pol A, Dobra A (2008) Scalable approximate query processing with the DBO engine. ACM Trans Database Syst 33(4):23:1–23:54CrossRefGoogle Scholar
  20. 20.
    Kim W (1982) On optimizing an SQL-like nested query. ACM Trans Database Syst 7(3):443–469CrossRefzbMATHGoogle Scholar
  21. 21.
    Lazaridis I, Mehrotra S (2001) Progressive approximate aggregate queries with a multi-resolution tree structure. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, pp 401–412Google Scholar
  22. 22.
    Li F, Wu B, Yi K, Zhao Z (2016) Wander join: online aggregation for joins. In: Proceedings of the 2016 ACM SIGMOD international conference on management of data, pp 2121–2124Google Scholar
  23. 23.
    Miller R (1968) Response time in man-computer conversational transactions. In: Proceedings of the fall joint computer conference, part I, pp 267–277Google Scholar
  24. 24.
    Mozafari B, Goh E, Yoon D (2015) Cliffguard: a principled framework for finding robust database designs. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1167–1182Google Scholar
  25. 25.
    Mozafari B, Niu N (2015) A handbook for building an approximate query engine. IEEE Data Eng Bull 38(3):3–29Google Scholar
  26. 26.
    O’Neil P, Quass D (1997) Improved query performance with variant indexes. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data, pp 38–49Google Scholar
  27. 27.
    Pansare N, Borkar V, Jermaine C, Condie T (2011) Online aggregation for large mapreduce jobs. Proc VLDB Endow 4(11):1135–1145Google Scholar
  28. 28.
    Poosala V, Ganti V, Ioannidis Y (1999) Approximate query answering using histograms. IEEE Data Eng Bull 22(4):5–14Google Scholar
  29. 29.
    Potti N, Patel J (2015) Daq: a new paradigm for approximate query processing. Proc VLDB Endow 8(9):898–909CrossRefGoogle Scholar
  30. 30.
    Rösch P, Lehner W (2009) Sample synopses for approximate answering of group-by queries. In: Proceedings of the 12th international conference on extending database technology: advances in database technology, pp 403–414Google Scholar
  31. 31.
    Wong H, Li J, Olken F et al (1986) Bit transposition for very large scientific and statistical databases. Algorithmica 1(3):289–309MathSciNetCrossRefGoogle Scholar
  32. 32.
    Wu S, Jiang S, Ooi B, Tan K (2009) Distributed online aggregation. Proc VLDB Endow 2(1):443–454CrossRefGoogle Scholar
  33. 33.
    Zeng K, Gao S, Mozafari B, Zaniolo C (2014) The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 277–288Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2017

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations