Skip to main content
Log in

MISS: finding optimal sample sizes for approximate analytics

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Nowadays, sampling-based Approximate Query Processing (AQP) is widely regarded as a promising way to achieve interactivity in big data analytics. To build such an AQP system, finding the minimal sample size for a query regarding given error constraints in general, called Sample Size Optimization (SSO), is an essential yet unsolved problem. Ideally, the goal of solving the SSO problem is to achieve statistical accuracy, computational efficiency and broad applicability all at the same time. Existing approaches either make idealistic assumptions on the statistical properties of the query, or completely disregard them. This may result in overemphasizing only one of the three goals while neglect the others. To overcome these limitations, we first examine carefully the statistical properties shared by common analytical queries. Then, based on the properties, we propose a linear model describing the relationship between sample sizes and the approximation errors of a query, which is called the error model. Then, we propose a Model-guided Iterative Sample Selection (MISS) framework to solve the SSO problem generally. Afterwards, based on the MISS framework, we propose a concrete algorithm, called \(L^{2}\textsc{Miss}\), to find optimal sample sizes under the \(L^{2}\) norm error metric. Moreover, we extend the \(L^{2}\textsc{Miss}\) algorithm to handle other error metrics. Finally, we show theoretically and empirically that the \(L^{2}\textsc{Miss}\) algorithm and its extensions achieve satisfactory accuracy and efficiency for a considerably wide range of analytical queries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Tpc-h benchmark (2017). http://www.tpc.org/tpch/

  2. Agarwal, S., Milner, H., Kleiner, A., Talwalkar, A., Jordan, M.I., Madden, S., Mozafari, B., Stoica, I.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 481–492 (2014). https://doi.org/10.1145/2588555.2593667

  3. Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Eighth Eurosys Conference 2013, EuroSys ’13, Prague, Czech Republic, 14–17 April 2013, pp. 29–42 (2013). https://doi.org/10.1145/2465351.2465355

  4. Alabi, D., Wu, E.: Pfunk-h: approximate query processing using perceptual models. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD 2016, San Francisco, CA, USA, 26 June–01 July 2016, p. 10 (2016) https://doi.org/10.1145/2939502.2939512

  5. Amaran, S., Sahinidis, N.V., Sharda, B., Bury, S.J.: Simulation optimization: A review of algorithms and applications. CoRR (2017). arxiv:1706.08591

  6. Bhatia, R., Davis, C.: A better bound on the variance. Am. Math. Mon. 107(4), 353–357 (2000). http://www.jstor.org/stable/2589180

  7. Box, G.E.P., Draper, N.R.: Empirical Model-Building and Response Surface. Wiley, New York (1986)

    MATH  Google Scholar 

  8. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  9. Casella, G., Berger, R.L.: Statistical Inference. Duxbury Advanced Series in Statistics and Decision Sciences. Thomson Learning, Pacific Grove (2002)

  10. Chen, X.: A New Generalization of Chebyshev Inequality for Random Vectors. arXiv e-prints (2007)

  11. Chung, F., Lu, L.: Concentration inequalities and martingale inequalities: a survey. Internet Math. 3(1), 79–127 (2006)

    Article  MathSciNet  Google Scholar 

  12. Cormode, G.: Data sketching. Commun. ACM 60(9), 48–55 (2017). https://doi.org/10.1145/3080008

  13. DiCiccio, T.J., Efron, B.: Bootstrap confidence intervals. Stat. Sci. 11(3), 189–228 (1996). https://doi.org/10.1214/ss/1032280214

    Article  MathSciNet  MATH  Google Scholar 

  14. Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample + seek: approximating aggregates with distribution precision guarantee. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 679–694 (2016). https://doi.org/10.1145/2882903.2915249

  15. Erlandson, E.: Faster random samples with gap sampling (2014). http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/

  16. Gryz, J., Guo, J., Liu, L., Zuzarte, C.: Query sampling in DB2 universal database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004, pp. 839–843 (2004). https://doi.org/10.1145/1007568.1007664

  17. Hall, P.: The Bootstrap and Edgeworth Expansion. Springer Series in Statistics. Springer, New York (1997)

  18. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, 13–15 May 1997, Tucson, AZ, USA., pp. 171–182 (1997). https://doi.org/10.1145/253260.253291

  19. Hoeffding, W.: A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19(3), 293–325 (1948). https://doi.org/10.1214/aoms/1177730196

    Article  MathSciNet  MATH  Google Scholar 

  20. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)

  21. Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964). https://doi.org/10.1214/aoms/1177703732

    Article  MathSciNet  MATH  Google Scholar 

  22. Inc., S.: Sample selection (2017). https://snappydatainc.github.io/snappydata/sde/sample_selection/

  23. Kerrisk, M.: The Linux Programming Interface. No Starch Press Series. No Starch Press, San Francisco (2010)

  24. Kim, A., Blais, E., Parameswaran, A.G., Indyk, P., Madden, S., Rubinfeld, R.: Rapid sampling for visualizations with ordering guarantees. PVLDB 8(5), 521–532 (2015). http://www.vldb.org/pvldb/vol8/p521-kim.pdf

  25. Kreyszig, E.: Introductory Functional Analysis with Applications. Wiley Classics Library. Wiley, New York (1989)

  26. Krishnan, S., Wang, J., Franklin, M.J., Goldberg, K., Kraska, T.: Privateclean: Data cleaning and differential privacy. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 937–951 (2016). https://doi.org/10.1145/2882903.2915248

  27. Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join: Online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 615–629 (2016). https://doi.org/10.1145/2882903.2915235

  28. Lohr, S.L.: Sampling: Design and Analysis. Advanced (Cengage Learning). Brooks/Cole, Boston (2009). https://books.google.com/books?id=aSXKXbyNlMQC

  29. Mozafari, B.: Approximate query engines: Commercial challenges and research opportunities. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 521–524 (2017). https://doi.org/10.1145/3035918.3056098

  30. Mozafari, B., Niu, N.: A handbook for building an approximate query engine. IEEE Data Eng. Bull. 38(3), 3–29 (2015)

  31. Mozafari, B., Ramnarayan, J., Menon, S., Mahajan, Y., Chakraborty, S., Bhanawat, H., Bachhav, K.: Snappydata: A unified cluster for streaming, transactions and interactice analytics. In: CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, 8–11 January 2017, Online Proceedings (2017). http://cidrdb.org/cidr2017/papers/p28-mozafari-cidr17.pdf

  32. Nocedal, J., Wright, S.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York (2006)

  33. van der Vaart, A.W.: Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2000)

  34. Wang, J., Krishnan, S., Franklin, M.J., Goldberg, K., Kraska, T., Milo, T.: A sample-and-clean framework for fast and accurate query processing on dirty data. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 469–480 (2014). https://doi.org/10.1145/2588555.2610505

  35. Wang, J., Lin, C., He, R., Chae, M., Papakonstantinou, Y., Swanson, S.: MILC: inverted list compression in memory. PVLDB 10(8), 853–864 (2017)

  36. Wasserman, L.: All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer, New York (2004)

  37. Wasserman, L.: All of Nonparametric Statistics. Springer Texts in Statistics. Springer, New York (2006)

  38. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, 25–27 April 2012, pp. 15–28 (2012). https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

  39. Zeng, K., Agarwal, S., Stoica, I.: iolap: Managing uncertainty for efficient incremental OLAP. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 1347–1361 (2016). https://doi.org/10.1145/2882903.2915240

  40. Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pp. 277–288 (2014). https://doi.org/10.1145/2588555.2588579

Download references

Acknowledgements

This paper was supported by NSFC Grant (Grant No. U1866602, 71773025). The National Key Research and Development Program of China (Grant No. 2020YFB1006104).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Su, X., Wang, H. MISS: finding optimal sample sizes for approximate analytics. Distrib Parallel Databases 40, 165–200 (2022). https://doi.org/10.1007/s10619-021-07376-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-021-07376-5

Keywords

Navigation