MISS: finding optimal sample sizes for approximate analytics

Su, Xuebin; Wang, Hongzhi

doi:10.1007/s10619-021-07376-5

MISS: finding optimal sample sizes for approximate analytics

Published: 21 October 2021

Volume 40, pages 165–200, (2022)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Xuebin Su¹ &
Hongzhi Wang¹

184 Accesses
1 Citation
Explore all metrics

Abstract

Nowadays, sampling-based Approximate Query Processing (AQP) is widely regarded as a promising way to achieve interactivity in big data analytics. To build such an AQP system, finding the minimal sample size for a query regarding given error constraints in general, called Sample Size Optimization (SSO), is an essential yet unsolved problem. Ideally, the goal of solving the SSO problem is to achieve statistical accuracy, computational efficiency and broad applicability all at the same time. Existing approaches either make idealistic assumptions on the statistical properties of the query, or completely disregard them. This may result in overemphasizing only one of the three goals while neglect the others. To overcome these limitations, we first examine carefully the statistical properties shared by common analytical queries. Then, based on the properties, we propose a linear model describing the relationship between sample sizes and the approximation errors of a query, which is called the error model. Then, we propose a Model-guided Iterative Sample Selection (MISS) framework to solve the SSO problem generally. Afterwards, based on the MISS framework, we propose a concrete algorithm, called \(L^{2}\textsc{Miss}\), to find optimal sample sizes under the \(L^{2}\) norm error metric. Moreover, we extend the \(L^{2}\textsc{Miss}\) algorithm to handle other error metrics. Finally, we show theoretically and empirically that the \(L^{2}\textsc{Miss}\) algorithm and its extensions achieve satisfactory accuracy and efficiency for a considerably wide range of analytical queries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Big data analytics: a survey

Article Open access 01 October 2015

Stratified random sampling from streaming and stored data

Article 23 October 2020

References

Tpc-h benchmark (2017). http://www.tpc.org/tpch/
Agarwal, S., Milner, H., Kleiner, A., Talwalkar, A., Jordan, M.I., Madden, S., Mozafari, B., Stoica, I.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 481–492 (2014). https://doi.org/10.1145/2588555.2593667
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Eighth Eurosys Conference 2013, EuroSys ’13, Prague, Czech Republic, 14–17 April 2013, pp. 29–42 (2013). https://doi.org/10.1145/2465351.2465355
Alabi, D., Wu, E.: Pfunk-h: approximate query processing using perceptual models. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD 2016, San Francisco, CA, USA, 26 June–01 July 2016, p. 10 (2016) https://doi.org/10.1145/2939502.2939512
Amaran, S., Sahinidis, N.V., Sharda, B., Bury, S.J.: Simulation optimization: A review of algorithms and applications. CoRR (2017). arxiv:1706.08591
Bhatia, R., Davis, C.: A better bound on the variance. Am. Math. Mon. 107(4), 353–357 (2000). http://www.jstor.org/stable/2589180
Box, G.E.P., Draper, N.R.: Empirical Model-Building and Response Surface. Wiley, New York (1986)
MATH Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Casella, G., Berger, R.L.: Statistical Inference. Duxbury Advanced Series in Statistics and Decision Sciences. Thomson Learning, Pacific Grove (2002)
Chen, X.: A New Generalization of Chebyshev Inequality for Random Vectors. arXiv e-prints (2007)
Chung, F., Lu, L.: Concentration inequalities and martingale inequalities: a survey. Internet Math. 3(1), 79–127 (2006)
Article MathSciNet Google Scholar
Cormode, G.: Data sketching. Commun. ACM 60(9), 48–55 (2017). https://doi.org/10.1145/3080008
DiCiccio, T.J., Efron, B.: Bootstrap confidence intervals. Stat. Sci. 11(3), 189–228 (1996). https://doi.org/10.1214/ss/1032280214
Article MathSciNet MATH Google Scholar
Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample + seek: approximating aggregates with distribution precision guarantee. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 679–694 (2016). https://doi.org/10.1145/2882903.2915249
Erlandson, E.: Faster random samples with gap sampling (2014). http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/
Gryz, J., Guo, J., Liu, L., Zuzarte, C.: Query sampling in DB2 universal database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004, pp. 839–843 (2004). https://doi.org/10.1145/1007568.1007664
Hall, P.: The Bootstrap and Edgeworth Expansion. Springer Series in Statistics. Springer, New York (1997)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, 13–15 May 1997, Tucson, AZ, USA., pp. 171–182 (1997). https://doi.org/10.1145/253260.253291
Hoeffding, W.: A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19(3), 293–325 (1948). https://doi.org/10.1214/aoms/1177730196
Article MathSciNet MATH Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964). https://doi.org/10.1214/aoms/1177703732
Article MathSciNet MATH Google Scholar
Inc., S.: Sample selection (2017). https://snappydatainc.github.io/snappydata/sde/sample_selection/
Kerrisk, M.: The Linux Programming Interface. No Starch Press Series. No Starch Press, San Francisco (2010)
Kim, A., Blais, E., Parameswaran, A.G., Indyk, P., Madden, S., Rubinfeld, R.: Rapid sampling for visualizations with ordering guarantees. PVLDB 8(5), 521–532 (2015). http://www.vldb.org/pvldb/vol8/p521-kim.pdf
Kreyszig, E.: Introductory Functional Analysis with Applications. Wiley Classics Library. Wiley, New York (1989)
Krishnan, S., Wang, J., Franklin, M.J., Goldberg, K., Kraska, T.: Privateclean: Data cleaning and differential privacy. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 937–951 (2016). https://doi.org/10.1145/2882903.2915248
Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join: Online aggregation via random walks. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 615–629 (2016). https://doi.org/10.1145/2882903.2915235
Lohr, S.L.: Sampling: Design and Analysis. Advanced (Cengage Learning). Brooks/Cole, Boston (2009). https://books.google.com/books?id=aSXKXbyNlMQC
Mozafari, B.: Approximate query engines: Commercial challenges and research opportunities. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 521–524 (2017). https://doi.org/10.1145/3035918.3056098
Mozafari, B., Niu, N.: A handbook for building an approximate query engine. IEEE Data Eng. Bull. 38(3), 3–29 (2015)
Mozafari, B., Ramnarayan, J., Menon, S., Mahajan, Y., Chakraborty, S., Bhanawat, H., Bachhav, K.: Snappydata: A unified cluster for streaming, transactions and interactice analytics. In: CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, 8–11 January 2017, Online Proceedings (2017). http://cidrdb.org/cidr2017/papers/p28-mozafari-cidr17.pdf
Nocedal, J., Wright, S.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York (2006)
van der Vaart, A.W.: Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2000)
Wang, J., Krishnan, S., Franklin, M.J., Goldberg, K., Kraska, T., Milo, T.: A sample-and-clean framework for fast and accurate query processing on dirty data. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 469–480 (2014). https://doi.org/10.1145/2588555.2610505
Wang, J., Lin, C., He, R., Chae, M., Papakonstantinou, Y., Swanson, S.: MILC: inverted list compression in memory. PVLDB 10(8), 853–864 (2017)
Wasserman, L.: All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer, New York (2004)
Wasserman, L.: All of Nonparametric Statistics. Springer Texts in Statistics. Springer, New York (2006)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, 25–27 April 2012, pp. 15–28 (2012). https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
Zeng, K., Agarwal, S., Stoica, I.: iolap: Managing uncertainty for efficient incremental OLAP. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–01 July 2016, pp. 1347–1361 (2016). https://doi.org/10.1145/2882903.2915240
Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pp. 277–288 (2014). https://doi.org/10.1145/2588555.2588579

Download references

Acknowledgements

This paper was supported by NSFC Grant (Grant No. U1866602, 71773025). The National Key Research and Development Program of China (Grant No. 2020YFB1006104).

Author information

Authors and Affiliations

Harbin Institute of Technology & Peng Cheng Lab, Shenzhen, China
Xuebin Su & Hongzhi Wang

Authors

Xuebin Su
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Su, X., Wang, H. MISS: finding optimal sample sizes for approximate analytics. Distrib Parallel Databases 40, 165–200 (2022). https://doi.org/10.1007/s10619-021-07376-5

Download citation

Accepted: 15 September 2021
Published: 21 October 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10619-021-07376-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MISS: finding optimal sample sizes for approximate analytics

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data analytics: a survey

Stratified random sampling from streaming and stored data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MISS: finding optimal sample sizes for approximate analytics

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data analytics: a survey

Stratified random sampling from streaming and stored data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation