Abstract
Sampling is a fundamental method in data science, which can reduce the dataset size and decrease the computational complexity. A basic sampling requirement is identically distributed sampling, which requires maintaining the probability distribution. Numerous sampling methods are proposed. However, how to estimate the sampling boundary under the constraint of the probability distribution is still unclear. In this paper, we formulate a Probably Approximate Correct (PAC) problem for sampling, which limits the distribution difference in the given error boundary with the given confidence level. We further apply Hoeffding’s inequality to estimate the sampling size by decomposing the joint probability distribution into conditional distributions based on Bayesian networks. In the experiments, we simulate 5 Bayesian datasets with size 1, 000, 000 and give out the sampling size with different error boundaries and confidence levels. When the error boundary is 0.05, and the confidence level is 0.99, at least \(80\%\) samples could be excluded according to the estimated sampling size.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018)
Beinlich, I.A., Suermondt, H.J., Chavez, R.M., Cooper, G.F.: The alarm monitoring system: a case study with two probabilistic inference techniques for belief networks. In: Hunter, J., Cookson, J., Wyatt, J. (eds.) AIME 89, vol. 38, pp. 247–256. Springer, Cham (1989). https://doi.org/10.1007/978-3-642-93437-7_28
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mobile Netw. Appl. 19(2), 171–209 (2014). https://doi.org/10.1007/s11036-013-0489-0
Chen, R., Xiao, Q., Zhang, Y., Xu, J.: Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–138 (2015)
Cochran, W.G.: Sampling Techniques. John Wiley, Hoboken (1977)
Conati, C., Gertner, A.S., VanLehn, K., Druzdzel, M.J.: On-line student modeling for coached problem solving using Bayesian networks. In: Jameson, A., Paris, C., Tasso, C. (eds.) User Modeling. ICMS, vol. 383, pp. 231–242. Springer, Vienna (1997). https://doi.org/10.1007/978-3-7091-2670-7_24
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2), 131–163 (1997). https://doi.org/10.1023/A:1007465528199
Heckerman, D., Breese, J.: Decision-theoretic troubleshooting: a framework for repair and experiment. In: Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pp. 124–132 (1996)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. In: Fisher, N.I., Sen, P.K. (eds.) The collected works of Wassily Hoeffding, pp. 409–426. Springer, New York (1994). https://doi.org/10.1007/978-1-4612-0865-5_26
Israel, G.D.: Sampling the evidence of extension program impact. Citeseer (1992)
Jones, S., Carley, S., Harrison, M.: An introduction to power and sample size estimation. Emerg. Med. J.: EMJ 20(5), 453 (2003)
Kock, N., Hadaya, P.: Minimum sample size estimation in PLS-SEM: the inverse square root and gamma-exponential methods. Information Syst. J. 28(1), 227–261 (2018)
Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. J. Roy. Stat. Soc.: Ser. B (Methodol.) 50(2), 157–194 (1988)
Liu, Z., Zhang, A.: A survey on sampling and profiling over big data (technical report). arXiv preprint arXiv:2005.05079 (2020)
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., Nolan, G.P.: Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523–529 (2005)
Silva, J., Ribeiro, B., Sung, A.H.: Finding the critical sampling of big datasets. In: Proceedings of the Computing Frontiers Conference, pp. 355–360 (2017)
Singh, A.S., Masuku, M.B.: Sampling techniques & determination of sample size in applied statistics research: an overview. Int. J. Econ. Commer. Manage. 2(11), 1–22 (2014)
Yamane, T.: Statistics: An introductory analysis. Technical report (1967)
Yan, Y., Chen, L.J., Zhang, Z.: Error-bounded sampling for analytics on big sparse data. Proc. VLDB Endowment 7(13), 1508–1519 (2014)
Yang, J.Y., Wang, J.D., Zhang, Y.F., Cheng, W.J., Li, L.: A heuristic sampling method for maintaining the probability distribution. J. Comput. Sci. Technol. 36(4), 896–909 (2021)
Yang, J., Wang, J., Cheng, W., Li, L.: Sampling to maintain approximate probability distribution under chi-square test. In: Sun, X., He, K., Chen, X. (eds.) NCTCS 2019. CCIS, vol. 1069, pp. 29–45. Springer, Singapore (2019). https://doi.org/10.1007/978-981-15-0105-0_3
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)
Acknowledgements
This work was partially supported by the National Natural Science Foundation of China (No. 62072153), the Anhui Provincial Key Technologies R &D Program (2022h11020015), the International Science and technology cooperation project of the Shenzhen Science and Technology Commission (GJHZ20200731095804014), the Program of Introducing Talents of Discipline to Universities (111 Program) (B14025).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yang, J., Ren, Z., Wang, J., Li, L. (2022). Determining the Sampling Size with Maintaining the Probability Distribution. In: Cai, Z., Chen, Y., Zhang, J. (eds) Theoretical Computer Science. NCTCS 2022. Communications in Computer and Information Science, vol 1693. Springer, Singapore. https://doi.org/10.1007/978-981-19-8152-4_4
Download citation
DOI: https://doi.org/10.1007/978-981-19-8152-4_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8151-7
Online ISBN: 978-981-19-8152-4
eBook Packages: Computer ScienceComputer Science (R0)