Determining the Sampling Size with Maintaining the Probability Distribution

Yang, Jiaoyun; Ren, Zhenyu; Wang, Junda; Li, Lian

doi:10.1007/978-981-19-8152-4_4

Jiaoyun Yang^8,9,
Zhenyu Ren^8,9,
Junda Wang¹⁰ &
…
Lian Li^8,9

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1693))

Included in the following conference series:

National Conference of Theoretical Computer Science

269 Accesses

Abstract

Sampling is a fundamental method in data science, which can reduce the dataset size and decrease the computational complexity. A basic sampling requirement is identically distributed sampling, which requires maintaining the probability distribution. Numerous sampling methods are proposed. However, how to estimate the sampling boundary under the constraint of the probability distribution is still unclear. In this paper, we formulate a Probably Approximate Correct (PAC) problem for sampling, which limits the distribution difference in the given error boundary with the given confidence level. We further apply Hoeffding’s inequality to estimate the sampling size by decomposing the joint probability distribution into conditional distributions based on Bayesian networks. In the experiments, we simulate 5 Bayesian datasets with size 1, 000, 000 and give out the sampling size with different error boundaries and confidence levels. When the error boundary is 0.05, and the confidence level is 0.99, at least \(80\%\) samples could be excluded according to the estimated sampling size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018)
Article Google Scholar
Beinlich, I.A., Suermondt, H.J., Chavez, R.M., Cooper, G.F.: The alarm monitoring system: a case study with two probabilistic inference techniques for belief networks. In: Hunter, J., Cookson, J., Wyatt, J. (eds.) AIME 89, vol. 38, pp. 247–256. Springer, Cham (1989). https://doi.org/10.1007/978-3-642-93437-7_28
Chapter Google Scholar
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mobile Netw. Appl. 19(2), 171–209 (2014). https://doi.org/10.1007/s11036-013-0489-0
Article Google Scholar
Chen, R., Xiao, Q., Zhang, Y., Xu, J.: Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–138 (2015)
Google Scholar
Cochran, W.G.: Sampling Techniques. John Wiley, Hoboken (1977)
MATH Google Scholar
Conati, C., Gertner, A.S., VanLehn, K., Druzdzel, M.J.: On-line student modeling for coached problem solving using Bayesian networks. In: Jameson, A., Paris, C., Tasso, C. (eds.) User Modeling. ICMS, vol. 383, pp. 231–242. Springer, Vienna (1997). https://doi.org/10.1007/978-3-7091-2670-7_24
Chapter Google Scholar
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2), 131–163 (1997). https://doi.org/10.1023/A:1007465528199
Article MATH Google Scholar
Heckerman, D., Breese, J.: Decision-theoretic troubleshooting: a framework for repair and experiment. In: Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pp. 124–132 (1996)
Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. In: Fisher, N.I., Sen, P.K. (eds.) The collected works of Wassily Hoeffding, pp. 409–426. Springer, New York (1994). https://doi.org/10.1007/978-1-4612-0865-5_26
Chapter MATH Google Scholar
Israel, G.D.: Sampling the evidence of extension program impact. Citeseer (1992)
Google Scholar
Jones, S., Carley, S., Harrison, M.: An introduction to power and sample size estimation. Emerg. Med. J.: EMJ 20(5), 453 (2003)
Article Google Scholar
Kock, N., Hadaya, P.: Minimum sample size estimation in PLS-SEM: the inverse square root and gamma-exponential methods. Information Syst. J. 28(1), 227–261 (2018)
Article Google Scholar
Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. J. Roy. Stat. Soc.: Ser. B (Methodol.) 50(2), 157–194 (1988)
MathSciNet MATH Google Scholar
Liu, Z., Zhang, A.: A survey on sampling and profiling over big data (technical report). arXiv preprint arXiv:2005.05079 (2020)
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., Nolan, G.P.: Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523–529 (2005)
Article Google Scholar
Silva, J., Ribeiro, B., Sung, A.H.: Finding the critical sampling of big datasets. In: Proceedings of the Computing Frontiers Conference, pp. 355–360 (2017)
Google Scholar
Singh, A.S., Masuku, M.B.: Sampling techniques & determination of sample size in applied statistics research: an overview. Int. J. Econ. Commer. Manage. 2(11), 1–22 (2014)
Google Scholar
Yamane, T.: Statistics: An introductory analysis. Technical report (1967)
Google Scholar
Yan, Y., Chen, L.J., Zhang, Z.: Error-bounded sampling for analytics on big sparse data. Proc. VLDB Endowment 7(13), 1508–1519 (2014)
Article Google Scholar
Yang, J.Y., Wang, J.D., Zhang, Y.F., Cheng, W.J., Li, L.: A heuristic sampling method for maintaining the probability distribution. J. Comput. Sci. Technol. 36(4), 896–909 (2021)
Article Google Scholar
Yang, J., Wang, J., Cheng, W., Li, L.: Sampling to maintain approximate probability distribution under chi-square test. In: Sun, X., He, K., Chen, X. (eds.) NCTCS 2019. CCIS, vol. 1069, pp. 29–45. Springer, Singapore (2019). https://doi.org/10.1007/978-981-15-0105-0_3
Chapter Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (No. 62072153), the Anhui Provincial Key Technologies R &D Program (2022h11020015), the International Science and technology cooperation project of the Shenzhen Science and Technology Commission (GJHZ20200731095804014), the Program of Introducing Talents of Discipline to Universities (111 Program) (B14025).

Author information

Authors and Affiliations

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230601, China
Jiaoyun Yang, Zhenyu Ren & Lian Li
National Smart Eldercare International Science and Technology Cooperation Base, Hefei University of Technology, Hefei, 230601, China
Jiaoyun Yang, Zhenyu Ren & Lian Li
Department of Computer Science, University of Rochester, Rochester, 14627, USA
Junda Wang

Authors

Jiaoyun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Ren
View author publications
You can also search for this author in PubMed Google Scholar
Junda Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lian Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenyu Ren .

Editor information

Editors and Affiliations

National University of Defense Technology, Changsha, China
Zhiping Cai
Shanghai Jiao Tong University, Shanghai, China
Yijia Chen
Chinese Academy of Sciences, Beijing, China
Jialin Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, J., Ren, Z., Wang, J., Li, L. (2022). Determining the Sampling Size with Maintaining the Probability Distribution. In: Cai, Z., Chen, Y., Zhang, J. (eds) Theoretical Computer Science. NCTCS 2022. Communications in Computer and Information Science, vol 1693. Springer, Singapore. https://doi.org/10.1007/978-981-19-8152-4_4

Download citation

DOI: https://doi.org/10.1007/978-981-19-8152-4_4
Published: 10 December 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8151-7
Online ISBN: 978-981-19-8152-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

Determining the Sampling Size with Maintaining the Probability Distribution