Skip to main content

Determining the Sampling Size with Maintaining the Probability Distribution

  • Conference paper
  • First Online:
Theoretical Computer Science (NCTCS 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1693))

Included in the following conference series:

  • 269 Accesses

Abstract

Sampling is a fundamental method in data science, which can reduce the dataset size and decrease the computational complexity. A basic sampling requirement is identically distributed sampling, which requires maintaining the probability distribution. Numerous sampling methods are proposed. However, how to estimate the sampling boundary under the constraint of the probability distribution is still unclear. In this paper, we formulate a Probably Approximate Correct (PAC) problem for sampling, which limits the distribution difference in the given error boundary with the given confidence level. We further apply Hoeffding’s inequality to estimate the sampling size by decomposing the joint probability distribution into conditional distributions based on Bayesian networks. In the experiments, we simulate 5 Bayesian datasets with size 1, 000, 000 and give out the sampling size with different error boundaries and confidence levels. When the error boundary is 0.05, and the confidence level is 0.99, at least \(80\%\) samples could be excluded according to the estimated sampling size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018)

    Article  Google Scholar 

  2. Beinlich, I.A., Suermondt, H.J., Chavez, R.M., Cooper, G.F.: The alarm monitoring system: a case study with two probabilistic inference techniques for belief networks. In: Hunter, J., Cookson, J., Wyatt, J. (eds.) AIME 89, vol. 38, pp. 247–256. Springer, Cham (1989). https://doi.org/10.1007/978-3-642-93437-7_28

    Chapter  Google Scholar 

  3. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mobile Netw. Appl. 19(2), 171–209 (2014). https://doi.org/10.1007/s11036-013-0489-0

    Article  Google Scholar 

  4. Chen, R., Xiao, Q., Zhang, Y., Xu, J.: Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–138 (2015)

    Google Scholar 

  5. Cochran, W.G.: Sampling Techniques. John Wiley, Hoboken (1977)

    MATH  Google Scholar 

  6. Conati, C., Gertner, A.S., VanLehn, K., Druzdzel, M.J.: On-line student modeling for coached problem solving using Bayesian networks. In: Jameson, A., Paris, C., Tasso, C. (eds.) User Modeling. ICMS, vol. 383, pp. 231–242. Springer, Vienna (1997). https://doi.org/10.1007/978-3-7091-2670-7_24

    Chapter  Google Scholar 

  7. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2), 131–163 (1997). https://doi.org/10.1023/A:1007465528199

    Article  MATH  Google Scholar 

  8. Heckerman, D., Breese, J.: Decision-theoretic troubleshooting: a framework for repair and experiment. In: Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pp. 124–132 (1996)

    Google Scholar 

  9. Hoeffding, W.: Probability inequalities for sums of bounded random variables. In: Fisher, N.I., Sen, P.K. (eds.) The collected works of Wassily Hoeffding, pp. 409–426. Springer, New York (1994). https://doi.org/10.1007/978-1-4612-0865-5_26

    Chapter  MATH  Google Scholar 

  10. Israel, G.D.: Sampling the evidence of extension program impact. Citeseer (1992)

    Google Scholar 

  11. Jones, S., Carley, S., Harrison, M.: An introduction to power and sample size estimation. Emerg. Med. J.: EMJ 20(5), 453 (2003)

    Article  Google Scholar 

  12. Kock, N., Hadaya, P.: Minimum sample size estimation in PLS-SEM: the inverse square root and gamma-exponential methods. Information Syst. J. 28(1), 227–261 (2018)

    Article  Google Scholar 

  13. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. J. Roy. Stat. Soc.: Ser. B (Methodol.) 50(2), 157–194 (1988)

    MathSciNet  MATH  Google Scholar 

  14. Liu, Z., Zhang, A.: A survey on sampling and profiling over big data (technical report). arXiv preprint arXiv:2005.05079 (2020)

  15. Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., Nolan, G.P.: Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523–529 (2005)

    Article  Google Scholar 

  16. Silva, J., Ribeiro, B., Sung, A.H.: Finding the critical sampling of big datasets. In: Proceedings of the Computing Frontiers Conference, pp. 355–360 (2017)

    Google Scholar 

  17. Singh, A.S., Masuku, M.B.: Sampling techniques & determination of sample size in applied statistics research: an overview. Int. J. Econ. Commer. Manage. 2(11), 1–22 (2014)

    Google Scholar 

  18. Yamane, T.: Statistics: An introductory analysis. Technical report (1967)

    Google Scholar 

  19. Yan, Y., Chen, L.J., Zhang, Z.: Error-bounded sampling for analytics on big sparse data. Proc. VLDB Endowment 7(13), 1508–1519 (2014)

    Article  Google Scholar 

  20. Yang, J.Y., Wang, J.D., Zhang, Y.F., Cheng, W.J., Li, L.: A heuristic sampling method for maintaining the probability distribution. J. Comput. Sci. Technol. 36(4), 896–909 (2021)

    Article  Google Scholar 

  21. Yang, J., Wang, J., Cheng, W., Li, L.: Sampling to maintain approximate probability distribution under chi-square test. In: Sun, X., He, K., Chen, X. (eds.) NCTCS 2019. CCIS, vol. 1069, pp. 29–45. Springer, Singapore (2019). https://doi.org/10.1007/978-981-15-0105-0_3

    Chapter  Google Scholar 

  22. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (No. 62072153), the Anhui Provincial Key Technologies R &D Program (2022h11020015), the International Science and technology cooperation project of the Shenzhen Science and Technology Commission (GJHZ20200731095804014), the Program of Introducing Talents of Discipline to Universities (111 Program) (B14025).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenyu Ren .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, J., Ren, Z., Wang, J., Li, L. (2022). Determining the Sampling Size with Maintaining the Probability Distribution. In: Cai, Z., Chen, Y., Zhang, J. (eds) Theoretical Computer Science. NCTCS 2022. Communications in Computer and Information Science, vol 1693. Springer, Singapore. https://doi.org/10.1007/978-981-19-8152-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8152-4_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8151-7

  • Online ISBN: 978-981-19-8152-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics