Advertisement

Frontiers of Computer Science

, Volume 12, Issue 6, pp 1220–1240 | Cite as

Distribution-free data density estimation in large-scale networks

  • Minqi Zhou
  • Rong ZhangEmail author
  • Weining Qian
  • Aoying Zhou
Research Article
  • 16 Downloads

Abstract

Estimating the global data distribution in large-scale networks is an important issue and yet to be well addressed. It can benefit many applications, especially in the cloud computing era, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate (number) generation, in this paper, we present a novel model called distribution-free data density estimation for large ring-based networks to achieve high estimation accuracy with low estimation cost regardless of the distribution models of the underlying data. This model generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. Armed with this estimation method, we can estimate data densities over both one-dimensional and multidimensional tuple sets, where each dimension could be either continuous or discrete as its domain. In large-scale networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which the global data distribution is estimated are introduced with a detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in large ring-based networks.

Keywords

distribution-free data density estimation random sampling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

This study was partially supported by the National Natural Science Foundation of China (Grant Nos. 61332006 and 61232002), the National High-Tech Research and Development Program (863 Program) of China (2015AA015303), and Infosys.

Supplementary material

11704_2016_6194_MOESM1_ESM.ppt (190 kb)
Supplementary material, approximately 189 KB.

References

  1. 1.
    Lakshman A, Malik P. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010, 44(2): 35–40CrossRefGoogle Scholar
  2. 2.
    DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. Dynamo: amazon’s highly available key-value store. ACM SIGOPS Review, 2007, 41(6): 205–220CrossRefGoogle Scholar
  3. 3.
    Pitoura T, Triantafillou P. Load distribution fairness in P2P data management systems. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 396–405Google Scholar
  4. 4.
    Zhu Y W, Hu Y M. Towards efficient load balancing in structured P2P system. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. 2004Google Scholar
  5. 5.
    Shu Y F, Ooi B C, Tan K L, Zhou A Y. Supporting multi-dimensional range queries in peer-to-peer systems. In: Proceedings of the 5th IEEE International Conference on Peer-to-Peer Computing. 2005, 173–180Google Scholar
  6. 6.
    Arai B, Das G, Gunopulos D, Kalogeraki V. Approximating aggregation queries in peer-to-peer networks. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006Google Scholar
  7. 7.
    Wang S Y, Ooi B C, Tung A K H, Xu L Z. Efficient skyline query processing on peer-to-peer networks. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 1126–1135Google Scholar
  8. 8.
    Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H. Distributed data mining in peer-to-peer networks. IEEE Internet Computing, 2006,10(4): 18–26CrossRefGoogle Scholar
  9. 9.
    Seshadri S. Probabilistic methods in query processing. Dissertation for the Doctoral Degree. University of Wisconsin, 1992Google Scholar
  10. 10.
    Matias Y, Vitter J S, Wang M. Wavelet-based histograms for selectivity estimation. ACM SIGMoD Record, 1998, 27(2): 448–459CrossRefGoogle Scholar
  11. 11.
    Bonamente M. Theory of Probability. In: Statistics and Analysis of Scientific Data. Graduate Texts in Physics. New York: Springer, 2017CrossRefGoogle Scholar
  12. 12.
    Eckhard R. Stan ulam, john von neumann, and the monte carlo method. Los Alamos Science, 1987, 15: 131–137MathSciNetGoogle Scholar
  13. 13.
    Zhou M Q, Shen H T, Zhou X F, Qian W N, Zhou A Y. Effective data density estimation in ring-based P2P networks. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 594–605Google Scholar
  14. 14.
    Christodoulakis S. Estimating record selectivities. Information Systems, 1983, 8(2): 105–115CrossRefGoogle Scholar
  15. 15.
    Chen C M, Roussopoulos N. Adaptive selectivity estimation using query feedback. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1994, 161–172Google Scholar
  16. 16.
    Haas P J, Naughton J F, Seshadri S, Stokes L. Sampling based estimation of the number of distinct values of an attribute. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995, 311–322Google Scholar
  17. 17.
    Jagadish H V, Koudas N, Muthukrishnan S, Poosala V, Sevcik K C, Suel T. Optimal histograms with quality gurantees. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 275–286Google Scholar
  18. 18.
    King V, Lewis S, Saia J, Young M. Choosing a random peer. In: Proceedings of the 23rd annual ACM Symposium on Principles of Distributed Computing. 2004, 125–130Google Scholar
  19. 19.
    Bharambe A R, Agrawal M, Seshan S. Mercury: supporting scalable multi-attribute range queries. ACM SIGCOMM Computer Communication Review, 2004, 34(4): 353–366CrossRefGoogle Scholar
  20. 20.
    Arai B, Lin S, Gunopulos D. Efficient data sampling in heterogeneous peer-to-peer networks. In: Proceedings of the 7th IEEE International Conference on Data Engineering. 2007, 23–32Google Scholar
  21. 21.
    Darlagiannis V, Mauthe A, Steinmetz R. Sampling cluster endurance for peer-to-peer based content distribution networks. Multimedia Systems, 2007, 13(1): 19–33CrossRefGoogle Scholar
  22. 22.
    Oguz B, Anantharam V, Norros I. Stable distributed P2P protocols based on random peer sampling. IEEE/ACM Transactions on Networking, 2015, 23(5): 1444–1456CrossRefGoogle Scholar
  23. 23.
    Jelasity M, Voulgaris S, Guerraoui R, Kermarrec A M, Van Steen M. Gossip-based peer sampling. ACM Transactions on Computer Systems, 2007, 25(3): 8CrossRefGoogle Scholar
  24. 24.
    Wu S, Li J Z, Ooi B C, Tan K L. Just-in-time query retrieval over partially indexed data on structured P2P overlays. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 279–290Google Scholar
  25. 25.
    Jagadish H V, Ooi B C, Vu Q H. Baton: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 661–672Google Scholar
  26. 26.
    Kempe D, Dobra A, Gehrke J. Gossip-based computation of aggregate information. In: Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science. 2003, 482–491Google Scholar
  27. 27.
    Hu Y S, Chen H, Lou J G, Li J. Distributed density estimation using non-parametric statistics. In: Proceedings of the 27th IEEE International Conference on Distributed Computing Systems. 2007Google Scholar
  28. 28.
    Zhou M Q, Qian W N, Gong X Q, Zhou A Y. Multi-dimensional data density estimation in P2P networks. Distributed and Parallel Databases, 2009, 26(2–3): 261CrossRefGoogle Scholar
  29. 29.
    Evans M, Hastings N, Peacock B. Statistical Distributions. New York: Wiley, 2000zbMATHGoogle Scholar
  30. 30.
    Stoica I, Morris R, Karger D, Kaashoek M F, Balakrishnan H. Chord:a scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 2001, 31(4): 149–160CrossRefGoogle Scholar
  31. 31.
    Rowstron A, Druschel P. Pastry: scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Proceedings of IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing. 2001, 329–350Google Scholar
  32. 32.
    Jagadish H V, Ooi B C, Vu Q H. BATON: a balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 661–672Google Scholar
  33. 33.
    Ioannidis Y E, Poosala V. Balancing histogram optimality and practicality for query result size estimation. ACM SIGMOD Record, 1995, 24(2): 233–244CrossRefGoogle Scholar
  34. 34.
    Deering S, Estrin D, Farinacci D, Jacobson V, Liu C G, Wei L. An architecture for wide-area multicast routing. ACM SIGCOMM Computer Communication Review, 1994, 24(4): 126–135CrossRefGoogle Scholar
  35. 35.
    Matsumoto M, Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Transactions on Modeling and Computer Simulation, 1998, 8(1): 3–30CrossRefzbMATHGoogle Scholar
  36. 36.
    Wahba G. A polynomial algorithm for density estimation. The Annals of Mathematical Statistics, 1971, 42(6): 1870–1886MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Pfeiffer P E. Probability for Applications. New York: Springer-Verlag, 1990CrossRefzbMATHGoogle Scholar
  38. 38.
    Gray F. Pulse code communication. U.S. Patent 2,632,058, 1953–03-17 include data management in distributed systems and in-memory database systems.Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Minqi Zhou
    • 1
    • 2
  • Rong Zhang
    • 1
    • 2
    Email author
  • Weining Qian
    • 1
  • Aoying Zhou
    • 1
  1. 1.Data Science and Engineering InstituteEast China Normal UniversityShanghaiChina
  2. 2.State Key Lab of Software EngineeringWuhan UniversityWuhanChina

Personalised recommendations