The VLDB Journal

, Volume 22, Issue 6, pp 753–772 | Cite as

Non-uniformity issues and workarounds in bounded-size sampling

Regular Paper

Abstract

A variety of schemes have been proposed in the literature to speed up query processing and analytics by incrementally maintaining a bounded-size uniform sample from a dataset in the presence of a sequence of insertion, deletion, and update transactions. These algorithms vary according to whether the dataset is an ordinary set or a multiset and whether the transaction sequence consists only of insertions or can include deletions and updates. We report on subtle non-uniformity issues that we found in a number of these prior bounded-size sampling schemes, including some of our own. We provide workarounds that can avoid the non-uniformity problem; these workarounds are easy to implement and incur negligible additional cost. We also consider the impact of non-uniformity in practice and describe simple statistical tests that can help detect non-uniformity in new algorithms.

Keywords

Database sampling Reservoir sampling Bernoulli sampling Sample maintenance 

References

  1. 1.
    Beyer, K., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 199–210, (2007)Google Scholar
  2. 2.
    Broder, A.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences 1997. pp. 21–29, (1997)Google Scholar
  3. 3.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: Proceedings of the 2006 International Conference on Data Engineering, page 6, (2006)Google Scholar
  5. 5.
    Cohen, E., Cormode, G., Duffield, N.: Don’t let the negatives bring you down: sampling form strams of signed updates. In: Proceedings ACM SIGMETRICS, pp. 343–354, (2012)Google Scholar
  6. 6.
    Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: Proceedings of the 2005 International Conference on Very Large Data, Bases, pp. 25–36, (2005)Google Scholar
  7. 7.
    Fan, C.T., Muller, M.E., Rezucha, I.: Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Am. Stat. Assoc. 57, 387–402 (1962)Google Scholar
  8. 8.
    Feller, W.: An Introduction to Probability Theory and Its Applications. Wiley Series in Probability and Mathematical Statistics. Wiley, 3rd edition, (1968)Google Scholar
  9. 9.
    Ferrenberg, A.M., Landau, D.P.: Monte Carlo simulations: Hidden errors from “good” random number generators. Phys. Rev. Lett. 69(23), 3382–3384 (1992)CrossRefGoogle Scholar
  10. 10.
    Gemulla, R., Lehner, W., Haas, P.J.: Maintaining Bernoulli samples over evolving multisets. In: Proceedings of the 2007 ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 93–102, (2007)Google Scholar
  11. 11.
    Gemulla, R.: Sampling Algorithms for Evolving Datasets. PhD thesis, Technische Universität Dresden, (2008). http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644
  12. 12.
    Gemulla, R., Lehner, W., Haas, P.J.: Maintaining bounded-size sample synopses of evolving datasets. VLDB J. 17(2), 173–201 (2008) Google Scholar
  13. 13.
    Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 331–342, (1998)Google Scholar
  14. 14.
    McLeod, A.I., Bellhouse, D.R.: A convenient algorithm for drawing a simple random sample. Appl. Stat. 32, 182–184 (1983)CrossRefMATHGoogle Scholar
  15. 15.
    Tao, Y., Lian, X., Papadias, D., Hadjieleftheriou, M.: Random sampling for continuous streams with arbitrary updates. IEEE Trans. Knowl. Data Eng. 19(1), 96–110 (2007)CrossRefGoogle Scholar
  16. 16.
    Wegman, M.N., Carter, J.L.: New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci. 22(3), 265–279 (1981)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Rainer Gemulla
    • 1
  • Peter J. Haas
    • 2
  • Wolfgang Lehner
    • 3
  1. 1.Max-Planck-Institut für InformatikSaarbrückenGermany
  2. 2.IBM Almaden Research CenterSan JoseUSA
  3. 3.Technische Universität DresdenDresdenGermany

Personalised recommendations