Processing Data-Stream Join Aggregates Using Skimmed Sketches

  • Sumit Ganguly
  • Minos Garofalakis
  • Rajeev Rastogi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2992)

Abstract

There is a growing interest in on-line algorithms for analyzing and querying data streams, that examine each stream element only once and have at their disposal, only a limited amount of memory. Providing (perhaps approximate) answers to aggregate queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. In this paper, we present the skimmed-sketch algorithm for estimating the join size of two streams. (Our techniques also readily extend to other join-aggregate queries.) To the best of our knowledge, our skimmed-sketch technique is the first comprehensive join-size estimation algorithm to provide tight error guarantees while: (1) achieving the lower bound on the space required by any join-size estimation method in a streaming environment, (2) handling streams containing general update operations (inserts and deletes), (3) incurring a low logarithmic processing time per stream element, and (4) not assuming any a-priori knowledge of the frequency distribution for domain values. Our skimmed-sketch technique achieves all of the above by first skimming the dense frequencies from random hash-sketch summaries of the two streams. It then computes the subjoin size involving only dense frequencies directly, and uses the skimmed sketches only to approximate subjoin sizes for the non-dense frequencies. Results from our experimental study with real-life as well as synthetic data streams indicate that our skimmed-sketch algorithm provides significantly more accurate estimates for join sizes compared to earlier sketch-based techniques.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, California (2001)Google Scholar
  2. 2.
    Gilbert, A., Kotidis, Y., Muthukrishnan, S., Strauss, M.: How to Summarize the Universe: Dynamic Maintenance of Quantiles. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong (2002)Google Scholar
  3. 3.
    Alon, N., Matias, Y., Szegedy, M.: The Space Complexity of Approximating the Frequency Moments. In: Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, Philadelphia, Pennsylvania, pp. 20–29 (1996)Google Scholar
  4. 4.
    Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking Join and Self-Join Sizes in Limited Storage. In: Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadeplphia, Pennsylvania (1999)Google Scholar
  5. 5.
    Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing Complex Aggregate Queries over Data Streams. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin (2002)Google Scholar
  6. 6.
    Gibbons, P.: Distinct Sampling for Highly-accurate Answers to Distinct Values Queries and Event Reports. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy (2001)Google Scholar
  7. 7.
    Cormode, G., Datar, M., Indyk, P., Muthukrishnan, S.: Comparing Data Streams Using Hamming Norms. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong (2002)Google Scholar
  8. 8.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on Automata Languages and Programming (2002)Google Scholar
  9. 9.
    Cormode, G., Muthukrishnan, S.: What’s Hot and What’s Not:Tracking Most Frequent Items Dynamically. In: Proceedings of the Twentysecond ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, San Diego, California (2003)Google Scholar
  10. 10.
    Manku, G., Motwani, R.: Approximate Frequency Counts over Data Streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong (2002)Google Scholar
  11. 11.
    Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: Surfing Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy (2001)Google Scholar
  12. 12.
    Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining Stream Statistics over Sliding Windows. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California (2002)Google Scholar
  13. 13.
    Vitter, J.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 37–57 (1985)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join Synopses for Approximate Query Answering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, pp. 275–286 (1999)Google Scholar
  15. 15.
    Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate Query Processing Using Wavelets. In: Proceedings of the 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp. 111–122 (2000)Google Scholar
  16. 16.
    Ganguly, S., Gibbons, P., Matias, Y., Silberschatz, A.: Bifocal Sampling for Skew-Resistant Join Size Estimation. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec (1996)Google Scholar
  17. 17.
    Ganguly, S., Garofalakis, M., Rastogi, R.: Processing Data-Stream Join Aggregates Using Skimmed Sketches. Bell Labs Tech. Memorandum (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Sumit Ganguly
    • 1
  • Minos Garofalakis
    • 1
  • Rajeev Rastogi
    • 1
  1. 1.Bell LaboratoriesLucent TechnologiesMurray HillUSA

Personalised recommendations