Skip to main content
Log in

A General Method for Estimating Correlated Aggregates Over a Data Stream

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

On a stream \({\fancyscript{S}}\) of two dimensional data items \((x,y)\) where \(x\) is an item identifier and \(y\) is a numerical attribute, a correlated aggregate query \(C(\sigma ,AGG,{\fancyscript{S}})\) asks to first apply a selection predicate \(\sigma \) along the \(y\) dimension, followed by an aggregation \(AGG\) along the \(x\) dimension. For selection predicates of the form \((y < c)\) or \((y > c)\), where parameter \(c\) is provided at query time, we present new streaming algorithms and lower bounds for estimating correlated aggregates. Our main result is a general method that reduces the estimation of a correlated aggregate \(AGG\) to the streaming computation of \(AGG\) over an entire stream, for an aggregate that satisfies certain conditions. This results in the first sublinear space algorithms for the correlated estimation of a large family of statistics, including frequency moments. Our experimental validation shows that the memory requirements of these algorithms are significantly smaller than existing linear storage solutions, and that these achieve a fast per-record processing time. We also study the setting when items have weights. In the case when weights can be negative, we give a strong space lower bound which holds even if the algorithm is allowed up to a logarithmic number of passes over the data. We complement this with a small space algorithm which uses a logarithmic number of passes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  2. Ananthakrishna, R., Das, A., Gehrke, J., Korn, F., Muthukrishnan, S., Srivastava, D.: Efficient approximation of correlated sums on data streams. IEEE Trans. Knowl. Data Eng. 15(3), 569–572 (2003)

    Article  Google Scholar 

  3. Andoni, A., Krauthgamer, R., Onak, K.: Streaming algorithms via precision sampling. In: Proceedings of the Foundations of Computer Science (FOCS), pp. 363–372 (2011)

  4. Braverman, V., Ostrovsky, R.: Effective computations on sliding windows. SIAM J. Comput. 39(6), 2113–2131 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  5. Braverman, V., Gelles, R., Ostrovsky, R.: How to catch l\(_{\text{2 }}\)-heavy-hitters on sliding windows. In: Proceedings of the 19th International Conference on Computing and Combinatorics (COCOON), pp. 638–650 (2013)

  6. Busch, C., Tirthapura, S.: A deterministic algorithm for summarizing asynchronous streams over a sliding window. In: 24th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pp. 465–476 (2007)

  7. Chan, H.-L., Lam, T.W., Lee, L.-K., Ting, H.-F.: Approximating frequent items in asynchronous data stream over a sliding window. In: Workshop on Approximation and Online Algorithms (WAOA), pp. 49–61 (2009)

  8. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  9. Chatziantoniou, D.: Ad Hoc OLAP: expression and evaluation. In: Proceedings of the 15th International Conference on Data Engineering (ICDE), p. 250 (1999)

  10. Chatziantoniou, D., Ross, K.A.: Querying multiple features of groups in relational databases. In: Proceedings of 22th International Conference on Very Large Data Bases (VLDB), pp. 295–306 (1996)

  11. Chatziantoniou, D., Akinde, M.O., Johnson, T., Kim, S.: The MD-join: an operator for complex OLAP. In: Proceedings of the 17th International Conference on Data Engineering (ICDE), pp. 524–533 (2001)

  12. Cisco Systems.: Cisco IOS Netflow. http://www.cisco.com/web/go/netflow

  13. Cormode, G., Tirthapura, S., Xu, B.: Time-decaying sketches for sensor data aggregation. In: Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 215–224 (2007)

  14. Cormode, G., Korn, F., Tirthapura, S.: Time-decaying aggregates in out-of-order streams. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 1379–1381 (2008)

  15. Cormode, G., Tirthapura, S., Xu, B.: Time-decaying sketches for robust aggregation of sensor data. SIAM J. Comput. 39(4), 1309–1339 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  16. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6), 1794–1813 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  17. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 182–209 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  18. Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 13–24 (2001)

  19. Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 281–291 (2001)

  20. Gibbons, P., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 63–72 (2002)

  21. Gibbons, P., Tirthapura, S.: Distributed streams algorithms for sliding windows. Theory Comput. Syst. 37, 457–478 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  22. Greenwald, M., Khanna, S.: Space efficient online computation of quantile summaries. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 58–66 (2001)

  23. Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pp. 202–208 (2005)

  24. Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: Proceedings of the Twenty-Ninth ACM Symposium on Principles of Database Systems (PODS), pp. 41–52 (2010)

  25. Lee, L.K., Ting, H.F.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: Proceedings of the Twenty-Fifth ACM Symposium on Principles of Database Systems (PODS), pp. 290–297 (2006)

  26. Miltersen, P.B., Nisan, N., Safra, S., Wigderson, A.: On data structures and asymmetric communication complexity. J. Comput. Syst. Sci. 57(1), 37–49 (1998)

  27. Muthukrishnan, S.: Data streams: algorithms and applications. In: Madhu Sudan (ed.) Foundation and Trends in Theoretical Computer Science. Now Publishers, Hanover (2005)

  28. Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the 2nd ACM International Conference on Embedded Networked Sensor Systems (SenSys) (2004)

  29. Thorup, M., Zhang, Y.: Tabulation based 4-universal hashing with applications to second moment estimation. In: Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 615–624 (2004)

  30. Tirthapura, S., Xu, B., Busch, C.: Sketching asynchronous streams over a sliding window. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 82–91 (2006)

  31. Xu, B., Tirthapura, S., Busch, C.: Sketching asynchronous data streams over sliding windows. Distrib. Comput. 20(5), 359–374 (2008)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

Srikanta Tirthapura is supported in part by NSF CNS-0834743, CNS-0831903.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srikanta Tirthapura.

Additional information

A preliminary version of this article appeared in Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE 2012), pages 162–173.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tirthapura, S., Woodruff, D.P. A General Method for Estimating Correlated Aggregates Over a Data Stream. Algorithmica 73, 235–260 (2015). https://doi.org/10.1007/s00453-014-9917-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-014-9917-1

Keywords

Navigation