Abstract
On a stream \({\fancyscript{S}}\) of two dimensional data items \((x,y)\) where \(x\) is an item identifier and \(y\) is a numerical attribute, a correlated aggregate query \(C(\sigma ,AGG,{\fancyscript{S}})\) asks to first apply a selection predicate \(\sigma \) along the \(y\) dimension, followed by an aggregation \(AGG\) along the \(x\) dimension. For selection predicates of the form \((y < c)\) or \((y > c)\), where parameter \(c\) is provided at query time, we present new streaming algorithms and lower bounds for estimating correlated aggregates. Our main result is a general method that reduces the estimation of a correlated aggregate \(AGG\) to the streaming computation of \(AGG\) over an entire stream, for an aggregate that satisfies certain conditions. This results in the first sublinear space algorithms for the correlated estimation of a large family of statistics, including frequency moments. Our experimental validation shows that the memory requirements of these algorithms are significantly smaller than existing linear storage solutions, and that these achieve a fast per-record processing time. We also study the setting when items have weights. In the case when weights can be negative, we give a strong space lower bound which holds even if the algorithm is allowed up to a logarithmic number of passes over the data. We complement this with a small space algorithm which uses a logarithmic number of passes.
Similar content being viewed by others
References
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)
Ananthakrishna, R., Das, A., Gehrke, J., Korn, F., Muthukrishnan, S., Srivastava, D.: Efficient approximation of correlated sums on data streams. IEEE Trans. Knowl. Data Eng. 15(3), 569–572 (2003)
Andoni, A., Krauthgamer, R., Onak, K.: Streaming algorithms via precision sampling. In: Proceedings of the Foundations of Computer Science (FOCS), pp. 363–372 (2011)
Braverman, V., Ostrovsky, R.: Effective computations on sliding windows. SIAM J. Comput. 39(6), 2113–2131 (2010)
Braverman, V., Gelles, R., Ostrovsky, R.: How to catch l\(_{\text{2 }}\)-heavy-hitters on sliding windows. In: Proceedings of the 19th International Conference on Computing and Combinatorics (COCOON), pp. 638–650 (2013)
Busch, C., Tirthapura, S.: A deterministic algorithm for summarizing asynchronous streams over a sliding window. In: 24th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pp. 465–476 (2007)
Chan, H.-L., Lam, T.W., Lee, L.-K., Ting, H.-F.: Approximating frequent items in asynchronous data stream over a sliding window. In: Workshop on Approximation and Online Algorithms (WAOA), pp. 49–61 (2009)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)
Chatziantoniou, D.: Ad Hoc OLAP: expression and evaluation. In: Proceedings of the 15th International Conference on Data Engineering (ICDE), p. 250 (1999)
Chatziantoniou, D., Ross, K.A.: Querying multiple features of groups in relational databases. In: Proceedings of 22th International Conference on Very Large Data Bases (VLDB), pp. 295–306 (1996)
Chatziantoniou, D., Akinde, M.O., Johnson, T., Kim, S.: The MD-join: an operator for complex OLAP. In: Proceedings of the 17th International Conference on Data Engineering (ICDE), pp. 524–533 (2001)
Cisco Systems.: Cisco IOS Netflow. http://www.cisco.com/web/go/netflow
Cormode, G., Tirthapura, S., Xu, B.: Time-decaying sketches for sensor data aggregation. In: Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 215–224 (2007)
Cormode, G., Korn, F., Tirthapura, S.: Time-decaying aggregates in out-of-order streams. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 1379–1381 (2008)
Cormode, G., Tirthapura, S., Xu, B.: Time-decaying sketches for robust aggregation of sensor data. SIAM J. Comput. 39(4), 1309–1339 (2009)
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6), 1794–1813 (2002)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 182–209 (1985)
Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 13–24 (2001)
Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 281–291 (2001)
Gibbons, P., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 63–72 (2002)
Gibbons, P., Tirthapura, S.: Distributed streams algorithms for sliding windows. Theory Comput. Syst. 37, 457–478 (2004)
Greenwald, M., Khanna, S.: Space efficient online computation of quantile summaries. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 58–66 (2001)
Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pp. 202–208 (2005)
Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: Proceedings of the Twenty-Ninth ACM Symposium on Principles of Database Systems (PODS), pp. 41–52 (2010)
Lee, L.K., Ting, H.F.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: Proceedings of the Twenty-Fifth ACM Symposium on Principles of Database Systems (PODS), pp. 290–297 (2006)
Miltersen, P.B., Nisan, N., Safra, S., Wigderson, A.: On data structures and asymmetric communication complexity. J. Comput. Syst. Sci. 57(1), 37–49 (1998)
Muthukrishnan, S.: Data streams: algorithms and applications. In: Madhu Sudan (ed.) Foundation and Trends in Theoretical Computer Science. Now Publishers, Hanover (2005)
Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the 2nd ACM International Conference on Embedded Networked Sensor Systems (SenSys) (2004)
Thorup, M., Zhang, Y.: Tabulation based 4-universal hashing with applications to second moment estimation. In: Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 615–624 (2004)
Tirthapura, S., Xu, B., Busch, C.: Sketching asynchronous streams over a sliding window. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 82–91 (2006)
Xu, B., Tirthapura, S., Busch, C.: Sketching asynchronous data streams over sliding windows. Distrib. Comput. 20(5), 359–374 (2008)
Acknowledgments
Srikanta Tirthapura is supported in part by NSF CNS-0834743, CNS-0831903.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this article appeared in Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE 2012), pages 162–173.
Rights and permissions
About this article
Cite this article
Tirthapura, S., Woodruff, D.P. A General Method for Estimating Correlated Aggregates Over a Data Stream. Algorithmica 73, 235–260 (2015). https://doi.org/10.1007/s00453-014-9917-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-014-9917-1