An Approximate Lp-Difference Algorithm for Massive Data Streams
Several recent papers have shown how to approximate the difference Σi |a i − b i| or Σ |a i − b i|2 between two functions, when the function values a i and b i are given in a data stream, and their order is chosen by an adversary. These algorithms use little space (much less than would be needed to store the entire stream) and little time to process each item in the stream and give approximations with small relative error. Using different techniques, we show how to approximate the L p-difference Σi |a i − b i|p for any rational-valued p ∈ (0,2], with comparable efficiency and error. We also show how to approximate Σi |a i − b i|p for larger values of p but with a worse error guarantee. These results can be used to assess the difference between two chronologically or physically separated massive data sets, making one quick pass over each data set, without buffering the data or requiring the data source to pause.
KeywordsData Stream Massive Data Small Relative Error Error Guarantee Synopsis Data Structure
Unable to display preview. Download preview PDF.
- 1.N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking Join and Self-Join Sizes in Limited Storage. In Proc. of the 18’th Symp. on Principles of Database Systems, ACM Press, New York, pages 10–20, 1999.Google Scholar
- 2.N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. of 28’th STOC, pages 20–29, 1996. To appear in Journal of Computing and System Sciences.Google Scholar
- 3.N. Alon and J. Spencer. The Probabilistic Method. Wiley, 1992.Google Scholar
- 4.A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proc. of the 30’th STOC, pages 327–336, 1998.Google Scholar
- 5.Cisco NetFlow, 1998. http://www.cisco.com/warp/public/732/netflow/.
- 6.J. Feigenbaum. Locally random reductions in interactive complexity theory. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 13, pages 73–98. American Mathematical Society, Providence, 1993.Google Scholar
- 7.J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An Approximate L 1-Difference Algorithm for Massive Data Streams. To appear in Proc. of the 40’th IEEE Symposium on Foundataions of Computer Science, 1999.Google Scholar
- 8.J. Feigenbaum and M. Strauss. An Information-Theoretic Treatment of Random-Self-Reducibility. Proc. of the 14’th Symposium on Theoretical Aspects of Computer Science, pages 523–534. Lecture Notes in Computer Science, vol. 1200, Springer-Verlag, New York, 1997.Google Scholar
- 9.P. Gibbons and Y. Matias. Synopsis Data Structures for Massive Data Sets. To appear in Proc. 1998 DIMACS Workshop on External Memory Algorithms. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society, Providence. Abstract in Proc. Tenth Symposium on Discrete Algorithms, ACM Press, New York and Society for Industrial and Applied Mathematics, Philadelphia, pages S909–910, 1999.Google Scholar
- 10.M. Rauch Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation Systems Research Center, May 1998.Google Scholar
- 11.E. Kushilevitz, R. Ostrovsky, Y. Rabani. Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces. Proc. of The 30’s ACM Symposium on Theory of Computing, ACM Press, New York, pages 514–523.Google Scholar