Frequency Estimation of Internet Packet Streams with Limited Space
We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring trafic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a result, network routers can collect only relatively few statistics about the data. The central problem addressed here is to use the limited memory of routers to determine essential features of the network traffic stream. A particularly difficult and representative subproblem is to determine the top k categories to which the most packets belong, for a desired value of k and for a given notion of categorization such as the destination IP address.
We present an algorithm that deterministically finds (in particular) all categories having a frequency above 1/(m+1) using m counters, which we prove is best possible in the worst case. We also present a sampling-based algorithm for the case that packet categories follow an arbitrary distribution, but their order over time is permuted uniformly at random. Under this model, our algorithm identifies flows above a frequency threshold of roughly 1/√nm with high probability, where m is the number of counters and n is the number of packets observed. This guarantee is not far off from the ideal of identifying all flows (probability 1/n), and we prove that it is best possible up to a logarithmic factor. We show that the algorithm ranks the identified flows according to frequency within any desired constant factor of accuracy.
KeywordsFrequency Estimation Frequency Threshold Current Element Popular Category Packet Stream
Unable to display preview. Download preview PDF.
- N. Alon, Y. Matias and M. Szegedy. “The space complexity of approximating the frequency moments”, STOC, 1996, pp. 20–29.Google Scholar
- M. Charikar, K. Chen and M. Farach-Colton. “Finding frequent items in data streams”, to appear in ICALP, 2002.Google Scholar
- S. Chaudhuri, R. Motwani and V. Narasayya. “Random sampling for histogram construction: how much is enough”, In SIGMOD, 1998, pp. 436–447.Google Scholar
- Cisco Systems. Sampled NetFlow, http://www.cisco.com/univercd/cc/td/doc/product/software/ios120/120newft/120limit/120s/120s11/12ssanf.htm, April 2002.
- K. Claffy, G. Miller, K. Thompson. The nature of the beast: recent traffic measurements from an Internet backbone. In Proc. 8th Annual Internet Society Conference, 1998.Google Scholar
- M. Datar, A. Gionis, P. Indyk and R. Motwani. “Maintaining stream statistics over sliding windows”, In SODA, 2002, pp. 635–644.Google Scholar
- N.G. Duffield and M. Grossglauser. “Trajectory sampling for direct traffic observation”, In Proc. ACM SIGCOMM, 2000, pp. 271–282.Google Scholar
- C. Estan and G. Varghese. “New directions in trafic measurement and accounting”, In Proc. ACM SIGCOMM Internet Measurement Workshop, 2001.Google Scholar
- M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani and J. Ullman. “Computingiceberg queries efficiently”, VLDB, 1998, pp. 299–310.Google Scholar
- J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. “An approximate L1-difference algorithm for massive data streams”, In FOCS, 1999, pp. 501–511.Google Scholar
- J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. “Testing and Spot Checking of Data Streams”, In SODA, 2000, pp. 165–174.Google Scholar
- W. Feller.An Introduction to Probability Theory and its Applications. 3rd Edition, John Wiley & Sons, 1968.Google Scholar
- P. B. Gibbons and Y. Matias. “New sampling-based summary statistics for improving approximate query answers”, InProc. ACM SIGMOD International Conf. on Management of Data, June 1998, pp. 331–342.Google Scholar
- I.D. Graham, S. F. Donelly, S. Martin, J. Martens, and J.G. Cleary. Nonintrusive and accurate measurements of unidirectional delay and delay variation in the Internet. Proc. 8th Annual Internet Society Conference, 1998.Google Scholar
- P. Gupta and N. Mckeown. “Packet classification on multiple fields”, In Proc. ACM SIGCOMM, 1999, pp. 147–160.Google Scholar
- P. J. Haas, J. F Naughton, S. Seshadri and L. Stokes. “Sampling-Based Estimation of the Number of Distinct Values of an Attributerd, In VLDB, 1995, pp. 311–322.Google Scholar
- P. Indyk. “Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computations”, In FOCS, 2000, pp. 189–197.Google Scholar
- J. G. Kalbfleisch, Probability and Statistical Inference, Springer-Verlag, 1979.Google Scholar
- R. Mahajan and S. Floyd. “Controlling High Bandwith Flows at the Congested Router”, In Proc. 9th International Conference on Network Protocols, 2001.Google Scholar
- Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms, Cambridge University Press, 1995.Google Scholar
- J. S. Vitter. “Optimum algorithms for two random sampling problemsrd, In FOCS, 1983, pp. 65–75.Google Scholar