Skip to main content
Log in

Improved Algorithms for Distributed Entropy Monitoring

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Modern data management systems often need to deal with massive, dynamic and inherently distributed data sources. We collect the data using a distributed network, and at the same time try to maintain a global view of the data at a central coordinator using a minimal amount of communication. Such applications have been captured by the distributed monitoring model which has attracted a lot of attention in recent years. In this paper we investigate the monitoring of the entropy functions, which are very useful in network monitoring applications such as detecting distributed denial-of-service attacks. Our results improve the previous best results by Arackaparambil et al. in ICLP 1: 95–106 (2009). Our technical contribution also includes implementing the celebrated AMS sampling method (by Alon et al. in J Comput Syst Sci 58(1): 137–147 1999) in the distributed monitoring model, which could be of independent interest.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. \(\tilde{O}(\cdot )\) ignores all polylog terms; see Table 2 for details.

  2. It is well-known that in sensor networks, communication is by far the biggest battery drain. [26]

  3. To see this, note that \(\mathbf {E}[X] = \sum _{j \in [m]} \mathbf {E}[X | a_J = j] \mathbf {Pr}[a_J = j] = \sum _{j \in [m]} (\mathbf {E}[f(R) - f(R-1) | a_J = j] \cdot m_j / m)\), and \(\mathbf {E}[f(R) - f(R-1) | a_J = j] = \sum _{k \in [m_j]} \frac{f(k) - f(k-1)}{m_j} = \frac{f(m_j)}{m_j}\).

  4. In practice, one can generate a random binary string of, say, \(10\log m\) bits as its rank, and w.h.p. all ranks will be different.

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  2. Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. ICALP 1, 95–106 (2009)

    MathSciNet  MATH  Google Scholar 

  3. Babcock, B., Olston, C.: Distributed top-k monitoring. In: SIGMOD, pp. 28–39, (2003)

  4. Chakrabarti, A., Ba, K.D., Muthukrishnan, S.: Estimating entropy and entropy norm on data streams. Internet Math. 3(1), 63–78 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  5. Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for estimating the entropy of a stream. ACM Trans. Alg. 6(3), 51 (2010)

    MathSciNet  MATH  Google Scholar 

  6. Chan, H.-L., Lam, T.W., Lee, L.-K., Ting, H.-F.: Continuous monitoring of distributed data streams over a time-based sliding window. Algorithmica 62(3–4), 1088–1111 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  7. Cormode, G., Garofalakis, M.N.: Sketching streams through the net: Distributed approximate query tracking. In: VLDB, pp. 13–24, (2005)

  8. Cormode, G., Garofalakis, M.N., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In: SIGMOD, pp. 25–36, (2005)

  9. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Alg. 55(1), 58–75 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  10. Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. In: SODA, pp. 1076–1085, (2008)

  11. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  12. Cormode, G., Muthukrishnan, S., Zhuang, W.: What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: ICDE, p. 57, (2006)

  13. Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the divide: Continuous clustering of distributed data streams. In: ICDE, pp. 1036–1045, (2007)

  14. Cormode, G., Yi, K.: Tracking distributed aggregates over time-based sliding windows. In: SSDBM, pp. 416–430, (2012)

  15. Dilman, M., Raz, D.: Efficient reactive monitoring. In: INFOCOM, pp. 1012–1019, (2001)

  16. Ganguly, S., Garofalakis, M.N., Rastogi, R.: Tracking set-expression cardinalities over continuous update streams. VLDB J. 13(4), 354–369 (2004)

    Article  Google Scholar 

  17. Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: SPAA, pp. 281–291, (2001)

  18. Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63–72, (2002)

  19. Guha, S., McGregor, A., Venkatasubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: SODA, pp. 733–742, (2006)

  20. Harvey, N.J.A., Nelson, J., Onak, K.: Sketching and streaming entropy via approximation theory. In: FOCS, pp. 489–498, (2008)

  21. Huang, Z., Yi, K., Zhang, Q.: Randomized algorithms for tracking distributed count, frequencies, and ranks. In: PODS, pp. 295–306, (2012)

  22. Jain, A., Hellerstein, J.M., Ratnasamy, S., Wetherall, D.: A wakeup call for internet monitoring systems: the case for distributed triggers. In: Proceedings of HotNets-III, (2004)

  23. Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: SIGMOD, pp. 289–300, (2006)

  24. Lakhina, A., Crovella, M., Diot, C.: Mining anomalies using traffic feature distributions. In: SIGCOMM, pp. 217–228, (2005)

  25. Lall, A., Sekar, V., Ogihara, M., Xu, J.J., Zhang, H.: Data streaming algorithms for estimating entropy of network traffic. In: SIGMETRICS/Performance, pp. 145–156, (2006)

  26. Madden, S., Franklin, M.J., Hellerstein, J.M., Hong, W.: The design of an acquisitional query processor for sensor networks. In: SIGMOD, pp. 491–502, (2003)

  27. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE, pp. 767–778, (2005)

  28. Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: SIGMOD, pp. 563–574, (2003)

  29. Sharfman, I., Schuster, A., Keren, D.: A geometric approach to monitoring threshold functions over distributed data streams. In: SIGMOD, pp. 301–312, (2006)

  30. Sharfman, I., Schuster, A., Keren, D.: Shape sensitive geometric monitoring. In: PODS, pp. 301–310, (2008)

  31. Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: DISC, pp. 283–297, (2011)

  32. Woodruff, D.P., Zhang, Q.: Tight bounds for distributed functional monitoring. In: STOC, pp. 941–960, (2012)

  33. Xu, K., Zhang, Z., Bhattacharyya, S.: Profiling internet backbone traffic: behavior models and applications. In: SIGCOMM, pp. 169–180, (2005)

  34. Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. In: PODS, pp. 167–174, (2009)

Download references

Acknowledgments

We would like to thank the Algorithmica anonymous reviewers for their insightful reviews.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qin Zhang.

Additional information

Work supported in part by NSF CCF-1525024, IIS-1633215, and IU’s office of the Vice Provost for Research through the Faculty Research Support Program.

Appendices

Appendix 1: Proof of Lemma 4

Proof

We can assume that \(U_1,\ldots ,U_m\) are distinct, since the event that \(U_i = U_j\) for some \(i \ne j\) has measure 0. Let \(T_i = \min \{U_1,\ldots U_m \}\). Let \(\text {OD}_i\) denote the order of \(U_1,\ldots ,U_i\). Let \(\Sigma _i=\{ \text {Permutation of}~ [i]\}\). It is clear that for any \(\sigma \in \Sigma _i\), \(\mathbf {Pr}[\text {OD}_i = \sigma \ |\ T_i > t] = \mathbf {Pr}[\text {OD}_i = \sigma ] = \frac{1}{i!}\). Since the order of \(U_1,\ldots ,U_i\) does not depend on the minimal value in that sequence, we have

$$\begin{aligned} \mathbf {Pr}[\text {OD}_i = \sigma , T_i> t]&= \mathbf {Pr}[\text {OD}_i = \sigma \ |\ T_i> t] \cdot \mathbf {Pr}[T_i> t] \\&= \mathbf {Pr}[\text {OD}_i = \sigma ] \cdot \mathbf {Pr}[T_i > t]. \end{aligned}$$

Therefore, the events \(\{\text {OD}_i=\sigma \}\) and \(\{ T_i > t\}\) are independent.

For any given \(\sigma \in \Sigma _{i-1}\) and \(z \in \{0,1\}\) :

$$\begin{aligned} \mathbf {Pr}[J_i = z\ |\ \text {OD}_{i-1}=\sigma ]= & {} \lim _{t \rightarrow 0} \mathbf {Pr}[J_i = z, T_{i-1}> t\ |\ \text {OD}_{i-1}=\sigma ] \nonumber \\= & {} \lim _{t \rightarrow 0}\frac{\mathbf {Pr}[J_i = z\ |\ T_{i-1}> t, \text {OD}_{i-1}=\sigma ] }{\mathbf {Pr}[\text {OD}_{i-1}=\sigma ]/ \mathbf {Pr}[ T_{i-1} > t, \text {OD}_{i-1}=\sigma ]}\nonumber \\ \end{aligned}$$
(11)
$$\begin{aligned}= & {} \lim _{t \rightarrow 0}\frac{\mathbf {Pr}[J_i = z\ |\ T_{i-1}> t] \cdot \mathbf {Pr}[T_{i-1}> t] }{\mathbf {Pr}[\text {OD}_{i-1}=\sigma ] / \mathbf {Pr}[\text {OD}_{i-1}=\sigma ]} \\= & {} \lim _{t \rightarrow 0} \mathbf {Pr}[J_i = z, T_{i-1} > t]\nonumber \\= & {} \mathbf {Pr}[J_i = z],\nonumber \end{aligned}$$
(12)

where (11) to (12) holds because the events \(\{ J_i = z\}\) and \(\{ \text {OD}_{i-1}=\sigma \}\) are conditionally independent given \(\{ T_{i-1} > t\}\), and the events \(\{\text {OD}_i=\sigma \}\) and \(\{ T_i > t\}\) are independent.

Therefore, \(J_i\) and \(\text {OD}_{i-1}\) are independent. Consequently, \(J_i\) is independent of \(J_1,\ldots , J_{i-1}\), since the latter sequence is fully determined by \(\text {OD}_{i-1}\).

$$\begin{aligned} \mathbf {Pr}[J_i = 1]= & {} \mathbf {Pr}[U_i < \min \{U_1, \ldots , U_{i-1} \}] \\= & {} \int _0^1(1-x)^{i-1}dx = \frac{1}{i}. \end{aligned}$$

Thus \(\mathbf {E}[J_i] = \mathbf {Pr}[J_i = 1] = \frac{1}{i}\). By the linearity of expectation, \(\mathbf {E}[J] = \sum _{i \in [m]} \frac{1}{i} \approx \log m\).

Since \(J_1, \ldots , J_m\) are independent, \(\mathbf {Pr}[J > 2 \log m] < m^{-1/3}\) follows from a Chernoff Bound. \(\square \)

Appendix 2: The Assumption of Tracking m Exactly

We explain here why it suffices to assume that m can be maintained at the coordinator exactly without any cost. First, note that we can always use CountEachSimple to maintain a \((1+\epsilon ^2)\)-approximation of m using \(O(\frac{k}{\epsilon ^2}\log m)\) bits of communication, which will be dominated by the cost of other parts of the algorithm for tracking the Shannon entropy. Second, the additional error introduced for the Shannon entropy by the \(\epsilon ^2 m\) additive error of m is negligible: let \(g_x(m) = f_m(x) - f_m(x-1) = x\log \frac{m}{x} - (x-1)\log \frac{m}{x-1}\), and recall (in the proof of Lemma 7) that \(X > 0.5\) under any \(A\in \mathcal {A'}\). It is easy to verify that

$$\begin{aligned} \left| g_x((1\pm \epsilon ^2)m) - g_x(m)\right| = O(\epsilon ^2) \le O(\epsilon ^2) X, \end{aligned}$$

which is negligible compared with \(|X - \hat{X}| \le O(\epsilon ) X\) (the error introduced by using \(\hat{R}\) to approximate R). Similar arguments also apply to \(X'\), and to the analysis of the Tsallis Entropy.

Appendix 3: Pseudocode for CountEachSimple

Algorithm 10, 11 describe how we can maintain a \((1 + \epsilon )\)-approximation to the frequency of element e.

figure j
figure k

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Zhang, Q. Improved Algorithms for Distributed Entropy Monitoring. Algorithmica 78, 1041–1066 (2017). https://doi.org/10.1007/s00453-016-0194-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-016-0194-z

Keywords

Navigation