Improved Algorithms for Distributed Entropy Monitoring

Chen, Jiecao; Zhang, Qin

doi:10.1007/s00453-016-0194-z

Improved Algorithms for Distributed Entropy Monitoring

Published: 04 August 2016

Volume 78, pages 1041–1066, (2017)
Cite this article

Algorithmica Aims and scope Submit manuscript

Jiecao Chen¹ &
Qin Zhang¹

227 Accesses
2 Citations
Explore all metrics

Abstract

Modern data management systems often need to deal with massive, dynamic and inherently distributed data sources. We collect the data using a distributed network, and at the same time try to maintain a global view of the data at a central coordinator using a minimal amount of communication. Such applications have been captured by the distributed monitoring model which has attracted a lot of attention in recent years. In this paper we investigate the monitoring of the entropy functions, which are very useful in network monitoring applications such as detecting distributed denial-of-service attacks. Our results improve the previous best results by Arackaparambil et al. in ICLP 1: 95–106 (2009). Our technical contribution also includes implementing the celebrated AMS sampling method (by Alon et al. in J Comput Syst Sci 58(1): 137–147 1999) in the distributed monitoring model, which could be of independent interest.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on security challenges in cloud computing: issues, threats, and solutions

Article 28 February 2020

DDoS Attack Detection and Mitigation Using SDN: Methods, Practices, and Solutions

Article 02 February 2017

Edge computing: current trends, research challenges and future directions

Article 18 January 2021

Notes

$\tilde{O}(\cdot )$ ignores all polylog terms; see Table 2 for details.
It is well-known that in sensor networks, communication is by far the biggest battery drain. [26]
To see this, note that $\mathbf {E}[X] = \sum _{j \in [m]} \mathbf {E}[X | a_J = j] \mathbf {Pr}[a_J = j] = \sum _{j \in [m]} (\mathbf {E}[f(R) - f(R-1) | a_J = j] \cdot m_j / m)$, and $\mathbf {E}[f(R) - f(R-1) | a_J = j] = \sum _{k \in [m_j]} \frac{f(k) - f(k-1)}{m_j} = \frac{f(m_j)}{m_j}$.
In practice, one can generate a random binary string of, say, $10\log m$ bits as its rank, and w.h.p. all ranks will be different.

References

Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)
Article MathSciNet MATH Google Scholar
Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. ICALP 1, 95–106 (2009)
MathSciNet MATH Google Scholar
Babcock, B., Olston, C.: Distributed top-k monitoring. In: SIGMOD, pp. 28–39, (2003)
Chakrabarti, A., Ba, K.D., Muthukrishnan, S.: Estimating entropy and entropy norm on data streams. Internet Math. 3(1), 63–78 (2006)
Article MathSciNet MATH Google Scholar
Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for estimating the entropy of a stream. ACM Trans. Alg. 6(3), 51 (2010)
MathSciNet MATH Google Scholar
Chan, H.-L., Lam, T.W., Lee, L.-K., Ting, H.-F.: Continuous monitoring of distributed data streams over a time-based sliding window. Algorithmica 62(3–4), 1088–1111 (2012)
Article MathSciNet MATH Google Scholar
Cormode, G., Garofalakis, M.N.: Sketching streams through the net: Distributed approximate query tracking. In: VLDB, pp. 13–24, (2005)
Cormode, G., Garofalakis, M.N., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In: SIGMOD, pp. 25–36, (2005)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Alg. 55(1), 58–75 (2005)
Article MathSciNet MATH Google Scholar
Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. In: SODA, pp. 1076–1085, (2008)
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10 (2012)
Article MathSciNet MATH Google Scholar
Cormode, G., Muthukrishnan, S., Zhuang, W.: What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: ICDE, p. 57, (2006)
Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the divide: Continuous clustering of distributed data streams. In: ICDE, pp. 1036–1045, (2007)
Cormode, G., Yi, K.: Tracking distributed aggregates over time-based sliding windows. In: SSDBM, pp. 416–430, (2012)
Dilman, M., Raz, D.: Efficient reactive monitoring. In: INFOCOM, pp. 1012–1019, (2001)
Ganguly, S., Garofalakis, M.N., Rastogi, R.: Tracking set-expression cardinalities over continuous update streams. VLDB J. 13(4), 354–369 (2004)
Article Google Scholar
Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: SPAA, pp. 281–291, (2001)
Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63–72, (2002)
Guha, S., McGregor, A., Venkatasubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: SODA, pp. 733–742, (2006)
Harvey, N.J.A., Nelson, J., Onak, K.: Sketching and streaming entropy via approximation theory. In: FOCS, pp. 489–498, (2008)
Huang, Z., Yi, K., Zhang, Q.: Randomized algorithms for tracking distributed count, frequencies, and ranks. In: PODS, pp. 295–306, (2012)
Jain, A., Hellerstein, J.M., Ratnasamy, S., Wetherall, D.: A wakeup call for internet monitoring systems: the case for distributed triggers. In: Proceedings of HotNets-III, (2004)
Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: SIGMOD, pp. 289–300, (2006)
Lakhina, A., Crovella, M., Diot, C.: Mining anomalies using traffic feature distributions. In: SIGCOMM, pp. 217–228, (2005)
Lall, A., Sekar, V., Ogihara, M., Xu, J.J., Zhang, H.: Data streaming algorithms for estimating entropy of network traffic. In: SIGMETRICS/Performance, pp. 145–156, (2006)
Madden, S., Franklin, M.J., Hellerstein, J.M., Hong, W.: The design of an acquisitional query processor for sensor networks. In: SIGMOD, pp. 491–502, (2003)
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE, pp. 767–778, (2005)
Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: SIGMOD, pp. 563–574, (2003)
Sharfman, I., Schuster, A., Keren, D.: A geometric approach to monitoring threshold functions over distributed data streams. In: SIGMOD, pp. 301–312, (2006)
Sharfman, I., Schuster, A., Keren, D.: Shape sensitive geometric monitoring. In: PODS, pp. 301–310, (2008)
Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: DISC, pp. 283–297, (2011)
Woodruff, D.P., Zhang, Q.: Tight bounds for distributed functional monitoring. In: STOC, pp. 941–960, (2012)
Xu, K., Zhang, Z., Bhattacharyya, S.: Profiling internet backbone traffic: behavior models and applications. In: SIGCOMM, pp. 169–180, (2005)
Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. In: PODS, pp. 167–174, (2009)

Download references

Acknowledgments

We would like to thank the Algorithmica anonymous reviewers for their insightful reviews.

Author information

Authors and Affiliations

Indiana University Bloomington, Bloomington, IN, USA
Jiecao Chen & Qin Zhang

Authors

Jiecao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qin Zhang.

Additional information

Work supported in part by NSF CCF-1525024, IIS-1633215, and IU’s office of the Vice Provost for Research through the Faculty Research Support Program.

Appendices

Appendix 1: Proof of Lemma 4

Proof

We can assume that $U_1,\ldots ,U_m$ are distinct, since the event that $U_i = U_j$ for some $i \ne j$ has measure 0. Let $T_i = \min \{U_1,\ldots U_m \}$. Let $\text {OD}_i$ denote the order of $U_1,\ldots ,U_i$. Let $\Sigma _i=\{ \text {Permutation of}~ [i]\}$. It is clear that for any $\sigma \in \Sigma _i$, $\mathbf {Pr}[\text {OD}_i = \sigma \ |\ T_i > t] = \mathbf {Pr}[\text {OD}_i = \sigma ] = \frac{1}{i!}$. Since the order of $U_1,\ldots ,U_i$ does not depend on the minimal value in that sequence, we have

$$\begin{aligned} \mathbf {Pr}[\text {OD}_i = \sigma , T_i> t]&= \mathbf {Pr}[\text {OD}_i = \sigma \ |\ T_i> t] \cdot \mathbf {Pr}[T_i> t] \\&= \mathbf {Pr}[\text {OD}_i = \sigma ] \cdot \mathbf {Pr}[T_i > t]. \end{aligned}$$

Therefore, the events $\{\text {OD}_i=\sigma \}$ and $\{ T_i > t\}$ are independent.

For any given $\sigma \in \Sigma _{i-1}$ and $z \in \{0,1\}$ :

$$\begin{aligned} \mathbf {Pr}[J_i = z\ |\ \text {OD}_{i-1}=\sigma ]= & {} \lim _{t \rightarrow 0} \mathbf {Pr}[J_i = z, T_{i-1}> t\ |\ \text {OD}_{i-1}=\sigma ] \nonumber \\= & {} \lim _{t \rightarrow 0}\frac{\mathbf {Pr}[J_i = z\ |\ T_{i-1}> t, \text {OD}_{i-1}=\sigma ] }{\mathbf {Pr}[\text {OD}_{i-1}=\sigma ]/ \mathbf {Pr}[ T_{i-1} > t, \text {OD}_{i-1}=\sigma ]}\nonumber \\ \end{aligned}$$

(11)

$$\begin{aligned}= & {} \lim _{t \rightarrow 0}\frac{\mathbf {Pr}[J_i = z\ |\ T_{i-1}> t] \cdot \mathbf {Pr}[T_{i-1}> t] }{\mathbf {Pr}[\text {OD}_{i-1}=\sigma ] / \mathbf {Pr}[\text {OD}_{i-1}=\sigma ]} \\= & {} \lim _{t \rightarrow 0} \mathbf {Pr}[J_i = z, T_{i-1} > t]\nonumber \\= & {} \mathbf {Pr}[J_i = z],\nonumber \end{aligned}$$

(12)

where (11) to (12) holds because the events $\{ J_i = z\}$ and $\{ \text {OD}_{i-1}=\sigma \}$ are conditionally independent given $\{ T_{i-1} > t\}$, and the events $\{\text {OD}_i=\sigma \}$ and $\{ T_i > t\}$ are independent.

Therefore, $J_i$ and $\text {OD}_{i-1}$ are independent. Consequently, $J_i$ is independent of $J_1,\ldots , J_{i-1}$, since the latter sequence is fully determined by $\text {OD}_{i-1}$.

$$\begin{aligned} \mathbf {Pr}[J_i = 1]= & {} \mathbf {Pr}[U_i < \min \{U_1, \ldots , U_{i-1} \}] \\= & {} \int _0^1(1-x)^{i-1}dx = \frac{1}{i}. \end{aligned}$$

Thus $\mathbf {E}[J_i] = \mathbf {Pr}[J_i = 1] = \frac{1}{i}$. By the linearity of expectation, $\mathbf {E}[J] = \sum _{i \in [m]} \frac{1}{i} \approx \log m$.

Since $J_1, \ldots , J_m$ are independent, $\mathbf {Pr}[J > 2 \log m] < m^{-1/3}$ follows from a Chernoff Bound. $\square $

Appendix 2: The Assumption of Tracking m Exactly

We explain here why it suffices to assume that m can be maintained at the coordinator exactly without any cost. First, note that we can always use CountEachSimple to maintain a $(1+\epsilon ^2)$-approximation of m using $O(\frac{k}{\epsilon ^2}\log m)$ bits of communication, which will be dominated by the cost of other parts of the algorithm for tracking the Shannon entropy. Second, the additional error introduced for the Shannon entropy by the $\epsilon ^2 m$ additive error of m is negligible: let $g_x(m) = f_m(x) - f_m(x-1) = x\log \frac{m}{x} - (x-1)\log \frac{m}{x-1}$, and recall (in the proof of Lemma 7) that $X > 0.5$ under any $A\in \mathcal {A'}$. It is easy to verify that

$$\begin{aligned} \left| g_x((1\pm \epsilon ^2)m) - g_x(m)\right| = O(\epsilon ^2) \le O(\epsilon ^2) X, \end{aligned}$$

which is negligible compared with $|X - \hat{X}| \le O(\epsilon ) X$ (the error introduced by using $\hat{R}$ to approximate R). Similar arguments also apply to $X'$, and to the analysis of the Tsallis Entropy.

Appendix 3: Pseudocode for CountEachSimple

Algorithm 10, 11 describe how we can maintain a $(1 + \epsilon )$-approximation to the frequency of element e.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Zhang, Q. Improved Algorithms for Distributed Entropy Monitoring. Algorithmica 78, 1041–1066 (2017). https://doi.org/10.1007/s00453-016-0194-z

Download citation

Received: 11 April 2015
Accepted: 29 July 2016
Published: 04 August 2016
Issue Date: July 2017
DOI: https://doi.org/10.1007/s00453-016-0194-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Algorithms for Distributed Entropy Monitoring

Abstract

Access this article

Similar content being viewed by others

A survey on security challenges in cloud computing: issues, threats, and solutions

DDoS Attack Detection and Mitigation Using SDN: Methods, Practices, and Solutions

Edge computing: current trends, research challenges and future directions

Notes

References

Acknowledgments