Skip to main content
Log in

Sketching distributed sliding-window data streams

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

While traditional data management systems focus on evaluating single, ad hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is “stale” and operate solely on a sliding window of recent data arrivals (e.g., data updates occurring over the last 24 h). Such distributed data streaming applications mandate novel algorithmic solutions that are both time and space efficient (to manage high-speed data streams) and also communication efficient (to deal with physical data distribution). In this paper, we consider the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees. Our sketch structure enables point, as well as inner product, queries and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hitters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving merging of all streams; furthermore, we show how ECM-sketches can be exploited for continuous monitoring of sliding-window queries over distributed streams. Our extensive experimental study with two real-life data sets validates our theoretical claims and verifies the effectiveness of our techniques. To the best of our knowledge, ours is the first work to address efficient, guaranteed-error complex query answering over distributed data streams in the sliding-window model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The geometric method is trivially extended to handle matrices instead of vectors by applying vectorization on the matrices and adjusting the monitored function to use the corresponding vector dimensions. We use the matrix notation for the sketches only for convenience.

  2. Available from http://www.caida.org/data/

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  2. Arlitt, M., Jin, T.: A workload characterization study of the 1998 world cup web site. Network 14(3), 30–37 (2000)

    Google Scholar 

  3. Busch, C., Tirthapura, S.: A deterministic algorithm for summarizing asynchronous streams over a sliding window. In: STACS, pp. 465–476 (2007)

  4. Chakrabarti, A., Cormode, G., Mcgregor, A.: A near-optimal algorithm for estimating the entropy of a stream. ACM Trans. Algorithms 6(3), 51:1–51:21 (2010)

    Article  MathSciNet  Google Scholar 

  5. Chan, H.L., Lam, T.W., Lee, L.K., Ting, H.F.: Continuous monitoring of distributed data streams over a time-based sliding window. Algorithmica 62(3–4), 1088–1111 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  6. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: ICALP, pp. 693–703 (2002)

  7. Cohen, E., Strauss, M.J.: Maintaining time-decaying stream aggregates. J. Algorithms 59(1), 19–36 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  8. Cormode, G., Garofalakis, M.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2) (2008)

  9. Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In: SIGMOD, pp. 25–36 (2005)

  10. Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: Tracking most frequent items dynamically. In: PODS, pp. 296–306 (2003)

  11. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  12. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In: PODS, pp. 77–86 (2010)

  13. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10:1–10:25 (2012)

    Article  MathSciNet  Google Scholar 

  14. Cormode, G., Tirthapura, S., Xu, B.: Time-decaying sketches for robust aggregation of sensor data. SIAM J. Comput. 39(4), 1309–1339 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  15. Cormode, G., Yi, K.: Tracking distributed aggregates over time-based sliding windows. In: SSDBM, pp. 416–430 (2012)

  16. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6), 1794–1813 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  17. Dimitropoulos, X.A., Stoecklin, M.P., Hurley, P., Kind, A.: The eternal sunshine of the sketch data structure. Computer Netw. 52(17), 3248–3257 (2008)

    Article  MATH  Google Scholar 

  18. Garofalakis, M.N., Keren, D., Samoladas, V.: Sketch-based geometric monitoring of distributed stream queries. PVLDB 6(10), 937–948 (2013)

    Google Scholar 

  19. Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: VLDB, pp. 541–550 (2001)

  20. Gibbons, P.B.: Distinct-values estimation over data streams. Data Stream Management: Processing High-Speed Data Streams. Springer, New York (2007)

    Google Scholar 

  21. Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63–72 (2002)

  22. Greenwald, M.B., Khanna, S.: Space-efficient online computation of quantile summaries. In: SIGMOD, pp. 58–66 (2001)

  23. Huang, L., Garofalakis, M., Joseph, A., Taft, N.: Communication efficient tracking of distributed cumulative triggers. In: ICDCS (2007)

  24. Huang, L., Nguyen, X., Garofalakis, M., Hellerstein, J., Jordan, M., Joseph, A., Taft, N.: Communication-efficient online detection of network-wide anomalies. In: INFOCOM, pp. 134–142 (2007)

  25. Hung, R.Y.S., Ting, H.F.: Finding heavy hitters over the sliding window of a weighted data stream. In: LATIN, pp. 699–710 (2008)

  26. Jain, A., Hellerstein, J.M., Ratnasamy, S., Wetherall, D.: A wakeup call for internet monitoring systems: the case for distributed triggers. In: SIGCOMM Workshop on Hot Topics in Networks (HotNets) (2004)

  27. Keren, D., Sharfman, I., Schuster, A., Livne, A.: Shape sensitive geometric monitoring. TKDE 24(8), 1520–1535 (2012)

    Google Scholar 

  28. Mirkovic, J., Prier, G., Reiher, P.L.: Attacking DDoS at the source. In: ICNP, pp. 312–321 (2002)

  29. Muthukrishnan, S.: Data Streams: Algorithms and Applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)

    Article  MathSciNet  Google Scholar 

  30. Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: SIGMOD, pp. 563–574 (2003)

  31. Papapetrou, O., Garofalakis, M.N., Deligiannakis, A.: Sketch-based querying of distributed sliding-window data streams. PVLDB 5(10), 992–1003 (2012)

    Google Scholar 

  32. Qiao, L., Agrawal, D., El Abbadi, A.: Supporting sliding window queries for continuous data streams. In: SSDBM, pp. 85–96 (2003)

  33. Sharfman, I., Schuster, A., Keren, D.: A geometric approach to monitoring threshold functions over distributed data streams. In: SIGMOD, pp. 301–312 (2006)

  34. Tirthapura, S., Xu, B., Busch, C.: Sketching asynchronous streams over a sliding window. In: PODC, pp. 82–91 (2006)

  35. Xu, B., Tirthapura, S., Busch, C.: Sketching asynchronous data streams over sliding windows. Distrib. Comput. 20(5), 359–374 (2008)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

This work was supported by the European Commission under ICT-FP7-LEADS-318809 (Large-Scale Elastic Architecture for Data-as-a-Service) and ICT-FP7-FERARI-619491 (Flexible Event pRocessing for big dAta aRchItectures).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Odysseas Papapetrou.

Appendix

Appendix

1.1 Proofs for centralized queries

Lemmas 3 and 4 provide error guarantees for point and inner product queries on ECM-sketches, for any set of \(\epsilon _{cm}\), \(\epsilon _{sw}\), \(\delta _{cm}\) and \(\delta _{sw}\). With Theorems 2 and 3 we derive the optimal values of these parameters (the ones that minimize the total cost), given only the acceptable total \(\epsilon \) and \(\delta \).

Lemma 3

With probability at least \(1-\delta _{cm}-\delta _{sw}\),

$$\begin{aligned} |\hat{f}(x,r)-f(x,r)|\!\le \! {\left\{ \begin{array}{ll} (1\!+\!\epsilon _{sw})\epsilon _{cm} ||a_r||_1 &{}\text {if } \epsilon _{sw}\le \frac{\epsilon _{cm}}{1-\epsilon _{cm}},\\ \epsilon _{sw} ||a_r||_1 &{}\text {if } \epsilon _{sw}\ge \frac{\epsilon _{cm}}{1-\epsilon _{cm}}. \end{array}\right. } \end{aligned}$$

Proof

We start with an overview of the proof. The ECM-sketch estimation is susceptible to two types of errors. The first is due to the hash collisions, i.e., two different items may hash to the same ECM-sketch cell. This error is relative to the L1 norm. The second is due to the sliding window counters, and is relative to the counter value, i.e., the number of items hashed to the particular counter. For the lemma, we derive a single error relative to the L1 norm by considering worst-case bounds (i.e., maximum possible values) for the combination of the two errors. We first bound the number of hash collisions that occur at counter \((j,h_j(x))\) for any row \(j\le d\) within the query range \(r\), assuming that the sliding window algorithm offers perfect accuracy. The proof stems from the accuracy proof for Count-Min sketches [11], differentiating on the estimation step of the number of hash collisions (by offering an error relative to \(||a_r||_1-f(x,r)\), instead of \(||a_r||_1\)). Then, we consider the error caused by the sliding window estimation.

Error due to hash collisions. Temporarily assume that the sliding window structure enables perfect accuracy (the assumption will be lifted later). With \(I_{x,j,y}\) we denote the indicator variables which are 1 if \(x \ne y \wedge h_j(x)=h_j(y)\), and 0 otherwise. We further define the variables \(X_{x,j,r}\) to be \(X_{x,j,r}=\sum _{y \in \mathcal {D}} I_{x,j,y} f(y,r)\). By our assumption that the sliding window estimation is accurate, \(E(j,h_j(x),r) = f(x,r) + X_{x,j,r}\). Since the ECM-sketch will return \(\hat{f}(x,r) = \min _j E(j,h_j(x),r) = f(x,r) + \min _j X_{x,j,r}\) as a frequency estimate, the estimation error will be \(\hat{f}(x,r)-f(x,r) = \min _j X_{x,j,r}\), which can be bounded as follows.

By pairwise independence of the hash functions: \(E(I_{x,j,y})=Pr[h_j(x)=h_j(y)]\le 1/w=\frac{\epsilon }{e}\). Therefore, the expected value of \(X_{x,j,r}\) is \(E(X_{x,j,r}) = E\left( \sum _{k=1}^n I_{x,j,y} f(k,r)\right) = \sum _{\forall k \in \mathcal {D}\backslash \{x\}} f(k,r) E(I_{x,j,y}) \le (||a_r||_1-f(x,r))\frac{\epsilon }{e}\). Furthermore, by Markov inequality:

$$\begin{aligned} Pr&[\min _j X_{x,j,r} > \epsilon (||a_r||_1 - f(x,r))] = \nonumber \\ Pr&[\forall j: X_{x,j,r} > \epsilon (||a_r||_1 - f(x,r))] \le \nonumber \\ Pr&[\forall j: X_{x,j,r} > e E(X_{x,j,r})] < e^{-d} \le \delta _{cm} \end{aligned}$$
(1)

Error due to the sliding window estimation. In practice, the sliding window algorithm may introduce errors to the computation of \(E(j,h_j(x),r)\). Let \(R(j,h_j(x),r)\) denote the accurate number of bits contained within the query range at counter \((j,h_j(x))\). Then, an \((\epsilon ,\delta )\)-approximate sliding window algorithm guarantees that \(Pr[E(j,h_j(x),r)-R(j,h_j(x),r)\le \epsilon _{sw} R(j,h_j(x),r)]\ge 1-\delta _{sw}\).

Consider row \(j = \min \arg _j E(j,h_j(x),r)\), i.e., the row with \(E(j,h_j(x),r) = \hat{f}(x,r)\) . For the case that \(\hat{f}(x,r) > R(j,h_j(x),r)\), we have:

Furthermore, \(X_{x,j,r}\) can be bounded by Inequality 1, giving: \(Pr[\hat{f}(x,r) - f(x,r) \le \epsilon _{cm} (||a_r||_1 - f(x,r)) + \epsilon _{sw} ( f(x,r) + \epsilon _{cm} (||a_r||_1 - f(x,r)))] \ge 1-\delta _{sw} -\delta _{cm}\). For convenience we define \(c=f(x,r)/||a_r||_1\). Then, \(Pr[\hat{f}(x,r) - f(x,r) \le ||a_r||_1 (\epsilon _{cm} (1-c) + \epsilon _{sw} (c+\epsilon _{cm}(1-c) ))] = Pr[\hat{f}(x,r)-f(x,r) \le ||a_r||_1 ( c (\epsilon _{sw} - \epsilon _{cm} - \epsilon _{sw}\epsilon _{cm}) + \epsilon _{cm} + \epsilon _{sw}\epsilon _{cm})] \ge 1-\delta _{sw} -\delta _{cm}\).

Variable \(c\) takes values between 0 and 1 (inclusive). When \(\epsilon _{sw}\le \epsilon _{cm}/(1-\epsilon _{cm})\), the RHS of the inequality (the error) is maximized for \(c=0\). Otherwise, the RHS is maximized for \(c=1\). Therefore, with a probability of at least \(1-\delta _{cm}-\delta _{sw}\):

$$\begin{aligned} \hat{f}(x,r) - f(x,r) \le \!&{\left\{ \begin{array}{ll} ||a_r||_1 \epsilon _{cm} (1\!+\!\epsilon _{sw}) &{}\quad \text {if }\quad \epsilon _{sw}\le \frac{\epsilon _{cm}}{1-\epsilon _{cm}},\\ ||a_r||_1 \epsilon _{sw} &{}\quad \text {if }\quad \epsilon _{sw}\ge \frac{\epsilon _{cm}}{1-\epsilon _{cm}}. \end{array}\right. } \end{aligned}$$
(2)

With a similar analysis, the case of \(\hat{f}(x,r)<R(j,h_j(x),r)\) gives a much tighter constraint:

$$\begin{aligned} Pr[f(x,r) - \hat{f}(x,r) \le \epsilon _{sw} f(x,r) ] \ge 1-\delta _{sw} \end{aligned}$$
(3)

The lemma follows directly by inequalities 2 and 3.\(\square \)

Theorem 3

Proof

We first consider an ECM-sketch with a deterministic sliding window structure, e.g., an exponential histogram. We want to derive the combination of \(\epsilon _{cm}\) and \(\epsilon _{sw}\) that minimizes the space complexity of the sketch for a given \(\epsilon \), i.e., minimizes \(C(\epsilon )=O(\frac{\ln (1/\delta _{cm})}{\epsilon _{cm}\epsilon _{sw}})\). We study the two cases of Lemma 3 separately:

Case 1 (\(\epsilon _{sw}\le \frac{\epsilon _{cm}}{1-\epsilon _{cm}}\)): We first exploit the fact that

$$\begin{aligned} \epsilon =(1+\epsilon _{sw})\epsilon _{cm} \end{aligned}$$
(4)

to eliminate \(\epsilon _{sw}\) from the space complexity of the sketch:

$$\begin{aligned} C(\epsilon )&=O\left( \frac{\ln (1/\delta _{cm})}{\epsilon _{sw} \epsilon _{cm}}\right) = O\left( \frac{\ln (1/\delta _{cm})}{\frac{\epsilon -\epsilon _{cm}}{\epsilon _{cm}} \epsilon _{cm}}\right) \\&= O\left( \frac{\ln (1/\delta _{cm})}{\epsilon -\epsilon _{cm}}\right) \end{aligned}$$

The cost is minimized when the denominator is maximized. For a fixed \(\epsilon \) chosen by the user, this happens when \(\epsilon _{cm}\) is minimized. The smallest \(\epsilon _{cm}\) satisfying Eqn. 4 and the precondition of Case 1 is \(\epsilon _{cm}=\frac{\epsilon }{1+\epsilon }\), resulting to \(\epsilon _{sw}=\epsilon \).

Case 2 (\(\epsilon _{sw}\ge \frac{\epsilon _{cm}}{1-\epsilon _{cm}}\)): By Lemma 3, we see that setting \(\epsilon _{sw}=\epsilon \) we achieve the required accuracy. In order to derive \(\epsilon _{cm}\), notice that we want to minimize the complexity \(C(\epsilon )=O(\frac{\ln (1/\delta _{cm})}{\epsilon _{cm}\epsilon _{sw}})\). This is achieved by maximizing \(\epsilon _{cm}\). The maximum value of \(\epsilon _{cm}\) satisfying the precondition of Case 2 is \(\epsilon _{cm}=\frac{\epsilon }{1+\epsilon }\).

Notice that both cases lead to the same combination for minimizing the cost, i.e., \(\epsilon _{cm}=\frac{\epsilon }{1+\epsilon }\) and \(\epsilon _{sw}=\epsilon \).

The case of randomized waves is similar. The cost function becomes \(O(\ln (1/\delta _{cm})\ln (1/\delta _{sw})/(\epsilon _{cm}\epsilon _{sw}^2))\), with the constraint that \(\delta =\delta _{cm}+\delta _{sw}\). The cost is minimized for \(\delta _{cm}=\delta _{sw}=\delta /2\), \(\epsilon _{cm}=\frac{\epsilon }{1+\epsilon }\) and \(\epsilon _{sw}=\epsilon \).\(\square \)

Lemma 4

With probability at least \(1-\delta _{cm}\),

$$\begin{aligned} |\widehat{a_r \odot b_r} - a_r \odot b_r | \le {\left\{ \begin{array}{ll} ||a_r||_1||b_r||_1 \epsilon _{cm}(1+\epsilon _{sw})^2 \\ \text {if } \epsilon _{cm}\ge \frac{\epsilon _{sw}^2+2\epsilon _{sw}}{(\epsilon _{sw}+1)^2},\\ ||a_r||_1||b_r||_1(\epsilon _{sw}^2 + 2\epsilon _{sw}) \\ \text {if } \epsilon _{cm}\le \frac{\epsilon _{sw}^2+2\epsilon _{sw}}{(\epsilon _{sw}+1)^2}. \end{array}\right. } \end{aligned}$$

Proof

We first examine the case that \(\widehat{a_r \odot b_r} > a_r \odot b_r\). Consider the estimation derived by any single row \(j\) of the ECM-sketch. With \(E_a(i,j,r)\) we denote the frequency estimation of the sliding window counter at position \((i,j)\) for stream \(a\) and for query range \(r\).

$$\begin{aligned} E((&\widehat{a_r \odot b_r})_j \!- a_r \odot b_r) \!=\!\! \sum _{i=1}^w E_a(i, j, r) E_b(i, j, r) \!- a_r \odot b_r \nonumber \\ \le&\!\sum _{i=1}^w \!\sum _{\begin{array}{c} p \in \mathcal {D}, \\ h_j(p)=i \end{array}} f_a(p,r) \sum _{\begin{array}{c} q \in \mathcal {D}, \\ h_j(q)=i \end{array}} f_b(q,r) * (1\!+\epsilon _{sw})^2 \!- a_r \odot b_r \nonumber \\ =&\sum _{i=1}^w \sum _{\begin{array}{c} x \in \mathcal {D}, \\ h_j(x)=i \end{array}} f_a(x,r) f_b(x,r) * (1+\epsilon _{sw})^2 \nonumber \\&+\sum _{i=1}^w \sum _{\begin{array}{c} p,q \in \mathcal {D}, p \ne q, \\ h_j(p)=h_j(q)=i \end{array}} f_a(p,r) f_b(q,r) * (1\!+\!\epsilon _{sw})^2 - a_r \odot b_r \nonumber \\ =&(1+\epsilon _{sw})^2 \left( \sum _{\begin{array}{c} x \in \mathcal {D} \end{array}} f_a(x,r) f_b(x,r)\right. \nonumber \\&\left. +\sum _{\begin{array}{c} p,q \in \mathcal {D}, p \ne q, \\ h_j(p)=h_j(q) \end{array}} f_a(p,r) f_b(q,r)\right) - a_r \odot b_r \nonumber \\ =&a_r \odot b_r \left( \epsilon _{sw}^2 +2\epsilon _{sw}\right) \nonumber \\&+ \sum _{\begin{array}{c} p,q \in \mathcal {D}, p \ne q, \\ h_j(p)=h_j(q) \end{array}} f_a(p,r) f_b(q,r)(1+\epsilon _{sw})^2 \end{aligned}$$
(5)

Our next step is to bound \(\sum _{\begin{array}{c} p,q \in \mathcal {D}, p \ne q, \\ h_j(p)=h_j(q) \end{array}} f_a(p,r) f_b(q,r)\). For convenience we use \(X_{i,j,r}\) as a shortcut for \(\sum _{\begin{array}{c} p,q \in \mathcal {D}, p \ne q, \\ h_j(p)=h_j(q) \end{array}} f_a(p,r) f_b(q,r)\). Then,

$$\begin{aligned} E&(X_{i,j,r}) =\! \sum _{\begin{array}{c} p,q \in \mathcal {D}, p \ne q \end{array}} Pr[h_j(p)=h_j(q)] f_a(p,r) f_b(q,r) \\ =&\frac{1}{w} \sum _{\begin{array}{c} p,q \in \mathcal {D}, p \ne q \end{array}} f_a(p,r) f_b(q,r)\\ \le&\frac{\epsilon _{cm}}{e} (\sum _{\begin{array}{c} p,q \in \mathcal {D} \end{array}} f_a(p,r) f_b(q,r) - a_r \odot b_r) \end{aligned}$$

\(X_{i,j,r}\) can be bounded by Markov inequality:

$$\begin{aligned} Pr&[\min _j X_{i,j,r} > \epsilon _{cm} (\sum _{\begin{array}{c} p,q \in \mathcal {D} \end{array}} f_a(p,r) f_b(q,r) - a_r \odot b_r)] = \nonumber \\ Pr&[\forall j : X_{i,j,r} > e E(X_{i,j,r})] < e^{-d} \le \delta _{cm} \end{aligned}$$
(6)

Let \(c=\frac{a_r \odot b_r}{||a_r||_1 ||b_r||_1}\). Combining Eq. (5) and Inequality 6:

$$\begin{aligned}&\widehat{a_r \odot b_r} - a_r \odot b_r \\\le & {} a_r \odot b_r (\epsilon _{sw}^2 +2\epsilon _{sw})\\&+ \sum _{\begin{array}{c} p,q \in \mathcal {D}, p \ne q, \\ h_j(p)=h_j(q) \end{array}} f_a(p,r) f_b(q,r)(1+\epsilon _{sw})^2\\< & {} a_r \odot b_r (\epsilon _{sw}^2 +2\epsilon _{sw}) + (1+\epsilon _{sw})^2 \min _j X_{i,j,r} \\\le & {} a_r \odot b_r (\epsilon _{sw}^2 +2\epsilon _{sw}) \\&+ (1+\epsilon _{sw})^2 \epsilon _{cm}\left( \sum _{\begin{array}{c} p,q \in \mathcal {D} \end{array}} f_a(p,r) f_b(q,r) - a_r \odot b_r\right) \\= & {} a_r \odot b_r (\epsilon _{sw}^2 +2\epsilon _{sw})\\&+ (1+\epsilon _{sw})^2 \epsilon _{cm}( ||a_r||_1 ||b_r||_1 - a_r \odot b_r)\\= & {} c||a_r||_1 ||b_r||_1 (\epsilon _{sw}^2 +2\epsilon _{sw})\\&+ \epsilon _{cm} (1+\epsilon _{sw})^2 ||a_r||_1 ||b_r||_1 (1-c)\\= & {} ||a_r||_1 ||b_r||_1 \left( c (\epsilon _{sw}^2 +2\epsilon _{sw})\right. \\&+ \left. \epsilon _{cm}(1+\epsilon _{sw})^2 (1-c)\right) \end{aligned}$$

with probability at least \(1-\delta _{cm}\).

The values of \(c\) that maximize the error (the RHS) are \(c=1\) when \(\epsilon _{cm}< \frac{\epsilon _{sw}^2+2\epsilon _{sw}}{(\epsilon _{sw}+1)^2}\), and \(c=0\) when \(\epsilon _{cm} \ge \frac{\epsilon _{sw}^2+2\epsilon _{sw}}{(\epsilon _{sw}+1)^2}\). The corresponding maximum errors are \(||a_r||_1 ||b_r||_1 (\epsilon _{sw}^2 +2\epsilon _{sw})\) (for \(c=1\)), and \(||a_r||_1 ||b_r||_1 \epsilon _{cm}(1+\epsilon _{sw})^2\) (for \(c=0\)).

With a similar analysis, the case of \(\widehat{a_r \odot b_r} < a_r \odot b_r\) gives a tighter constraint: \(Pr[a_r \odot b_r - \widehat{a_r \odot b_r} > (\epsilon _{sw}^2 + 2\epsilon _{sw}) a_r \odot b_r]< \delta _{sw}\). The lemma follows directly.\(\square \)

Theorem 2

Proof

Similar to the analysis for point queries, we need to consider the two cases of Lemma 4 separately.

Case 1 (\(\epsilon _{cm}\ge \frac{\epsilon ^2_{sw}+2\epsilon _{sw}}{(\epsilon _{sw}+1)^2}\)): By Lemma 4, we set \(\epsilon _{cm}(1+\epsilon _{sw})^2=\epsilon \) in order to achieve the required accuracy. The space complexity then becomes \(C(\epsilon )=\frac{1}{\epsilon _{sw}\epsilon _{cm}}= \frac{(1+\epsilon _{sw})^2}{ \epsilon \epsilon _{sw}}\). Since \(\frac{(1+\epsilon _{sw})^2}{\epsilon \epsilon _{sw}}\) is strictly decreasing for \(\epsilon _{sw}\) in the interval \([0,1]\), we can minimize the space complexity by setting the maximum value for \(\epsilon _{sw}\) satisfying the case’s precondition \(\epsilon _{cm}\ge \frac{\epsilon ^2_{sw}+2\epsilon _{sw}}{(\epsilon _{sw}+1)^2}\), i.e., \(\epsilon _{sw} = \sqrt{\epsilon +1}-1\). Then, \(\epsilon _{cm}\) becomes equal to \(\frac{\epsilon }{\epsilon +1}\).

Case 2 (\(\epsilon _{cm}\le \frac{\epsilon ^2_{sw}+2\epsilon _{sw}}{(\epsilon _{sw}+1)^2}\)): By Lemma 4, in order to achieve the required accuracy we need to set \(\epsilon ^2_{sw}+ 2\epsilon ^2_{sw}=\epsilon \Rightarrow \epsilon _{sw}=\sqrt{\epsilon +1}-1\). Accordingly, \(\epsilon _{cm}=\frac{\epsilon }{\epsilon +1}\).

Notice that, similar to the point queries analysis, the two cases lead to the same configuration for minimizing the cost, i.e., \(\epsilon _{cm}=\frac{\epsilon }{\epsilon +1}\) and \(\epsilon _{sw}=\sqrt{\epsilon +1}-1\).\(\square \)

1.2 Proofs for distributed setups

Theorem 4 derives worst-case error bounds for the merging of exponential histograms. Lemma 2 and Theorem 5 prove the correctness of the algorithm for continuous self-join size queries.

Theorem 4

Proof

We argue that \(EH_\oplus \) approximates the exponential histogram of the logical stream, with a maximum relative error of \(\epsilon +\epsilon '+\epsilon \epsilon '\), where \(\epsilon \) is the error parameter of the initial exponential histograms. Consider a query for the last \(q\) time units. With \(s_q=t-q\) we denote the query starting time. Let \(Q\) denote the index of the bucket of \(EH_\oplus \) which contains \(s_q\) in its range, i.e., \(s(EH_\oplus ^Q) \le s_q \le e(EH_\oplus ^Q)\). With \(i\) and \(\hat{i}\) we denote the accurate and estimated number of true bits in the query range. According to the estimation algorithm, the estimation for the number of true bits in the stream will be \(\hat{i}=1/2 |EH_\oplus ^Q| + \sum _{1\le Y<Q}{|EH_\oplus ^Y|}\). This estimation may be influenced by two types of approximation errors: (a) a possible approximation error of the overlap of bucket \(EH_\oplus ^Q\) with the query range, denoted as \(\text {err}_1\), and, (b) a possible approximation error of \(i\), denoted as \(\text {err}_2\), because of the inclusion of data that arrived before \(s_q\) in buckets \(Y \le Q\), or data that arrived after \(s_q\) in buckets \(Y > Q\). Let us now look into these two errors in more details.

With respect to \(\text {err}_2\), recall that the contents of individual buckets are inserted to \(EH_\oplus \) using the starting time and the ending time of the buckets. Therefore, it may happen that some bits arrive before \(s_q\) but are inserted to \(EH_\oplus \) with a timestamp after \(s_q\), creating ‘false positives’. The opposite is also possible. These bits are called out-of-order bits with respect to \(s_q\). Clearly, out-of-order bits may lead to underestimation or overestimation of the query answer. According to Lemma 5, the number of out-of-order bits originating from each exponential histogram \(EH_x\) is at most \(\epsilon i_x\), with \(i_x\) denoting the accurate number of true bits that were inserted in \(EH_x\) at or after \(s_q\). The number of out-of-order bits from all streams is then bounded as follows: \(\text {err}_2 \le \sum _{x=1}^n \epsilon i_x = \epsilon \sum _{x=1}^n i_x = \epsilon i\).

Underestimation or overestimation of the overlap may also happen because of the halving of the size of bucket \(EH_\oplus ^Q\) during query time (\(\text {err}_1\)). As shown in [16], this process may introduce a maximum relative error of \(\epsilon r\), where \(r\) is the sum of the sizes of all buckets in \(EH_\oplus \) with an index lower than \(Q\) (i.e., with a starting time at least equal to \(s_q\)). Recall that \(r\) may also include bits that have arrived before \(s_q\) (the out-of-order bits), which is however upper bounded by \(\epsilon i\), as discussed before. Therefore, the maximum underestimation or overestimation error is \(\text {err}_1 = \epsilon ' r \le \epsilon ' (i+ \epsilon i) = \epsilon ' i + \epsilon \epsilon ' i\), with \(i=\sum _{x=1}^n i_x\).

Summing \(\text {err}_1\) and \(\text {err}_2\), we get a maximum relative error of \((\epsilon +\epsilon '+ \epsilon \epsilon ')\), which completes the proof.\(\square \)

Lemma 5

Consider an individual exponential histogram \(EH_x\) of stream \(X\), configured with error parameter \(\epsilon \). The out-of-order bits with respect to the query starting time \(s_q\) that \(EH_x\) can generate are at most \(\epsilon i_x\), with \(i_x\) denoting the number of true bits arriving at or after \(s_q\) in \(X\).

Proof

Due to the non-decreasing nature of bucket timestamps, there can be only one bucket with a start time less than \(s_q\) and end time greater than or equal to \(s_q\). Let this bucket be \(EH_x^j\). All other buckets have both starting and ending time at the same side of \(s_q\), and therefore their contents are always inserted with a timestamp at the correct side of \(s_q\) and do not create out-of-order bits.

Since the ending time of \(EH_x^j\) is at or after \(s_q\), its most recent true bit has arrived at or after \(s_q\), and should be included in the query range. Therefore, the number of true bits arriving at or after \(s_q\) in stream \(X\) is \(i_x \ge 1+\sum _{b=1}^{j-1}|EH_x^b|\). Furthermore, since half of the bits of \(EH_x^j\) are inserted using the ending time and half using the starting time of the bucket, the maximum number of out-of-order bits is \(|EH_x^j|/2\). By construction (invariant 1):

$$\begin{aligned} \frac{\left| EH_x^j\right| }{2\left( 1+\displaystyle \sum _{b=1}^{j-1}\left| EH_x^b\right| \right) }\le \epsilon \Rightarrow \frac{\left| EH_x^j\right| }{2} \le \epsilon \left( 1+\sum _{b=1}^{j-1}\left| EH_x^b\right| \right) \le \epsilon i_x \end{aligned}$$

\(\square \)

Lemma 2

Proof

The proof relies on the following properties of the \(\min \):

Monotonicity: :

If \(\mathbf {x}[i]\le \mathbf {y}[i]\) for all \(i\), then \(\min _i\{\mathbf {x}[i]\}\le \min _i\{\mathbf {y}[i]\}\).

Distributivity: :

For any monotonically increasing function \(f\), \(\min _i\{f(\mathbf {x}[i])\} = f(\min _i\{\mathbf {x}[i]\})\).

We want to derive sufficient conditions such that \((1 - \theta ) \varvec{f}(\mathbf {v}(t)) \le \varvec{f}(\mathbf {v}(t_0)) \le (1 + \theta ) \varvec{f}(\mathbf {v}(t))\), with \(\varvec{f}(\mathbf {v}(t))=\min _{\mathrm{row}=1}^d\{||\mathbf {v}[\mathrm{row}]||^2 \}\). By the distributivity property of the \(\min \) for monotonically increasing functions (i.e., the square root), it is sufficient to verify:

$$\begin{aligned} \sqrt{\frac{\varvec{f}(\mathbf {v}(t_0))}{1+\theta }} \le \min _{\mathrm{row}=1}^d \{||\mathbf {v}(t)[\mathrm{row}]||\} \le \sqrt{\frac{\varvec{f}(\mathbf {v}(t_0))}{1-\theta }}. \end{aligned}$$

By the triangle inequality:

$$\begin{aligned} ||\mathbf {v}(t)[\mathrm{row}] \!-\! \mathbf {v}(t_0)[\mathrm{row}]||\le & {} \sum _{j=1}^n ||\mathbf {v}_j(t)[\mathrm{row}] \!-\! \mathbf {v}_j(t_0)[\mathrm{row}]|| \nonumber \\= & {} \sum _{j=1}^n \varvec{d}_j[\mathrm{row}] = n \varvec{d}[\mathrm{row}] \Rightarrow \nonumber \\ ||\mathbf {v}(t_0)[\mathrm{row}]|| - n\varvec{d}[\mathrm{row}]\le & {} ||\mathbf {v}(t)[\mathrm{row}]|| \nonumber \\\le & {} ||\mathbf {v}(t_0)[\mathrm{row}]|| + n\varvec{d}[\mathrm{row}] \end{aligned}$$
(7)

Notice that \(||\mathbf {v}(t_0)[\mathrm{row}]||\) is constant per synchronization. Therefore, Inequality 7 bounds \(||\mathbf {v}(t)[\mathrm{row}]||\) by a linear relation of \(\varvec{d}\), i.e., it allows us to form threshold-crossing queries in the \(R^d\) space. By monotonicity of the min, it suffices to monitor the following conditions:

$$\begin{aligned} \min _{i=1}^d\{||\mathbf {v}(t_0)[i]|| + n\varvec{d}[i]\} \le \sqrt{\frac{\varvec{f}(\mathbf {v}(t_0))}{1-\theta }} \end{aligned}$$

and

$$\begin{aligned} \min _{i=1}^d\{||\mathbf {v}(t_0)[i]|| - n\varvec{d}[i]\} \ge \sqrt{\frac{\varvec{f}(\mathbf {v}(t_0))}{1+\theta }}. \end{aligned}$$

The lemma follows directly, by dividing both sides of the conditions by \(n\). \(\square \)

Theorem 5

Proof Sketch:

By construction, all counters of \(\mathbf {v}_i^u(t)\) are at least equal to the corresponding counters of \(\mathbf {v}_i(t)\). Therefore, the self-join size estimate for \(\mathbf {v}_i^u(t)\) will be at least equal to the self-join size estimate for \(\mathbf {v}_i(t)\) at all times. Using Lemma 2 to monitor \(\mathbf {v}\) but only considering the shifts which increase the counters, we get that if \(\min _{\mathrm{row}=1}^d \{ \frac{||\mathbf {v}(t_0)[\mathrm{row}]||}{n}+\varvec{d}^u[\mathrm{row}]\} \le \frac{1}{n} \sqrt{\frac{\varvec{f}(\mathbf {v}(t_0))}{1-\theta }}\), then \(\varvec{f}(\mathbf {v}(t_0)) \le (1 + \theta ) \varvec{f}(\mathbf {v}(t))\). The lower bound is shown analogously.\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Papapetrou, O., Garofalakis, M. & Deligiannakis, A. Sketching distributed sliding-window data streams. The VLDB Journal 24, 345–368 (2015). https://doi.org/10.1007/s00778-015-0380-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-015-0380-7

Keywords

Navigation