Constructing fading histograms from data streams

Sebastião, Raquel; Gama, João; Mendonça, Teresa

doi:10.1007/s13748-014-0050-9

Constructing fading histograms from data streams

Regular Paper
Published: 11 April 2014

Volume 3, pages 15–28, (2014)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

Raquel Sebastião^1,2,
João Gama^1,3 &
Teresa Mendonça^2,4

464 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

The ability to collect data is changing drastically. Nowadays, data are gathered in the form of transient and finite data streams. Memory restrictions preclude keeping all received data in memory. When dealing with massive data streams, it is mandatory to create compact representations of data, also known as synopses structures or summaries. Reducing memory occupancy is of utmost importance when handling a huge amount of data. This paper addresses the problem of constructing histograms from data streams under error constraints. When constructing online histograms from data streams there are two main characteristics to embrace: the updating facility and the error of the histogram. Moreover, in dynamic environments, besides the need of compact summaries to capture the most important properties of data, it is also essential to forget old data. Therefore, this paper presents sliding histograms and fading histograms, an abrupt and a smooth strategies to forget outdated data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

A survey of methods for time series change point detection

Article 08 September 2016

Notes

The square error is one of the most used error measures in histogram construction. It is also known as the V-Optimal measure and was introduced by [14].

References

Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the 21st ACM SIGMOD–SIGACT–SIGART Symposium on Principles of Database Systems, PODS ’02, pp. 1–16. ACM, New York (2002). doi10.1145/543613.543615
Barbar, D.: Requirements for clustering data streams. SIGKDD Explor. Newsl. 3(2), 23–27 (2002). doi:10.1145/507515.507519
Chakrabarti, K., Garofalakis, M.N., Rastogi, R., Shim, K.: Approximate query processing using wavelets. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) VLDB 2000. Proceedings of 26th International Conference on Very Large Data Bases, 10–14 September 2000, Cairo, pp. 111–122. Morgan Kaufmann, Burlington (2000)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005). doi:10.1016/j.jalgor.2003.12.001
Google Scholar
Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: tracking most frequent items dynamically. ACM Trans. Database Syst. 30(1), 249–278 (2005). doi:10.1145/1061318.1061325
Google Scholar
Correa, M., Bielza, C., Pamies-Teixeira, J.: Comparison of bayesian networks and artificial neural networks for quality detection in a machining process. Expert Syst. Appl. 36(3), 7270–7279 (2009). http://dblp.uni-trier.de/db/journals/eswa/eswa36.html#CorreaBP09
Freedman, D., Diaconis, P.: On the histogram as a density estimator: L2 theory. Probab. Theory Relat. Fields 57(4), 453–476 (1981). doi:10.1007/BF01025868
Gama, J., Sebastipo, R., Rodrigues, P.P.: On evaluating stream learning algorithms. Mach. Learn. 90(3), 317–346 (2013)
Article MATH MathSciNet Google Scholar
Gibbons, P.B., Matias, Y.: Synopsis data structures for massive data sets. In: ACM–SIAM Symposium on Discrete Algorithms, pp. 909–910 (1999). doi:10.1145/314500.315083
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: One-pass wavelet decompositions of data streams. IEEE Trans. Knowl. Data Eng. 15(3), 541–554 (2003). doi:10.1109/TKDE.2003.1198389
Google Scholar
Guha, S., Koudas, N., Shim, K.: Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst. 31(1), 396–438 (2006). doi:10.1145/1132863.1132873
Google Scholar
Guha, S., Shim, K., Woo, J.: Rehist: relative error histogram construction algorithms. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 300–311 (2004)
Ioannidis, Y.: The history of histograms (abridged). In: VLDB Endowment. Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, VLDB ’03, pp. 19–30 (2003). http://dl.acm.org/citation.cfm?id=1315451.1315455
Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: Carey, M.J., Schneider, D.A. (eds.) Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, 22–25 May 1995, pp. 233–244. ACM Press, New York (1995)
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal histograms with quality guarantees. In: Proceedings of the 24th International Conference on Very Large Data Bases, VLDB ’98, pp. 275–286. Morgan Kaufmann Publishers Inc., San Francisco (1998). http://dl.acm.org/citation.cfm?id=645924.671191
Karras, P., Mamoulis, N.: Hierarchical synopses with optimal error guarantees. ACM Trans. Database Syst. 33, 1–53 (2008). doi:10.1145/1386118.1386124
Google Scholar
Lin, M.Y., Hsueh, S.C., Hwang, S.K.: Interactive mining of frequent itemsets over arbitrary time intervals in a data stream. In: Proceedings of the 19th Conference on Australasian Database, vol. 75, ADC ’08, pp. 15–21. Australian Computer Society Inc., Darlinghurst (2007). http://dl.acm.org/citation.cfm?id=1378307.1378315
Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982). doi:10.1016/0167-6423(82)90012-0
Article MATH MathSciNet Google Scholar
Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: SIGMOD Conference, pp. 294–305 (1996)
Rodrigues, P., Gama, J., Sebastipo, R.: Memoryless fading windows in ubiquitous settings. In: Proceedings of Ubiquitous Data Mining (UDM) Workshop, in conjunction with the 19th European Conference on Artificial Intelligence, ECAI 2010, pp. 27–32 (2010)
Scott, D.W.: On optimal and data-based histograms. Biometrika 66(3), 605–610 (1979). doi:10.1093/biomet/66.3.605
Google Scholar
Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382. ACM Press, New York (2001)
Sturges, H.A.: The choice of a class interval. Am. Stat. Assoc. 21, 65–66 (1926)
Article Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985). doi:10.1145/3147.3165
Google Scholar

Download references

Acknowledgments

The work of Raquel Sebastião was supported by FCT (Portuguese Foundation for Science and Technology) under the PhD Grant SFRH/BD/41569/2007. The authors acknowledge the support of the European Commission through the project MAESTRA (Grant number ICT-2013-612944). This work was also funded by the European Regional Development Fund through the COMPETE Program, by the Portuguese Funds through the FCT (Portuguese Foundation for Science and Technology) within project FCOMP-01-0124-FEDER-022701, and by the Projects NORTE-07-0124-FEDER-000056/000059 which is financed by the North Portugal Regional Operational Program (ON.2 O Novo Norte), under the National Strategic Reference Framework (NSRF).

Author information

Authors and Affiliations

LIAAD, INESC TEC, Campus da FEUP, Rua Dr. Roberto Frias, 378, 4200-465 , Porto, Portugal
Raquel Sebastião & João Gama
Dep. Matemática, Fac. Ciências da Universidade do Porto (FCUP), Rua do Campo Alegre, s/n, 4169-007 , Porto, Portugal
Raquel Sebastião & Teresa Mendonça
Fac. Economia da Universidade do Porto (FEP), Rua Dr. Roberto Frias, 4200-464 , Porto, Portugal
João Gama
Dep. de Matemática, Center for Research and Developments in Mathematics and Applications (CIDMA), Campos Universitário de Santiago, 3810-193 , Aveiro, Portugal
Teresa Mendonça

Authors

Raquel Sebastião
View author publications
You can also search for this author in PubMed Google Scholar
João Gama
View author publications
You can also search for this author in PubMed Google Scholar
Teresa Mendonça
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raquel Sebastião.

Appendix: Fading histograms

This appendix presents some computations on the error of approximating the fading sliding histogram with the fading histogram. Considering the definition of histogram frequencies (1), the frequencies of a sliding histogram (with $k$ buckets) constructed over a sliding window of length $w$ and computed at observation $i$ with an exponential fading factor $\alpha $ ($0 \ll \alpha < 1$) can be defined as follows:

$$\begin{aligned} F_{\alpha , w, j} (i) \!=\! \frac{\sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l)}{\sum \nolimits _{j=1}^k \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l)}, \quad \forall j \!=\! 1,\dots , k, \end{aligned}$$

(13)

To approximate a fading sliding histogram by a fading histogram, the older data than that within the most recent window $W = \{x_l: l=i-w+1,\ldots ,i \}$ must be taken into consideration. Therefore, for each bucket $j = 1,\ldots , k$, the proportion of weight given to old observations (with respect to $W$) in the computation of the fading histogram is defined as the bucket ballast weight:

$$\begin{aligned} B_{\alpha , w, j}(i) = \frac{\sum \nolimits _{l=w}^{i-1} \alpha ^{l}}{N_\alpha (i)},\quad \forall j = 1,\dots , k, \end{aligned}$$

(14)

where $N_\alpha (i)$ is the fading increment defined as $N_{\alpha }(i) = \sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l)$.

As with the old observations, for each bucket $j = 1,\dots , k$, the proportion of weight given to observations within the most recent window $W$ is defined by

$$\begin{aligned} B_{\alpha , w, j}'(i) = 1 - B_{\alpha , w, j}(i) = \frac{\sum \nolimits _{l=0}^{w-1} \alpha ^{l}}{N_\alpha (i)}, \quad \forall j = 1,\dots , k.\nonumber \\ \end{aligned}$$

(15)

Hence, the error of approximating the fading sliding histogram with the fading histogram, both with $k$ buckets, can be defined as

$$\begin{aligned} \Delta _{\alpha ,w}(i)&= \sum \nolimits _{j=1}^k \Delta _{\alpha ,w, j}(i) \nonumber \\&= \sum \nolimits _{j=1}^k \left\| F_{\alpha , w, j} (i) \!-\! F_{\alpha , j} (i) \right\| . \end{aligned}$$

(16)

Theorem 1

Let $\varepsilon <1$ be an admissible ballast weight for the fading histogram. Then, $\Delta _{\alpha ,w}(i) \le 2\varepsilon $.

Proof

From the respective histogram frequency definitions comes that the approximation error in each bucket is

$$\begin{aligned}&\Delta _{\alpha ,w, j}(i) = \left\| \ \frac{\sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l)}{\sum \nolimits _{j=1}^k \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l)} - \frac{\sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l)}{\sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l)} \right\| ,\\&\quad \quad \qquad \qquad \qquad \forall j = 1,\ldots , k \end{aligned}$$

Splitting each of these errors considering the frequencies inside and outside the most recent window of size $w$

$$\begin{aligned} \Delta _{\alpha ,w, j}(i) = \left\| \Delta _{\alpha ,w, j}(i)^\mathrm{in} - \Delta _{\alpha ,w, j}(i)^\mathrm{out}\right\| , \end{aligned}$$

where

$$\begin{aligned} \Delta _{\alpha ,w, j}(i)^\mathrm{in} = \frac{\sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l)}{\sum \nolimits _{j=1}^k \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l)} - \frac{\sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l)}{\sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l)}, \end{aligned}$$

and

$$\begin{aligned} \Delta _{\alpha ,w, j}(i)^\mathrm{out}(i) = \frac{\sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} C_j(l)}{\sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l)}. \end{aligned}$$

Looking for an upper bound on the error, the worst case scenario is that these two sources of error do not cancel out, rather adding up their effect

$$\begin{aligned} \Delta _{\alpha ,w, j}(i) \le \left\| \Delta _{\alpha ,w, j}(i)^\mathrm{in} \right\| + \left\| \Delta _{\alpha ,w, j}(i)^\mathrm{out} \right\| . \end{aligned}$$

Hence

$$\begin{aligned}&\Delta _{\alpha ,w, j}(i)\\&\quad \le \left\| \frac{\left( \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l) \right) \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} C_j(l) \right) - \left( \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l) \right) \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=i-w+1}^{i} \alpha ^{i-l} C_j(l) \right) }{ \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l) \right) \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l) \right) } \right\| \\&\qquad + \left\| \frac{\sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} C_j(l)}{\sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l)} \right\| \Leftrightarrow \\&\quad \Leftrightarrow \Delta _{\alpha ,w, j}(i)\le \left\| \frac{\left( \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l) \right) \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} C_j(l) \right) }{ \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l) \right) \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l) \right) } \right\| + \left\| \frac{\sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} C_j(l)}{\sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^i \alpha ^{i-l} C_j(l)} \right\| \Leftrightarrow \\&\quad \Leftrightarrow \Delta _{\alpha ,w, j}(i)\le \left\| \frac{\left( \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l) \right) \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} C_j(l) \right) }{ \left( \sum \nolimits _{j=1}^k \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} C_j(l) \right) N_{\alpha }(i)} \right\| + \left\| \frac{\sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} C_j(l)}{N_{\alpha }(i)} \right\| \end{aligned}$$

The upper bound on the error is given by considering all $C_j(l) = 1$:

$$\begin{aligned}&\Delta _{\alpha ,w, j}(i)\\&\quad \le \left\| \frac{\left( \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} \right) \left( k \sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} \right) }{ \left( k \sum \nolimits _{l=i-w+1}^i \alpha ^{i-l} \right) N_{\alpha }(i)} \right\| \\&\quad + \left\| \frac{\sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} }{N_{\alpha }(i)} \right\| = 2 \left\| \frac{\sum \nolimits _{l=1}^{i-w} \alpha ^{i-l} }{ N_{\alpha }(i)} \right\| \end{aligned}$$

Then, from bucket ballast weight definition comes that

$$\begin{aligned} \Delta _{\alpha ,w, j}(i)&\le 2 \left\| B_{\alpha , w, j}(i)\right\| \end{aligned}$$

Considering in each bucket $j = 1,\dots , k$ an admissible ballast weight, at most, of $\varepsilon /k$ comes that

$$\begin{aligned} \Delta _{\alpha ,w}(i) = \sum \limits _{j=1}^k \Delta _{\alpha ,w, j}(i) \le \sum \limits _{j=1}^k 2\varepsilon /k = 2\varepsilon . \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sebastião, R., Gama, J. & Mendonça, T. Constructing fading histograms from data streams. Prog Artif Intell 3, 15–28 (2014). https://doi.org/10.1007/s13748-014-0050-9

Download citation

Received: 06 December 2013
Accepted: 16 March 2014
Published: 11 April 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s13748-014-0050-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing fading histograms from data streams

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Uncertainty in big data analytics: survey, opportunities, and challenges

A survey of methods for time series change point detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Fading histograms

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Constructing fading histograms from data streams

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Uncertainty in big data analytics: survey, opportunities, and challenges

A survey of methods for time series change point detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Fading histograms

Appendix: Fading histograms

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation