Progress in Artificial Intelligence

, Volume 3, Issue 1, pp 15–28 | Cite as

Constructing fading histograms from data streams

Regular Paper

Abstract

The ability to collect data is changing drastically. Nowadays, data are gathered in the form of transient and finite data streams. Memory restrictions preclude keeping all received data in memory. When dealing with massive data streams, it is mandatory to create compact representations of data, also known as synopses structures or summaries. Reducing memory occupancy is of utmost importance when handling a huge amount of data. This paper addresses the problem of constructing histograms from data streams under error constraints. When constructing online histograms from data streams there are two main characteristics to embrace: the updating facility and the error of the histogram. Moreover, in dynamic environments, besides the need of compact summaries to capture the most important properties of data, it is also essential to forget old data. Therefore, this paper presents sliding histograms and fading histograms, an abrupt and a smooth strategies to forget outdated data.

Keywords

Data streams Online histograms  Error constraints Fading histograms 

References

  1. 1.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the 21st ACM SIGMOD–SIGACT–SIGART Symposium on Principles of Database Systems, PODS ’02, pp. 1–16. ACM, New York (2002). doi10.1145/543613.543615
  2. 2.
    Barbar, D.: Requirements for clustering data streams. SIGKDD Explor. Newsl. 3(2), 23–27 (2002). doi:10.1145/507515.507519
  3. 3.
    Chakrabarti, K., Garofalakis, M.N., Rastogi, R., Shim, K.: Approximate query processing using wavelets. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) VLDB 2000. Proceedings of 26th International Conference on Very Large Data Bases, 10–14 September 2000, Cairo, pp. 111–122. Morgan Kaufmann, Burlington (2000)Google Scholar
  4. 4.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005). doi:10.1016/j.jalgor.2003.12.001 Google Scholar
  5. 5.
    Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: tracking most frequent items dynamically. ACM Trans. Database Syst. 30(1), 249–278 (2005). doi:10.1145/1061318.1061325 Google Scholar
  6. 6.
    Correa, M., Bielza, C., Pamies-Teixeira, J.: Comparison of bayesian networks and artificial neural networks for quality detection in a machining process. Expert Syst. Appl. 36(3), 7270–7279 (2009). http://dblp.uni-trier.de/db/journals/eswa/eswa36.html#CorreaBP09
  7. 7.
    Freedman, D., Diaconis, P.: On the histogram as a density estimator: L2 theory. Probab. Theory Relat. Fields 57(4), 453–476 (1981). doi:10.1007/BF01025868
  8. 8.
    Gama, J., Sebastipo, R., Rodrigues, P.P.: On evaluating stream learning algorithms. Mach. Learn. 90(3), 317–346 (2013)CrossRefMATHMathSciNetGoogle Scholar
  9. 9.
    Gibbons, P.B., Matias, Y.: Synopsis data structures for massive data sets. In: ACM–SIAM Symposium on Discrete Algorithms, pp. 909–910 (1999). doi:10.1145/314500.315083
  10. 10.
    Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: One-pass wavelet decompositions of data streams. IEEE Trans. Knowl. Data Eng. 15(3), 541–554 (2003). doi:10.1109/TKDE.2003.1198389 Google Scholar
  11. 11.
    Guha, S., Koudas, N., Shim, K.: Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst. 31(1), 396–438 (2006). doi:10.1145/1132863.1132873 Google Scholar
  12. 12.
    Guha, S., Shim, K., Woo, J.: Rehist: relative error histogram construction algorithms. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 300–311 (2004)Google Scholar
  13. 13.
    Ioannidis, Y.: The history of histograms (abridged). In: VLDB Endowment. Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, VLDB ’03, pp. 19–30 (2003). http://dl.acm.org/citation.cfm?id=1315451.1315455
  14. 14.
    Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: Carey, M.J., Schneider, D.A. (eds.) Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, 22–25 May 1995, pp. 233–244. ACM Press, New York (1995)Google Scholar
  15. 15.
    Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal histograms with quality guarantees. In: Proceedings of the 24th International Conference on Very Large Data Bases, VLDB ’98, pp. 275–286. Morgan Kaufmann Publishers Inc., San Francisco (1998). http://dl.acm.org/citation.cfm?id=645924.671191
  16. 16.
    Karras, P., Mamoulis, N.: Hierarchical synopses with optimal error guarantees. ACM Trans. Database Syst. 33, 1–53 (2008). doi:10.1145/1386118.1386124 Google Scholar
  17. 17.
    Lin, M.Y., Hsueh, S.C., Hwang, S.K.: Interactive mining of frequent itemsets over arbitrary time intervals in a data stream. In: Proceedings of the 19th Conference on Australasian Database, vol. 75, ADC ’08, pp. 15–21. Australian Computer Society Inc., Darlinghurst (2007). http://dl.acm.org/citation.cfm?id=1378307.1378315
  18. 18.
    Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982). doi:10.1016/0167-6423(82)90012-0 CrossRefMATHMathSciNetGoogle Scholar
  19. 19.
    Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: SIGMOD Conference, pp. 294–305 (1996)Google Scholar
  20. 20.
    Rodrigues, P., Gama, J., Sebastipo, R.: Memoryless fading windows in ubiquitous settings. In: Proceedings of Ubiquitous Data Mining (UDM) Workshop, in conjunction with the 19th European Conference on Artificial Intelligence, ECAI 2010, pp. 27–32 (2010)Google Scholar
  21. 21.
    Scott, D.W.: On optimal and data-based histograms. Biometrika 66(3), 605–610 (1979). doi:10.1093/biomet/66.3.605 Google Scholar
  22. 22.
    Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382. ACM Press, New York (2001)Google Scholar
  23. 23.
    Sturges, H.A.: The choice of a class interval. Am. Stat. Assoc. 21, 65–66 (1926)CrossRefGoogle Scholar
  24. 24.
    Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985). doi:10.1145/3147.3165 Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Raquel Sebastião
    • 1
    • 2
  • João Gama
    • 1
    • 3
  • Teresa Mendonça
    • 2
    • 4
  1. 1.LIAAD, INESC TECPortoPortugal
  2. 2.Dep. MatemáticaFac. Ciências da Universidade do Porto (FCUP)PortoPortugal
  3. 3.Fac. Economia da Universidade do Porto (FEP)PortoPortugal
  4. 4.Dep. de MatemáticaCenter for Research and Developments in Mathematics and Applications (CIDMA)AveiroPortugal

Personalised recommendations