An Hybrid Data Stream Summarizing Approach by Sampling and Clustering

  • Nesrine Gabsi
  • Fabrice Clérot
  • Georges Hébrail
Part of the Studies in Computational Intelligence book series (SCI, volume 292)


Computer systems generate a large amount of data that, in terms of space and time, is very expensive - even impossible - to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in computer systems. One solution is to treat the data as streams being processed on the fly in order to build historical summaries. Many data summarizing techniques have already been developed such as sampling, clustering, histograms, etc. Some of them have been extended to be applied directly to data streams. This chapter presents a new approach to build such historical summaries of data streams. It is based on a combination of two existing algorithms: StreamSamp and CluStream. The combination takes advantages of the benefits of each algorithm and avoids their drawbacks. Some experiments are presented both on real and synthetic data. These experiments show that the new approach gives better results than using any one of the two mentioned algorithms.


Data Streams Non-specialized Summary Sampling Clustering 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. The VLDB Journal 12(2), 120–139 (2003), CrossRefGoogle Scholar
  2. Aggarwal, C. (ed.): Data Streams – Models and Algorithms. Springer, Heidelberg (2007)zbMATHGoogle Scholar
  3. Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: VLDB 2006: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, pp. 607–618 (2006)Google Scholar
  4. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data Streams. In: VLDB, pp. 81–92 (2003)Google Scholar
  5. Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-Size Reservoir Sampling over Data Streams. In: SSDBM, p. 22 (2007)Google Scholar
  6. Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., Widom, J.: STREAM: the stanford stream data manager (demonstration description). In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, p. 665. ACM, New York (2003), CrossRefGoogle Scholar
  7. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS 2002: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–16. ACM, New York (2002), CrossRefGoogle Scholar
  8. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970), zbMATHCrossRefGoogle Scholar
  9. Csernel, B.: Résumé généraliste de flux de données. Ph.D. thesis, Ecole Nationale Supérieur des Télécommunications (Février 2008)Google Scholar
  10. Csernel, B., Clérot, F., Hébrail, G.: StreamSamp: DataStream Clustering Over Tilted Windows Through Sampling. In: ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams (2006)Google Scholar
  11. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985), zbMATHCrossRefMathSciNetGoogle Scholar
  12. Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD Conference, pp. 379–392 (2008)Google Scholar
  13. Golab, L., Özsu, M.T.: Issues in data stream management. SIGMOD Rec. 32(2), 5–14 (2003), CrossRefGoogle Scholar
  14. Guha, S., Harb, B.: Wavelet synopsis for data streams: minimizing non-euclidean error. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 88–97. ACM, New York (2005), CrossRefGoogle Scholar
  15. Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: STOC 2001: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 471–475. ACM, New York (2001), CrossRefGoogle Scholar
  16. Ioannidis, Y.E., Poosala, V.: Histogram-Based Approximation of Set-Valued Query-Answers. In: VLDB, pp. 174–185 (1999)Google Scholar
  17. Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal Histograms with Quality Guarantees. In: VLDB, pp. 275–286 (1998)Google Scholar
  18. Ma, L., Nutt, W., Taylor, H.: Condensative Stream Query Language for Data Streams. In: ADC, pp. 113–122 (2007)Google Scholar
  19. Muthukrishnan, S., Strauss, M., Zheng, X.: Workload-Optimal Histograms on Streams. In: Brodal, G.S., Leonardi, S. (eds.) ESA 2005. LNCS, vol. 3669, pp. 734–745. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. Park, B.-H., Ostrouchov, G., Samatova, N.F., Geist, A.: Reservoir-Based Random Sampling with Replacement from Data Stream. In: SIAM SDM International Conference on Data Mining (2004)Google Scholar
  21. Puttagunta, V., Kalpakis, K.: Adaptive Clusters and Histograms over Data Streams. In: IKE International Conference on Information and Knowledge Engineering, pp. 98–104 (2005)Google Scholar
  22. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985), zbMATHCrossRefMathSciNetGoogle Scholar
  23. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996), CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Nesrine Gabsi
    • 1
    • 2
  • Fabrice Clérot
    • 2
  • Georges Hébrail
    • 3
  1. 1.Institut TELECOMTELECOM ParisTechParis
  2. 2.France Télécom RDLannion
  3. 3.Institut TELECOMTELECOM ParisTech, Partially Suported by ANR (MIDAS Project ANR-07-MDO-008)Paris

Personalised recommendations