Skip to main content

Data Summarization Using Sampling Algorithms: Data Stream Case Study

  • Chapter
  • First Online:
Principles of Data Science

Abstract

Data streams represent a challenge to the data processing operations such as query execution and information retrieval. They pose many constraints in terms of memory space and execution time for the computation process. This is mainly due to the huge volume of the data and their high arrival rate. Generating approximate answers by using a small proportion of the data stream, called “summary,” is acceptable for many applications. Sampling algorithms are used to construct a data stream summary. The purpose of sampling algorithms is to provide information concerning a large set of data from a representative sample extracted from it. An effective summary of a data stream must have the ability to respond, in an approximate manner, to any query, whatever the period of time investigating. In this chapter, we present a survey of these algorithms. Firstly, we introduce the basic concepts of data streams, windowing models, as well as data stream applications. Next, we introduce the state of the art of different sampling algorithms used in data stream environments. We classify these algorithms according to the following metrics: number of passes over the data, memory consumption, and skewing ability. In the end, we evaluate the performance of three sampling algorithms according to their execution time and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Golab, L., & Ozsu, M. T. (2003). Issues in data stream management. ACM SIGMOD Record, 32(2), 5–14.

    Article  Google Scholar 

  2. Gabsi, N. (2011). Extension et interrogation de résumés de flux de données. PhD thesis, Télécom ParisTech.

    Google Scholar 

  3. Chakravarthy, S., & Jiang, Q. (2009). Stream data processing: A quality of service perspective: Modeling, scheduling, load shedding, and complex event processing (Vol. 36). New York: Springer.

    MATH  Google Scholar 

  4. Crovella, M. E., & Bestavros, A. (1997). Self-similarity in world wide web traffic: Evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6), 835–846.

    Article  Google Scholar 

  5. Leland, W. E., Taqqu, M. S., Willinger, W., & Wilson, D. V. (1993). On the self- similar nature of Ethernet traffic. ACM SIGCOMM computer communication review, 23, 183–193.

    Article  Google Scholar 

  6. Blumberg, R., & Atre, S. (2003). The problem with unstructured data. DM Review, 13(42–49), 62.

    Google Scholar 

  7. Liu, L., Calton, P., & Tang, W. (2000). Webcq-detecting and delivering information changes on the web. In Proceedings of the ninth international conference on information and knowledge management (pp. 512–519). New York: Association for Computing Machinery.

    Google Scholar 

  8. Golab, L., & Ozsu, M. T. (2003). Data stream management issues–a survey. Technical report, Apr. 2003. http://db.uwaterloo.ca/~ddbms/publications/stream/streamsurvey.pdf

  9. Gietl, J. K., & Klemm, O. (2009). Analysis of traffic and meteorology on air-borne particulate matter in Münster, Northwest Germany. Journal of the Air & Waste Management Association, 59(7), 809–818.

    Article  Google Scholar 

  10. Bartok, J., Habala, O., Bednar, P., Gazak, M., & Hluchy, L. (2010). Data mining and integration for predicting significant meteorological phenomena. Procedia Computer Science, 1(1), 37–46.

    Article  Google Scholar 

  11. Mathioudakis, M., & Koudas, N. (2010). Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD international conference on management of data (pp. 1155–1158). ACM.

    Google Scholar 

  12. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss, M. (2001). Quick-sand: Quick summary and analysis of network data. Technical Report, Dec. 2001. https://citeseer.nj.nec.com/gilbert01quicksand.html.

  13. Sachpazidis, J. (2002). @ home: A modular telemedicine system. In Mobile computing in medicine, second conference on mobile computing in medicine, workshop of the project group MoCoMed. GMDS-Fachbereich Medizinische Informatik & GI-Fachausschuss 4.7 (pp 87–95). GI.

    Google Scholar 

  14. Brettlecker, G., & Schuldt, H. (2007). The osiris-se (stream-enabled) infrastructure for reliable data stream management on mobile devices. In Proceedings of the 2007 ACM SIGMOD international conference on management of data (pp. 1097–1099). New York: Association for Computing Machinery.

    Chapter  Google Scholar 

  15. Liu, J., Liu, J., Reich, J., Cheung, P., & Zhao, F. (2003). Distributed group management for track initiation and maintenance in target localization applications. In Information processing in sensor networks (pp. 113–128). Heidelberg: Springer.

    Chapter  Google Scholar 

  16. Gurgen, L. (2007). Gestion à grande échelle de données de capteurs hétérogènes. PhD thesis, Grenoble, INPG.

    Google Scholar 

  17. Abdessalem, T., Chiky, R., Hébrail, G., Vitti, J. L., & GET-ENST Paris. (2007). Traitement de données de consommation électrique par un système de gestion de flux de données. In EGC (pp. 521–532).

    Google Scholar 

  18. Küçük, D., İnan, T., Boyrazoğlu, B., Buhan, S., Salor, Ö., Çadırcı, I., Ermiş, M. (2015). Pqstream: A data stream architecture for electrical power quality. arXiv preprint arXiv:1504.04750.

    Google Scholar 

  19. Kovalerchuk, B., & Vityaev, E. (2000). Data mining in finance: Advances in relational and hybrid methods (Vol. 547). New York: Kluwer Academic Publishers.

    MATH  Google Scholar 

  20. https://traderbotmarketplace.com.

  21. Cranor, C., Gao, Y., Johnson, T., Shkapenyuk, V., & Spatscheck, O. (2002). Gigascope: High-performance network monitoring with an SQL interface. In Proceedings of the 2002 ACM SIGMOD international conference on management of data (pp. 623–623). ACM.

    Google Scholar 

  22. Sullivan, M., & Heybey, A. (1998). A system for managing large databases of network traffic. In Proceedings of USENIX.

    Google Scholar 

  23. Csernel, B. (2008). Résumé généraliste de flux de données. PhD thesis, Paris, ENST.

    Google Scholar 

  24. Tatbul, N., Cetintemel, U., Zdonik, S., Cherniack, M., & Stonebraker, M. (2003). Load shedding in a data stream manager. In Proceedings 2003 VLDB conference (pp. 309–320). San Diego: Elsevier.

    Chapter  Google Scholar 

  25. Abadi, D. J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., & Zdonik, S. (2003). Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2), 120–139.

    Article  Google Scholar 

  26. http://www.truviso.com/, 2004.

  27. Cetintemel, U. (2003). The aurora and medusa projects. Data Engineering, 51, 3.

    Google Scholar 

  28. Abadi, D. J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., et al. (2005). The design of the borealis stream processing engine. Cidr, 5, 277–289.

    Google Scholar 

  29. Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S. R., Reiss, F., & Shah, M. A. (2003). TelegraphCQ: continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 668–668). ACM.

    Google Scholar 

  30. Chen, J., DeWitt, D. J., Tian, F., & Wang, Y. (2000). Niagaracq: A scalable continuous query system for internet databases. ACM SIGMOD Record, 29, 379–390.

    Article  Google Scholar 

  31. Chao, M. T. (1982). A general purpose unequal probability sampling plan. Biometrika, 69(3), 653–656.

    Article  MathSciNet  Google Scholar 

  32. Olken, F., & Rotem, D. (1992). Maintenance of materialized views of sampling queries. In Proceedings of eighth international conference on data engineering (pp. 632–641). IEEE.

    Google Scholar 

  33. Efraimidis, P. S., & Spirakis, P. G. (2006). Weighted random sampling with a reservoir. In Information processing letters (pp 181–185)

    Google Scholar 

  34. Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11, 37–57.

    Article  MathSciNet  Google Scholar 

  35. Phillip, B. (1997). Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. In VLDB, 97, 466–475.

    Google Scholar 

  36. Gibbons, P. B., Matias, Y., & Poosala, V. (2002). Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems (TODS), 27, 261–298.

    Article  Google Scholar 

  37. Gibbons, P. B., & Matias, Y. (1998). New sampling-based summary statistics for improving approximate query answers. ACM SIGMOD Record, 27, 331–342.

    Article  Google Scholar 

  38. Babcock, B., Datar, M., & Motwani, R. (2002). Sampling from a moving window over streaming data. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, 2002, 633–634.

    MATH  Google Scholar 

  39. Gemulla, R., Lehner, W., & Haas, P. J. (2006). A dip in the reservoir: Maintaining sample synopses of evolving datasets. In Proceedings of the 32nd international conference on Very large databases (pp. 595–606). VLDB Endowment.

    Google Scholar 

  40. Gemulla, R. (2008). Sampling algorithms for evolving datasets. PhD thesis, Technischen Universitat Dresden Fakultat Informatik.

    Google Scholar 

  41. Dash, M., & Ng, W. (2006). Efficient reservoir sampling for transactional data streams. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06) (pp. 662–666). IEEE.

    Google Scholar 

  42. El Sibai, R., Chabchoub, Y., Demerjian, J., Chiky, R., & Barbar, K. (2018). A performance evaluation of data streams sampling algorithms over a sliding window. In Communications conference (MENACOMM), IEEE Middle East and North Africa (pp. 1–6). IEEE.

    Google Scholar 

  43. Muthukrishnan, S., et al. (2005). Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 117–236.

    Article  MathSciNet  Google Scholar 

  44. Csernel, B., Clerot, F., & Hébrail, G. (2006). Datastream clustering over tilted windows through sampling. In Knowledge discovery from data streams (p. 127).

    Google Scholar 

  45. de Aquino, A. L. L., Figueiredo, C. M. S., Nakamura, E. F., Buriol, L. S., Loureiro, A. A. F., Fernandes, A. O., & Coelho, C. J. N., Jr. (2007). A sampling data stream algorithm for wireless sensor networks. In 2007. ICC’07. IEEE international conference on communications (pp. 3207–3212). IEEE.

    Google Scholar 

  46. Mai, J., Chuah, C.-N., Sridharan, A., Ye, T., & Zang, H. (2006). Is sampled data sufficient for anomaly detection? In Proceedings of the 6th ACM SIGCOMM conference on internet measurement (pp. 165–176). New York: ACM.

    Google Scholar 

  47. El Sibai, R., Chabchoub, Y., Demerjian, J., Kazi-Aoul, Z., & Barbar, K. (2016). Sampling algorithms in data stream environments. In 2016 IEEE first International Conference on Digital Economy Emerging Technologies and Business Innovation (ICDEc) (pp. 29–36). IEEE.

    Google Scholar 

  48. El Sibai, R., Chabchoub, Y., Demerjian, J., Kazi-Aoul, Z., & Barbar, K. (2015). A performance study of the chain sampling algorithm. In 2015 IEEE seventh international conference on intelligent computing and information systems (ICICIS) (pp. 487–494). Cairo: IEEE.

    Chapter  Google Scholar 

  49. Brauckhoff, D., Tellenbach, B., Wagner, A., May, M., & Lakhina, A. (2006). Impact of packet sampling on anomaly detection metrics. In Proceedings of the 6th ACM SIGCOMM conference on internet measurement (pp. 159–164). New York: ACM.

    Google Scholar 

  50. Pescapé, A., Rossi, D., Tammaro, D., & Valenti, S. (2010). On the impact of sampling on traffic monitoring and analysis. In Teletraffic congress (ITC), 2010 22nd international (pp. 1–8). Piscataway: IEEE.

    Google Scholar 

  51. Hu, Z., Liu, J., Zhou, W., & Zhang, S. (2016). Sampling method in traffic logs analyzing. In 2016 8th international conference on intelligent human-machine systems and cybernetics (IHMSC) (Vol. 1, pp. 554–558). Piscataway: IEEE.

    Google Scholar 

  52. Xu, K., Wang, F., Jia, X., & Wang, H. (2015). The impact of sampling on big data analysis of social media: A case study on flu and Ebola. In Global communications conference (GLOBECOM), 2015 IEEE (pp. 1–6). Piscataway: IEEE.

    Google Scholar 

  53. Schinkel, M., & Chen, W.-H. (2006). Control of sampled-data systems with variable sampling rate. International Journal of Systems Science, 37(9), 609–618.

    Article  MathSciNet  Google Scholar 

  54. Liu, L., Calton, P., & Tang, W. (1999). Continual queries for internet scale event-driven information delivery. IEEE Transactions on Knowledge and Data Engineering, 11(4), 610–628.

    Article  Google Scholar 

  55. Zhu, Y., & Shasha, D. (2002). Statstream: Statistical monitoring of thousands of data streams in real time∗∗ work supported in part by us NSF grants iis-9988345 and n2010: 0115586. In VLDB’02: Proceedings of the 28th international conference on very large databases (pp 358–369). Elsevier.

    Google Scholar 

  56. McLeod, A. I., & Bellhouse, D. R. (1983). A convenient algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32, 182–184.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jacques Bou Abdo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sibai, R.E., Abdo, J.B., Chabchoub, Y., Demerjian, J., Chiky, R., Barbar, K. (2020). Data Summarization Using Sampling Algorithms: Data Stream Case Study. In: Arabnia, H.R., Daimi, K., Stahlbock, R., Soviany, C., Heilig, L., Brüssau, K. (eds) Principles of Data Science. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-43981-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43981-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43980-4

  • Online ISBN: 978-3-030-43981-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics