Data Summarization Using Sampling Algorithms: Data Stream Case Study

Sibai, Rayane El; Abdo, Jacques Bou; Chabchoub, Yousra; Demerjian, Jacques; Chiky, Raja; Barbar, Kablan

doi:10.1007/978-3-030-43981-1_6

Rayane El Sibai⁸,
Jacques Bou Abdo⁹,
Yousra Chabchoub¹⁰,
Jacques Demerjian¹¹,
Raja Chiky¹⁰ &
…
Kablan Barbar¹¹

Part of the book series: Transactions on Computational Science and Computational Intelligence ((TRACOSCI))

1176 Accesses

Abstract

Data streams represent a challenge to the data processing operations such as query execution and information retrieval. They pose many constraints in terms of memory space and execution time for the computation process. This is mainly due to the huge volume of the data and their high arrival rate. Generating approximate answers by using a small proportion of the data stream, called “summary,” is acceptable for many applications. Sampling algorithms are used to construct a data stream summary. The purpose of sampling algorithms is to provide information concerning a large set of data from a representative sample extracted from it. An effective summary of a data stream must have the ability to respond, in an approximate manner, to any query, whatever the period of time investigating. In this chapter, we present a survey of these algorithms. Firstly, we introduce the basic concepts of data streams, windowing models, as well as data stream applications. Next, we introduce the state of the art of different sampling algorithms used in data stream environments. We classify these algorithms according to the following metrics: number of passes over the data, memory consumption, and skewing ability. In the end, we evaluate the performance of three sampling algorithms according to their execution time and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Golab, L., & Ozsu, M. T. (2003). Issues in data stream management. ACM SIGMOD Record, 32(2), 5–14.
Article Google Scholar
Gabsi, N. (2011). Extension et interrogation de résumés de flux de données. PhD thesis, Télécom ParisTech.
Google Scholar
Chakravarthy, S., & Jiang, Q. (2009). Stream data processing: A quality of service perspective: Modeling, scheduling, load shedding, and complex event processing (Vol. 36). New York: Springer.
MATH Google Scholar
Crovella, M. E., & Bestavros, A. (1997). Self-similarity in world wide web traffic: Evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6), 835–846.
Article Google Scholar
Leland, W. E., Taqqu, M. S., Willinger, W., & Wilson, D. V. (1993). On the self- similar nature of Ethernet traffic. ACM SIGCOMM computer communication review, 23, 183–193.
Article Google Scholar
Blumberg, R., & Atre, S. (2003). The problem with unstructured data. DM Review, 13(42–49), 62.
Google Scholar
Liu, L., Calton, P., & Tang, W. (2000). Webcq-detecting and delivering information changes on the web. In Proceedings of the ninth international conference on information and knowledge management (pp. 512–519). New York: Association for Computing Machinery.
Google Scholar
Golab, L., & Ozsu, M. T. (2003). Data stream management issues–a survey. Technical report, Apr. 2003. http://db.uwaterloo.ca/~ddbms/publications/stream/streamsurvey.pdf
Gietl, J. K., & Klemm, O. (2009). Analysis of traffic and meteorology on air-borne particulate matter in Münster, Northwest Germany. Journal of the Air & Waste Management Association, 59(7), 809–818.
Article Google Scholar
Bartok, J., Habala, O., Bednar, P., Gazak, M., & Hluchy, L. (2010). Data mining and integration for predicting significant meteorological phenomena. Procedia Computer Science, 1(1), 37–46.
Article Google Scholar
Mathioudakis, M., & Koudas, N. (2010). Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD international conference on management of data (pp. 1155–1158). ACM.
Google Scholar
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss, M. (2001). Quick-sand: Quick summary and analysis of network data. Technical Report, Dec. 2001. https://citeseer.nj.nec.com/gilbert01quicksand.html.
Sachpazidis, J. (2002). @ home: A modular telemedicine system. In Mobile computing in medicine, second conference on mobile computing in medicine, workshop of the project group MoCoMed. GMDS-Fachbereich Medizinische Informatik & GI-Fachausschuss 4.7 (pp 87–95). GI.
Google Scholar
Brettlecker, G., & Schuldt, H. (2007). The osiris-se (stream-enabled) infrastructure for reliable data stream management on mobile devices. In Proceedings of the 2007 ACM SIGMOD international conference on management of data (pp. 1097–1099). New York: Association for Computing Machinery.
Chapter Google Scholar
Liu, J., Liu, J., Reich, J., Cheung, P., & Zhao, F. (2003). Distributed group management for track initiation and maintenance in target localization applications. In Information processing in sensor networks (pp. 113–128). Heidelberg: Springer.
Chapter Google Scholar
Gurgen, L. (2007). Gestion à grande échelle de données de capteurs hétérogènes. PhD thesis, Grenoble, INPG.
Google Scholar
Abdessalem, T., Chiky, R., Hébrail, G., Vitti, J. L., & GET-ENST Paris. (2007). Traitement de données de consommation électrique par un système de gestion de flux de données. In EGC (pp. 521–532).
Google Scholar
Küçük, D., İnan, T., Boyrazoğlu, B., Buhan, S., Salor, Ö., Çadırcı, I., Ermiş, M. (2015). Pqstream: A data stream architecture for electrical power quality. arXiv preprint arXiv:1504.04750.
Google Scholar
Kovalerchuk, B., & Vityaev, E. (2000). Data mining in finance: Advances in relational and hybrid methods (Vol. 547). New York: Kluwer Academic Publishers.
MATH Google Scholar
https://traderbotmarketplace.com.
Cranor, C., Gao, Y., Johnson, T., Shkapenyuk, V., & Spatscheck, O. (2002). Gigascope: High-performance network monitoring with an SQL interface. In Proceedings of the 2002 ACM SIGMOD international conference on management of data (pp. 623–623). ACM.
Google Scholar
Sullivan, M., & Heybey, A. (1998). A system for managing large databases of network traffic. In Proceedings of USENIX.
Google Scholar
Csernel, B. (2008). Résumé généraliste de flux de données. PhD thesis, Paris, ENST.
Google Scholar
Tatbul, N., Cetintemel, U., Zdonik, S., Cherniack, M., & Stonebraker, M. (2003). Load shedding in a data stream manager. In Proceedings 2003 VLDB conference (pp. 309–320). San Diego: Elsevier.
Chapter Google Scholar
Abadi, D. J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., & Zdonik, S. (2003). Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2), 120–139.
Article Google Scholar
http://www.truviso.com/, 2004.
Cetintemel, U. (2003). The aurora and medusa projects. Data Engineering, 51, 3.
Google Scholar
Abadi, D. J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., et al. (2005). The design of the borealis stream processing engine. Cidr, 5, 277–289.
Google Scholar
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S. R., Reiss, F., & Shah, M. A. (2003). TelegraphCQ: continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 668–668). ACM.
Google Scholar
Chen, J., DeWitt, D. J., Tian, F., & Wang, Y. (2000). Niagaracq: A scalable continuous query system for internet databases. ACM SIGMOD Record, 29, 379–390.
Article Google Scholar
Chao, M. T. (1982). A general purpose unequal probability sampling plan. Biometrika, 69(3), 653–656.
Article MathSciNet Google Scholar
Olken, F., & Rotem, D. (1992). Maintenance of materialized views of sampling queries. In Proceedings of eighth international conference on data engineering (pp. 632–641). IEEE.
Google Scholar
Efraimidis, P. S., & Spirakis, P. G. (2006). Weighted random sampling with a reservoir. In Information processing letters (pp 181–185)
Google Scholar
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11, 37–57.
Article MathSciNet Google Scholar
Phillip, B. (1997). Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. In VLDB, 97, 466–475.
Google Scholar
Gibbons, P. B., Matias, Y., & Poosala, V. (2002). Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems (TODS), 27, 261–298.
Article Google Scholar
Gibbons, P. B., & Matias, Y. (1998). New sampling-based summary statistics for improving approximate query answers. ACM SIGMOD Record, 27, 331–342.
Article Google Scholar
Babcock, B., Datar, M., & Motwani, R. (2002). Sampling from a moving window over streaming data. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, 2002, 633–634.
MATH Google Scholar
Gemulla, R., Lehner, W., & Haas, P. J. (2006). A dip in the reservoir: Maintaining sample synopses of evolving datasets. In Proceedings of the 32nd international conference on Very large databases (pp. 595–606). VLDB Endowment.
Google Scholar
Gemulla, R. (2008). Sampling algorithms for evolving datasets. PhD thesis, Technischen Universitat Dresden Fakultat Informatik.
Google Scholar
Dash, M., & Ng, W. (2006). Efficient reservoir sampling for transactional data streams. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06) (pp. 662–666). IEEE.
Google Scholar
El Sibai, R., Chabchoub, Y., Demerjian, J., Chiky, R., & Barbar, K. (2018). A performance evaluation of data streams sampling algorithms over a sliding window. In Communications conference (MENACOMM), IEEE Middle East and North Africa (pp. 1–6). IEEE.
Google Scholar
Muthukrishnan, S., et al. (2005). Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 117–236.
Article MathSciNet Google Scholar
Csernel, B., Clerot, F., & Hébrail, G. (2006). Datastream clustering over tilted windows through sampling. In Knowledge discovery from data streams (p. 127).
Google Scholar
de Aquino, A. L. L., Figueiredo, C. M. S., Nakamura, E. F., Buriol, L. S., Loureiro, A. A. F., Fernandes, A. O., & Coelho, C. J. N., Jr. (2007). A sampling data stream algorithm for wireless sensor networks. In 2007. ICC’07. IEEE international conference on communications (pp. 3207–3212). IEEE.
Google Scholar
Mai, J., Chuah, C.-N., Sridharan, A., Ye, T., & Zang, H. (2006). Is sampled data sufficient for anomaly detection? In Proceedings of the 6th ACM SIGCOMM conference on internet measurement (pp. 165–176). New York: ACM.
Google Scholar
El Sibai, R., Chabchoub, Y., Demerjian, J., Kazi-Aoul, Z., & Barbar, K. (2016). Sampling algorithms in data stream environments. In 2016 IEEE first International Conference on Digital Economy Emerging Technologies and Business Innovation (ICDEc) (pp. 29–36). IEEE.
Google Scholar
El Sibai, R., Chabchoub, Y., Demerjian, J., Kazi-Aoul, Z., & Barbar, K. (2015). A performance study of the chain sampling algorithm. In 2015 IEEE seventh international conference on intelligent computing and information systems (ICICIS) (pp. 487–494). Cairo: IEEE.
Chapter Google Scholar
Brauckhoff, D., Tellenbach, B., Wagner, A., May, M., & Lakhina, A. (2006). Impact of packet sampling on anomaly detection metrics. In Proceedings of the 6th ACM SIGCOMM conference on internet measurement (pp. 159–164). New York: ACM.
Google Scholar
Pescapé, A., Rossi, D., Tammaro, D., & Valenti, S. (2010). On the impact of sampling on traffic monitoring and analysis. In Teletraffic congress (ITC), 2010 22nd international (pp. 1–8). Piscataway: IEEE.
Google Scholar
Hu, Z., Liu, J., Zhou, W., & Zhang, S. (2016). Sampling method in traffic logs analyzing. In 2016 8th international conference on intelligent human-machine systems and cybernetics (IHMSC) (Vol. 1, pp. 554–558). Piscataway: IEEE.
Google Scholar
Xu, K., Wang, F., Jia, X., & Wang, H. (2015). The impact of sampling on big data analysis of social media: A case study on flu and Ebola. In Global communications conference (GLOBECOM), 2015 IEEE (pp. 1–6). Piscataway: IEEE.
Google Scholar
Schinkel, M., & Chen, W.-H. (2006). Control of sampled-data systems with variable sampling rate. International Journal of Systems Science, 37(9), 609–618.
Article MathSciNet Google Scholar
Liu, L., Calton, P., & Tang, W. (1999). Continual queries for internet scale event-driven information delivery. IEEE Transactions on Knowledge and Data Engineering, 11(4), 610–628.
Article Google Scholar
Zhu, Y., & Shasha, D. (2002). Statstream: Statistical monitoring of thousands of data streams in real time∗∗ work supported in part by us NSF grants iis-9988345 and n2010: 0115586. In VLDB’02: Proceedings of the 28th international conference on very large databases (pp 358–369). Elsevier.
Google Scholar
McLeod, A. I., & Bellhouse, D. R. (1983). A convenient algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32, 182–184.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Al Maaref University, Faculty of Engineering, Beirut, Lebanon
Rayane El Sibai
Faculty of Natural and Applied Sciences, Notre Dame University, Deir El Kamar, Lebanon
Jacques Bou Abdo
Institut supérieur d’électronique de Paris, Issy-les-Moulineaux, France
Yousra Chabchoub & Raja Chiky
Faculty of Sciences, LaRRIS, Lebanese University, Fanar, Lebanon
Jacques Demerjian & Kablan Barbar

Authors

Rayane El Sibai
View author publications
You can also search for this author in PubMed Google Scholar
Jacques Bou Abdo
View author publications
You can also search for this author in PubMed Google Scholar
Yousra Chabchoub
View author publications
You can also search for this author in PubMed Google Scholar
Jacques Demerjian
View author publications
You can also search for this author in PubMed Google Scholar
Raja Chiky
View author publications
You can also search for this author in PubMed Google Scholar
Kablan Barbar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacques Bou Abdo .

Editor information

Editors and Affiliations

University of Georgia, Athens, GA, USA
Hamid R. Arabnia
University of Detroit Mercy, Detroit, MI, USA
Kevin Daimi
University of Hamburg, Hamburg, Hamburg, Germany
Robert Stahlbock
Features Analytics, Nivelles, Belgium
Cristina Soviany
University of Hamburg, Hamburg, Hamburg, Germany
Leonard Heilig
University of Hamburg, Hamburg, Hamburg, Germany
Kai Brüssau

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sibai, R.E., Abdo, J.B., Chabchoub, Y., Demerjian, J., Chiky, R., Barbar, K. (2020). Data Summarization Using Sampling Algorithms: Data Stream Case Study. In: Arabnia, H.R., Daimi, K., Stahlbock, R., Soviany, C., Heilig, L., Brüssau, K. (eds) Principles of Data Science. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-43981-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-43981-1_6
Published: 09 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43980-4
Online ISBN: 978-3-030-43981-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics