Abstract
Data streams represent a challenge to the data processing operations such as query execution and information retrieval. They pose many constraints in terms of memory space and execution time for the computation process. This is mainly due to the huge volume of the data and their high arrival rate. Generating approximate answers by using a small proportion of the data stream, called “summary,” is acceptable for many applications. Sampling algorithms are used to construct a data stream summary. The purpose of sampling algorithms is to provide information concerning a large set of data from a representative sample extracted from it. An effective summary of a data stream must have the ability to respond, in an approximate manner, to any query, whatever the period of time investigating. In this chapter, we present a survey of these algorithms. Firstly, we introduce the basic concepts of data streams, windowing models, as well as data stream applications. Next, we introduce the state of the art of different sampling algorithms used in data stream environments. We classify these algorithms according to the following metrics: number of passes over the data, memory consumption, and skewing ability. In the end, we evaluate the performance of three sampling algorithms according to their execution time and accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Golab, L., & Ozsu, M. T. (2003). Issues in data stream management. ACM SIGMOD Record, 32(2), 5–14.
Gabsi, N. (2011). Extension et interrogation de résumés de flux de données. PhD thesis, Télécom ParisTech.
Chakravarthy, S., & Jiang, Q. (2009). Stream data processing: A quality of service perspective: Modeling, scheduling, load shedding, and complex event processing (Vol. 36). New York: Springer.
Crovella, M. E., & Bestavros, A. (1997). Self-similarity in world wide web traffic: Evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6), 835–846.
Leland, W. E., Taqqu, M. S., Willinger, W., & Wilson, D. V. (1993). On the self- similar nature of Ethernet traffic. ACM SIGCOMM computer communication review, 23, 183–193.
Blumberg, R., & Atre, S. (2003). The problem with unstructured data. DM Review, 13(42–49), 62.
Liu, L., Calton, P., & Tang, W. (2000). Webcq-detecting and delivering information changes on the web. In Proceedings of the ninth international conference on information and knowledge management (pp. 512–519). New York: Association for Computing Machinery.
Golab, L., & Ozsu, M. T. (2003). Data stream management issues–a survey. Technical report, Apr. 2003. http://db.uwaterloo.ca/~ddbms/publications/stream/streamsurvey.pdf
Gietl, J. K., & Klemm, O. (2009). Analysis of traffic and meteorology on air-borne particulate matter in Münster, Northwest Germany. Journal of the Air & Waste Management Association, 59(7), 809–818.
Bartok, J., Habala, O., Bednar, P., Gazak, M., & Hluchy, L. (2010). Data mining and integration for predicting significant meteorological phenomena. Procedia Computer Science, 1(1), 37–46.
Mathioudakis, M., & Koudas, N. (2010). Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD international conference on management of data (pp. 1155–1158). ACM.
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss, M. (2001). Quick-sand: Quick summary and analysis of network data. Technical Report, Dec. 2001. https://citeseer.nj.nec.com/gilbert01quicksand.html.
Sachpazidis, J. (2002). @ home: A modular telemedicine system. In Mobile computing in medicine, second conference on mobile computing in medicine, workshop of the project group MoCoMed. GMDS-Fachbereich Medizinische Informatik & GI-Fachausschuss 4.7 (pp 87–95). GI.
Brettlecker, G., & Schuldt, H. (2007). The osiris-se (stream-enabled) infrastructure for reliable data stream management on mobile devices. In Proceedings of the 2007 ACM SIGMOD international conference on management of data (pp. 1097–1099). New York: Association for Computing Machinery.
Liu, J., Liu, J., Reich, J., Cheung, P., & Zhao, F. (2003). Distributed group management for track initiation and maintenance in target localization applications. In Information processing in sensor networks (pp. 113–128). Heidelberg: Springer.
Gurgen, L. (2007). Gestion à grande échelle de données de capteurs hétérogènes. PhD thesis, Grenoble, INPG.
Abdessalem, T., Chiky, R., Hébrail, G., Vitti, J. L., & GET-ENST Paris. (2007). Traitement de données de consommation électrique par un système de gestion de flux de données. In EGC (pp. 521–532).
Küçük, D., İnan, T., Boyrazoğlu, B., Buhan, S., Salor, Ö., Çadırcı, I., Ermiş, M. (2015). Pqstream: A data stream architecture for electrical power quality. arXiv preprint arXiv:1504.04750.
Kovalerchuk, B., & Vityaev, E. (2000). Data mining in finance: Advances in relational and hybrid methods (Vol. 547). New York: Kluwer Academic Publishers.
Cranor, C., Gao, Y., Johnson, T., Shkapenyuk, V., & Spatscheck, O. (2002). Gigascope: High-performance network monitoring with an SQL interface. In Proceedings of the 2002 ACM SIGMOD international conference on management of data (pp. 623–623). ACM.
Sullivan, M., & Heybey, A. (1998). A system for managing large databases of network traffic. In Proceedings of USENIX.
Csernel, B. (2008). Résumé généraliste de flux de données. PhD thesis, Paris, ENST.
Tatbul, N., Cetintemel, U., Zdonik, S., Cherniack, M., & Stonebraker, M. (2003). Load shedding in a data stream manager. In Proceedings 2003 VLDB conference (pp. 309–320). San Diego: Elsevier.
Abadi, D. J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., & Zdonik, S. (2003). Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2), 120–139.
http://www.truviso.com/, 2004.
Cetintemel, U. (2003). The aurora and medusa projects. Data Engineering, 51, 3.
Abadi, D. J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., et al. (2005). The design of the borealis stream processing engine. Cidr, 5, 277–289.
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S. R., Reiss, F., & Shah, M. A. (2003). TelegraphCQ: continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 668–668). ACM.
Chen, J., DeWitt, D. J., Tian, F., & Wang, Y. (2000). Niagaracq: A scalable continuous query system for internet databases. ACM SIGMOD Record, 29, 379–390.
Chao, M. T. (1982). A general purpose unequal probability sampling plan. Biometrika, 69(3), 653–656.
Olken, F., & Rotem, D. (1992). Maintenance of materialized views of sampling queries. In Proceedings of eighth international conference on data engineering (pp. 632–641). IEEE.
Efraimidis, P. S., & Spirakis, P. G. (2006). Weighted random sampling with a reservoir. In Information processing letters (pp 181–185)
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11, 37–57.
Phillip, B. (1997). Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. In VLDB, 97, 466–475.
Gibbons, P. B., Matias, Y., & Poosala, V. (2002). Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems (TODS), 27, 261–298.
Gibbons, P. B., & Matias, Y. (1998). New sampling-based summary statistics for improving approximate query answers. ACM SIGMOD Record, 27, 331–342.
Babcock, B., Datar, M., & Motwani, R. (2002). Sampling from a moving window over streaming data. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, 2002, 633–634.
Gemulla, R., Lehner, W., & Haas, P. J. (2006). A dip in the reservoir: Maintaining sample synopses of evolving datasets. In Proceedings of the 32nd international conference on Very large databases (pp. 595–606). VLDB Endowment.
Gemulla, R. (2008). Sampling algorithms for evolving datasets. PhD thesis, Technischen Universitat Dresden Fakultat Informatik.
Dash, M., & Ng, W. (2006). Efficient reservoir sampling for transactional data streams. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06) (pp. 662–666). IEEE.
El Sibai, R., Chabchoub, Y., Demerjian, J., Chiky, R., & Barbar, K. (2018). A performance evaluation of data streams sampling algorithms over a sliding window. In Communications conference (MENACOMM), IEEE Middle East and North Africa (pp. 1–6). IEEE.
Muthukrishnan, S., et al. (2005). Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 117–236.
Csernel, B., Clerot, F., & Hébrail, G. (2006). Datastream clustering over tilted windows through sampling. In Knowledge discovery from data streams (p. 127).
de Aquino, A. L. L., Figueiredo, C. M. S., Nakamura, E. F., Buriol, L. S., Loureiro, A. A. F., Fernandes, A. O., & Coelho, C. J. N., Jr. (2007). A sampling data stream algorithm for wireless sensor networks. In 2007. ICC’07. IEEE international conference on communications (pp. 3207–3212). IEEE.
Mai, J., Chuah, C.-N., Sridharan, A., Ye, T., & Zang, H. (2006). Is sampled data sufficient for anomaly detection? In Proceedings of the 6th ACM SIGCOMM conference on internet measurement (pp. 165–176). New York: ACM.
El Sibai, R., Chabchoub, Y., Demerjian, J., Kazi-Aoul, Z., & Barbar, K. (2016). Sampling algorithms in data stream environments. In 2016 IEEE first International Conference on Digital Economy Emerging Technologies and Business Innovation (ICDEc) (pp. 29–36). IEEE.
El Sibai, R., Chabchoub, Y., Demerjian, J., Kazi-Aoul, Z., & Barbar, K. (2015). A performance study of the chain sampling algorithm. In 2015 IEEE seventh international conference on intelligent computing and information systems (ICICIS) (pp. 487–494). Cairo: IEEE.
Brauckhoff, D., Tellenbach, B., Wagner, A., May, M., & Lakhina, A. (2006). Impact of packet sampling on anomaly detection metrics. In Proceedings of the 6th ACM SIGCOMM conference on internet measurement (pp. 159–164). New York: ACM.
Pescapé, A., Rossi, D., Tammaro, D., & Valenti, S. (2010). On the impact of sampling on traffic monitoring and analysis. In Teletraffic congress (ITC), 2010 22nd international (pp. 1–8). Piscataway: IEEE.
Hu, Z., Liu, J., Zhou, W., & Zhang, S. (2016). Sampling method in traffic logs analyzing. In 2016 8th international conference on intelligent human-machine systems and cybernetics (IHMSC) (Vol. 1, pp. 554–558). Piscataway: IEEE.
Xu, K., Wang, F., Jia, X., & Wang, H. (2015). The impact of sampling on big data analysis of social media: A case study on flu and Ebola. In Global communications conference (GLOBECOM), 2015 IEEE (pp. 1–6). Piscataway: IEEE.
Schinkel, M., & Chen, W.-H. (2006). Control of sampled-data systems with variable sampling rate. International Journal of Systems Science, 37(9), 609–618.
Liu, L., Calton, P., & Tang, W. (1999). Continual queries for internet scale event-driven information delivery. IEEE Transactions on Knowledge and Data Engineering, 11(4), 610–628.
Zhu, Y., & Shasha, D. (2002). Statstream: Statistical monitoring of thousands of data streams in real time∗∗ work supported in part by us NSF grants iis-9988345 and n2010: 0115586. In VLDB’02: Proceedings of the 28th international conference on very large databases (pp 358–369). Elsevier.
McLeod, A. I., & Bellhouse, D. R. (1983). A convenient algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32, 182–184.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sibai, R.E., Abdo, J.B., Chabchoub, Y., Demerjian, J., Chiky, R., Barbar, K. (2020). Data Summarization Using Sampling Algorithms: Data Stream Case Study. In: Arabnia, H.R., Daimi, K., Stahlbock, R., Soviany, C., Heilig, L., Brüssau, K. (eds) Principles of Data Science. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-43981-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-43981-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43980-4
Online ISBN: 978-3-030-43981-1
eBook Packages: EngineeringEngineering (R0)