Abstract
In this paper we introduce a new strategy for summarizing a fast changing data stream. Evolving data streams are generated by non stationary processes which require to adapt the knowledge discovery process to the new emerging concepts. To deal with this challenge we propose a clustering algorithm where each cluster is summarized by a histogram and data are allocated to clusters through a Wasserstein derived distance. Histograms are a well known graphical tool for representing the frequency distribution of data and are widely used in data stream mining, however, unlike to existing methods, we discover a set of histograms where each one represents a main concept in the data. In order to evaluate the performance of the method, we have performed extensive tests on simulated data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aggarwal, C. C., Han, J., Wang, J., & Philip, S. (2003). A framework for clustering evolving data streams. In: 29th int. conf. on very large data bases.
Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom J. (2002). Models and issues in data stream systems. In: 21th ACM SIGMOD-SIGACT-SIGART symposium PODS ’02 (pp. 1–16).
Balzanella, A. (2009). Clustering and summarizing massive data streams. PHD Thesis, http://www.fedoa.unina.it/4184(2009).
Balzanella, A., Irpino, A., & Verde, R. (2010). Dimensionality reduction techniques for streaming time series: a new symbolic approach. Studies in classification, data analysis, and knowledge organization (pp. 381–389). Heidelberg, Berlin: Springer.
Balzanella, A., Romano, E., & Verde, R. (2011). Summarizing and mining streaming data via a functional data approach. Classification and multivariate analysis for complex data structures (pp. 409–416). Heidelberg, Berlin: Springer
Gama, J., & Gaber, M. M. (2007). Learning from data stream. Techniques in sensor networks. Heidelberg, Berlin: Springer
Guha, S., Koudas, N., & Shim, K. (2001). Data-streams and histograms. In: 33th annual ACM symposium on theory of computing (pp. 471–475). New York: ACM.
Irpino, A., & Verde, R. (2006). Dynamic clustering of histograms using Wasserstein metric. In A. Rizzi, & M. Vichi (Eds.) COMPSTAT 2006 - Advances in computational statistics (pp. 869–876). Heidelberg: Physica-Verlag.
Mallows, C. L. (1972). A note on asymptotic joint normality. Annals of Mathematical Statistics, 43(2), 508–515.
Sebastiao, R., & Gama, J. (2007). Change detection in learning histograms from data streams. Progress in Artificial Intelligence. Lecture Notes in Computer Science. Springer Berlin Heidelberg. ISBN: 978-3-540-77000-8
Verde, R., & Irpino, A. (2007). Dynamic clustering of histogram data: using the right metric. Studies in Classification, Data Analysis, and Knowledge Organization, Part I, 123–134, doi: 10.1007/978-3-540-73560-1 12.
Verde, R., & Irpino, A. (2010). Ordinary least squares for histogram data based on wasserstein distance. In Y. Lechevallier, & G. Saporta (Eds.) COMPSTAT 2010 (pp. 581588). Berlin: PhysicaVerlag.
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Balzanella, A., Rivoli, L., Verde, R. (2013). Data Stream Summarization by Histograms Clustering. In: Giudici, P., Ingrassia, S., Vichi, M. (eds) Statistical Models for Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00032-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-00032-9_4
Published:
Publisher Name: Springer, Heidelberg
Print ISBN: 978-3-319-00031-2
Online ISBN: 978-3-319-00032-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)