Abstract
In recent years, data streams have become ubiquitous because of the large number of applications which generate huge volumes of data in an automated way. Many existing data mining methods cannot be applied directly on data streams because of the fact that the data needs to be mined in one pass. Furthermore, data streams show a considerable amount of temporal locality because of which a direct application of the existing methods may lead to misleading results. In this paper, we develop an efficient and effective approach for mining fast evolving data streams, which integrates the micro-clustering technique with the high-level data mining process, and discovers data evolution regularities as well. Our analysis and experiments demonstrate two important data mining problems, namely stream clustering and stream classification, can be performed effectively using this approach, with high quality mining results. We discuss the use of micro-clustering as a general summarization technology to solve data mining problems on streams. Our discussion illustrates the importance of our approach for a variety of mining problems in the data stream domain.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aggarwal C., Procopiuc C., Wolf J., Yu P., Park J.-S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Conference.
Aggarwal C., Yu P. (2000). Finding Generalized Projected Clusters in High Dimensional Spaces, ACM SIGMOD Conference.
Aggarwal C., Yu P., (2004). A Condensation Approach to Privacy Preserving Data Mining. EDBT Conference.
Agrawal R., Gehrke J., Gunopulos D., Raghavan P (1998). Automatic Sub-space Clustering of High Dimensional Data for Data Mining Applications. ACM SIGMOD Conference.
Aggarwal C (2003). A Framework for Diagnosing Changes in Evolving Data Streams. ACM SIGMOD Conference.
Aggarwal C., Han J., Wang J., Yu P (2003). A Framework for Clustering Evolving Data Streams. VLDB Conference.
Aggarwal C, Han J., Wang J., Yu P. (2004). On-Demand Classification of Evolving Data Streams. ACM KDD Conference.
Aggarwal C., Han J., Wang J., Yu P. (2004). A Framework for Projected Clustering of High Dimensional Data Streams. VLDB Conference.
Aggarwal C. (2006) on Futuristic Query Processing in Data Streams. EDBT Conference.
Ankerst M., Breunig M., Kriegel H.-P., Sander J. (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD Conference.
Babcock B., Babu S., Datar M., Motwani R., Widom J. (2002). Models and Issues in Data Stream Systems, ACM PODS Conference.
Bradley P., Fayyad U., Reina C. (1998) Scaling Clustering Algorithms to Large Databases. SIGKDD Conference.
Cortes C, Fisher K., Pregibon D., Rogers A., Smith F. (2000). Hancock: A Language for Extracting Signatures from Data Streams. ACM SIGKDD Conference.
Domingos P., Hulten G. (2000). Mining High-Speed Data Streams. ACM SIGKDD Conference.
Duda R., Hart P (1973). Pattern Classification and Scene Analysis, Wiley, New York.
Farnstrom F,, Lewis J., Elkan C. (2000). Scalability for Clustering Algorithms Revisited. SIGKDD Explorations, 2(1):pp. 51–57.
Guha S., Mishra N., Motwani R., O’Callaghan L. (2000). Clustering Data Streams. IEEE FOCS Conference.
Guha S., Rastogi R., Shim K. (1998). CURE: An Efficient Clustering Algorithm for Large Databases. ACM SIGMOD Conference.
Hulten G., Spencer L., Domingos P. (2001). Mining Time Changing Data Streams. ACM KDD Conference.
Jain A., Dubes R. (1998). Algorithms for Clustering Data, Prentice Hall, New Jersey.
Kaufman L., Rousseuw P. (1990). Finding Groups in Data-An Introduction to Cluster Analysis. Wiley Series in Probability and Math Sciences.
Ng R., Han J (1994). Efficient and Effective Clustering Methods for Spatial Data Mining. Very Large Data Bases Conference.
O’Callaghan L., Mishra N., Meyerson A., Guha S., Motwani R (2002). Streaming-Data Algorithms For High-Quality Clustering. ICDE Conference.
Zhang T., Ramakrishnan R., and Livny M (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S. (2007). On Clustering Massive Data Streams: A Summarization Paradigm. In: Aggarwal, C.C. (eds) Data Streams. Advances in Database Systems, vol 31. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-47534-9_2
Download citation
DOI: https://doi.org/10.1007/978-0-387-47534-9_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-28759-1
Online ISBN: 978-0-387-47534-9
eBook Packages: Computer ScienceComputer Science (R0)