Abstract
Stream clustering algorithms are traditionally designed to process streams efficiently and to adapt to the evolution of the underlying population. This is done without assuming any prior knowledge about the data. However, in many cases, a certain amount of domain or background knowledge is available, and instead of simply using it for the external validation of the clustering results, this knowledge can be used to guide the clustering process. In non-stream data, domain knowledge is exploited in the context of semi-supervised clustering.
In this paper, we extend the static semi-supervised learning paradigm for streams. We present C-DenStream, a density-based clustering algorithm for data streams that includes domain information in the form of constraints. We also propose a novel method for the use of background knowledge in data streams. The performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method. To our knowledge, this is the first approach to include domain knowledge in clustering for data streams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms, Advances in Database Systems. Springer, Heidelberg (2007)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data Streams. In: VLDB 2003: Proc. of the 29th in Very Large Data Bases Conf (2003)
Aguilar-Ruiz, J.S., Gama, J.: Data Streams. Journal of Universal Computer Science 11(8), 1349–1352 (2005)
Barbará, D.: Requirements for clustering data streams. SIGKDD Explor. Newsl. 3(2), 23–27 (2002)
Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised Clustering by Seeding. In: ICML 2002: Proc. Int. Conf. on Machine Learning, pp. 19–26 (2002)
Basu, S., Bilenko, M., Mooney, R.J.: A Probabilistic Framework for Semi-Supervised Clustering. In: KDD 2004: Proc. of 10th Int. Conf. on Knowledge Discovery in Databases and Data Mining, pp. 59–68 (2004)
Bilenko, M., Basu, S., Mooney, R.J.: Integrating Constraints and Metric Learning in Semisupervised Clustering. In: ICML 2004: Proc. of the 21th Int. Conf. on Machine Learning, pp. 11–19 (2004)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SIAM 2006: SIAM Int. Conf. on Data Mining (2006)
Chaudhry, N., Shaw, K., Abdelguerfi, M. (eds.): Stream Data Management, Advances in Database Systems. Springer, Heidelberg (2005)
Davidson, I., Basu, S.: Clustering with Constraints: Theory and Practice. In: KDD 2006: Tutorial at The Int. Conf. on Knowledge Discovery in Databases and Data Mining (2006)
Davidson, I., Ravi, S.S.: Clustering with Constraints: Feasibility Issues and the k-Means Algorithm. In: SIAM 2005: Proceeding of the SIAM Int. Conf. on Data Mining Int. Conf. in Data Mining (2005)
Davidson, I., Wagstaff, K.L., Basu, S.: Measuring Constraint-Set Utility for Partitional Clustering Algorithms. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg (2006)
Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: KDD 2000: Proc. of the ACM 6th Int. Conf. on Knowledge Discovery and Data Mining, pp. 71–80 (2000)
Domingos, P., Hulten, G.: A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering. In: ICML 2001: Proc. of the 18th Int. Conf. on Machine Learning, San Francisco, CA, USA, pp. 106–113. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Domingos, P., Hulten, G.: Catching Up with the Data: Research Issues in Mining Data Streams. In: DMKD 2001: Workshop on Research Issues in Data Mining and Knowledge Discovery (2001)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algortihm for Discovering Clusters in Large Spatial Database with Noise. In: KDD 1996: Proc. of 2nd Int. Conf. on Knowledge Discovery in Databases and Data Mining (1996)
Gaber, M., Krishnaswamy, S., Zaslavsky, A.: Ubiquitous Data Stream Mining. In: Proc. of the Current Research and Future Directions Workshop in PAKDD 2004, pp. 37 – 46 (2004)
Gao, J., Li, J., Zhang, Z., Tan, P.-N.: An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection. In: Proc. of the Current Research and Future Directions Workshop held in PAKDD 2005, pp. 420–425 (2005)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)
Gunopulos, D., Vazirgiannis, M., Halkidi, M.: Novel Aspects in Unsupervised Learning: Semi-Supervised and Distributed Algorithms. In: ECML/PKDD 2006: Tutorial at 17th European Conf. on Machine Learning and the 10th European Conf. on Principles and Practice of Knowledge Discovery in Databases (2006)
Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., Domeniconi, C.: A Framework for Semi-Supervised Learning Based on Subjective and Objective Clustering Criteria. In: ICDM 2005: Proc. of the 5th IEEE Int. Conf. on Data Mining, pp. 637–640 (2005)
Klein, D., Kamvar, S.D., Manning, C.: From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: ICML 2002: Proc. of the 19th Int. Conf. on Machine Learning, pp. 307–314 (2002)
Nasraoui, O., Rojas, C.: Robust Clustering for Tracking Noisy Evolving Data Streams. In: SDM 2006: Proc. of the 6th SIAM Int. Conf. on Data Mining, pp. 80–89 (2006)
Nasraoui, O., Uribe, C.C., Coronel, C.R., Gonzalez, F.: TECNO-STREAMS: Tracking Evolving Clusters in Noisy Data Streams with a Scalable Immune System Learning Model. In: ICDM 2003: Proc. of the 3rd IEEE Int. Conf. on Data Mining, Washington, DC, USA, p. 235. IEEE Computer Society Press, Los Alamitos (2003)
Ordonez, C.: Clustering binary data streams with K-means. In: DMKD 2003: Proc. of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 12–19. ACM Press, New York (2003)
Rand, W.M.: Objective Criteria for the Evalluation of Clustering Methods. Journal of the American Statistical Association 66, 846–850 (1971)
Ruiz, C., Spiliopoulou, M., Menasalvas, E.: C-DBSCAN: Density-Based Clustering with Constraints. In: RSFDGrC 2007: Proc. of the Int. Conf. on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (2007)
Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A., Chan, P.K.: Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project. Technical report, U. of Columbia (1998)
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means Clustering with Background Knowledge. In: ICML 2001: Proc. of 18th Int. Conf. on Machine Learning, pp. 577–584 (2001)
Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning, with Application to Clustering with Side-Information. Advances in Neural Information Processing Systems 15, 505–512 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ruiz, C., Menasalvas, E., Spiliopoulou, M. (2009). C-DenStream: Using Domain Knowledge on a Data Stream. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds) Discovery Science. DS 2009. Lecture Notes in Computer Science(), vol 5808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04747-3_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-04747-3_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04746-6
Online ISBN: 978-3-642-04747-3
eBook Packages: Computer ScienceComputer Science (R0)