Skip to main content

C-DenStream: Using Domain Knowledge on a Data Stream

  • Conference paper
Discovery Science (DS 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5808))

Included in the following conference series:

Abstract

Stream clustering algorithms are traditionally designed to process streams efficiently and to adapt to the evolution of the underlying population. This is done without assuming any prior knowledge about the data. However, in many cases, a certain amount of domain or background knowledge is available, and instead of simply using it for the external validation of the clustering results, this knowledge can be used to guide the clustering process. In non-stream data, domain knowledge is exploited in the context of semi-supervised clustering.

In this paper, we extend the static semi-supervised learning paradigm for streams. We present C-DenStream, a density-based clustering algorithm for data streams that includes domain information in the form of constraints. We also propose a novel method for the use of background knowledge in data streams. The performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method. To our knowledge, this is the first approach to include domain knowledge in clustering for data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms, Advances in Database Systems. Springer, Heidelberg (2007)

    Google Scholar 

  2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data Streams. In: VLDB 2003: Proc. of the 29th in Very Large Data Bases Conf (2003)

    Google Scholar 

  3. Aguilar-Ruiz, J.S., Gama, J.: Data Streams. Journal of Universal Computer Science 11(8), 1349–1352 (2005)

    MathSciNet  Google Scholar 

  4. Barbará, D.: Requirements for clustering data streams. SIGKDD Explor. Newsl. 3(2), 23–27 (2002)

    Article  Google Scholar 

  5. Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised Clustering by Seeding. In: ICML 2002: Proc. Int. Conf. on Machine Learning, pp. 19–26 (2002)

    Google Scholar 

  6. Basu, S., Bilenko, M., Mooney, R.J.: A Probabilistic Framework for Semi-Supervised Clustering. In: KDD 2004: Proc. of 10th Int. Conf. on Knowledge Discovery in Databases and Data Mining, pp. 59–68 (2004)

    Google Scholar 

  7. Bilenko, M., Basu, S., Mooney, R.J.: Integrating Constraints and Metric Learning in Semisupervised Clustering. In: ICML 2004: Proc. of the 21th Int. Conf. on Machine Learning, pp. 11–19 (2004)

    Google Scholar 

  8. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SIAM 2006: SIAM Int. Conf. on Data Mining (2006)

    Google Scholar 

  9. Chaudhry, N., Shaw, K., Abdelguerfi, M. (eds.): Stream Data Management, Advances in Database Systems. Springer, Heidelberg (2005)

    Google Scholar 

  10. Davidson, I., Basu, S.: Clustering with Constraints: Theory and Practice. In: KDD 2006: Tutorial at The Int. Conf. on Knowledge Discovery in Databases and Data Mining (2006)

    Google Scholar 

  11. Davidson, I., Ravi, S.S.: Clustering with Constraints: Feasibility Issues and the k-Means Algorithm. In: SIAM 2005: Proceeding of the SIAM Int. Conf. on Data Mining Int. Conf. in Data Mining (2005)

    Google Scholar 

  12. Davidson, I., Wagstaff, K.L., Basu, S.: Measuring Constraint-Set Utility for Partitional Clustering Algorithms. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  13. Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: KDD 2000: Proc. of the ACM 6th Int. Conf. on Knowledge Discovery and Data Mining, pp. 71–80 (2000)

    Google Scholar 

  14. Domingos, P., Hulten, G.: A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering. In: ICML 2001: Proc. of the 18th Int. Conf. on Machine Learning, San Francisco, CA, USA, pp. 106–113. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  15. Domingos, P., Hulten, G.: Catching Up with the Data: Research Issues in Mining Data Streams. In: DMKD 2001: Workshop on Research Issues in Data Mining and Knowledge Discovery (2001)

    Google Scholar 

  16. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algortihm for Discovering Clusters in Large Spatial Database with Noise. In: KDD 1996: Proc. of 2nd Int. Conf. on Knowledge Discovery in Databases and Data Mining (1996)

    Google Scholar 

  17. Gaber, M., Krishnaswamy, S., Zaslavsky, A.: Ubiquitous Data Stream Mining. In: Proc. of the Current Research and Future Directions Workshop in PAKDD 2004, pp. 37 – 46 (2004)

    Google Scholar 

  18. Gao, J., Li, J., Zhang, Z., Tan, P.-N.: An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection. In: Proc. of the Current Research and Future Directions Workshop held in PAKDD 2005, pp. 420–425 (2005)

    Google Scholar 

  19. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)

    Article  Google Scholar 

  20. Gunopulos, D., Vazirgiannis, M., Halkidi, M.: Novel Aspects in Unsupervised Learning: Semi-Supervised and Distributed Algorithms. In: ECML/PKDD 2006: Tutorial at 17th European Conf. on Machine Learning and the 10th European Conf. on Principles and Practice of Knowledge Discovery in Databases (2006)

    Google Scholar 

  21. Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., Domeniconi, C.: A Framework for Semi-Supervised Learning Based on Subjective and Objective Clustering Criteria. In: ICDM 2005: Proc. of the 5th IEEE Int. Conf. on Data Mining, pp. 637–640 (2005)

    Google Scholar 

  22. Klein, D., Kamvar, S.D., Manning, C.: From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: ICML 2002: Proc. of the 19th Int. Conf. on Machine Learning, pp. 307–314 (2002)

    Google Scholar 

  23. Nasraoui, O., Rojas, C.: Robust Clustering for Tracking Noisy Evolving Data Streams. In: SDM 2006: Proc. of the 6th SIAM Int. Conf. on Data Mining, pp. 80–89 (2006)

    Google Scholar 

  24. Nasraoui, O., Uribe, C.C., Coronel, C.R., Gonzalez, F.: TECNO-STREAMS: Tracking Evolving Clusters in Noisy Data Streams with a Scalable Immune System Learning Model. In: ICDM 2003: Proc. of the 3rd IEEE Int. Conf. on Data Mining, Washington, DC, USA, p. 235. IEEE Computer Society Press, Los Alamitos (2003)

    Google Scholar 

  25. Ordonez, C.: Clustering binary data streams with K-means. In: DMKD 2003: Proc. of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 12–19. ACM Press, New York (2003)

    Google Scholar 

  26. Rand, W.M.: Objective Criteria for the Evalluation of Clustering Methods. Journal of the American Statistical Association 66, 846–850 (1971)

    Article  Google Scholar 

  27. Ruiz, C., Spiliopoulou, M., Menasalvas, E.: C-DBSCAN: Density-Based Clustering with Constraints. In: RSFDGrC 2007: Proc. of the Int. Conf. on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (2007)

    Google Scholar 

  28. Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A., Chan, P.K.: Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project. Technical report, U. of Columbia (1998)

    Google Scholar 

  29. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means Clustering with Background Knowledge. In: ICML 2001: Proc. of 18th Int. Conf. on Machine Learning, pp. 577–584 (2001)

    Google Scholar 

  30. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning, with Application to Clustering with Side-Information. Advances in Neural Information Processing Systems 15, 505–512 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ruiz, C., Menasalvas, E., Spiliopoulou, M. (2009). C-DenStream: Using Domain Knowledge on a Data Stream. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds) Discovery Science. DS 2009. Lecture Notes in Computer Science(), vol 5808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04747-3_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04747-3_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04746-6

  • Online ISBN: 978-3-642-04747-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics