Advertisement

C-DenStream: Using Domain Knowledge on a Data Stream

  • Carlos Ruiz
  • Ernestina Menasalvas
  • Myra Spiliopoulou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5808)

Abstract

Stream clustering algorithms are traditionally designed to process streams efficiently and to adapt to the evolution of the underlying population. This is done without assuming any prior knowledge about the data. However, in many cases, a certain amount of domain or background knowledge is available, and instead of simply using it for the external validation of the clustering results, this knowledge can be used to guide the clustering process. In non-stream data, domain knowledge is exploited in the context of semi-supervised clustering.

In this paper, we extend the static semi-supervised learning paradigm for streams. We present C-DenStream, a density-based clustering algorithm for data streams that includes domain information in the form of constraints. We also propose a novel method for the use of background knowledge in data streams. The performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method. To our knowledge, this is the first approach to include domain knowledge in clustering for data streams.

Keywords

Data Stream Domain Knowledge Time Stamp Synthetic Dataset Rand Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms, Advances in Database Systems. Springer, Heidelberg (2007)Google Scholar
  2. 2.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data Streams. In: VLDB 2003: Proc. of the 29th in Very Large Data Bases Conf (2003)Google Scholar
  3. 3.
    Aguilar-Ruiz, J.S., Gama, J.: Data Streams. Journal of Universal Computer Science 11(8), 1349–1352 (2005)MathSciNetGoogle Scholar
  4. 4.
    Barbará, D.: Requirements for clustering data streams. SIGKDD Explor. Newsl. 3(2), 23–27 (2002)CrossRefGoogle Scholar
  5. 5.
    Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised Clustering by Seeding. In: ICML 2002: Proc. Int. Conf. on Machine Learning, pp. 19–26 (2002)Google Scholar
  6. 6.
    Basu, S., Bilenko, M., Mooney, R.J.: A Probabilistic Framework for Semi-Supervised Clustering. In: KDD 2004: Proc. of 10th Int. Conf. on Knowledge Discovery in Databases and Data Mining, pp. 59–68 (2004)Google Scholar
  7. 7.
    Bilenko, M., Basu, S., Mooney, R.J.: Integrating Constraints and Metric Learning in Semisupervised Clustering. In: ICML 2004: Proc. of the 21th Int. Conf. on Machine Learning, pp. 11–19 (2004)Google Scholar
  8. 8.
    Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SIAM 2006: SIAM Int. Conf. on Data Mining (2006)Google Scholar
  9. 9.
    Chaudhry, N., Shaw, K., Abdelguerfi, M. (eds.): Stream Data Management, Advances in Database Systems. Springer, Heidelberg (2005)Google Scholar
  10. 10.
    Davidson, I., Basu, S.: Clustering with Constraints: Theory and Practice. In: KDD 2006: Tutorial at The Int. Conf. on Knowledge Discovery in Databases and Data Mining (2006)Google Scholar
  11. 11.
    Davidson, I., Ravi, S.S.: Clustering with Constraints: Feasibility Issues and the k-Means Algorithm. In: SIAM 2005: Proceeding of the SIAM Int. Conf. on Data Mining Int. Conf. in Data Mining (2005)Google Scholar
  12. 12.
    Davidson, I., Wagstaff, K.L., Basu, S.: Measuring Constraint-Set Utility for Partitional Clustering Algorithms. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: KDD 2000: Proc. of the ACM 6th Int. Conf. on Knowledge Discovery and Data Mining, pp. 71–80 (2000)Google Scholar
  14. 14.
    Domingos, P., Hulten, G.: A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering. In: ICML 2001: Proc. of the 18th Int. Conf. on Machine Learning, San Francisco, CA, USA, pp. 106–113. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  15. 15.
    Domingos, P., Hulten, G.: Catching Up with the Data: Research Issues in Mining Data Streams. In: DMKD 2001: Workshop on Research Issues in Data Mining and Knowledge Discovery (2001)Google Scholar
  16. 16.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algortihm for Discovering Clusters in Large Spatial Database with Noise. In: KDD 1996: Proc. of 2nd Int. Conf. on Knowledge Discovery in Databases and Data Mining (1996)Google Scholar
  17. 17.
    Gaber, M., Krishnaswamy, S., Zaslavsky, A.: Ubiquitous Data Stream Mining. In: Proc. of the Current Research and Future Directions Workshop in PAKDD 2004, pp. 37 – 46 (2004)Google Scholar
  18. 18.
    Gao, J., Li, J., Zhang, Z., Tan, P.-N.: An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection. In: Proc. of the Current Research and Future Directions Workshop held in PAKDD 2005, pp. 420–425 (2005)Google Scholar
  19. 19.
    Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)CrossRefGoogle Scholar
  20. 20.
    Gunopulos, D., Vazirgiannis, M., Halkidi, M.: Novel Aspects in Unsupervised Learning: Semi-Supervised and Distributed Algorithms. In: ECML/PKDD 2006: Tutorial at 17th European Conf. on Machine Learning and the 10th European Conf. on Principles and Practice of Knowledge Discovery in Databases (2006)Google Scholar
  21. 21.
    Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., Domeniconi, C.: A Framework for Semi-Supervised Learning Based on Subjective and Objective Clustering Criteria. In: ICDM 2005: Proc. of the 5th IEEE Int. Conf. on Data Mining, pp. 637–640 (2005)Google Scholar
  22. 22.
    Klein, D., Kamvar, S.D., Manning, C.: From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: ICML 2002: Proc. of the 19th Int. Conf. on Machine Learning, pp. 307–314 (2002)Google Scholar
  23. 23.
    Nasraoui, O., Rojas, C.: Robust Clustering for Tracking Noisy Evolving Data Streams. In: SDM 2006: Proc. of the 6th SIAM Int. Conf. on Data Mining, pp. 80–89 (2006)Google Scholar
  24. 24.
    Nasraoui, O., Uribe, C.C., Coronel, C.R., Gonzalez, F.: TECNO-STREAMS: Tracking Evolving Clusters in Noisy Data Streams with a Scalable Immune System Learning Model. In: ICDM 2003: Proc. of the 3rd IEEE Int. Conf. on Data Mining, Washington, DC, USA, p. 235. IEEE Computer Society Press, Los Alamitos (2003)Google Scholar
  25. 25.
    Ordonez, C.: Clustering binary data streams with K-means. In: DMKD 2003: Proc. of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 12–19. ACM Press, New York (2003)Google Scholar
  26. 26.
    Rand, W.M.: Objective Criteria for the Evalluation of Clustering Methods. Journal of the American Statistical Association 66, 846–850 (1971)CrossRefGoogle Scholar
  27. 27.
    Ruiz, C., Spiliopoulou, M., Menasalvas, E.: C-DBSCAN: Density-Based Clustering with Constraints. In: RSFDGrC 2007: Proc. of the Int. Conf. on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (2007)Google Scholar
  28. 28.
    Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A., Chan, P.K.: Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project. Technical report, U. of Columbia (1998)Google Scholar
  29. 29.
    Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means Clustering with Background Knowledge. In: ICML 2001: Proc. of 18th Int. Conf. on Machine Learning, pp. 577–584 (2001)Google Scholar
  30. 30.
    Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning, with Application to Clustering with Side-Information. Advances in Neural Information Processing Systems 15, 505–512 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Carlos Ruiz
    • 1
  • Ernestina Menasalvas
    • 1
  • Myra Spiliopoulou
    • 2
  1. 1.Facultad de InformáticaUniversidad Politécnica de MadridSpain
  2. 2.Faculty of Computer ScienceMagdeburg UniversityGermany

Personalised recommendations