Background Knowledge Integration in Clustering Using Purity Indexes

  • Germain Forestier
  • Cédric Wemmert
  • Pierre Gançarski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6291)


In recent years, the use of background knowledge to improve the data mining process has been intensively studied. Indeed, background knowledge along with knowledge directly or indirectly provided by the user are often available. However, it is often difficult to formalize this kind of knowledge, as it is often dependent of the domain. In this article, we studied the integration of knowledge as labeled objects in clustering algorithms. Several criteria allowing the evaluation of the purity of a clustering are presented and their behaviours are compared using artificial datasets. Advantages and drawbacks of each criterion are analyzed in order to help the user to make a choice among them.


Clustering background knowledge semi-supervised algorithm purity indexes 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ajmera, J., Bourlard, H., Lapidot, I., McCowan, I.: Unknown-multiple speaker clustering using hmm. In: International Conference on Spoken Language Processing, September 2002, pp. 573–576 (2002)Google Scholar
  2. 2.
    Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: International Conference on Machine Learning, pp. 19–26 (2002)Google Scholar
  3. 3.
    Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise constrained clustering. In: SIAM International Conference on Data Mining, pp. 333–344 (2004)Google Scholar
  4. 4.
    Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: International Conference on Knowledge Discovery and Data Mining, pp. 59–68 (2004)Google Scholar
  5. 5.
    Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: International Conference on Machine Learning, pp. 81–88 (2004)Google Scholar
  6. 6.
    Bouchachia, A., Pedrycz, W.: Data clustering with partial supervision. Data Min. Knowl. Discov. 12(1), 47–78 (2006)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Davidson, I., Wagstaff, K.L., Basu, S.: Measuring constraint-set utility for partitional clustering algorithms. In: European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 115–126 (2006)Google Scholar
  8. 8.
    Davies, D., Bouldin, D.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1(2), 224–227 (1979)CrossRefGoogle Scholar
  9. 9.
    Demiriz, A., Bennett, K., Embrechts, M.: Semi-supervised clustering using genetic algorithms. In: Intelligent Engineering Systems Through Artificial Neural Networks, pp. 809–814 (1999)Google Scholar
  10. 10.
    Eick, C.F., Zeidat, N., Zhao, Z.: Supervised clustering - algorithms and benefits. In: International Conference on Tools with Artificial Intelligence, pp. 774–776 (2004)Google Scholar
  11. 11.
    Fung, B.C., Wang, K., Wang, L., Hung, P.C.: Privacy-preserving data publishing for cluster analysis. Data & Knowledge Engineering 68(6), 552–575 (2009)CrossRefGoogle Scholar
  12. 12.
    Gao, J., Tan, P., Cheng, H.: Semi-supervised clustering with partial background information. In: SIAM International Conference on Data Mining, pp. 489–493 (2006)Google Scholar
  13. 13.
    Grira, N., Crucianu, M., Boujemaa, N.: Active semi-supervised fuzzy clustering. Pattern Recognition 41(5), 1851–1861 (2008)CrossRefGoogle Scholar
  14. 14.
    Huang, R., Lam, W.: An active learning framework for semi-supervised document clustering with language modeling. Data & Knowledge Engineering 68(1), 49–67 (2009)CrossRefGoogle Scholar
  15. 15.
    Klein, D., Kamvar, S., Manning, C.: From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In: The Nineteenth International Conference on Machine Learning, pp. 307–314 (2002)Google Scholar
  16. 16.
    Kumar, N., Kummamuru, K.: Semisupervised clustering with metric learning using relative comparisons. IEEE Transactions on Knowledge and Data Engineering 20(4), 496–503 (2008)CrossRefGoogle Scholar
  17. 17.
    Loia, V., Pedrycz, W., Senatore, S.: Semantic web content analysis: A study in proximity-based collaborative clustering. IEEE Transactions on Fuzzy Systems 15(6), 1294–1312 (2007)CrossRefGoogle Scholar
  18. 18.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)zbMATHGoogle Scholar
  19. 19.
    Pedrycz, W.: Fuzzy clustering with a knowledge-based guidance. Pattern Recognition Letters 25(4), 469–480 (2004)CrossRefMathSciNetGoogle Scholar
  20. 20.
    Pedrycz, W.: Collaborative and knowledge-based fuzzy clustering. International Journal of Innovative, Computing, Information and Control 1(3), 1–12 (2007)MathSciNetGoogle Scholar
  21. 21.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 622–626 (1971)CrossRefGoogle Scholar
  22. 22.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)Google Scholar
  23. 23.
    Solomonoff, A., Mielke, A., Schmidt, M., Gish, H.: Clustering speakers by their voices. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, May 1998, vol. 2, pp. 757–760 (1998)Google Scholar
  24. 24.
    Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: International Conference on Machine Learning, pp. 557–584 (2001)Google Scholar
  25. 25.
    Wagstaff, K.L.: Value, cost, and sharing: Open issues in constrained clustering. In: Džeroski, S., Struyf, J. (eds.) KDID 2006. LNCS, vol. 4747, pp. 1–10. Springer, Heidelberg (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Germain Forestier
    • 1
  • Cédric Wemmert
    • 1
  • Pierre Gançarski
    • 1
  1. 1.Image Sciences, Computer Sciences and Remote Sensing LaboratoryUniversity of StrasbourgFrance

Personalised recommendations