Abstract
Clustering with partial supervision finds its application in situations where data is neither entirely nor accurately labeled. This paper discusses a semi-supervised clustering algorithm based on a modified version of the fuzzy C-Means (FCM) algorithm. The objective function of the proposed algorithm consists of two components. The first concerns traditional unsupervised clustering while the second tracks the relationship between classes (available labels) and the clusters generated by the first component. The balance between the two components is tuned by a scaling factor. Comprehensive experimental studies are presented. First, the discrimination of the proposed algorithm is discussed before its reformulation as a classifier is addressed. The induced classifier is evaluated on completely labeled data and validated by comparison against some fully supervised classifiers, namely support vector machines and neural networks. This classifier is then evaluated and compared against three semi-supervised algorithms in the context of learning from partly labeled data. In addition, the behavior of the algorithm is discussed and the relation between classes and clusters is investigated using a linear regression model. Finally, the complexity of the algorithm is briefly discussed.
Similar content being viewed by others
References
Amini, M. and Gallinari, P. 2003. Semi-supervised learning with explicit misclassification modeling. Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 555–561.
Basu, S., Banerjee, A., and Mooney, R. 2002. Semi-supervised clustering by seeding. Proceedings of the Int. Conference on Machine Learning, pp. 19–26.
Bennett, K. and Demiriz, A. 1999. Semi-supervised support vector machines. Advances in Neural Information Processing Systems 11:368–374.
Bezdek, J.C. 1981. Pattern recognition with fuzzy objective function algorithms. Plenum, New York.
Bishop, C. 1995. Neural networks for pattern recognition. Oxford press, New York.
Blum, A. and Mitchell, T. 1998. Combining labeled and unlabaled data with co-training. Proceedings of the 11th Annual Conference on Computatioonal Learning Theory, pp. 92–100.
Blum, A., Lafferty, J., Rwebangira, M., and Reddy, R. 2004. Cluster kernels for semi-supervised learning. Proceedings of the 21th International Conference on Machine Learning, pp. 92–100.
Bouchachia, A. 2005a. RBF networks for learning from partially labeled data. Proceedings of the workshop on learning with partially classified training data at the 22nd international conference on machine learning,Bonn pp. 10–18.
Bouchachia, A. 2005b. Learning with hybrid data. Proceedings of the 5th International IEEE Conference on Intelligent Hybrid Systems, pp. 193–198, IEEE Computer Society.
Chapelle, O., Weston, J., and Schölkopf, B. 2002. Semi-supervised learning using randomized mincuts. Advances in Neural Information Processing Systems, 15:585–592.
Demiriz, A., Bennett, K., and Embrechts, M. 1999. Semi-supervised clustering using genetic algorithms. Intelligent Engineering Systems, pp. 809–814.
Guyon, I., Matic, N., and Vapnik, V. 1996. Discovering information patterns and data cleaning. Advances in Knowledge Discovery and Data Mining. U. Fayyad et al. (eds.) AAAI Press, pp. 181–203.
Hathaway, R.J., Bezdek, J., and Hu, Y. 2000. Generalized fuzzy C-Means clustering strategies using \(L_p\)-norm distances. IEEE Transaction on Fuzzy Systems, 8(5):576–582.
Jeon, B. and Landgrebe, D. 1999. Partially supervised classification using weighted unsupervised clustering. IEEE Transactions on Geoscience and Remote Sensing, 37(2):1073–1079.
Klinkenberg, R. 2001. Using labeled and unlabeled data to learn drifting concepts. Proceedings of the Workshop on Learning from Temporal and Spatial Data, pp. 16–24.
Mason, R., Lind, D., and Marchal, W. 1983. Statistics: An Introduction. Harcourt Brace Jovanovich, Inc.
Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. 2000. Text classification from labeled and unlabeled documents using Expectation-Maximization. Machine Learning, 39(2/3):103–134.
Pedrycz, W. and Waletzky, J. 1997. Fuzzy clustering with partial supervision. IEEE Transactions on Systems Man and Cybernetics, B27(5):787–795.
Pizzi, N. 1999. Fuzzy pre-processing of gold standards as applied to biomedical spectra classification. Artificial Intelligence in Medicine, 16:171–182.
Snedecor, G. and Cochran, W. 1989. Statistical Methods. 8th edition, Iowa State University Press.
Suykens, J. and Vandewalle, J. 1999. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300.
Zhu, X., Kandola, J., Ghahramani, Z., and Lafferty, J. 2005. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems, 17:1641–1648.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
BOUCHACHIA, A., PEDRYCZ, W. Data Clustering with Partial Supervision. Data Min Knowl Disc 12, 47–78 (2006). https://doi.org/10.1007/s10618-005-0019-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-005-0019-1