ConDist: A Context-Driven Categorical Distance Measure
A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects characterized by categorical attributes, defining meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge. In this paper, we propose an unsupervised distance measure for objects with categorical attributes based on the idea that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes. Thus, the distance measure is automatically derived from the given data set. We compare our new distance measure to existing categorical distance measures and evaluate on different data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches.
KeywordsCategorical data Distance measure Heterogeneous data Unsupervised learning
Unable to display preview. Download preview PDF.
- 2.Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: Proc. of IJCNN, pp. 1907–1914. IEEE (2014)Google Scholar
- 4.Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A comparative evaluation. In: Proc. SIAM Int. Conference on Data Mining, pp. 243–254 (2008)Google Scholar
- 7.Jia, H., Cheung, Y.M.: A new distance metric for unsupervised learning of categorical data. In: Proc. of IJCNN, pp. 1893–1899. IEEE (2014)Google Scholar
- 9.Lehmann, E., Romano, J.: Testing Statistical Hypotheses, Springer Texts in Statistics. Springer (2005)Google Scholar
- 10.Lichman, M.: Uci machine learning repository (2013). http://archive.ics.uci.edu/ml
- 12.Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research 3 (2003)Google Scholar
- 13.Tan, P.N., Steinbach, M., Kumar, V.: Introduction to data mining. Pearson Addison Wesley Boston (2006)Google Scholar
- 15.Yu, L., Liu, H.: Feature selection for high-dimensional data: A fast correlation-based filter solution. ICML 3, 856–863 (2003)Google Scholar