Joint European Conference on Machine Learning and Knowledge Discovery in Databases

ECML PKDD 2015: Machine Learning and Knowledge Discovery in Databases pp 251-266

ConDist: A Context-Driven Categorical Distance Measure

  • Markus Ring
  • Florian Otto
  • Martin Becker
  • Thomas Niebler
  • Dieter Landes
  • Andreas Hotho
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9284)

Abstract

A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects characterized by categorical attributes, defining meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge. In this paper, we propose an unsupervised distance measure for objects with categorical attributes based on the idea that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes. Thus, the distance measure is automatically derived from the given data set. We compare our new distance measure to existing categorical distance measures and evaluate on different data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches.

Keywords

Categorical data Distance measure Heterogeneous data Unsupervised learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognition Letters 28(1), 110–118 (2007)CrossRefGoogle Scholar
  2. 2.
    Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: Proc. of IJCNN, pp. 1907–1914. IEEE (2014)Google Scholar
  3. 3.
    Au, W.H., Chan, K.C., Wong, A.K., Wang, Y.: Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2(2), 83–101 (2005)CrossRefGoogle Scholar
  4. 4.
    Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A comparative evaluation. In: Proc. SIAM Int. Conference on Data Mining, pp. 243–254 (2008)Google Scholar
  5. 5.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, 1–30 (2006)MATHGoogle Scholar
  6. 6.
    Ienco, D., Pensa, R.G., Meo, R.: Context-Based Distance Learning for Categorical Data Clustering. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.) IDA 2009. LNCS, vol. 5772, pp. 83–94. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  7. 7.
    Jia, H., Cheung, Y.M.: A new distance metric for unsupervised learning of categorical data. In: Proc. of IJCNN, pp. 1893–1899. IEEE (2014)Google Scholar
  8. 8.
    Le, S.Q., Ho, T.B.: An association-based dissimilarity measure for categorical data. Pattern Recognition Letters 26(16), 2549–2557 (2005)CrossRefGoogle Scholar
  9. 9.
    Lehmann, E., Romano, J.: Testing Statistical Hypotheses, Springer Texts in Statistics. Springer (2005)Google Scholar
  10. 10.
    Lichman, M.: Uci machine learning repository (2013). http://archive.ics.uci.edu/ml
  11. 11.
    Schmidberger, G., Frank, E.: Unsupervised Discretization Using Tree-Based Density Estimation. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 240–251. Springer, Heidelberg (2005) CrossRefGoogle Scholar
  12. 12.
    Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research 3 (2003)Google Scholar
  13. 13.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to data mining. Pearson Addison Wesley Boston (2006)Google Scholar
  14. 14.
    Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. Journal of the American statistical association 58(301), 236–244 (1963)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Yu, L., Liu, H.: Feature selection for high-dimensional data: A fast correlation-based filter solution. ICML 3, 856–863 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Markus Ring
    • 1
  • Florian Otto
    • 1
  • Martin Becker
    • 2
  • Thomas Niebler
    • 2
  • Dieter Landes
    • 1
  • Andreas Hotho
    • 2
  1. 1.Faculty of Electrical Engineering and InformaticsCoburg University of Applied Sciences and ArtsCoburgGermany
  2. 2.Data Mining and Information Retrieval GroupUniversity of WürzburgWürzburgGermany

Personalised recommendations