Association-Based Dissimilarity Measures for Categorical Data: Limitation and Improvement

  • Si Quang Le
  • Tu Bao Ho
  • Le Sy Vinh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3918)


Measuring the similarity for categorical data is a challenging task in data mining due to the poor structure of categorical data. This paper presents a dissimilarity measure for categorical data based on the relations among attributes. This measure not only has the advantage of value variance but also overcomes the limitations of condition the probability-based measure when applied to databases whose attributes are independent. Experiments with 30 databases also showed that the proposed measure boosted the accuracy of Nearest Neighbor classification in comparison with other tested measures.


Categorical Data Binary Vector Near Neighbor Association Rule Mining Dissimilarity Measure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Gower, J.C., Legendre, P.: Metric and euclidean properties of dissimilarity coefficients. Journal of classification 3, 5–48 (1986)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Le, S.Q., Ho, T.B.: A Conditional probability distribution-based dissimilarity measure for categorical data. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 580–589. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Aono, M., Kobayashi, M.: Vector space models for search and cluster mining. In: Survey of Text Mining: clustering, classification and retrieval, pp. 103–122. Springer, New York (2004)Google Scholar
  4. 4.
    Goodall, D.W.: A new similarity index based on probability. Biometrics 22, 882–907 (1966)CrossRefGoogle Scholar
  5. 5.
    Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27 (1967)CrossRefMATHGoogle Scholar
  6. 6.
    Le, S.Q., Ho, T.B.: An association-based dissimilarity measure for categorical data. Pattern Recognition Letters 26(16), 2549–2557 (2005)CrossRefGoogle Scholar
  7. 7.
    Blake, C.L., Merz, C.J.: (uci) repository of machine learning databases (1998)Google Scholar
  8. 8.
    Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Knowledge Discovery and Data Mining, pp. 80–86 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Si Quang Le
    • 1
    • 2
  • Tu Bao Ho
    • 1
  • Le Sy Vinh
    • 3
    • 4
  1. 1.Japan Advanced Institute of Science and TechnologyTatsunokuchi, IshikawaJapan
  2. 2.LIRMMMontpellier Cedex 5France
  3. 3.John von Neumann Institute for ComputingJuelichGermany
  4. 4.American Museum of Natural HistoryNew YorkUSA

Personalised recommendations