Abstract
We introduce new similarity measures between two subjects, with reference to variables with multiple categories. In contrast to traditionally used similarity indices, they also take into account the frequency of the categories of each attribute in the sample. This feature is useful when dealing with rare categories, since it makes sense to differently evaluate the pairwise presence of a rare category from the pairwise presence of a widespread one. A weighting criterion for each category derived from Shannon’s information theory is suggested. There are two versions of the weighted index: one for independent categorical variables and one for dependent variables. The suitability of the proposed indices is shown in this paper using both simulated and real world data sets.
Similar content being viewed by others
References
ALBATINEH, A.N., NIEWIADOMKA-BUGAJ, M., and MIHALKO, D. (2006), “On Similarity Indices and Correction for Chance Agreement”, Journal of Classification, 23, 301–313.
ANDERBERG, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press.
ARABIE P., HUBERT, L.J., and DE SOETE, G. (1996), Clustering and Classification, River Edge, NJ: World Scientific.
BAUER, D.J., and CURRAN, P.J. (2003), “Distributional Assumptions of Growth Mixture Models: Implications for Overextraction of Latent Trajectory Classes”, Psychological Methods, 8, 338–363.
BAULIEU, F.B. (1989), “A Classification of Presence/Absence Based Dissimilarity Coefficients”, Journal of Classification, 6, 233–246.
BORIAH, S., CHANDOLA, V., and KUMAR, V. (2008), “Similarity Measures for Categorical Data: A Comparative Evaluation”, Proceedings of 2008 SIAM Data Mining Conference, Atlanta, GA.
BRUSCO, M.J. (2004), “Clustering Binary Data in the Presence of Masking Variables”, Psychological Methods, 9, 510–523.
BURNABY, T.P. (1970), “On a Method for Character Weighting a Similarity Coefficient, Employing the Concept of Information”, Mathematical Geology, 2, 25–38.
BURNHAM, K.P., and ANDERSON, D.R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed.), New York: Springer Science.
CHATURVEDI, A.D., CARROL, J.D., GREEN, P.E., and ROTONDO, J.A. (1997), “A Feature Based Approach toMarket Segmentation via Overlapping K-Centroids Clusters”, Journal of Marketing Research, 34, 370–377.
CHATURVEDI, A.D., GREEN, P.E., and CARROL, J.D. (2001), “K-Modes Clustering”, Journal of Classification, 18, 35–55.
COVER, T.M., and THOMAS, J.A. (2006), Elements of Information Theory (2nd ed.), New York: Wiley-Interscience.
DORFMAN, J.H. (2007), Introduction to MATLAB Programming, with an Emphasis on Software Design through Numerical Examples, Berkeley, CA: Decagon Press.
EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: OxfordUniversity Press.
GABARRO ARPA, J., and REVILLA, R. (2000), “Clustering of a Molecular Dynamics Trajectory with a Hamming Distance”, Computers and Chemistry, 24, 693–698.
GASIENIEC, L., JASSON, J., and LINGAS, A. (2004), “Approximation Algorithms for Hamming Clustering Problems”, Journal of Discrete Algorithms, 2, 289–301.
GIFI, A. (1990), Nonlinear Multivariate Analysis, Chicester: Wiley.
GNANADESIKAN, R., KETTENRING, J.R., and MALOOR, S. (2007), “Better Alternatives to Current Methods of Scaling andWeighting Data for Cluster Analysis”, Journal of Statistical Planning and Inference, 173, 3483–3496.
GOLDMAN, S. (2005), Information Theory, New York: Prentice Hall.
GOODMAN, G.D., and KRUSKAL, W.H. (1954). “Measures of Association for Cross Classification”, Journal of the American Statistical Association, 49, 732–765.
GORDON, A.D. (1999), Classification (2nd ed.), New York: Chapman & Hall, CRC.
GOWER, J.C. (1970), “A Note on Burnaby’s Character-Weighted Similarity Coefficient”, Mathematical Geology, 2-1, 39–45.
GOWER, J.C. (1971), “A General Coefficient of Similarity and Some of its Properties”, Biometrics, 27, 857–871.
GOWER, J.C., and LEGENDRE, P. (1986), “Metric and Euclidean Properties of Dissimilarity Coefficients”, Journal of Classification, 3, 5–48.
GREENACRE, M.J. (1984), Correspondence Analysis in Practice (2nd ed.), Florida: Chapman & Hall.
GREENACRE, M.J. (2007), Theory and Applications of Correspondence Analysis, London: Academic Press.
HAMMING, R.W. (1950), “Error Detecting and Error Correcting Codes”, Bell System Technical Journal, 29, 147–160.
HEISER, W.J., and MEULMAN, J.J. (1997), “Representation of Binary Multivariate Data by Graph Models Using the Hamming Metric”, in Computing Science and Statistics, 29-2, eds. E. Wegman and S. Azen, pp. 517–525.
HELSEN, K., and GREEN, P.E. (1991), “A Computational Study of Replicated Clustering with an Application to Marketing Research”, Decision Science, 22, 1124–1141.
HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.
JACCARD, P. (1901), “Etude Comparative de la Distribution Florale Dans Une Portion des Alpes et des Jura”, Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579.
KURCZYNKY, T.W. (1970), “Generalized Distance and Discrete Variables”, Biometrics, 26-3, 525–534.
LEBART, L. MORINEAU, A., and WARWICK, K. (1984), Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techiques for Large Matrices, New York: Wiley.
MACKAY, D.J.C. (2003), Information Theory, Inference and Learning Algorithms, Cambridge, UK: Cambridge University Press.
MILLIGAN, G.W., and COOPER, M.C. (1986), “A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis”, Multivariate Behavioral Research, 21, 441–458.
MOREY, L., and AGRESTI, A. (1984), “The Measurement of Classification of Agreement: An Adjustment to the Rand Statistic for Chance Agreement”, Educational and Psychological Measurement, 44, 33–37.
RAND, W.M. (1971), “Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, 6, 846–850.
REGISTER, A.H. (2007), A Guide to MATLAB Object-Oriented Programming, New York: Chapman & Hall, CRC.
SEPKOSKI, J.J. (1974), “Quantified Coefficients of Association and Measurement of Similarity", Mathematical Geology, 6, 135–152.
SHANNON, C.E. (1948), “A Mathematical Theory of Communication", Bell System Technical Journal, 27, 379–423.
SKRONDAL, A., and RABE-HESKETH, S. (2004), Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models, Boca Raton FL: Chapman & Hall/CRC.
SNEATH, P.H., and SOKAL, R.R. (1973), Numerical Taxonomy, San Francisco CA: Freeman.
STEINLEY, D. (2004),“Properties of the Hubert-Arabie Adjusted Rand Index”, Psychological Methods, 9, 386–396.
STEINLEY, D. (2006), “Profiling Local Optima in the K-Means Clustering: Developing a Diagnostic Technique. Psychological Methods, 11, 178–192.
STEINLEY, D., and BRUSCO, M.J. (2008), “A New Variable Weighting and Selection Procedure for K-Means Cluster Analysis", Multivariate Behavioral Research, 43, 77–108.
TENENHAUS, M., and YOUNG, F.W. (1985), “An Analysis and Synthesis of Multiple Correspondence Analysis, Optimal Scaling, Dual Scaling, Homogeneity Analysis and Other Methods for Quantifying Categorical Multivariate Data”, Psychometrica, 50, 91–119.
VANBELLE, S., and ALBERT A. (2009), “A Note on the Linearly Weighted Kappa Coefficient for Ordinal Scales”, Statistical Methodology, 6, 157–163.
WARRENS,M.J. (2008a), “On the Indeterminacy of the Resemblance Measures for Binary (Presence/Absence) Data”, Journal of Classification, 25, 125–136
WARRENS, M.J. (2008b), “On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index”, Journal of Classification, 25, 177–183.
WARRENS, M.J. (2008c), “Bounds of Resemblance Measures for Binary (Presence/ Absence) Variables”, Journal of Classification, 25, 195–208
WARRENS, M.J. (2008d), “On Association Coefficients for 2 × 2 Tables and Properties That Do Not Depend on the Marginal Distributions”, Psychometrika, 73, 778–289.
WARRENS, M.J. (2010), “Chance-Corrected Measures for 2×2 Tables that Coincide with Weighted Kappa”, British Journal of Mathematical and Statistical Psychology, 64, 355–365.
WARRENS, M.J. (2011), “Inequalities Between Kappa and Kappa-Like Statistics for k × k Tables”, Psychometrika, 75, 176–185.
ZANI, S. (1982), “Sui Criteri di Ponderazione negli Indici di Similarità”, in Alcuni Lavori di Analisi Statistica Multivariata, ed. R. Leoni, Firenze, Italia, SIS, pp. 187–208.
ZEGERS, F.E., and TEN BERGE J.M.F. (1986), “Correlation Coefficients forMore tha One Scale Type: An Alternative to the Janson and Vegelius Approach”, Psychometrika, 51, 549–557.
ZHANG, P., WANG, X., and SONG, P.X. (2006), “Clustering Categorical Data Based on Distance Vectors”, Journal of the American Statistical Association, 101, 355–367.
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors thank the Editor and the anonymous referees for their comments and suggestions. We feel that the paper has substantially improved due to their helpful feedback.
Rights and permissions
About this article
Cite this article
Morlini, I., Zani, S. A New Class of Weighted Similarity Indices Using Polytomous Variables. J Classif 29, 199–226 (2012). https://doi.org/10.1007/s00357-012-9107-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-012-9107-2