Skip to main content
Log in

A New Class of Weighted Similarity Indices Using Polytomous Variables

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

We introduce new similarity measures between two subjects, with reference to variables with multiple categories. In contrast to traditionally used similarity indices, they also take into account the frequency of the categories of each attribute in the sample. This feature is useful when dealing with rare categories, since it makes sense to differently evaluate the pairwise presence of a rare category from the pairwise presence of a widespread one. A weighting criterion for each category derived from Shannon’s information theory is suggested. There are two versions of the weighted index: one for independent categorical variables and one for dependent variables. The suitability of the proposed indices is shown in this paper using both simulated and real world data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • ALBATINEH, A.N., NIEWIADOMKA-BUGAJ, M., and MIHALKO, D. (2006), “On Similarity Indices and Correction for Chance Agreement”, Journal of Classification, 23, 301–313.

    Article  MathSciNet  Google Scholar 

  • ANDERBERG, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press.

    MATH  Google Scholar 

  • ARABIE P., HUBERT, L.J., and DE SOETE, G. (1996), Clustering and Classification, River Edge, NJ: World Scientific.

    MATH  Google Scholar 

  • BAUER, D.J., and CURRAN, P.J. (2003), “Distributional Assumptions of Growth Mixture Models: Implications for Overextraction of Latent Trajectory Classes”, Psychological Methods, 8, 338–363.

    Article  Google Scholar 

  • BAULIEU, F.B. (1989), “A Classification of Presence/Absence Based Dissimilarity Coefficients”, Journal of Classification, 6, 233–246.

    Article  MathSciNet  MATH  Google Scholar 

  • BORIAH, S., CHANDOLA, V., and KUMAR, V. (2008), “Similarity Measures for Categorical Data: A Comparative Evaluation”, Proceedings of 2008 SIAM Data Mining Conference, Atlanta, GA.

  • BRUSCO, M.J. (2004), “Clustering Binary Data in the Presence of Masking Variables”, Psychological Methods, 9, 510–523.

    Article  Google Scholar 

  • BURNABY, T.P. (1970), “On a Method for Character Weighting a Similarity Coefficient, Employing the Concept of Information”, Mathematical Geology, 2, 25–38.

    Article  Google Scholar 

  • BURNHAM, K.P., and ANDERSON, D.R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed.), New York: Springer Science.

    MATH  Google Scholar 

  • CHATURVEDI, A.D., CARROL, J.D., GREEN, P.E., and ROTONDO, J.A. (1997), “A Feature Based Approach toMarket Segmentation via Overlapping K-Centroids Clusters”, Journal of Marketing Research, 34, 370–377.

    Article  Google Scholar 

  • CHATURVEDI, A.D., GREEN, P.E., and CARROL, J.D. (2001), “K-Modes Clustering”, Journal of Classification, 18, 35–55.

    MathSciNet  MATH  Google Scholar 

  • COVER, T.M., and THOMAS, J.A. (2006), Elements of Information Theory (2nd ed.), New York: Wiley-Interscience.

    MATH  Google Scholar 

  • DORFMAN, J.H. (2007), Introduction to MATLAB Programming, with an Emphasis on Software Design through Numerical Examples, Berkeley, CA: Decagon Press.

    Google Scholar 

  • EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: OxfordUniversity Press.

    MATH  Google Scholar 

  • GABARRO ARPA, J., and REVILLA, R. (2000), “Clustering of a Molecular Dynamics Trajectory with a Hamming Distance”, Computers and Chemistry, 24, 693–698.

    Article  Google Scholar 

  • GASIENIEC, L., JASSON, J., and LINGAS, A. (2004), “Approximation Algorithms for Hamming Clustering Problems”, Journal of Discrete Algorithms, 2, 289–301.

    Article  MathSciNet  MATH  Google Scholar 

  • GIFI, A. (1990), Nonlinear Multivariate Analysis, Chicester: Wiley.

    MATH  Google Scholar 

  • GNANADESIKAN, R., KETTENRING, J.R., and MALOOR, S. (2007), “Better Alternatives to Current Methods of Scaling andWeighting Data for Cluster Analysis”, Journal of Statistical Planning and Inference, 173, 3483–3496.

    Article  MathSciNet  Google Scholar 

  • GOLDMAN, S. (2005), Information Theory, New York: Prentice Hall.

    MATH  Google Scholar 

  • GOODMAN, G.D., and KRUSKAL, W.H. (1954). “Measures of Association for Cross Classification”, Journal of the American Statistical Association, 49, 732–765.

    MATH  Google Scholar 

  • GORDON, A.D. (1999), Classification (2nd ed.), New York: Chapman & Hall, CRC.

    MATH  Google Scholar 

  • GOWER, J.C. (1970), “A Note on Burnaby’s Character-Weighted Similarity Coefficient”, Mathematical Geology, 2-1, 39–45.

    Article  Google Scholar 

  • GOWER, J.C. (1971), “A General Coefficient of Similarity and Some of its Properties”, Biometrics, 27, 857–871.

    Article  Google Scholar 

  • GOWER, J.C., and LEGENDRE, P. (1986), “Metric and Euclidean Properties of Dissimilarity Coefficients”, Journal of Classification, 3, 5–48.

    Article  MathSciNet  MATH  Google Scholar 

  • GREENACRE, M.J. (1984), Correspondence Analysis in Practice (2nd ed.), Florida: Chapman & Hall.

    Google Scholar 

  • GREENACRE, M.J. (2007), Theory and Applications of Correspondence Analysis, London: Academic Press.

    Google Scholar 

  • HAMMING, R.W. (1950), “Error Detecting and Error Correcting Codes”, Bell System Technical Journal, 29, 147–160.

    MathSciNet  Google Scholar 

  • HEISER, W.J., and MEULMAN, J.J. (1997), “Representation of Binary Multivariate Data by Graph Models Using the Hamming Metric”, in Computing Science and Statistics, 29-2, eds. E. Wegman and S. Azen, pp. 517–525.

  • HELSEN, K., and GREEN, P.E. (1991), “A Computational Study of Replicated Clustering with an Application to Marketing Research”, Decision Science, 22, 1124–1141.

    Article  Google Scholar 

  • HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.

    Article  Google Scholar 

  • JACCARD, P. (1901), “Etude Comparative de la Distribution Florale Dans Une Portion des Alpes et des Jura”, Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579.

    Google Scholar 

  • KURCZYNKY, T.W. (1970), “Generalized Distance and Discrete Variables”, Biometrics, 26-3, 525–534.

    Article  Google Scholar 

  • LEBART, L. MORINEAU, A., and WARWICK, K. (1984), Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techiques for Large Matrices, New York: Wiley.

    Google Scholar 

  • MACKAY, D.J.C. (2003), Information Theory, Inference and Learning Algorithms, Cambridge, UK: Cambridge University Press.

    MATH  Google Scholar 

  • MILLIGAN, G.W., and COOPER, M.C. (1986), “A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis”, Multivariate Behavioral Research, 21, 441–458.

    Article  Google Scholar 

  • MOREY, L., and AGRESTI, A. (1984), “The Measurement of Classification of Agreement: An Adjustment to the Rand Statistic for Chance Agreement”, Educational and Psychological Measurement, 44, 33–37.

    Article  Google Scholar 

  • RAND, W.M. (1971), “Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, 6, 846–850.

    Google Scholar 

  • REGISTER, A.H. (2007), A Guide to MATLAB Object-Oriented Programming, New York: Chapman & Hall, CRC.

    Book  MATH  Google Scholar 

  • SEPKOSKI, J.J. (1974), “Quantified Coefficients of Association and Measurement of Similarity", Mathematical Geology, 6, 135–152.

    Article  Google Scholar 

  • SHANNON, C.E. (1948), “A Mathematical Theory of Communication", Bell System Technical Journal, 27, 379–423.

    MathSciNet  MATH  Google Scholar 

  • SKRONDAL, A., and RABE-HESKETH, S. (2004), Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models, Boca Raton FL: Chapman & Hall/CRC.

    Book  MATH  Google Scholar 

  • SNEATH, P.H., and SOKAL, R.R. (1973), Numerical Taxonomy, San Francisco CA: Freeman.

    MATH  Google Scholar 

  • STEINLEY, D. (2004),“Properties of the Hubert-Arabie Adjusted Rand Index”, Psychological Methods, 9, 386–396.

    Article  Google Scholar 

  • STEINLEY, D. (2006), “Profiling Local Optima in the K-Means Clustering: Developing a Diagnostic Technique. Psychological Methods, 11, 178–192.

    Article  Google Scholar 

  • STEINLEY, D., and BRUSCO, M.J. (2008), “A New Variable Weighting and Selection Procedure for K-Means Cluster Analysis", Multivariate Behavioral Research, 43, 77–108.

    Article  Google Scholar 

  • TENENHAUS, M., and YOUNG, F.W. (1985), “An Analysis and Synthesis of Multiple Correspondence Analysis, Optimal Scaling, Dual Scaling, Homogeneity Analysis and Other Methods for Quantifying Categorical Multivariate Data”, Psychometrica, 50, 91–119.

    Article  MathSciNet  MATH  Google Scholar 

  • VANBELLE, S., and ALBERT A. (2009), “A Note on the Linearly Weighted Kappa Coefficient for Ordinal Scales”, Statistical Methodology, 6, 157–163.

    Article  MathSciNet  MATH  Google Scholar 

  • WARRENS,M.J. (2008a), “On the Indeterminacy of the Resemblance Measures for Binary (Presence/Absence) Data”, Journal of Classification, 25, 125–136

    Article  MathSciNet  MATH  Google Scholar 

  • WARRENS, M.J. (2008b), “On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index”, Journal of Classification, 25, 177–183.

    Article  MathSciNet  MATH  Google Scholar 

  • WARRENS, M.J. (2008c), “Bounds of Resemblance Measures for Binary (Presence/ Absence) Variables”, Journal of Classification, 25, 195–208

    Article  MathSciNet  MATH  Google Scholar 

  • WARRENS, M.J. (2008d), “On Association Coefficients for 2 × 2 Tables and Properties That Do Not Depend on the Marginal Distributions”, Psychometrika, 73, 778–289.

    Google Scholar 

  • WARRENS, M.J. (2010), “Chance-Corrected Measures for 2×2 Tables that Coincide with Weighted Kappa”, British Journal of Mathematical and Statistical Psychology, 64, 355–365.

    Article  MathSciNet  Google Scholar 

  • WARRENS, M.J. (2011), “Inequalities Between Kappa and Kappa-Like Statistics for k × k Tables”, Psychometrika, 75, 176–185.

    Article  MathSciNet  Google Scholar 

  • ZANI, S. (1982), “Sui Criteri di Ponderazione negli Indici di Similarità”, in Alcuni Lavori di Analisi Statistica Multivariata, ed. R. Leoni, Firenze, Italia, SIS, pp. 187–208.

  • ZEGERS, F.E., and TEN BERGE J.M.F. (1986), “Correlation Coefficients forMore tha One Scale Type: An Alternative to the Janson and Vegelius Approach”, Psychometrika, 51, 549–557.

    Article  MathSciNet  MATH  Google Scholar 

  • ZHANG, P., WANG, X., and SONG, P.X. (2006), “Clustering Categorical Data Based on Distance Vectors”, Journal of the American Statistical Association, 101, 355–367.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isabella Morlini.

Additional information

The authors thank the Editor and the anonymous referees for their comments and suggestions. We feel that the paper has substantially improved due to their helpful feedback.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morlini, I., Zani, S. A New Class of Weighted Similarity Indices Using Polytomous Variables. J Classif 29, 199–226 (2012). https://doi.org/10.1007/s00357-012-9107-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-012-9107-2

Keywords

Navigation