A New Class of Weighted Similarity Indices Using Polytomous Variables

Morlini, Isabella; Zani, Sergio

doi:10.1007/s00357-012-9107-2

A New Class of Weighted Similarity Indices Using Polytomous Variables

Published: 19 June 2012

Volume 29, pages 199–226, (2012)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Isabella Morlini¹ &
Sergio Zani²

284 Accesses
13 Citations
Explore all metrics

Abstract

We introduce new similarity measures between two subjects, with reference to variables with multiple categories. In contrast to traditionally used similarity indices, they also take into account the frequency of the categories of each attribute in the sample. This feature is useful when dealing with rare categories, since it makes sense to differently evaluate the pairwise presence of a rare category from the pairwise presence of a widespread one. A weighting criterion for each category derived from Shannon’s information theory is suggested. There are two versions of the weighted index: one for independent categorical variables and one for dependent variables. The suitability of the proposed indices is shown in this paper using both simulated and real world data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Weighted Euclidean Biplots

Article 01 October 2016

Michael J. Greenacre & Patrick J. F. Groenen

Data Science: Similarity, Dissimilarity and Correlation Functions

Statistical estimation of multiple measures of similarity

Article 16 September 2014

B. I. Semkin & M. V. Gorshkov

References

ALBATINEH, A.N., NIEWIADOMKA-BUGAJ, M., and MIHALKO, D. (2006), “On Similarity Indices and Correction for Chance Agreement”, Journal of Classification, 23, 301–313.
Article MathSciNet Google Scholar
ANDERBERG, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press.
MATH Google Scholar
ARABIE P., HUBERT, L.J., and DE SOETE, G. (1996), Clustering and Classification, River Edge, NJ: World Scientific.
MATH Google Scholar
BAUER, D.J., and CURRAN, P.J. (2003), “Distributional Assumptions of Growth Mixture Models: Implications for Overextraction of Latent Trajectory Classes”, Psychological Methods, 8, 338–363.
Article Google Scholar
BAULIEU, F.B. (1989), “A Classification of Presence/Absence Based Dissimilarity Coefficients”, Journal of Classification, 6, 233–246.
Article MathSciNet MATH Google Scholar
BORIAH, S., CHANDOLA, V., and KUMAR, V. (2008), “Similarity Measures for Categorical Data: A Comparative Evaluation”, Proceedings of 2008 SIAM Data Mining Conference, Atlanta, GA.
BRUSCO, M.J. (2004), “Clustering Binary Data in the Presence of Masking Variables”, Psychological Methods, 9, 510–523.
Article Google Scholar
BURNABY, T.P. (1970), “On a Method for Character Weighting a Similarity Coefficient, Employing the Concept of Information”, Mathematical Geology, 2, 25–38.
Article Google Scholar
BURNHAM, K.P., and ANDERSON, D.R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed.), New York: Springer Science.
MATH Google Scholar
CHATURVEDI, A.D., CARROL, J.D., GREEN, P.E., and ROTONDO, J.A. (1997), “A Feature Based Approach toMarket Segmentation via Overlapping K-Centroids Clusters”, Journal of Marketing Research, 34, 370–377.
Article Google Scholar
CHATURVEDI, A.D., GREEN, P.E., and CARROL, J.D. (2001), “K-Modes Clustering”, Journal of Classification, 18, 35–55.
MathSciNet MATH Google Scholar
COVER, T.M., and THOMAS, J.A. (2006), Elements of Information Theory (2nd ed.), New York: Wiley-Interscience.
MATH Google Scholar
DORFMAN, J.H. (2007), Introduction to MATLAB Programming, with an Emphasis on Software Design through Numerical Examples, Berkeley, CA: Decagon Press.
Google Scholar
EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: OxfordUniversity Press.
MATH Google Scholar
GABARRO ARPA, J., and REVILLA, R. (2000), “Clustering of a Molecular Dynamics Trajectory with a Hamming Distance”, Computers and Chemistry, 24, 693–698.
Article Google Scholar
GASIENIEC, L., JASSON, J., and LINGAS, A. (2004), “Approximation Algorithms for Hamming Clustering Problems”, Journal of Discrete Algorithms, 2, 289–301.
Article MathSciNet MATH Google Scholar
GIFI, A. (1990), Nonlinear Multivariate Analysis, Chicester: Wiley.
MATH Google Scholar
GNANADESIKAN, R., KETTENRING, J.R., and MALOOR, S. (2007), “Better Alternatives to Current Methods of Scaling andWeighting Data for Cluster Analysis”, Journal of Statistical Planning and Inference, 173, 3483–3496.
Article MathSciNet Google Scholar
GOLDMAN, S. (2005), Information Theory, New York: Prentice Hall.
MATH Google Scholar
GOODMAN, G.D., and KRUSKAL, W.H. (1954). “Measures of Association for Cross Classification”, Journal of the American Statistical Association, 49, 732–765.
MATH Google Scholar
GORDON, A.D. (1999), Classification (2nd ed.), New York: Chapman & Hall, CRC.
MATH Google Scholar
GOWER, J.C. (1970), “A Note on Burnaby’s Character-Weighted Similarity Coefficient”, Mathematical Geology, 2-1, 39–45.
Article Google Scholar
GOWER, J.C. (1971), “A General Coefficient of Similarity and Some of its Properties”, Biometrics, 27, 857–871.
Article Google Scholar
GOWER, J.C., and LEGENDRE, P. (1986), “Metric and Euclidean Properties of Dissimilarity Coefficients”, Journal of Classification, 3, 5–48.
Article MathSciNet MATH Google Scholar
GREENACRE, M.J. (1984), Correspondence Analysis in Practice (2nd ed.), Florida: Chapman & Hall.
Google Scholar
GREENACRE, M.J. (2007), Theory and Applications of Correspondence Analysis, London: Academic Press.
Google Scholar
HAMMING, R.W. (1950), “Error Detecting and Error Correcting Codes”, Bell System Technical Journal, 29, 147–160.
MathSciNet Google Scholar
HEISER, W.J., and MEULMAN, J.J. (1997), “Representation of Binary Multivariate Data by Graph Models Using the Hamming Metric”, in Computing Science and Statistics, 29-2, eds. E. Wegman and S. Azen, pp. 517–525.
HELSEN, K., and GREEN, P.E. (1991), “A Computational Study of Replicated Clustering with an Application to Marketing Research”, Decision Science, 22, 1124–1141.
Article Google Scholar
HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.
Article Google Scholar
JACCARD, P. (1901), “Etude Comparative de la Distribution Florale Dans Une Portion des Alpes et des Jura”, Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579.
Google Scholar
KURCZYNKY, T.W. (1970), “Generalized Distance and Discrete Variables”, Biometrics, 26-3, 525–534.
Article Google Scholar
LEBART, L. MORINEAU, A., and WARWICK, K. (1984), Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techiques for Large Matrices, New York: Wiley.
Google Scholar
MACKAY, D.J.C. (2003), Information Theory, Inference and Learning Algorithms, Cambridge, UK: Cambridge University Press.
MATH Google Scholar
MILLIGAN, G.W., and COOPER, M.C. (1986), “A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis”, Multivariate Behavioral Research, 21, 441–458.
Article Google Scholar
MOREY, L., and AGRESTI, A. (1984), “The Measurement of Classification of Agreement: An Adjustment to the Rand Statistic for Chance Agreement”, Educational and Psychological Measurement, 44, 33–37.
Article Google Scholar
RAND, W.M. (1971), “Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, 6, 846–850.
Google Scholar
REGISTER, A.H. (2007), A Guide to MATLAB Object-Oriented Programming, New York: Chapman & Hall, CRC.
Book MATH Google Scholar
SEPKOSKI, J.J. (1974), “Quantified Coefficients of Association and Measurement of Similarity", Mathematical Geology, 6, 135–152.
Article Google Scholar
SHANNON, C.E. (1948), “A Mathematical Theory of Communication", Bell System Technical Journal, 27, 379–423.
MathSciNet MATH Google Scholar
SKRONDAL, A., and RABE-HESKETH, S. (2004), Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models, Boca Raton FL: Chapman & Hall/CRC.
Book MATH Google Scholar
SNEATH, P.H., and SOKAL, R.R. (1973), Numerical Taxonomy, San Francisco CA: Freeman.
MATH Google Scholar
STEINLEY, D. (2004),“Properties of the Hubert-Arabie Adjusted Rand Index”, Psychological Methods, 9, 386–396.
Article Google Scholar
STEINLEY, D. (2006), “Profiling Local Optima in the K-Means Clustering: Developing a Diagnostic Technique. Psychological Methods, 11, 178–192.
Article Google Scholar
STEINLEY, D., and BRUSCO, M.J. (2008), “A New Variable Weighting and Selection Procedure for K-Means Cluster Analysis", Multivariate Behavioral Research, 43, 77–108.
Article Google Scholar
TENENHAUS, M., and YOUNG, F.W. (1985), “An Analysis and Synthesis of Multiple Correspondence Analysis, Optimal Scaling, Dual Scaling, Homogeneity Analysis and Other Methods for Quantifying Categorical Multivariate Data”, Psychometrica, 50, 91–119.
Article MathSciNet MATH Google Scholar
VANBELLE, S., and ALBERT A. (2009), “A Note on the Linearly Weighted Kappa Coefficient for Ordinal Scales”, Statistical Methodology, 6, 157–163.
Article MathSciNet MATH Google Scholar
WARRENS,M.J. (2008a), “On the Indeterminacy of the Resemblance Measures for Binary (Presence/Absence) Data”, Journal of Classification, 25, 125–136
Article MathSciNet MATH Google Scholar
WARRENS, M.J. (2008b), “On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index”, Journal of Classification, 25, 177–183.
Article MathSciNet MATH Google Scholar
WARRENS, M.J. (2008c), “Bounds of Resemblance Measures for Binary (Presence/ Absence) Variables”, Journal of Classification, 25, 195–208
Article MathSciNet MATH Google Scholar
WARRENS, M.J. (2008d), “On Association Coefficients for 2 × 2 Tables and Properties That Do Not Depend on the Marginal Distributions”, Psychometrika, 73, 778–289.
Google Scholar
WARRENS, M.J. (2010), “Chance-Corrected Measures for 2×2 Tables that Coincide with Weighted Kappa”, British Journal of Mathematical and Statistical Psychology, 64, 355–365.
Article MathSciNet Google Scholar
WARRENS, M.J. (2011), “Inequalities Between Kappa and Kappa-Like Statistics for k × k Tables”, Psychometrika, 75, 176–185.
Article MathSciNet Google Scholar
ZANI, S. (1982), “Sui Criteri di Ponderazione negli Indici di Similarità”, in Alcuni Lavori di Analisi Statistica Multivariata, ed. R. Leoni, Firenze, Italia, SIS, pp. 187–208.
ZEGERS, F.E., and TEN BERGE J.M.F. (1986), “Correlation Coefficients forMore tha One Scale Type: An Alternative to the Janson and Vegelius Approach”, Psychometrika, 51, 549–557.
Article MathSciNet MATH Google Scholar
ZHANG, P., WANG, X., and SONG, P.X. (2006), “Clustering Categorical Data Based on Distance Vectors”, Journal of the American Statistical Association, 101, 355–367.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Economics, University of Modena and Reggio Emilia, Via Berengario 51, 41100, Modena, Italy
Isabella Morlini
University of Parma, Parma, Italy
Sergio Zani

Authors

Isabella Morlini
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Zani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isabella Morlini.

Additional information

The authors thank the Editor and the anonymous referees for their comments and suggestions. We feel that the paper has substantially improved due to their helpful feedback.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morlini, I., Zani, S. A New Class of Weighted Similarity Indices Using Polytomous Variables. J Classif 29, 199–226 (2012). https://doi.org/10.1007/s00357-012-9107-2

Download citation

Published: 19 June 2012
Issue Date: July 2012
DOI: https://doi.org/10.1007/s00357-012-9107-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A New Class of Weighted Similarity Indices Using Polytomous Variables

Abstract

Access this article

Similar content being viewed by others

Weighted Euclidean Biplots

Data Science: Similarity, Dissimilarity and Correlation Functions

Statistical estimation of multiple measures of similarity

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A New Class of Weighted Similarity Indices Using Polytomous Variables

Abstract

Access this article

Similar content being viewed by others

Weighted Euclidean Biplots

Data Science: Similarity, Dissimilarity and Correlation Functions

Statistical estimation of multiple measures of similarity

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation