Advertisement

Proximities in Statistics: Similarity and Distance

  • Hans-J. Lenz
Part of the CISM International Centre for Mechanical Sciences book series (CISM, volume 504)

Abstract

We review similarity and distance measures used in Statistics for clustering and classification. We are motivated by the lack of most measures to adequately utilize a non uniform distribution defined on the data or sample space.

Such measures are mappings from O x OR + where O is either a finite set of objects or vector space like R p and R + is the set of non-negative real numbers. In most cases those mappings fulfil conditions like symmetry and reflexivity. Moreover, further characteristics like transitivity or the triangle equation in case of distance measures are of concern.

We start with Hartigan’s list of proximity measures which he compiled in 1967. It is good practice to pay special attention to the type of scales of the variables involved, i.e. to nominal (often binary), ordinal and metric (interval and ratio) types of scales. We are interested in the algebraic structure of proximities as suggested by (1967) and (1971), information-theoretic measures as discussed by (1971), and the probabilistic W-distance measure as proposed by (1970). The last measure combines distances of objects or vectors with their corresponding probabilities to improve overall discrimination power. The idea is that rare events, i.e. set of values with a very low probability of observing, related to a pair of objects may be a strong hint to strong similarity of this pair.

Keywords

Distance Measure Mahalanobis Distance Winter Time Proximity Measure City Block 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Borgelt, Ch., Prototype-based Classification and Clustering, Habilitationsschrift, Ottovon-Guericke-Universität Magdeburg, Magdeburg, 2005Google Scholar
  2. Cormack, R.M., A review of classification (with Discussion), J.R.Stat. Soc., A, 31, 321–367Google Scholar
  3. Cox, T.F. and Cox, M.A.A., Multidimensional Scaling, 2nd. Ed., Chapman & Hall, Boca Raton etc., 2001zbMATHGoogle Scholar
  4. Frakes, W.B. and Baeza-Yates, R., Information Retrieval: Data Structures and Algorithms, Prentice Hal, Upper Saddle River, 1992Google Scholar
  5. Godan, M., Über die Komplexität der Bestimmung der Ähnlichkeit von geometrischen Objekten in höheren Dimensionen, Dissertation, Freie Universität Berlin, 1991Google Scholar
  6. Gower, J., A general coefficient of similarity and some of its properties, Biometrics, 27, 857–874Google Scholar
  7. Hartigan, J.A., Representation of similarity matrices by trees, J.Am.Stat.Assoc., 62, 1140–1158, 1967CrossRefMathSciNetGoogle Scholar
  8. Hubálek, Z., Coefficients of association and similarity based on binary (presence-absence) data; an evaluation, Biol. Rev., 57, 669–689Google Scholar
  9. Kruse, R. and Meyer, K.D., Statistics with Vague Data. D. Reidel Publishing Company, Dordrecht, 1987zbMATHGoogle Scholar
  10. Kullback, S., Information Theory and Statistics, Wiley, New York etc., 1959zbMATHGoogle Scholar
  11. Jardine, N. and Sibson, R., Mathematical Taxonomy, Wiley, London, 1971zbMATHGoogle Scholar
  12. Mahalanobis, P.C., On the Generalized Distance in Statistics. In: Proceedings Natl. Inst. Sci. India, 2, 49–55, 1936zbMATHGoogle Scholar
  13. Murtagh, F., Identifying and Exploiting ultrametricity. In: Advances in Data Analysis, Decker, R. and Lenz, H.-J. (eds.), Springer, Heidelberg, 2007Google Scholar
  14. Skarabis, H., Mathematische Grundlagen und praktische Aspekte der Diskrimination und Klassifikation, Physika-Verlag, Würzburg, 1970zbMATHGoogle Scholar
  15. Sneath, P.H.A. and Sokal, R.R., Numerical Taxonomy, Freeman and Co., San Francisco, 1973zbMATHGoogle Scholar

Copyright information

© CISM, Udine 2008

Authors and Affiliations

  • Hans-J. Lenz
    • 1
  1. 1.Institute of Statistics and EconometricsFreie Universität BerlinGermany

Personalised recommendations