Abstract
An examination of many of the indices proposed as numerical measures of pairwise similarity shows that they have strong relationships to string-to-string measures variously known as ‘Levenshtein distance’, ‘longest common subsequence’ or ‘minimal mutation distance’. The variations among coefficients are created in several ways, including changing the set of operations, using a richer structural pattern, modifying weights, limiting the extent of operations and varying the basis for normalisation. In total these measures provide a very flexible means of assessing similarity and can be extended to similarities based on collections of strings. While not denying the interest to the user of other properties, such as metricity or embedding in a euclidean space, examining the coefficients as variations on the Levenshtein theme provides a common basis for their comparison and provides the user with a means of choosing between coefficients in a rational manner. But however interesting this array of coefficients might be, it remains true that only some features of similarity will be captured in a minimal mutational measure. These features may be more or less than are actually required by the user. In this paper I have made a preliminary examination of various measures, some of which are related to the Levenshtein metric, and some of which appear to capture other aspects of similarity (i.e. topological, functional, analogic and/or conceptual). These latter are all measures which I have been unable to relate to the Levenshtein distance, although I have not pursued this very far as yet. All measures were applied to vegetation data, classifying both plots and attributes into a two-way table. The SAHN algorithm has been used for most of the clusterings, so that differences between measures of similarity are the primary cause of differences in results. In a few cases other clustering algorithms have been used and the data has been converted to presence/absence when this was necessary with the particular coefficient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arabie, P. and J.D. Carroll. 1980. MAPCLUS: a mathematical programming approach to fitting the ADCLUS model. Psychometrika 45: 211–235.
Austin, M.P. and L. Belbin. 1982. A new approach to the species classification problem in floristic analysis. Austral. J. Ecol. 7: 75–89.
Bartels, P.H., G.F. Bahr, D.W. Calhoun and G.L. Wied. 1970. Cell recognition by neighbourhood grouping techniques in Ticas. Acta Cytol. 14: 313–324.
Bednarek, A.R. and S.M. Ulam. 1979. An integer valued metric for patterns. In: Fundamentals of Computation Theory, pp. 52–57. Academie-Verlag, Berlin.
Ben-Bassat, M. and L. Zaidenberg. 1984. Contextual template matching: a distance measure for patterns with hierarchically dependent features. IEEE Trans Patt. Anal. Machine Intell. PAMI-6: 201–211.
Blackborn, D.T. 1980. A generalized distance metric for the analysis of variable taxa. Bot. Gaz. 141: 325–335.
Bowman, D.M.J.S. and B.A. Wilson. 1986. Wetland vegetation pattern on the Adelaide River flood plain, Northern Territory, Australia. Proc. Roy. Soc. Qld. 97: 69–77.
Burkea, J. and C.R. Rao. 1982. Entropy differential metric distance and divergence measures in probability spaces: a unified approach. J. Multivar. Anal. 17: 575–596.
Bykat, A. 1979. On polygon similarity. Inform. Process. Lett. 9: 23–25.
Culik, K. and D. Wood. 1982. A note on some tree similarity measures. Inform. Process. Lett. 15: 39–42.
Cheetham, A.H. and J.E. Hazel. 1969. Binary (presence absence) similarity coefficients. J. Paleont. 43: 1130–1136.
Czekanowski, J. 1909. Zur differential Diagnose der Neanderthalgruppe. Korrespbl. dt. Ges. Anthrop. 40: 44–47.
Dale, M.B. 1964. The application of multivariate methods to heterogenous data. Ph.D. Thesis, University of Southampton.
Dale, M.B. and D.J. Anderson. 1972. Qualitative and quantitative information analysis. J. Ecol. 60: 639–653.
Dale, M.B. and D.J. Anderson. 1973. Inosculate analysis of vegetation data. Austral. J. Bot. 21: 253–276.
Dale, M.B., H.T. Clifford and D.R. Ross. 1984. Species, equivalence and morphological redescription: a Stradbroke Island vegetation study. In: R.J. Coleman, J. Covacevich and P. Davie (eds.), Focus on Stradbroke: New Information on North Stradbroke Island and surrounding areas, 1974–1984. Boolarong Publ., Brisbane and Stradbroke Island Management Organization, Amity Point.
Dale, M.B. and W.T. Williams. 1978. A new method of species reduction for ecological data. Austral. J. Ecol. 3: 1–5.
Day, W.H.E. and D.P. Faith. 1986. A model in partial orders for comparing objects by dualistic measures. Math. Bio. Sci. 78: 179–192.
Faith, D.P. 1983. Asymmetric binary similarity measures. Oecologia (Berlin) 57: 287–290.
Faith, D.P. 1984. Patterns of sensitivity of association measures in numerical taxonomy. Math. Bio. Sci. 69: 199–207.
Feoli, E. and M. Lagonegro. 1983. A resemblance function based on probability: applications to field and simulated data. Vegetatio 53: 3–9.
Feoli, E., M. Lagonegro and L. Orlóci. 1984. Information Analysis of Vegetation Data. Dr. W. Junk, The Hague, p. 143.
Fowlkes, E.B. and C.L. Mallowes. 1983. A method for comparing two hierarchical clusterings. J. Amer. Statist. Assoc. 78: 553–569.
Gambarov, G.M., I.D. Mandel and I.A. Rybina. 1980. Some metrics arising in data analysis. Automat. Remote Control 41: 1717–1723.
Goodall, D.W. 1964. A probabilistic similarity index. Nature 203: 1098.
Gower, J.C. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325–338.
Gower, J.C. 1986. Metric and euclidean properties of dissimilarity coefficients. J. Classif. 3: 5–48.
Hajdu, L.J. 1981. Graphical comparison of resemblance coefficients in phytosociology. Vegetatio 48: 47–59.
Hill, M.O., R.G.H. Bunce and M.W. Shaw. 1975. Indicator species analysis, a divisive polythetic method of classification and its application to a survey of native pine-woods in Scotland. J. Ecol. 63: 597–613.
Janson, S. and J. Vegelius. 1981. Measures of ecological association. Oecologia 49: 371–376.
Juhász-Nagy P. 1984. Spatial dependence in plant populations 2. A family of new models. Acta Bot. Hung. 30: 363–402.
Kashyap, R.L. and B.J. Oommen. 1983a. A common basis for similarity measures involving two strings. Int. J. Comput. Math. 13: 17–40.
Kashyap, R.L. and B.J. Oommen. 1983b. Similarity measures for sets of strings. Intern. J. Comput. Math. 13: 95–104.
Klopman, G. and O.T. Macina. 1985. Use of the computer automated structure evaluation program in determining quantitative structure-activity relationships with hallucinogenic phenylalkylamines. J. Theor. Biol. 113: 637–648.
Korhonen, T. 1984. Self-Organization and Associative Memory. Springer-Verlag, Berlin, pp. 125–188.
Kullback, S. 1959. Information Theory and Statistics. Wiley, New York.
Lambert, J.M. and W.T. Williams. 1962. Multivariate methods in plant ecology IV. Nodal analysis. J. Ecol. 50: 775–802.
Lamont, B.B. and K.J. Grant. 1979. A comparison of twenty-one measures of site dissimilarity In: L. Orlóci, C.R. Rao and W.M. Stiteler (eds.), Multivariate Methods in Ecological Work, pp. 101–126. International Coop. Publ. House, Fairland, Maryland.
Lance, G.N. and W.T. Williams. 1967. Mixed data classificatory programs Agglomerative systems. Austral. Comput. J. 1: 82–85.
Lance, G.N. and W.T. Williams. 1968. Mixed data classificatory programs II divisive systems. Austral. Comput. J. 1: 82–85.
Le Quense, W.J. 1974. The uniquely derived character concept and its cladistic application. Syst. Zool. 23: 513–517.
Lehmann, D.R. 1972. Judged similarity and brand-switching data as similarity measures. J. Marketing Res. 9: 331–334.
Lemone, K.A. 1982. Similarity measures between strings extended to lets of strings. IEEE Trans. Patt. Anal. Mach. Intel. PAMI-4; 345–347.
Lerman, I.-C. and P. Peter. 1985. Elaboration et logiciel d’un indice de similarité entre objets d’un type quelconque. Application au probleme de consensus en classification. IRISA, Rennes. Publ. Intern. 262. p. 72.
Levandowsky, M. 1972. An ordination of phytoplankton populations of ponds of varying salinity and temperature. Ecology 53: 398–407.
Levandowsky, M. and D. Winter. 1971. Distance between sets. Nature 234: 34–35.
Lewis, P.A.W., Baxendale, P.B. and J.L. Bennet. 1967. Statistical discrimination of the Synonymy/Antonymy relationship between words. Assoc. Comput. Mach. J. 14: 20–44.
Littlem, I.P. and D.R. Ross. 1985. The Levenshtein metric, a new means for soil classification tested by data from a sandpodzol chronosequence and evaluated by discriminant analysis. Austral. J. Soil Res. 23: 115–130.
Lu, S.-Y. and K.S. Fu. 1978. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. Systems, Man and Cybernetics SMC-8: 381–389.
Micalaski, S. and R.E. Stepp. 1985. Automated construction of classifications: conceptual clustering versus numerical texonomy. IEEE Trans. Patt. Anal. Mach Intel. PAMI-5: 396–410.
Miyamoto, S. and K. Nakayama. 1986. Similarity measures based on a fuzzy set model and application to hierarchical clustering. IEEE Trans. Systems, Man and Cybernetics. SMC-16: 479–482.
Moore, R.K. 1979. A dynamic programming algorithm for the distance between two finite areas. IEEE Trans. Patt. Anal. Machine Intell. PAMI-1: 86–88.
Mountfort, M.S. 1971. A test of the difference between two clusters. In: Patil G.P., Pielou, E.C. and W.E. Waters. Statistical Ecology 3. pp. 237–251. Penn. State Univ. Press.
Nakamura, K. and S. Iwai. 1982. A representation of analogical inference by fuzzy sets and its application to information retrieval system. In: M.M. Grupta and E. Sanchew (eds.), Fuzzy Information and Decision Processes pp. 373–386. North-Holland.
Orlóci, L. 1978. Multivariate Analysis in Vegetation Research. Dr. W. Junk, The Hague, p. 451.
Orlóci, L. 1969. Information theory models for hierarchic and non hierarchic classification. In: A.J. Cole (ed.), Numerical Taxonomy, pp. 148–164. Academic Press, London.
Rao, C.R. 1982. Diversity and dissimilarity coefficients: unified approach. Theor. Popultn. Biol. 21: 24–43.
Rajski, C. 1961. Entropy and metric spaces. In: C. Cheny (ed.) Information Theory. pp. 41–45. Butterworth, London.
Sankoff, D. and J.B. Kruskal. 1983. Time Warps, String Edits and Macromolecules: the Theory and Practice of Sequence Comparison. Addison-Wesley, London.
Samdal, C.E.A. 1974. A comparative study of association measures. Psychometrika 39: 165–187.
Sattath, S. and A. Tversky. 1977. Additive similarity trees. Psychometrika 42: 319–345.
Sepolsky, J.J. 1974. Quantified coefficients of association and measurement of similarity. Math. Geol. 6: 135–152.
Tversky, A. 1977. Features of similarity. Psychol. Rev. 84: 327–352.
Van Rijsbergen, C.J. 1986. A non-classical logic for information retrieval. Comput. J. 29: 481–485.
Vašiček, Z. and R. Jicin. 1976. The problem of similarity of shape. Syst. Zool. 21: 91–96.
Venot, A., J.F. Leubruchec and J.C. Roucayrol. 1984. A new class of similarity measures for robust image registration. Comput. Vision, Graphics. Image Process. 28: 176–184.
Vesely, A. 1981. Logically oriented cluster analysis. Kybernetika 17: 82–92.
Wahl, F.M. 1983. A new distance mapping and its use for shape measurement of binary patterns. Comput. Vision, Graph. Image Process. 23: 218–226.
Wallbrecher, E. 1976. Ein-Cluster-Vertahren wur richtungsstatistichen Analyse tektonischer Daten. Geol. Rdsch. 67: 840–857.
Werman, M., S. Pelg and A. Rosenfeld. 1985. A distance metric for multidimensional histograms. Comput. Vision, Graphics and Image Process. 32: 328–336.
Williams, W.T. 1973. Partition of information. Austral. J. Bot. 21: 277–281.
Wolds, H. 1986. Similarity indices, sample size and diversity. Oecologia 50: 296–302.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1991 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Dale, M. (1991). Mutational and Nonmutational Similarity Measures: A Preliminary Examination. In: Feoli, E., Orlóci, L. (eds) Computer assisted vegetation analysis. Handbook of vegetation science, vol 11. Springer, Dordrecht. https://doi.org/10.1007/978-94-011-3418-7_12
Download citation
DOI: https://doi.org/10.1007/978-94-011-3418-7_12
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-010-5512-3
Online ISBN: 978-94-011-3418-7
eBook Packages: Springer Book Archive