Journal of Classification

, Volume 5, Issue 2, pp 181–204 | Cite as

A study of standardization of variables in cluster analysis

  • Glenn W. Milligan
  • Martha C. Cooper
Authors Of Articles

Abstract

A methodological problem in applied clustering involves the decision of whether or not to standardize the input variables prior to the computation of a Euclidean distance dissimilarity measure. Existing results have been mixed with some studies recommending standardization and others suggesting that it may not be desirable. The existence of numerous approaches to standardization complicates the decision process. The present simulation study examined the standardization problem. A variety of data structures were generated which varied the intercluster spacing and the scales for the variables. The data sets were examined in four different types of error environments. These involved error free data, error perturbed distances, inclusion of outliers, and the addition of random noise dimensions. Recovery of true cluster structure as found by four clustering methods was measured at the correct partition level and at reduced levels of coverage. Results for eight standardization strategies are presented. It was found that those approaches which standardize by division by the range of the variable gave consistently superior recovery of the underlying cluster structure. The result held over different error conditions, separation distances, clustering methods, and coverage levels. The traditionalz-score transformation was found to be less effective in several situations.

Keywords

Standard scores Cluster analysis 

References

  1. ANDERBERG, M.R. (1973),Cluster Analysis for Applications, New York: Academic Press.Google Scholar
  2. BAYNE, C.K., BEAUCHAMP, J.J., BEGOVICH, C.L., and KANE, V.E. (1980), “Monte Carlo Comparisons of Selected Clustering Procedures,”Pattern Recognition, 12, 51–62.Google Scholar
  3. BLASHFIELD, R.K. (1976), “Mixture Model Tests of Cluster Analysis: Accuracy of Four Agglomerative Hierarchical Methods,”Psychological Bulletin, 83, 377–388.Google Scholar
  4. BLASHFIELD, R.K. (1977), “The Equivalence of Three Statistical Packages for Performing Hierarchical Cluster Analysis,”Psychometrika, 42, 429–431.Google Scholar
  5. BURR, E.J. (1968), “Clustering Sorting with Mixed Character Types: I. Standardization of Character Values,”Australian Computer Journal, 1, 97–99.Google Scholar
  6. CAIN, A.J., and HARRISON, G.A. (1958), “An Analysis of the Taxonomist's Judgement of Affinity,”Proceedings of the Zoological Society of London, 131, 85–98.Google Scholar
  7. CARMICHAEL, J.W., GEORGE, J.A., and JULIUS, R.S. (1968), “Finding Natural Clusters,”Systematic Zoology, 17, 144–150.Google Scholar
  8. CONOVER, W.J., and IMAN, R.L. (1981), “Rank Transformation as a Bridge Between Parametric and Nonparametric Statistics,”The American Statistician, 35, 124–129.Google Scholar
  9. CORMACK, R.M. (1971), “A Review of Classification,”Journal of the Royal Statistical Society, Series A, 134, 321–367.Google Scholar
  10. DE SOETE, G., DESARBO, W.S., and CARROLL, J.D. (1985), “Optimal Variable Weighting for Hierarchical Clustering: An Alternating Least-Squares Algorithm,”Journal of Classification, 2, 173–192.Google Scholar
  11. DUBES, R., and JAIN, A.K. (1980), “Clustering Methodologies in Exploratory Data Analysis,”Advances in Computers, 19, 113–228.Google Scholar
  12. EDELBROCK, C. (1979), “Comparing the Accuracy of Hierarchical Clustering Algorithms: The Problem of Classifying Everybody,”Multivariate Behavioral Research, 14, 367–384.Google Scholar
  13. EVERITT, B.S. (1980),Cluster Analysis (2nd ed.), London: Heinemann.Google Scholar
  14. FLEISS, J.L., and ZUBIN, J. (1969), “On the Methods and Theory of Clustering,”Multivariate Behavioral Research, 4, 235–250.Google Scholar
  15. GORDON, A.D. (1981),Classification: Methods for the Exploratory Analysis of Multivariate Data, London: Chapman and Hall.Google Scholar
  16. GOWER, J.C. (1971), “A General Coefficient of Similarity and Some of Its Properties,”Biometrics, 27, 857–871.Google Scholar
  17. HALL, A.V. (1965), “The Peculiarity Index, a New Function for Use in Numerical Taxonomy,”Nature, 206, 952.Google Scholar
  18. HALL, A.V. (1969), “Group Forming and Discrimination with Homogeneity Functions,” inNumerical Taxonomy, ed. A.J. Cole, New York: Academic Press.Google Scholar
  19. HARTIGAN, J.A. (1975),Clustering Algorithms, New York: Wiley.Google Scholar
  20. HOHENEGGER, J. (1986), “Weighted Standardization — A General Data Transformation Method Preceeding Classification Procedures,”Biometrical Journal, 28, 295–303.Google Scholar
  21. HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions,”Journal of Classification, 2, 193–218.Google Scholar
  22. JARDINE, N., and SIBSON, R. (1971),Mathematical Taxonomy, New York: Wiley.Google Scholar
  23. JOHNSON, S.C. (1967), “Hierarchical Clustering Schemes,”Psychometrika, 32, 241–254.PubMedGoogle Scholar
  24. KAUFMAN, R.L. (1985), “Issues in Multivariate Cluster Analysis: Some Simulation Results,”Sociological Methods and Research, 13, 467–486.Google Scholar
  25. LANCE, G.N., and WILLIAMS, W.T. (1967), “Mixed Data Classificatory Programs: I. Agglomerative Systems,”Australian Computer Journal, 1, 15–20.Google Scholar
  26. LORR, M. (1983),Cluster Analysis for the Social Sciences, San Francisco: Jossey-Bass.Google Scholar
  27. MILLIGAN, G.W. (1980), “An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms,”Psychometrika, 45, 325–342.Google Scholar
  28. MILLIGAN, G.W. (1981), “A Review of Monte Carlo Tests of Cluster Analysis,”Multivariate Behavioral Research, 16, 379–407.Google Scholar
  29. MILLIGAN, G.W. (1985), “An Algorithm for Generating Artificial Test Clusters,”Psychometrika, 50, 123–127.Google Scholar
  30. MILLIGAN, G.W., and COOPER, M.C. (1986), “A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis,”Multivariate Behavioral Research, 21, 441–458.Google Scholar
  31. MILLIGAN, G.W., and COOPER, M.C. (1987), “Methodological Review: Clustering Methods,”Applied Psychological Measurement, 11, 329–354.Google Scholar
  32. MORRISON, D.G. (1967), “Measurement Problems in Cluster Analysis,”Management Science, 13, 775–780.Google Scholar
  33. OVERALL, J.E., and KLETT, C.J. (1972),Applied Multivariate Analysis, New York: McGraw-Hill.Google Scholar
  34. RAMSEY, P.H. (1978), “Power Differences Between Pairwise Multiple Comparisons,”Journal of the American Statistical Association, 73, 479–487.Google Scholar
  35. ROMESBURG, H.C. (1984),Cluster Analysis for Researchers, Belmont, CA: Lifetime Learning Publications.Google Scholar
  36. SAS User's Guide: Statistics, (1985), Cary, NC: SAS Institute.Google Scholar
  37. SAWERY, W.L., KELLER, L., and CONGER, J.J. (1960), “An Objective Method of Grouping Profiles by Distance Functions and Its Relation to Factor Analysis,”Educational and Psychological Measurement, 20, 651–674.Google Scholar
  38. SCHEIBLER, D., and SCHNEIDER, W. (1985), “Monte Carlo Tests of the Accuracy of Cluster Analysis Algorithms — A Comparison of Hierarchical and Nonhierarchical Methods,”Multivariate Behavioral Research, 20, 283–304.Google Scholar
  39. SNEATH, P.H.A., and SOKAL, R.R. (1973),Numerical Taxonomy, San Francisco: Freeman.Google Scholar
  40. SOKAL, R.R. (1961), “Distance as a Measure of Taxonomic Similarity,”Systematic Zoology, 10, 70–79.Google Scholar
  41. SOKAL, R.R., and ROHLF, F.J. (1969),Biometry, the Principles and Practice of Statistics in Biological Research, San Francisco: Freeman.Google Scholar
  42. SPATH, H. (1980),Cluster Analysis Algorithms, New York: Wiley.Google Scholar
  43. STODDARD, A.M. (1979), “Standardization of Measures Prior to Cluster Analysis,”Biometrics, 35, 765–773.Google Scholar
  44. TUKEY, J.W. (1977),Exploratory Data Analysis, Reading, Ma.: Addison-Wesley.Google Scholar
  45. WILLIAMS, W.T., DALE, M.B., and MAC NAUGHTON-SMITH, P. (1964), “An Objective Method of Weighting in Similarity Analysis,”Nature, 201, 426.Google Scholar
  46. WILLIAMS, W.T., LAMBERT, J.M., and LANCE, G.N. (1966), “Multivariate Methods in Plant Ecology. V. Similarity Analyses and Information Analysis,”Journal of Ecology, 54, 427–445.Google Scholar

Copyright information

© Springer-Verlag New York Inc. 1988

Authors and Affiliations

  • Glenn W. Milligan
    • 1
  • Martha C. Cooper
    • 2
  1. 1.Faculty of Management SciencesThe Ohio State UniversityColumbusUSA
  2. 2.Faculty of MarketingThe Ohio State UniversityColumbusUSA

Personalised recommendations