A study of standardization of variables in cluster analysis

Abstract

A methodological problem in applied clustering involves the decision of whether or not to standardize the input variables prior to the computation of a Euclidean distance dissimilarity measure. Existing results have been mixed with some studies recommending standardization and others suggesting that it may not be desirable. The existence of numerous approaches to standardization complicates the decision process. The present simulation study examined the standardization problem. A variety of data structures were generated which varied the intercluster spacing and the scales for the variables. The data sets were examined in four different types of error environments. These involved error free data, error perturbed distances, inclusion of outliers, and the addition of random noise dimensions. Recovery of true cluster structure as found by four clustering methods was measured at the correct partition level and at reduced levels of coverage. Results for eight standardization strategies are presented. It was found that those approaches which standardize by division by the range of the variable gave consistently superior recovery of the underlying cluster structure. The result held over different error conditions, separation distances, clustering methods, and coverage levels. The traditionalz-score transformation was found to be less effective in several situations.

This is a preview of subscription content, log in to check access.

References

  1. ANDERBERG, M.R. (1973),Cluster Analysis for Applications, New York: Academic Press.

    Google Scholar 

  2. BAYNE, C.K., BEAUCHAMP, J.J., BEGOVICH, C.L., and KANE, V.E. (1980), “Monte Carlo Comparisons of Selected Clustering Procedures,”Pattern Recognition, 12, 51–62.

    Google Scholar 

  3. BLASHFIELD, R.K. (1976), “Mixture Model Tests of Cluster Analysis: Accuracy of Four Agglomerative Hierarchical Methods,”Psychological Bulletin, 83, 377–388.

    Google Scholar 

  4. BLASHFIELD, R.K. (1977), “The Equivalence of Three Statistical Packages for Performing Hierarchical Cluster Analysis,”Psychometrika, 42, 429–431.

    Google Scholar 

  5. BURR, E.J. (1968), “Clustering Sorting with Mixed Character Types: I. Standardization of Character Values,”Australian Computer Journal, 1, 97–99.

    Google Scholar 

  6. CAIN, A.J., and HARRISON, G.A. (1958), “An Analysis of the Taxonomist's Judgement of Affinity,”Proceedings of the Zoological Society of London, 131, 85–98.

    Google Scholar 

  7. CARMICHAEL, J.W., GEORGE, J.A., and JULIUS, R.S. (1968), “Finding Natural Clusters,”Systematic Zoology, 17, 144–150.

    Google Scholar 

  8. CONOVER, W.J., and IMAN, R.L. (1981), “Rank Transformation as a Bridge Between Parametric and Nonparametric Statistics,”The American Statistician, 35, 124–129.

    Google Scholar 

  9. CORMACK, R.M. (1971), “A Review of Classification,”Journal of the Royal Statistical Society, Series A, 134, 321–367.

    Google Scholar 

  10. DE SOETE, G., DESARBO, W.S., and CARROLL, J.D. (1985), “Optimal Variable Weighting for Hierarchical Clustering: An Alternating Least-Squares Algorithm,”Journal of Classification, 2, 173–192.

    Google Scholar 

  11. DUBES, R., and JAIN, A.K. (1980), “Clustering Methodologies in Exploratory Data Analysis,”Advances in Computers, 19, 113–228.

    Google Scholar 

  12. EDELBROCK, C. (1979), “Comparing the Accuracy of Hierarchical Clustering Algorithms: The Problem of Classifying Everybody,”Multivariate Behavioral Research, 14, 367–384.

    Google Scholar 

  13. EVERITT, B.S. (1980),Cluster Analysis (2nd ed.), London: Heinemann.

    Google Scholar 

  14. FLEISS, J.L., and ZUBIN, J. (1969), “On the Methods and Theory of Clustering,”Multivariate Behavioral Research, 4, 235–250.

    Google Scholar 

  15. GORDON, A.D. (1981),Classification: Methods for the Exploratory Analysis of Multivariate Data, London: Chapman and Hall.

    Google Scholar 

  16. GOWER, J.C. (1971), “A General Coefficient of Similarity and Some of Its Properties,”Biometrics, 27, 857–871.

    Google Scholar 

  17. HALL, A.V. (1965), “The Peculiarity Index, a New Function for Use in Numerical Taxonomy,”Nature, 206, 952.

    Google Scholar 

  18. HALL, A.V. (1969), “Group Forming and Discrimination with Homogeneity Functions,” inNumerical Taxonomy, ed. A.J. Cole, New York: Academic Press.

    Google Scholar 

  19. HARTIGAN, J.A. (1975),Clustering Algorithms, New York: Wiley.

    Google Scholar 

  20. HOHENEGGER, J. (1986), “Weighted Standardization — A General Data Transformation Method Preceeding Classification Procedures,”Biometrical Journal, 28, 295–303.

    Google Scholar 

  21. HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions,”Journal of Classification, 2, 193–218.

    Google Scholar 

  22. JARDINE, N., and SIBSON, R. (1971),Mathematical Taxonomy, New York: Wiley.

    Google Scholar 

  23. JOHNSON, S.C. (1967), “Hierarchical Clustering Schemes,”Psychometrika, 32, 241–254.

    PubMed  Google Scholar 

  24. KAUFMAN, R.L. (1985), “Issues in Multivariate Cluster Analysis: Some Simulation Results,”Sociological Methods and Research, 13, 467–486.

    Google Scholar 

  25. LANCE, G.N., and WILLIAMS, W.T. (1967), “Mixed Data Classificatory Programs: I. Agglomerative Systems,”Australian Computer Journal, 1, 15–20.

    Google Scholar 

  26. LORR, M. (1983),Cluster Analysis for the Social Sciences, San Francisco: Jossey-Bass.

    Google Scholar 

  27. MILLIGAN, G.W. (1980), “An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms,”Psychometrika, 45, 325–342.

    Google Scholar 

  28. MILLIGAN, G.W. (1981), “A Review of Monte Carlo Tests of Cluster Analysis,”Multivariate Behavioral Research, 16, 379–407.

    Google Scholar 

  29. MILLIGAN, G.W. (1985), “An Algorithm for Generating Artificial Test Clusters,”Psychometrika, 50, 123–127.

    Google Scholar 

  30. MILLIGAN, G.W., and COOPER, M.C. (1986), “A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis,”Multivariate Behavioral Research, 21, 441–458.

    Google Scholar 

  31. MILLIGAN, G.W., and COOPER, M.C. (1987), “Methodological Review: Clustering Methods,”Applied Psychological Measurement, 11, 329–354.

    Google Scholar 

  32. MORRISON, D.G. (1967), “Measurement Problems in Cluster Analysis,”Management Science, 13, 775–780.

    Google Scholar 

  33. OVERALL, J.E., and KLETT, C.J. (1972),Applied Multivariate Analysis, New York: McGraw-Hill.

    Google Scholar 

  34. RAMSEY, P.H. (1978), “Power Differences Between Pairwise Multiple Comparisons,”Journal of the American Statistical Association, 73, 479–487.

    Google Scholar 

  35. ROMESBURG, H.C. (1984),Cluster Analysis for Researchers, Belmont, CA: Lifetime Learning Publications.

    Google Scholar 

  36. SAS User's Guide: Statistics, (1985), Cary, NC: SAS Institute.

  37. SAWERY, W.L., KELLER, L., and CONGER, J.J. (1960), “An Objective Method of Grouping Profiles by Distance Functions and Its Relation to Factor Analysis,”Educational and Psychological Measurement, 20, 651–674.

    Google Scholar 

  38. SCHEIBLER, D., and SCHNEIDER, W. (1985), “Monte Carlo Tests of the Accuracy of Cluster Analysis Algorithms — A Comparison of Hierarchical and Nonhierarchical Methods,”Multivariate Behavioral Research, 20, 283–304.

    Google Scholar 

  39. SNEATH, P.H.A., and SOKAL, R.R. (1973),Numerical Taxonomy, San Francisco: Freeman.

    Google Scholar 

  40. SOKAL, R.R. (1961), “Distance as a Measure of Taxonomic Similarity,”Systematic Zoology, 10, 70–79.

    Google Scholar 

  41. SOKAL, R.R., and ROHLF, F.J. (1969),Biometry, the Principles and Practice of Statistics in Biological Research, San Francisco: Freeman.

    Google Scholar 

  42. SPATH, H. (1980),Cluster Analysis Algorithms, New York: Wiley.

    Google Scholar 

  43. STODDARD, A.M. (1979), “Standardization of Measures Prior to Cluster Analysis,”Biometrics, 35, 765–773.

    Google Scholar 

  44. TUKEY, J.W. (1977),Exploratory Data Analysis, Reading, Ma.: Addison-Wesley.

    Google Scholar 

  45. WILLIAMS, W.T., DALE, M.B., and MAC NAUGHTON-SMITH, P. (1964), “An Objective Method of Weighting in Similarity Analysis,”Nature, 201, 426.

    Google Scholar 

  46. WILLIAMS, W.T., LAMBERT, J.M., and LANCE, G.N. (1966), “Multivariate Methods in Plant Ecology. V. Similarity Analyses and Information Analysis,”Journal of Ecology, 54, 427–445.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Milligan, G.W., Cooper, M.C. A study of standardization of variables in cluster analysis. Journal of Classification 5, 181–204 (1988). https://doi.org/10.1007/BF01897163

Download citation

Keywords

  • Standard scores
  • Cluster analysis