Abstract
A methodological problem in applied clustering involves the decision of whether or not to standardize the input variables prior to the computation of a Euclidean distance dissimilarity measure. Existing results have been mixed with some studies recommending standardization and others suggesting that it may not be desirable. The existence of numerous approaches to standardization complicates the decision process. The present simulation study examined the standardization problem. A variety of data structures were generated which varied the intercluster spacing and the scales for the variables. The data sets were examined in four different types of error environments. These involved error free data, error perturbed distances, inclusion of outliers, and the addition of random noise dimensions. Recovery of true cluster structure as found by four clustering methods was measured at the correct partition level and at reduced levels of coverage. Results for eight standardization strategies are presented. It was found that those approaches which standardize by division by the range of the variable gave consistently superior recovery of the underlying cluster structure. The result held over different error conditions, separation distances, clustering methods, and coverage levels. The traditionalz-score transformation was found to be less effective in several situations.
Similar content being viewed by others
References
ANDERBERG, M.R. (1973),Cluster Analysis for Applications, New York: Academic Press.
BAYNE, C.K., BEAUCHAMP, J.J., BEGOVICH, C.L., and KANE, V.E. (1980), “Monte Carlo Comparisons of Selected Clustering Procedures,”Pattern Recognition, 12, 51–62.
BLASHFIELD, R.K. (1976), “Mixture Model Tests of Cluster Analysis: Accuracy of Four Agglomerative Hierarchical Methods,”Psychological Bulletin, 83, 377–388.
BLASHFIELD, R.K. (1977), “The Equivalence of Three Statistical Packages for Performing Hierarchical Cluster Analysis,”Psychometrika, 42, 429–431.
BURR, E.J. (1968), “Clustering Sorting with Mixed Character Types: I. Standardization of Character Values,”Australian Computer Journal, 1, 97–99.
CAIN, A.J., and HARRISON, G.A. (1958), “An Analysis of the Taxonomist's Judgement of Affinity,”Proceedings of the Zoological Society of London, 131, 85–98.
CARMICHAEL, J.W., GEORGE, J.A., and JULIUS, R.S. (1968), “Finding Natural Clusters,”Systematic Zoology, 17, 144–150.
CONOVER, W.J., and IMAN, R.L. (1981), “Rank Transformation as a Bridge Between Parametric and Nonparametric Statistics,”The American Statistician, 35, 124–129.
CORMACK, R.M. (1971), “A Review of Classification,”Journal of the Royal Statistical Society, Series A, 134, 321–367.
DE SOETE, G., DESARBO, W.S., and CARROLL, J.D. (1985), “Optimal Variable Weighting for Hierarchical Clustering: An Alternating Least-Squares Algorithm,”Journal of Classification, 2, 173–192.
DUBES, R., and JAIN, A.K. (1980), “Clustering Methodologies in Exploratory Data Analysis,”Advances in Computers, 19, 113–228.
EDELBROCK, C. (1979), “Comparing the Accuracy of Hierarchical Clustering Algorithms: The Problem of Classifying Everybody,”Multivariate Behavioral Research, 14, 367–384.
EVERITT, B.S. (1980),Cluster Analysis (2nd ed.), London: Heinemann.
FLEISS, J.L., and ZUBIN, J. (1969), “On the Methods and Theory of Clustering,”Multivariate Behavioral Research, 4, 235–250.
GORDON, A.D. (1981),Classification: Methods for the Exploratory Analysis of Multivariate Data, London: Chapman and Hall.
GOWER, J.C. (1971), “A General Coefficient of Similarity and Some of Its Properties,”Biometrics, 27, 857–871.
HALL, A.V. (1965), “The Peculiarity Index, a New Function for Use in Numerical Taxonomy,”Nature, 206, 952.
HALL, A.V. (1969), “Group Forming and Discrimination with Homogeneity Functions,” inNumerical Taxonomy, ed. A.J. Cole, New York: Academic Press.
HARTIGAN, J.A. (1975),Clustering Algorithms, New York: Wiley.
HOHENEGGER, J. (1986), “Weighted Standardization — A General Data Transformation Method Preceeding Classification Procedures,”Biometrical Journal, 28, 295–303.
HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions,”Journal of Classification, 2, 193–218.
JARDINE, N., and SIBSON, R. (1971),Mathematical Taxonomy, New York: Wiley.
JOHNSON, S.C. (1967), “Hierarchical Clustering Schemes,”Psychometrika, 32, 241–254.
KAUFMAN, R.L. (1985), “Issues in Multivariate Cluster Analysis: Some Simulation Results,”Sociological Methods and Research, 13, 467–486.
LANCE, G.N., and WILLIAMS, W.T. (1967), “Mixed Data Classificatory Programs: I. Agglomerative Systems,”Australian Computer Journal, 1, 15–20.
LORR, M. (1983),Cluster Analysis for the Social Sciences, San Francisco: Jossey-Bass.
MILLIGAN, G.W. (1980), “An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms,”Psychometrika, 45, 325–342.
MILLIGAN, G.W. (1981), “A Review of Monte Carlo Tests of Cluster Analysis,”Multivariate Behavioral Research, 16, 379–407.
MILLIGAN, G.W. (1985), “An Algorithm for Generating Artificial Test Clusters,”Psychometrika, 50, 123–127.
MILLIGAN, G.W., and COOPER, M.C. (1986), “A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis,”Multivariate Behavioral Research, 21, 441–458.
MILLIGAN, G.W., and COOPER, M.C. (1987), “Methodological Review: Clustering Methods,”Applied Psychological Measurement, 11, 329–354.
MORRISON, D.G. (1967), “Measurement Problems in Cluster Analysis,”Management Science, 13, 775–780.
OVERALL, J.E., and KLETT, C.J. (1972),Applied Multivariate Analysis, New York: McGraw-Hill.
RAMSEY, P.H. (1978), “Power Differences Between Pairwise Multiple Comparisons,”Journal of the American Statistical Association, 73, 479–487.
ROMESBURG, H.C. (1984),Cluster Analysis for Researchers, Belmont, CA: Lifetime Learning Publications.
SAS User's Guide: Statistics, (1985), Cary, NC: SAS Institute.
SAWERY, W.L., KELLER, L., and CONGER, J.J. (1960), “An Objective Method of Grouping Profiles by Distance Functions and Its Relation to Factor Analysis,”Educational and Psychological Measurement, 20, 651–674.
SCHEIBLER, D., and SCHNEIDER, W. (1985), “Monte Carlo Tests of the Accuracy of Cluster Analysis Algorithms — A Comparison of Hierarchical and Nonhierarchical Methods,”Multivariate Behavioral Research, 20, 283–304.
SNEATH, P.H.A., and SOKAL, R.R. (1973),Numerical Taxonomy, San Francisco: Freeman.
SOKAL, R.R. (1961), “Distance as a Measure of Taxonomic Similarity,”Systematic Zoology, 10, 70–79.
SOKAL, R.R., and ROHLF, F.J. (1969),Biometry, the Principles and Practice of Statistics in Biological Research, San Francisco: Freeman.
SPATH, H. (1980),Cluster Analysis Algorithms, New York: Wiley.
STODDARD, A.M. (1979), “Standardization of Measures Prior to Cluster Analysis,”Biometrics, 35, 765–773.
TUKEY, J.W. (1977),Exploratory Data Analysis, Reading, Ma.: Addison-Wesley.
WILLIAMS, W.T., DALE, M.B., and MAC NAUGHTON-SMITH, P. (1964), “An Objective Method of Weighting in Similarity Analysis,”Nature, 201, 426.
WILLIAMS, W.T., LAMBERT, J.M., and LANCE, G.N. (1966), “Multivariate Methods in Plant Ecology. V. Similarity Analyses and Information Analysis,”Journal of Ecology, 54, 427–445.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Milligan, G.W., Cooper, M.C. A study of standardization of variables in cluster analysis. Journal of Classification 5, 181–204 (1988). https://doi.org/10.1007/BF01897163
Issue Date:
DOI: https://doi.org/10.1007/BF01897163