Abstract
The problem of measuring the impact of individual data points in a cluster analysis is examined. The purpose is to identify those data points that have an influence on the resulting cluster partitions. Influence of a single data point is considered present when different cluster partitions result from the removal of the element from the data set. The Hubert and Arabie (1985) corrected Rand index was used to provide numerical measures of influence of a data point. Simulated data sets consisting of a variety of cluster structures and error conditions were generated to validate the influence measures. The results showed that the measure of internal influence was 100% accurate in identifying those data elements exhibiting an influential effect. The nature of the influence, whether beneficial or detrimental to the clustering, can be evaluated with the use of the gamma and point-biserial statistics.
Similar content being viewed by others
References
ANDERBERG, M.R. (1973),Cluster Analysis for Applications, New York: Academic Press.
BELBIN, L., FAITH, D., and MILLIGAN, G.W. (1992), “A Comparison of Two Approaches to Beta-Flexible Clustering,”Multivariate Behavioral Research, 27, 417–433.
BRECKENRIDGE, J.N. (1989), “Replicating Cluster Analysis: Method, Consistency, and Validity,”Multivariate Behavioral Research, 24, 147–161.
CHENG, R., and MILLIGAN, G.W. (1995), “Mapping Influence Regions in Hierarchical Clustering,”Multivariate Behavioral Research, 30, 547–576.
CORMACK, R.M. (1971), “A Review of Classification,”Journal of Royal Statistical Society, Series A, 134, 321–367.
CROVELLO, T. (1968), “The Effect of Change of Number of OTU's in a Numerical Taxonomic Study,”Brittonia, 20, 346–367.
CROVELLO, T. (1969), “Effects of Change of Characters and of Number of Characters in Numerical Taxonomy,”American Midland Naturalist, 81, 68–86.
DUBES, R., and JAIN, A.K. (1980), “Clustering Methodologies in Exploratory Data Analysis,” inAdvances in Computers (Vol. 19), Ed., M. C. Yovits, New York: Academic Press, 113–215.
EDELBROCK, C. (1979), “Comparing the Accuracy of Hierarchical Clustering Algorithms: The Problem of Classifying Everybody,”Multivariate Behavioral Research, 14, 367–384.
EVERITT, B.S. (1974),Cluster Analysis, New York: Wiley.
GNANADESIKAN, R., KETTENRING, J.R., and LANDWEHR, J.M. (1977), “Interpreting and Assessing the Results of Cluster Analyses,”Bulletin of the International Statistical Institute, 47, 451–463.
GOODMAN, L.A., and KRUSKAL, W.H. (1954), “Measures of Association for Cross-Classifications,”Journal of the American Statistical Association, 49, 732–764.
GORDON, A.D. (1981),Classification: Methods for the Exploratory Analysis of Multivariate Data, London: Chapman & Hall.
GORDON, A.D. (1987), “A Review of Hierarchical Classification,”Journal of the Royal Statistical Society, Series A, 150, 119–137.
GORDON, A.D., and DE CATA, A. (1988), “Stability and Influence in Sum of Squares Clustering,”metron, 46, 347–360.
GOWER, J.C., and ROSS, G.J.S. (1969), “Minimum Spanning Trees and Single-Link Cluster Analysis,”Applied Statistics, 18, 54–64.
HUBERT, L.J., and ARABIE, P. (1985), “Comparing Partitions,”Journal of Classification, 2, 193–218.
JAIN, A.K., and DUBES, R.C. (1988),Algorithms for Clustering Data, Englewood Cliffs, NJ: Prentice-Hall.
JOLLIFFE, I.T., JONES, B., and MORGAN, B.J.T. (1988), “Stability and Influence in Cluster Analysis,” inData Analysis and Informatics, V, Ed., E. Diday, Amsterdam: Elsevier (North-Holland), 507–514.
MCINTYRE, R.M., and BLASHFIELD, R.K. (1980), “A Nearest-Centroid Technique for Evaluating the Minimum-Variance Clustering Procedure,”Multivariate Behavioral Research, 15, 225–238.
MILLIGAN, G.W. (1980), “An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms,”Psychometrika, 45, 325–342.
MILLIGAN, G.W. (1981), “A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis,”Psychometrika, 46, 187–199.
MILLIGAN, G.W. (1985), “An Algorithm for Generating Artificial Test Clusters,”Psychometrika, 50, 123–127.
MILLIGAN, G.W. (1989), “A Validation Study of a Variable Weighting Algorithm for Cluster Analysis,”Journal of Classification, 6, 53–71.
MILLIGAN, G.W. (1995), “Clustering Validation: Results and Implications for Applied Analyses,” inClustering and Classification, Eds., P. Arabie, L. Hubert, and G. De Soete, River Edge, New Jersey: World Scientific Press, 345–375.
MOREY, L.C., BLASHFIELD, R.K., and SKINNER, H.A. (1983), “A Comparison of Cluster Analysis Techniques Within a Sequential Validation Framework,”Multivariate Behavioral Research, 18, 309–329.
SILVESTRI, L., and HILL, I.R. (1964), “Some Problems of the Taxonometric Approach,” inPhenetic and Phylogenetic Classification, Eds., V.H. Heywood and J. McNeill, London: The Systematics Association (The Systematics Assocation Publication No. 6), 87–104.
SMITH, P.S., and DUBES, R. (1980), “Stability of a Hierarchical Clustering,”Pattern Recognition, 12, 177–187.
SOKAL, R.R., KIM, J., and ROHLF, F.J. (1992), “Character and OTU Stability in Five Taxonomic Groups,”Journal of Classification, 9, 117–140.
WARD, J.H. JR. (1963), “Hierarchical Grouping to Optimize an Objective Function,”Journal of the American Statistical Association, 58, 236–244.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Milligan, G.W., Cheng, R. Measuring the influence of individual data points in a cluster analysis. Journal of Classification 13, 315–335 (1996). https://doi.org/10.1007/BF01246105
Issue Date:
DOI: https://doi.org/10.1007/BF01246105