Effect of Data Standardization on the Result of k-Means Clustering
In applying clustering to multivariate data, in which there are some large-scale variables, clustering results depend on the variables more than the user’s needs. In such cases, we should standardize the data to control the dependency. For high-dimensional data, Doherty et al. (Appl Soft Comput 7:203–210, 2007) argued numerically that data standardization by variable range leads to almost the same results regardless of the kinds of norms, although Aggarwal et al. (Lect Notes Comput Sci 1973:420–434, 2001) showed theoretically that a fraction norm reduces the effect of the curse of high dimensionality for k-means result more than the Euclidean norm does. However, they have not considered the effects of standardization and factors properly. In this paper, we verify the effects of six data standardization methods with various norms and examine factors that affect the clustering results for high-dimensional data. As a result, we show that data standardization with the fraction norm reduces the effect of the curse of high dimensionality and gives a more effective result than data standardization with the Euclidean norm and not applying data standardization with the fraction norm.