Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

K-Means and K-Medoids

Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_545-2

Synonyms

Definitions

K-Means

Given an integer k and a set of objects S = {p 1 , p 2 ,...,p n } in Euclidian space, the problem of k-means clustering is to find a set of centre points (means) P = {c 1 , c 2 ,...,c k }, |P| = k in the space, such that S can be partitioned into k corresponding clusters C 1 , C 2 ,...,C k , by assigning each object in S to the closest centre c i . The sum of square error criterion (SEC) measure, defined as \( {\displaystyle {\sum}_{i=1}^k{\displaystyle \sum_{p\in {C}_i}\Big|p-{c}_i}}\Big|{}^2 \), is minimized.

K-Medoids

Given an integer k and a set of objects S = {p 1 , p 2 , ..., p n } in Euclidian space, the problem of k-medoids clustering is to find a set of objects as medoids P = {o 1 , o 2 ,...,o k }, |P| = k in the space, such that S can be partitioned into k corresponding clusters C 1 , C 2 ,...,C k , by assigning each object in S to the closest medoid o i . The sum of square error criterion (SEC) measure, defined as \( {\displaystyle {\sum}_{i=1}^k{\displaystyle \sum_{p\in {C}_j}\Big|p-{o}_j}}\Big|{}^2 \), is minimized.

Key Points

K-Means

The k-means algorithm starts with a random selection of k objects as the centres or the means of k clusters. The challenge is to place these centres in a clever way because each different location of centres may result in different clustering. Hence, a reasonable initial choice is to locate them as far away from each other as possible. Then an iteration process is used to assign remaining objects to their nearest centres. After the membership of all clusters is considered according to the given initial centres, the cluster centres will all be reconsidered. Hence the new cluster centres are calculated a set of k means P = {c 1 , c 2 ,...,c k }. For example, for the mean of a set of single-valued objects in cluster i with m points is defined as m i :
$$ {m}_i=\frac{1}{m}{\displaystyle \sum_{j=1}^m{p}_{i,\kern0.5em j}} $$

The means of high-dimensional objects are calculated in a same by using this formula for each of the dimensions. When cluster centres are updated, the membership computations will be executed again to reallocate the objects to the nearest centers. The main idea of updating the cluster centres and the memberships is based on the computation of Euclidian distances such that the objects in a cluster have the minimum distances and the objects between clusters have the maximum distances. The process will continue until there are no more updates that could reallocate the centres and the memberships of clusters. Even though it is easy to prove that the procedure always terminates, the k-means algorithm does not necessarily find the global optimal solutions, corresponding to the global minimum of the objective function. Also, it is easy to demonstrate that this algorithm is significantly sensitive to the initial randomly selected centres. To mitigate this situation the k-means algorithm can be executed multiple times in order to identify a better global solution. The complexity of the algorithm is O(nkt), for n objects, k clusters and titerations.

The k-means algorithm is only applicable to the objects with mean defined. The user must specify an initial value k. The k-means approach is only suitable for finding convex shapes with all clusters having similar sizes. Moreover, when there is noise and outliers in the dataset, the means of clusters may appear at locations where the majority objects are far away.

K-Medoids

K-medoids algorithm avoids calculating means of clusters in which extremely large values may affect the membership computations substantially. K-medoids can handle outliers well by selecting the most centrally located object in a cluster as a reference point, namely, medoid. The difference between k-means and k-medoids is analogous to the difference between mean and median: where mean indicates the average value of all data items collected, while median indicates the value around that which all data items are evenly distributed around it. The basic idea of k-medoids is that it first arbitrarily finds k objects amongst n objects in the dataset as the initial medoids. Then the remaining objects are partitioned into k clusters by computing the minimum Euclidian distances that can be maintained for the members in each of the clusters. An iterative process then starts to consider objects p i , i = 1,...,n, if a medoid o j , j = 1,...,k, can be replaced by a candidate object o c , c = 1,...,n, c ≠ i. There are four situations to be considered in this process: (i) Shift-out membership: an object p i may need to be shifted from currently considered cluster of o j to another cluster; (ii) Update the current medoid: a new medoid o c is found to replace the current medoid o j ; (iii) No change: objects in the current cluster result have the same or even smaller SEC for all the possible redistributions considered; (iv) Shift-in membership: an outside object p i is assigned to the current cluster with the new (replaced) medoid o c .

K-medoids algorithm needs to test if any existing medoids can be replaced by any other objects. By looking at all of these possible replacements, if the overall SEC is improved then the given medoid will be replaced. The computational cost of k-medoid is much higher than the k-means. For each iteration, computational cost is k(n-k) 2 for k(n-k) pairs of objects to be considered with each (n-k) evaluations of the SEC. K-meoids algorithm then is considered with a sampling method for the improvement on the computational complexity. In this way, several samples are taken from the dataset, and then the k-medoids algorithm is applied to each of the samples. The convergence of medoids is achieved by choosing the sample that performs the best.

Cross-References

Recommended Reading

  1. 1.
    Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. New York: John Wiley; 1990.Google Scholar
  2. 2.
    MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th berkeley symposium on mathematics, statistics and probabilities, vol. 1. 1967. p. 281–97.Google Scholar
  3. 3.
    Ng RT, Han J. Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases. 1994. p. 144–55.Google Scholar

Copyright information

© Springer Science+Business Media LLC 2016

Authors and Affiliations

  1. 1.The University of QueenslandBrisbaneAustralia

Section editors and affiliations

  • Xiaofang Zhou
    • 1
  1. 1.School of Inf. Tech. & Elec. Eng.Univ. of QueenslandBrisbaneAustralia