Advertisement

Using a Set of Triangle Inequalities to Accelerate K-means Clustering

Conference paper
  • 261 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12440)

Abstract

The k-means clustering is a well-known problem in data mining and machine learning. However, the de facto standard, i.e., Lloyd’s k-mean algorithm, suffers from a large amount of time on the distance calculations. Elkan’s k-means algorithm as one prominent approach exploits triangle inequality to greatly reduce such distance calculations between points and centers, while achieving the exactly same clustering results with significant speed improvement, especially on high-dimensional datasets. In this paper, we propose a set of triangle inequalities to enhance the filtering step of Elkan’s k-means algorithm. With our new filtering bounds, a filtering-based Elkan (FB-Elkan) is proposed, which preserves the same results as Lloyd’s k-means algorithm and additionally prunes unnecessary distance calculations. In addition, a memory-optimized Elkan (MO-Elkan) is provided, where the space complexity is greatly reduced by trading-off the maintenance of lower bounds and the run-time efficiency. Throughout evaluations with real-world datasets, FB-Elkan in general accelerates the original Elkan’s k-means algorithm for high-dimensional datasets (up to 1.69x), whereas MO-Elkan outperforms the others for low-dimensional datasets (up to 2.48x). Specifically, when the datasets have a large number of points, i.e., \(n\ge 5\)M, MO-Elkan still can derive the exact clustering results, while the original Elkan’s k-means algorithm is not applicable due to memory limitation.

Keywords

K-means Clustering accelerating Triangle inequalities 

1 Introduction

The k-means clustering is one of the popular problems in data mining and machine learning due to its simplicity and applicability. The de facto k-means algorithm, i.e., Lloyd’s k-means algorithm  [12], performs two steps repeatedly: 1) the assignment step matches each point to its closest center, and 2) the update step calibrates the center for each cluster with the assigned points. However, the bottleneck in terms of time complexity, is to identify the closest center for each input data point, which leads to significantly high time complexity, i.e., O(nkd), where n is the number of data points, k is the number of centers and d is the number of dimensions. In many situations, those numbers are big, e.g., data on health status of patients, on earth observation, on computer vision, etc. Therefore, efficient k-mean clustering algorithms are indeed desired.

In order to accelerate the k-means algorithm, two distinctive categories are widely-studied in the literature. 1) Approximated solution: Instead of accelerating the exact k-means algorithm, the proposed techniques in this category perform approximated solutions, e.g.,  [15, 17, 18], which indeed accelerate k-means algorithms, but the final clustering results cannot be guaranteed to be the same as Lloyd’s k-means algorithm. 2) Acceleration with exact results: The proposed techniques in this category accelerate the calculation procedure while preserving the exact results as Lloyd’s k-means algorithm. For example, Kanungo et al.  [11] and Pelleg et al.  [14] propose to accelerate the nearest neighbor search without computing distances to all k centers by using the properties of special data structures. However, the overhead of preprocessing becomes significant when the input datasets are high-dimensional. Alternatively, several acceleration techniques exploit bounds on distance between data points and centers, e.g.,  [3, 5, 7, 8, 10, 13, 16]. By maintaining lower and upper bounds on the distances to the cluster centers, most of distance calculations can be skipped. In particular, Elkan’s k-means algorithm  [8] as one prominent approach of them can still dominate the others on high-dimensional datasets  [13]. Nevertheless, Elkan’s k-means algorithm is apparently infeasible when the number of data point (n) or centers (k) is large due to the size of memory footprint for storing the lower bounds, where the space complexity is O(nk).1

With the above pros and cons, we are motivated to revisit Elkan’s k-means algorithm and propose a set of new filtering bounds based on triangle inequalities to improve the filtering step.
Fig. 1.

Overview of the optimized Elkan’s k-means algorithm, which illustrates interactions between different components. Our contributions focus on the filtering step, which are highlighted in green. (Color figure online)

Our Contributions: Figure 1 illustrates an overview of our contributions. We aim at the filtering step in Elkan’s k-means algorithm highlighted in green, detailed as follows:
  • Three filtering bounds are proposed based on triangle inequalities to overcome shortcomings of Elkan’s k-means algorithm, by which the most unnecessary distance calculations between points and centers during the iterations of Elkan’s k-means algorithm can be pruned (see Sect. 4).

  • We present how to optimize the original Elkan’s k-means algorithm to alleviate the time and space overheads by applying above filtering bounds. Two optimized algorithms are proposed: runtime optimized Elkan (FB-Elkan) and memory optimized Elkan (MO-Elkan). Specifically, the MO-Elkan has the space complexity \(O(n+k^2+kd)\), whereas Elkan’s k-means algorithm requires \(O(nk+kd)\), where n is number of the input data points, d is number of dimensions and k is number of clusters (see Sect. 5).

  • Throughout evaluation we show that FB-Elkan is faster than the original Elkan’s k-means algorithm on high-dimensional datasets in general, whereas MO-Elkan can outperforms the others on low-dimensional datasets considerably. Specifically, MO-Elkan can derive the exact clustering results when the number of data points is large, i.e., \(n = 5\)M while the original one and FB-Elkan may not be applicable due to memory limitation. (see Sect. 6).

The rest of this paper is organized as follows: In Sect. 2, we review related work regarding the bound-based accelerated algorithms with the exactly same clustering results as the standard (Lloyd’s) k-means algorithm. Section 3 defines the notation used in this paper and presents a short, general overview of Elkan’s k-means algorithm as we use it as a backbone. Section 4 presents our new filtering conditions. In Sect. 5 we discuss how to use the proposed bounds to optimize the original Elkan’s k-means algorithm. In Sect. 6, extensive evaluation results and discussions on different real-world datasets are presented. Finally, we conclude the paper in Sect. 7.

2 Related Work

In this section, we review related work regarding accelerating Lloyd’s k-means algorithm with the triangle inequality so called bound-based acceleration, listed as follows:
  • Elkan’s k-means algorithm  [8] takes advantage of lower bounds and upper bounds to reduce the redundant distance calculations.

  • Hamerly in  [10] proposes to keep only one lower bound on the distance between each point and its second closest center instead of keeping lower bounds per point. Actually, it is a simplified version of Elkan’s k-means algorithm, but it is more efficient for low-dimensional datasets.

  • Drake and Hamerly  [7] extend the above approach  [10] to keep a variable number of lower bounds, which is automatically adjusted on the fly. Drake later on proposes Annulus algorithm in  [6] to prune the search space for each point by annular region.

  • Yinyang k-means algorithm  [5] groups a number of cluster centers, which balances the time of filtering and the time of distance calculations.

  • Fast Yinyang k-means algorithm  [3] further proposes to approximate Euclidean distances by using block vectors, which can achieve good improvements when the dimension of data is high.

  • Newling and Fleuret in  [13] simplify Yinyang and Elkan k-means algorithms and provide tighter upper and lower bounds for updating. They also propose an Exponion algorithm, which improves Yinyang and Elkan’s k-mean algorithms for low-dimensional datasets.

  • Ryšavý and Hamerly in  [16] propose a few methods to accelerate all aforementioned algorithms, such as producing tighter lower bounds, finding neighbor centers and accelerating k-means in the first iteration.

  • Fission-Fusion k-means algorithm  [19] keeps bounds for subgroups of clusters. It performs better for low-dimensional datasets.

Elkan’s k-means algorithm is known to suffer from the required space complexity O(nk) to store the lower bounds, which may be infeasible for large k, demonstrated in  [5, 13]. However, Elkan’s k-means algorithm performs the best in terms of run-time among the aforementioned accelerated k-means algorithms, for high-dimensional datasets, as shown in [13], e.g., Gassensor \((d = 128)\), KDDcup98 \((d = 310)\), and MNIST784 \((d = 784)\).2 Therefore, we are motivated to continue this same vain to make Elkan’s k-means algorithm even faster or with less memory footprint to improve the scalability.

3 K-means Clustering and Elkan’s K-means Algorithm

For k-means clustering, we are given a positive integer k and a set \(\mathbf{X} \) of n d-dimensional data points. The objective is to partition data points in \(\mathbf{X} \) into k clusters while minimizing within-cluster variances, which are defined as Euclidean distance between each data point and the center of the cluster it belongs to. In this paper, we use \(t=0,1,2,\ldots \) to identify the discrete iterations, and each of the given data points in \(\mathbf{X} \) is classified into one of the k clusters in each iteration t. Specifically, Elkan’s k-means algorithm  [8] accelerates Lloyd’s k-means algorithm using triangle inequality.

We use \(C_i(t)\) to denote the set of data points that are classified into the i-th cluster at the end of the t-th iteration. The i-th cluster at the end of the t-th iteration is defined by its cluster center \(c_i(t)\). A data point x is classified into the cluster \(C_i(t)\) if the Euclidean distance between the data point x and the cluster center is the shortest among all cluster centers. That is, \(x \in C_i(t)\) if \(\delta (x, c_i(t)) \le \delta (x, c_j(t))\), ties being broken arbitrarily, where \(\delta (x, y)\) is the Euclidean distance between two points x and y. For any t, we have \(\cup _{i=1}^{k} C_i(t) = \mathbf{X} \) and \(C_i(t) \cap C_j(t) = \emptyset \) when \(i\ne j\). In this paper, we assume that calculation of the distance of any two points can be done in O(d) time complexity and O(1) space complexity.

Initially, when \(t=0\), k seeds are chosen as the initial cluster centers and each of the data points in \(\mathbf{X} \) is classified into one of the k clusters. At the beginning of the next iteration, i.e., \(t+1\), the i-th cluster center is positioned to \(c_i(t+1)\) by calculating the means of the data points in \(C_i(t)\). The shift of the cluster center is \(\delta (c_i(t), c_i(t+1))\). To update the clustering at the end of the t-th iteration, the time complexity is O(nkd) in the above procedure.

Elkan’s k-means algorithm can reduce a large number of distance calculations by applying triangle inequality based on the upper bound and lower bounds decided by each point in each cluster. More precisely, in the t-th iteration, for every data point x in \(\mathbf{X} \), instead of calculating the distance of x to the k cluster centers, the algorithm maintains two types of bounds:
  • An upper bound \(ub(x, c_i(t))\) to the cluster center \(c_i(t)\) when x is classified into the i-th cluster, i.e., \(x \in C_i(t)\).

  • \(k-1\) lower bounds \(lb(x, c_j(t))\) to the other cluster centers \(c_j(t)\) for any \(x \notin C_j(t)\).

The elegance of Elkan’s k-means algorithm is to apply triangle inequality to maintain these bounds without calculating the distances. If x remains in the same cluster, i.e., \(x \in C_i(t)\) and \(x \in C_i(t+1)\), instead of updating the distance information precisely, we simply apply the triangle inequality by setting
  • \(ub\left( x, c_i(t+1)\right) \) to \(ub\left( x, c_i(t)) + \delta (c_i(t), c_i(t+1)\right) \) and

  • \(lb(x, c_j(t+1))\) to \(lb(x, c_j(t)) - \delta (c_j(t), c_j(t+1))\) for any \(j \ne i\).

Elkan  [8] proves that a data point x in cluster \(C_i(t)\) is not going to be assigned to another cluster \(C_j(t+1)\) in the following lemma.

Lemma 1

(Elkan [8]). Suppose that t is a non-negative integer and \(x \in C_i(t)\). Then, x is not going to be classified into another cluster \(C_j(t+1)\) at the end of the \((t+1)\)-th iteration if
$$\begin{aligned} ub(x, c_i(t+1)) < \frac{1}{2} \delta (c_i(t+1), c_j(t+1)) \end{aligned}$$
(1)
or
$$\begin{aligned} ub \left( x, c_i(t+1) \right) \le lb(x, c_j(t)) - \delta (c_j(t), c_j(t+1)) \end{aligned}$$
(2)

We note that a significant drawback of Elkan’s k-means algorithm is that the space complexity is O(nk) due to the storage of the lower bounds, in addition to the O(nd) input data. The algorithm may not be applicable when nk (or even k) is sufficiently large.

4 New Filtering Bounds

Although Elkan’s k-means algorithm can greatly avoid unnecessary distance calculations, its has two shortcomings. First, for an iteration, i.e., fixed t, solely applying Eq. (1) to decide the impossibility of relocating a data point to another cluster can be inefficient. In Sect. 4.1, we propose a simple condition, which can be used to filter out centers that no data in \(C_i(t)\) will be relocated to, at the end of the \((t+1)\)-th iteration. Moreover, the maintained lower bounds \(lb(x, c_j(t))\) can be very expensive, i.e., it requires O(nk) space complexity, and even become too inaccurate in some scenarios. In Sect. 4.2, we present two new lower bounds that can be independently applied to improve the space complexity and inaccuracy.

4.1 Filtering for Clusters of Points

The following theorem provides a new filtering condition to ensure that a point that is not assigned to a cluster \(C_j(t)\) is not assigned to another cluster \(C_j(t+1)\) at the end of the \((t+1)\)-th iteration as well.

Theorem 1

Suppose that t is a non-negative integer and \(x \in C_i(t)\) and some \(j \ne i\). Moreover, assume that
$$\begin{aligned} ub(x, c_i(t)) < \frac{1}{2} \delta (c_i(t), c_j(t)). \end{aligned}$$
(3)
The data point x is not going to be classified into another data cluster \(C_j(t+1)\) at the end of the \((t+1)\)-th iteration, if
$$\begin{aligned} \frac{1}{2}\delta \left( c_i(t), c_j(t) \right) + \delta \left( c_i(t), c_i(t+1) \right) \le \frac{1}{2}\delta \left( c_i(t+1), c_j(t+1) \right) . \end{aligned}$$
(4)

Proof

Recall that the the i-th center is shifted from \(c_i(t)\) to \(c_i(t+1)\) after one iteration. By triangle inequality, we have
$$\begin{aligned} \delta (x, c_i(t+1))&\;\;\;\le&\delta (x, c_i(t)) + \delta (c_i(t), c_i(t+1)) \\&\,\, \!\!\!\!\!\!\!\!\!\underset{\text{ definition } \text{ of } \text{ ub }}{\le }&ub(x, c_i(t)) + \delta (c_i(t), c_i(t+1)) \\&\underset{\text{ Eq. } \text{(3) }}{\le }&\frac{1}{2}\delta \left( c_i(t), c_j(t) \right) + \delta \left( c_i(t), c_i(t+1) \right) \\&\;\underset{\text{ Eq. } \text{(4) }}{\le }&\frac{1}{2}\delta \left( c_i(t+1), c_j(t+1) \right) . \end{aligned}$$
By the above condition, i.e., \(\delta (x, c_i(t+1)) \le \frac{1}{2} \delta \left( c_i(t+1), c_j(t+1) \right) \), we can apply the key property from Elkan  [8] (summarized in Lemma 1), which concludes that the data point x is not going to be classified into cluster \(C_j(t+1)\) whenever the conditions in Eq. (3) and Eq. (4) hold.    \(\square \)

The condition in Eq. (3) is always ensured by applying the original Elkan’s k-means algorithm as this property is ensured by Lemma 1. The difference here is to apply a tighter bound if the condition in Eq. (4) holds. This theorem is useful when the distance \(\delta \left( c_i(t+1), c_j(t+1) \right) \) is larger than \(\delta \left( c_i(t), c_j(t) \right) \).

Corollary 1

Suppose that t is a non-negative integer and the upper bound on the Euclidean distance of every data point in cluster \(C_i(t)\) is at most \(UB_i(t)\), i.e., \(UB_i(t) = \max _{x \in C_i(t)}ub(x,c_i(t))\). If \(UB_i(t) < \frac{1}{2} \delta (c_i(t), c_j(t))\) and the condition in Eq. (4) holds \(\forall x \in C_i(t)\), then none of the data points in cluster \(C_i(t)\) is going to be classified into another data cluster \(C_j(t+1)\) at the end of the \((t+1)\)-th iteration.

Proof

This comes directly from Theorem 1.

4.2 Additional Lower Bounds

In Elkan’s k-means algorithm, to ensure the impossibility that a data point x in \(C_i(t)\) is going to be classified into a new cluster \(C_j(t+1)\) for some \(j \ne i\) in the next iteration is to make sure that the upper bound of the distance \(\delta (x, C_i(t+1))\) is no more than the lower bound of the distance \(\delta (x, C_j(t+1))\). To ensure that, \(lb(x, c_j(t)) - \delta (c_j(t), c_j(t+1))\) is used as a lower bound of \(\delta (x, C_j(t+1))\), as stated in Eq. (2) in Lemma 1.

However, this lower bound becomes very small if the shift of the j-th center is significant. In fact, when \(\delta (c_j(t), c_j(t+1)\) is large, it is possible to find a tighter (i.e., larger) lower bound of \(\delta (x, C_j(t+1))\), as presented in the following theorem:

Theorem 2

Suppose that t is a non-negative integer and \(x \in C_i(t)\). The data point x is not going to be classified into another data cluster \(C_j(t+1)\) at the end of the \((t+1)\)-th iteration for any \(j \ne i\), if
$$\begin{aligned} ub \left( x, c_i(t+1) \right) \le \delta (c_i(t), c_j(t+1))- ub(x, c_i(t)) \end{aligned}$$
(5)

Proof

By triangle inequality
$$\begin{aligned} \delta (c_i(t), c_j(t+1))- ub(x, c_i(t))&\le \delta (c_i(t), c_j(t+1)) - \delta (x, c_i(t))\\&\le \delta (x, c_j(t+1)) \end{aligned}$$
Therefore, if the condition in Eq. (5) holds, we ensure that \(\delta (x, c_i(t+1)) \le \delta (x, c_j(t+1))\) and the theorem is proved.    \(\square \)

Moreover, the lower bound \(lb(x, c_j(t))\) may be not available if we do not want to keep tracking the distance between x and the other \(k-1\) cluster centers that x does not belong to. In fact, if \(c_i(t)\) and \(c_j(t)\) are quite distant, the lower bound in the following lemma can be applied:

Theorem 3

Suppose that t is a non-negative integer and \(x \in C_i(t)\). The data point x is not going to be classified into another data cluster \(C_j(t+1)\) at the end of the \((t+1)\)-th iteration for any \(j \ne i\), if
$$\begin{aligned} ub \left( x, c_i(t+1) \right) \le \delta (c_i(t), c_j(t))- ub(x, c_i(t)) - \delta (c_j(t), c_j(t+1)) \end{aligned}$$
(6)

Proof

By triangle inequality
$$\begin{aligned}&\delta (c_i(t), c_j(t))- ub(x, c_i(t)) - \delta (c_j(t), c_j(t+1)) \\ \le \;\;&\delta (x, c_j(t)) - \delta (c_j(t), c_j(t+1))\\ \le \;\;&\delta (x, c_j(t+1)) \end{aligned}$$
Therefore, if the condition in Eq. (6) holds, we ensure that \(\delta (x, c_i(t+1)) \le \delta (x, c_j(t+1))\) and the theorem is proved.    \(\square \)

We note that the two new lower bounds introduced in Theorems 2 and 3 only require the information of \(ub(x, c_i(t))\) and distances of the cluster centers. Therefore, they can be used to reduce the space complexity when maintaining the lower bounds \(lb(x, c_j(t)), \forall x \in \mathbf{X} \) and \(x \ne C_j(t)\) is too expensive, i.e., O(nk), detailed in Sect. 5.

5 Optimized Elkan’s K-means

In this section we present how to optimize the original Elkan’s k-means algorithm to alleviate the time and space overheads by applying different triangle inequalities presented in Lemma 1, Theorems 12, and 3, and Corollary 1. We note that the triangle inequalities based on \(lb(x, c_j(t))\), for all \(x\in \mathbf{X} \) and \(C_j(t)\) with \(x \notin C_j(t)\), are only applicable when these O(nk) lower bounds are maintained, which can be problematic for the memory usage when nk is large. That is, whenever Eq. (2) is applied, the space complexity may become a bottleneck.

Algorithm 1 presents the pseudocode of our optimized algorithms. After the initialization (Line 3), the clustering procedure keeps repeating until the process converges, i.e., all centers stop changing. If Eq. (2) is not used in the algorithm (in Line 27), we can skip the maintenance of the lower bounds in Lines 13, 18, and 35. The pseudocode consists of two procedures, one for the initialization when t is 0 (i.e., Line 8 to Line 13) and one for the \(t'\leftarrow (t+1)\)-th iteration (i.e., Line 14 to Line 35). We focus our explanation on the latter procedure.

Line 15 updates each of the k centers by calculating the Euclidean mean value of the points assigned to the cluster in the previous iteration. Line 16 calculates different distances between different centers in the last iteration t and in this iteration \(t'=t+1\). Line 17 updates the upper bound of the distance from x to its shifted center \(c_i(t+1)\) by applying a triangle inequality. The time complexity of the above steps is \(O((n+k^2)d)\) and the space complexity is \(O(n+k^2+kd)\).

Moreover, Line 18 updates the lower bounds of the distance from x to other centers with \(x \ne C_j(t)\) using a triangle inequality if necessary. Line 18 requires O(nk) space and time complexity.

For the simplicity of presentation, we use an auxiliary set \(C_i(t')\) which is initialized as \(C_i(t)\) in Line 19 for every \(i=1,\ldots ,k\). We then go through each of the i-th center in the loop described between Line 20 and 35. Line 21 defines a set \(\text{ Set}_i\) based on Corollary 1. That is, it is guaranteed that, for any \(j \notin \text{ Set}_i\), there is no possibility that a data point in \(C_i(t)\) is classified into \(C_j(t+1)\). Line 21 requires O(k) time/space complexity, provided that \(UB_i(t)\) is always maintained. For each \(j \in \text{ Set}_i\), there are two possibilities in the presented algorithms between Line 26 and Line 30:
  • We can apply Eq. (1), Eq. (2), and Eq. (5). For a given x and j, each of them takes O(1) time/space complexity. Line 29 takes O(d) time complexity. However, this requires the lower bounds maintained in Lines 13, 18, and 35. We denote this option as filtering-based Elkan, FB-Elkan.

  • We can apply Eq. (1), Eq. (5), and Eq. (6). For a given x and j, this takes O(1) time/space complexity. Line 29 takes O(d) time complexity. This combination does not require the lower bounds maintained in Lines 13, 18, and 35. We denote this option as memory-optimized Elkan, MO-Elkan.

In the pseudo-code, for the simplicity of presentation, we use an auxiliary set Temp to store the indexes of the possible new centers for a data point x, which are maintained in Line 24 and Line 30. The data point x is assigned to the closest center in Lines 31 to 35. The time complexity between Line 25 and Line 35 is \(O(|\text{ Set}_i|d) = O(kd)\) and the space complexity is O(k). Please note that Temp is just introduced for better readability in the pseudocode. A simple implementation regarding Temp can directly calculate and store the closest index \(j^*\) on the fly using a buffer (instead of calculating the distance again in Line 32).

With the above discussion, we have the following conclusion for one iteration when \(n \ge k\) and \(t \ge 1\):
  • FB-Elkan: time complexity O(nkd) and space complexity \(O(nk+kd)\).

  • MO-Elkan: time complexity O(nkd) and space complexity \(O(n+k^2+kd)\).

We note that the above time complexity analysis is asymptotic and does not reflect the actual run-time efficiency of these two algorithms and the lower bound in Elkan’s k-means algorithm, i.e., Eq. (2), is usually stronger than Eq. (6). Therefore, if the space complexity is affordable, using Eq. (2) is more run-time efficient than using Eq. (6), further explained in Sect. 6.

6 Evaluation and Discussion

In this section, we first present our evaluation setup. Afterwards, we present the evaluation results of normalized speed-up. Specifically, we show the scalability of MO-Elkan on large n datasets, i.e., SUSY and HIGGS. Please note that the ns-bounds provided in [13] can also be included in our algorithms, but we decide not to involve them here due to the page limit.

6.1 Evaluation Setup

We compared two optimized Elkan’s k-means algorithms with the original Elkan’s k-means algorithm (denoted as Elkan)  [8]: FB-Elkan represents the combination of Eq. (1), Eq. (2), and Eq. (5) in Algorithm 1. MO-Elkan represents the combination of Eq. (1), Eq. (5), and Eq. (6) in Algorithm 1. The presented speed-up factors are all normalized according to Elkan. If the normalized value is greater than 1, the considered algorithm is faster than Elkan. Otherwise, it is slower than Elkan.

To evaluate the runtime efficiency, we considered several datasets from the following repositories: the UCI machine learning repository  [2], clustering basic datasets  [9], and LIBSVM  [4]. To show the scalability of MO-Elkan, we specifically consider two additional datasets, i.e., SUSY (\(n=5\)M) and HIGGS (\(n=11\)M). For each of dataset (excluding SUSY and HIGGS), we tested over 50 times3 for each \(k\in \{10, 50, 100, 500\}\) and calculated the variance to show how spread out the measured results are. Although most datasets are given number of classes, which could be used as a natural choice of k, testing over various k is to demonstrate the computational performance. All tested algorithms were executed under the same initialization via k-means Open image in new window [1], and the clustering results of all algorithms were eventually the same as expected. All approaches were implemented in the same programming language, i.e., Open image in new window , and executed on the same machine, i.e., Intel Core i7-8550U with 1.8 GHz and 16 GB RAM.
Table 1.

Speed-up normalized to Elkan and variances with high-dimensional datasets. For the simplicity of the presentation, the shown variance is set to 0 if the calculated value is less than \(10^{-4}\).

Dataset

n

d

k

MO-Elkan

Variance

FB-Elkan

Variance

Covtype

150000

54

10

50

100

500

0.33

0.44

0.56

0.67

0.006

0.49

0.41

4.07

1.14

1.18

1.19

0.70

0.0004

0.04

0.23

1.23

KDDcup98

95412

56

10

50

100

500

0.32

0.39

0.49

0.53

0.008

0.12

1.28

6.33

1.40

1.18

1.09

0.83

0.004

0.017

0.22

0.25

KDDcup04

145751

74

10

50

100

500

0.26

0.25

0.24

0.18

0.43

19.81

200.13

339.76

1.05

1.10

1.09

1.17

0.005

0.60

2.35

8.71

Gassenor

14000

128

10

50

100

500

0.36

0.37

0.43

0.47

0

0.002

0.002

0.049

1.25

1.31

1.22

1.06

0

0.0003

0.0009

0.07

Usps

7291

256

10

50

100

500

0.33

0.27

0.39

0.56

0.003

0.05

0.12

2.28

1.11

1.48

1.69

1.28

0.0003

0.002

0.015

0.68

MNIST784

60000

784

10

50

100

500

0.57

0.3

0.43

0.45

0.088

1.36

9.69

15.73

1.19

1.28

1.10

1.38

0.003

0.15

0.13

1.23

Table 2.

Speed-up normalized to Elkan and variances with low-dimensional datasets. For the simplicity of the presentation, the shown variance is set to 0 if the calculated value is less than \(10^{-4}\).

Dataset

n

d

k

MO-Elkan

Variance

FB-Elkan

Variance

birth

100000

2

10

50

100

500

1.03

1.64

1.90

1.69

0.0001

0.0006

0.018

0.038

0.94

0.92

0.90

0.89

0

0.011

0.0006

0.008

skin_noneskin

245057

3

10

50

100

500

0.95

1.42

1.43

1.49

0

0.002

0.0036

0.61

0.90

0.92

0.93

0.95

0

0.002

0.0036

0.078

3D_spatial_network

434874

4

10

50

100

500

1.36

2.30

2.48

1.27

0

0

0.001

0.005

0.91

0.86

0.91

0.93

0

0

0.004

0.001

6.2 Runtime Efficiency Evaluation

With high-dimensional datasets (see Table 1), FB-Elkan can mostly outperform the others and achieve up to 1.69x. The trends of variances also follow the increase of k for each dataset. However, when the number of clusters k is as large as 500, we observe that the benefit of filtering routines, i.e., avoiding unnecessary distance calculations, is mitigated by the overhead of calculating the filtering bounds. For Covtype dataset, the additional time for calculating Eq. 5 increases from \(11\%\) to over \(20\%\) when k increases from 50 to 500, whereas the original Elkan’s k-means algorithm has no such overhead.

For low-dimensional datasets (see Table 2), we can notice that the variance of the measured results is almost negligible. Moreover, MO-Elkan can reach up to 2.48x, whereas FB-Elkan performs slightly worse than Elkan. In fact, the overhead of checking additional filtering bounds in FB-Elkan is higher than the benefit of filtering unnecessary distance calculations. With a similar reason, MO-Elkan only requires less memory accesses to the filtering bounds. Therefore, it is faster than Elkan for such datasets.

6.3 Scalability Evaluation

In order to demonstrate the improvement of scalability, we specifically evaluated Elkan, FB-Elkan and MO-Elkan with two additional datasets with large n, i.e., SUSY (\(n=5\)M) and HIGGS (\(n=11\)M). We tested over different numbers of clusters k, where \(k = \{5, 10, 50, 100, 500\}\) and report the normalized speed-up factor. In case Elkan halted due to out of memory, we mark the corresponding entry with “v” if FB-Elkan or MO-Elkan can be successfully executed till completion. Otherwise, if FB-Elkan or MO-Elkan also halted, the corresponding entry is marked with “-”. As shown in Table 3, Elkan and FB-Elkan essentially outperform MO-Elkan when their required memory footprints are affordable. When the number of data points n multiplied with k becomes bigger, e.g., SUSY with \(k = 500\) or HIGG with \(k\ge 100\), the required memory footprints clearly become a critical issue, whereas MO-Elkan can still finish the k-means clustering. We note that the memory footprint of MO-Elkan was mainly dominated by the number of data points, and the increased size respect to k was tolerable, i.e., \(\simeq 1.44\) GB for SUSY and \(\simeq 2.57\) GB for HIGGS. However Elkan required 4.608 GB for SUSY with \(k=100\) and 6.72 GB for HIGGS with \(k=50\), and FB-Elkan required slightly more than Elkan.
Table 3.

Speed-up normalized to Elkan with large n datasets.

Dataset

n

d

k

MO-Elkan

FB-Elkan

SUSY

5M

18

5

10

50

100

500

0.19

0.15

0.20

0.28

v

1.17

1.05

0.96

0.95

HIGGS

11M

28

5

10

50

100

500

0.14

0.11

0.07

v

v

1.10

1.08

1.10

7 Conclusion and Outlook

In this paper, we present new filtering bounds to optimize Elkan’s k-means algorithm. Specifically, two different combinations of the proposed bounds are proposed to either filter more unnecessary distance calculations (FB-Elkan), or reduce the space complexity (MO-Elkan) to improve the scalability of the original Elkan’s k-means algorithm. Throughout extensive evaluations with several real-world datasets, we reach the conclusion that FB-Elkan improves the runtime efficiency of Elkan for high-dimensional datasets and MO-Elkan outperforms the others for low-dimensional datasets while improving the scalability of Elkan, i.e., the memory footprint is mainly dominated by the number of data points.

In the future work, we plan to integrate the proposed filtering bounds into other bounds-based accelerated k-means algorithms. For example, an integration with Fission-Fusion k-means algorithm  [19] may additionally refine the bounds not only for each data point but also for each cluster. Integrating our bounds with Yingyang  [5] and Fast Yingyang k-means algorithms  [3], can be expected to greatly reduce computation time of distance calculations.

Footnotes

  1. 1.

    The O(nd) space complexity of the input points is ignored in our complexity analysis.

  2. 2.

    In fact, Elkan’s k-means algorithm using the ns-bounds derived from the norm of a sum in [13] sometimes outperforms the original Elkan’s k-means algorithm.

  3. 3.

    Due to the amount of required time for each test, we can reach this number for all setups to fairly demonstrate the statistical significance of the differences.

Notes

Acknowledgement

We thank our colleague Mr. Mikail Yayla for his precious comments at early stages. This paper has been supported by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), as part of the Collaborative Research Center (SFB 876), “Providing Information by Resource-Constrained Analysis” (project number 124020371), project A1 (http://sfb876.tu-dortmund.de).

References

  1. 1.
    Arthur, D., Vassilvitskii, S.: K-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)Google Scholar
  2. 2.
    Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
  3. 3.
    Bottesch, T., Bühler, T., Kächele, M.: Speeding up k-means by approximating euclidean distances via block vectors. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, vol. 48, pp. 2578–2586. JMLR.org (2016)Google Scholar
  4. 4.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)CrossRefGoogle Scholar
  5. 5.
    Ding, Y., Zhao, Y., Shen, X., Musuvathi, M., Mytkowicz, T.: Yinyang k-means: a drop-in replacement of the classic k-means with consistent speedup. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML 2015, vol. 37, pp. 579–587. JMLR.org (2015)Google Scholar
  6. 6.
    Drake, J.: Faster k-means Clustering. Master Thesis in Baylor University (2013)Google Scholar
  7. 7.
    Drake, J., Hamerly, G.: Accelerated k-means with adaptive distance bounds. In: 5th NIPS Workshop on Optimization for Machine Learning (2012)Google Scholar
  8. 8.
    Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 2003, pp. 147–153. AAAI Press (2003)Google Scholar
  9. 9.
    Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets (2018). http://cs.uef.fi/sipu/datasets/
  10. 10.
    Hamerly, G.: Making k-means even faster. In: SDM, pp. 130–140 (2010)Google Scholar
  11. 11.
    Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 881–892 (2002)CrossRefGoogle Scholar
  12. 12.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (2006).  https://doi.org/10.1109/TIT.1982.1056489
  13. 13.
    Newling, J., Fleuret, F.: Fast k-means with accurate bounds. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of the 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, 20–22 Jun 2016, New York, USA, pp. 936–944Google Scholar
  14. 14.
    Pelleg, D., Moore, A.: Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 277–281. Association for Computing Machinery, New York (1999).  https://doi.org/10.1145/312129.312248
  15. 15.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)Google Scholar
  16. 16.
    Ryšavý, P., Hamerly, G.: Geometric methods to accelerate k-means algorithms. In: Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 324–332 (2016)Google Scholar
  17. 17.
    Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178. Association for Computing Machinery, New York (2010)Google Scholar
  18. 18.
    Wang, J., Wang, J., Ke, Q., Zeng, G., Li, S.: Fast approximate k-means via cluster closures. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3037–3044 (2012)Google Scholar
  19. 19.
    Yu, Q., Dai, B.-R.: Accelerating K-Means by grouping points automatically. In: Bellatreche, L., Chakravarthy, S. (eds.) DaWaK 2017. LNCS, vol. 10440, pp. 199–213. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-64283-3_15CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Design Automation for Embedded Systems Group, Department of Computer ScienceTU DortmundDortmundGermany

Personalised recommendations