Also for k-means: more data does not imply better performance

Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that k-means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger k, the question remains open.


Introduction
k-Means continues to remain among the most-used techniques both in data mining and in machine learning. Next to being employed for analysis, k-means is also actively researched and further developed as such. Among recent work, for instance, we find that Stemmer (2021) studied k-means in the differentially private setting, Klochkov et al. (2021) developed a robustified version and report non-asymptotic bounds, Liu (2021) provided improved learning bounds, Ghadiri et al. (2021) came up with a fair k-means, while Cohen-Addad et al. (2021) developed and analyzed an online version. The, largely theoretical, study that we provide in this work concerns the learning curve behavior of k-means.
Learning curves (or, alternatively, sample complexity curves (Zubek and Plewczynski 2016)) consider, on a specific learning problem, the average performance of a learner against the sample size of the training set (Perlich, 2010;. At least since the work by Vallet et al. (1989), it is known that such curves can exhibit counterintuitive behavior and show deteriorating performance with increasing training set sizes, i.e., these learners can display so-called nonmonotonic behavior .
The investigation of such a, arguably, problematic property  has found renewed attention in the wake of the important work by Belkin et al. (2019). They raise various issues concerning classical learning theory in the light of modern-day overparameterized learners. The phenomenon for which this work seems to be cited primarily, however, was essentially described by Vallet et al. (1989) already (see also Loog et al., 2020). Following Belkin et al. (2019), it is nowadays referred to as double descent, also in the learning curve setting: the curve starts off as expected and improves with increasing numbers of training samples, then its performance starts to deteriorate over a consecutive number of training sizes, following which it again gets to improved performance with more training data.
There is a gamut of surprising and curious learning curve behaviors next to double descent. One of the more peculiar ones is the zigzag curve that least absolute deviation regression can display and which has been explained recently by Chen et al. (2023). Loog and Viering (2022) provide a complete overview of the present state of affairs. Our current work is, more specifically, in line with the findings from Loog et al. (2019). There it is shown that learners that rely on empirical risk minimization (ERM), being at the basis of many learners these days, can act nonmonotonically no matter the training sample size. Stated differently, with some sense of drama, Loog et al. (2019) show that even if, during the training phase, we optimize for the loss that we are also using during the test phase, 1 test-time performance can still deteriorate in expectation. The original work showed the emergence of such behavior for classification, regression, and density estimation.

Contributions and outline
In this work, we show, in a precise sense, that k-means clustering is not devoid of such quirks either. Our contribution is limited in the sense that we merely manage to prove nonmonotonicity for k = 2 . Nevertheless, this is significant, firstly, for the fact that we can at all prove that something like this happens for k-means and, secondly, because it demonstrates that nonmonotonicity extends to settings beyond classification, regression, and density estimation.
The next section goes through some preliminaries and provides further related work. It primarily puts k-means in the context of empirical risk minimization and introduces the precise notion of monotonicity. Section 3 formulates our two main results: for k = 1 the learning curve of k-means always behaves properly, i.e., it improves with more data, but for k = 2 its behavior can be problematic. More precisely, we show that no matter the size of the training set, there are clustering problems for which the 2-means performance still becomes worse with even more training data. While the proof for the former result is provided in the same section, all of Sect. 4 revolves around the proof of the second result. Section 5 discusses and concludes our work.

Preliminaries and additional related work
We formulate k-means clustering within the framework of empirical risk minimization (ERM) and touch upon some further relevant literature in this context. In addition, we make precise the notion of monotonicity that we are going to employ. Regarding the latter, we largely follow the notation and definitions as proposed in Loog et al., 2019).

Empirical risk minimization and k-means
X be a training set of size N. This is, in our k-means setting, an i.i.d. sample from a distribution D over the standard d-dimensional feature space X = d ℝ . In addition, we have as hypothesis class the set of sets of k means Note that, as every m ∈ M k is an actual set, their cardinality |m| is smaller than k in case m i = m j for one or more pairs i ≠ j . That is, redundant means are discarded.
The particular loss function that we consider in our case could be termed the clusterwise or group-wise squared loss. We, however, go with within-group squared (WGS) loss, inspired by the term within-group sum of squares that Hartigan (1978) considers in the context of k-means: The ultimate objective is to minimize the expected loss, i.e., the risk: (2) As we do not know the actual underlying distribution D, the principle of ERM (Vapnik, 1982) suggests the learner to rely on its empirical distribution, defined by the training sample of size N, and to consider the loss on this distribution: This empirical risk is then equivalent to Hartigan (1978)'s within-group sum of squares.
We can now define the following: Definition 1 (k-means clustering) k-means clustering is the learner A k that maps from the set of all samples S ∶= ⋃ ∞ i=1 i X to the hypothesis class M k , i.e., A k ∶ S → M k , according to the optimality assignment The first minimizers in our definition picks out all mean sets that minimize the empirical risk. A(S N ) can indeed be a set of solutions, in particular when N < k . The second minimizer then makes sure that we obtain a solution set of minimal cardinality.
The ERM view on k-means that we consider is equivalent to the formulation that Pollard (1981) provides. Different formulations are possible, for instance, where ones does not consider the squared distance to the closest means, but where one looks for a partitioning of the space in optimal regions. The latter can be found, for example, in the works by Dalenius (1950), MacQueen (1967, Ben-David et al. (2006). The former "center-based" formulation can also be found in Rakhlin (2005), Rakhlin and Caponnetto (2006) and Ben-David (2004) (under the name vector quantization problem). Buhmann (1998) provides a slightly different ERM setting and normalizes the influence of every cluster with its size. Bock (2007) relates different center-based and partitioning based approaches.

Idealized and practical k-means
Note that the specific k-means that we consider is, in some sense, idealized, because we actually assume that it minimizes the empirical risk globally, which is known to be an NP-hard problem (Dasgupta, 2008;Aloise et al., 2009).
For this reason, in practice, one often needs to resort to suboptimal approaches when carrying out the optimization in Eq. (5). This is where the well-known alternating optimization method of assigned points to means, then updating the means, and repeating this process comes in (Steinhaus, 1956;Jancey 1966;Bock 2007). Often, k-means is actually associated with exactly this algorithm, which we expressly do not do in this paper.
The successful k-means++ algorithm (Arthur & Vassilvitskii, 2007) has been shown to provide a reasonable solution to the k-means problem, despite it being NP-hard. More specifically, it is guaranteed to achieve a WGS risk in polynomial time that is, in expectation, no worse than log k + 2 times the optimal. It should be noted that k-means++ considers the WGS risk of the training set, while we are primarily interested in the expected loss of k-means on the full problem distribution. |m|.

Monotonic or smart?
We note that more than 20 years before Loog et al. (2019), Devroye et al. (1996) already suggested the notion of smart rules, which refers to classifiers that show nonincreasing expected error rates with increasing training set sizes. The terms smart and smartness, may be better choices than the terms monotonic and monotonicity. The latter term, as it is used originally, does of course not distinguish between increasing or decreasing curves. What we are after, in particular, is a nonincreasing curve in terms of the expected risk. As we evaluate in terms of the same loss as the one that is being optimized during training, this risk is not the error rate in our setting, but equals the expected within-group squared loss from Eq. (3).

Monotonicity
The behavior we are interested in is that we get better, or at least not worse, test performance when having more data to train on. In particular, we want k-means to not perform worse with increasing N in terms of the expected within-group squared loss as given by Eq.
(3). As adding a single very bad sample can always ruin performance, it is reasonable to merely ask for such performance non-deterioration in expectation, i.e., over all possible samples S N and S N+1 of size N and N + 1 , respectively. A basic initial definition is therefore the following.

Definition 2 (local monotonicity)
The two entities we would like to get rid of in Definition 2 are N and D. The former, because we would like our learner to act monotonically irrespective of the sample size. The latter, because we typically do not know the underlying distribution. What we do know, however, is in which domain we are operating, which is ℝ d for our k-means. Therefore, employing the difference Δ N+1 N as defined in Eq. (6), the following is appropriate.
Note that the above definition is a refinement of the original from Loog et al., 2019). In particular, we added the statement on the existence of the difference in expected risks to make sure that the learner is ℝ d -monotonic even though on some distributions the necessary integrals may not exist or the difference in itself is problematic. Such issues arise, for instance, when both expectations in the difference evaluate to infinity and Δ N+1 N would become ∞ − ∞.
The double-descent phenomenon that we covered earlier shows that learning curves can have some difficulties being monotonous from the start. The second best thing to hope for is that a learner becomes monotonic after some sample size, which leads to a weak form of monotonicity.
If the n from Definition 4 can be set to 1, the learner is called globally ℝ d -monotonic. Meek et al. (2002) were possibly the first to consider learning curves for clustering techniques in an application setting. In particular, they studied mixtures of Gaussians, but the notion can be transferred readily to the setting of k-means as seen in what follows. Interestingly, apart from Meek et al. (2002), there seems little additional learning curve work.

Learning curves and bounds
There is more work available on theoretical bounds for k-means. These report nonasymptotic bounds on the excess risk-or excess distortion as it is also called in this setting. Some of the more recent works are (Levrard, 2015;Maurer, 2016;Chichignoud & Loustau, 2014;Levrard, 2013;Biau et al., 2008). Earlier mentioned (Klochkov et al., 2021) is one of the recent additions, which, alternatively, studies a robust version of k-means. Their bounds show the typical 1/N or √ 1∕N power-law behavior in terms of the training sample size N. Clearly, as these are bounds, they do not necessarily say a lot about the actual (local) behavior of the learning curve.

Main theoretical results
Having laid the groundwork in the previous section, we can now come to our main results.
The proof follows in Sect. 3.1. It is fairly uninvolved-certainly compared to the proof of the next theorem. Nevertheless, the result is there both for completeness and to contrast it with what happens when k = 2 , in which case the behavior changes notably.

Theorem 2 For
The proof of this result is relatively involved and could be considered a bit cumbersome. We therefore dedicate a separate section to it, which is Sect. 4.

Proof of Theorem 1
Proof It is easy to check that the minimizing hypothesis of the empirical risk A 1 (S N ) is attained for the mean: x i , as we are dealing with only one cluster. Let now denote the expectation both over the training sample S N ∼ D N and over the test sample x ∼ D , where this latter expectation comes from the risk R D expressed by Eq. 3.
With the true mean of the distribution D, we can write the following for the expected within-cluster loss for sample size N: Only the last term matters as it is the only part depending on the training set size. We have Now, for Δ N+1 N (as defined in Eq. (6)) to exist, � ‖x − ‖ 2 � needs to be finite. In that case, as � ‖x − ‖ 2 � is nonnegative, we have that for any N and so A 1 is globally monotonic. ◻

Proof of Theorem 2
At a high level, the proof is fairly straightforward. We explicitly construct a class of distributions that has two free parameters and we show that by having those parameters depend on N in the right way, we can always make the learning curve go up when going from a sample of size N to one of size N + 1 on that same distribution, i.e., Δ N+1 N < 0 . As we can construct such a distribution for any N ≥ 14 (so the distribution is allowed to be different for every (N, N + 1)-pair), A 2 is not locally ℝ d -monotonic for any N ≥ 14 . As such, A 2 is also not weakly monotonic for any n ∈ ℕ , as there is always an N ≥ n and a corresponding distribution for which Δ N+1 N < 0. The distributions that we construct for this consist simply of three point masses in one-dimensional feature space ℝ . Because we can always embed this one-dimensional problem in ℝ d for any d ∈ ℕ , we can limit ourselves to this specific problem in our proof.
The complication in proving Theorem 2 stems mainly from the fact that we cannot explicitly evaluate the difference Δ N+1 N between two consecutive points on the learning curve in full. We therefore demonstrate that the curve goes up by showing that the difference is strictly negative by bounding it away from zero.
We start the preparations for the proof in the next subsection where we introduce our parameterized three-point problem. In Sect. 4.2, we then formulate and prove six lemmas, which give us different handles on parts of the behavior of the true expected risks. Sect. 4.3 brings it all together and finalizes the proof of Theorem 2.

A parameterized three-point problem and its risk
Definition 6 (three-point problem) This clustering problem, which depends on two parameters c and p, both in the open interval (0, 1), considers three locations in one-dimensional Euclidean space with associated probability mass function P. The specific points are A at −1 , B at the origin, and C at c, while the associated probability masses are P(A) = p and P(B) = P(C) = 1 2 (1 − p).
In addition, let us now introduce the following notation and definitions. Firstly, let the number of training samples from A, B, and C, equal i, j, and k, respectively. Secondly, let X (i, j, k) equal the true loss incurred at point X ∈ {A, B, C} . It is important to note that these three losses are, of course, dependent on the precise counts for i, j, and k, as those determine the hypothesis for that training set. We denote its associated hypothesis from M k by m (i, j, k). A further definition we use is which denotes the true risk given the counts (i, j, k) for the three points A, B, and C. Finally, for a training set of size N, the expected risk for the three-point clustering problem, which we simply denote by E(N) from now on, is equal to where the count for point C equals k = N − i − j.

Six preparatory lemmas
The six lemmas presented in this section provide four different types of results. To start with, Lemma 1 describes a specific situation in which a minimizer for our three-point problem can be identified easily and uniquely. Lemmas 2 and 3 provide simplifications of some of the expressions involving binomials an risks that we will encounter. Lemmas 4 and 5 provide bounds for some of the expressions that appear in the proof of Theorem 2 when considering the difference in expected risk Δ N+1 N at training sample size N and N + 1 under the three-point distribution parameterized by the same c and p. Our ultimate lemma shows a specific one-dimensional function to be negative beyond a certain point. This is merely a technical result, that will ultimately be used to lower-bound the increase in the risk.

is the unique minimizer of the empirical risk for the three-point problem if and only if
Proof First observe that using two rather than one mean, will always improve the empirical risk in the setting considered. Secondly, assigning points observed at the same location to two different means can never reduce the minimum risk. Thirdly, associating one mean with the middle location B and the other with the observations on both sides at A and C cannot be optimal. All in all, the optimal solution should either associate the one mean with A and the other with B and C or the one with A and B and the other with C. The mean associated with one point should, of course, be exactly that point. The other mean, in order to minimize the squared loss, is the weighted average of the two observed locations. As a result, either the hypothesis as specified in Eq. (12) is the optimizer or − i i+j , c is. Now, with N = i + j + k , the respective empirical risks for these two hypotheses are and One therefore uniquely chooses the first hypothesis in case the value in Eq. (14) is (strictly) smaller than that in Eq. (15), as these are the only two hypotheses that we need to consider. It is easy to see that this holds if and only if c 2 k(i + j) < i(j + k) . ◻

Lemma 2
For the three-point problem from Definition 6 and any N ≥ 1,

which is the contribution to the expected risk of all training sets that do not contain samples at A.
Proof When j = 0 or j = N , the hypotheses minimizing the empirical risk are in fact single means at c and 0, respectively. In these cases, we have For 1 ≤ j ≤ N − 1 , both B and C are in the training set and the minimizer equals m(0, j, k) = {0, c} . The associated true risk is therefore R(0, j, k) = p . Working out the summation now leads to Lemma 3 Consider the three-point problem from Definition 6 and assume c 2 < 4(N−1) The left-hand side is quadratic in i and j and takes its maximum for any i + j = N 2 . The right-hand side is quadratic and takes its minimum in i = 1 and i = N − 1 . Therefore, under the assumptions that c 2 < 4(N−1) N 2 , we indeed have that Now, the optimal hypothesis −1, c k j+k has corresponding true risk and we find the desired identity by working the sum on the left-hand side of Eq. (20).
The next equality is established with the use of the identities ∑ n k=0 � n k � = 2 n , ∑ n k=0 � n k � k = n2 n−1 , and ∑ n k=0 � n k � k 2 = (n 2 + n)2 n−2 . All of these can be readily derived from the binomial theorem: the first sum is standard, the other two identities are reported in Gould (2010) as Eq.s (1.70) and (1.67), respectively.

Lemma 4 Given
Proof To start with, observe that the term 1 − N+7 N+6 (1 − p) equals 0 when p takes on its supremum 1 N+7 . As soon as p becomes smaller than 1 N+7 , the term becomes strictly negative and we have the second, strict inequality.
To show the first inequality, note initially that we can rewrite it into the equivalent requirement that Demonstrating this for i = 1 comes down to showing that To see that this holds, multiply left and right by N 3 (N + 6) and reorganize terms to come to 5N 2 − 7N − 6 ≥ 0 . Equality is attained for N = − 3 5 and N = 2 , and so the inequality indeed holds for all N ≥ 2 , as we are dealing with a convex quadratic equation. (24) N i To show now that the inequality holds for any 1 ≤ i ≤ N − 1 , consider the derivative to i: This only becomes zero, when i 2 − 2(N + 1)i + N 2 + 2N − 2 = 0 , which only happens 3 . Both these solutions are larger than N − 1 . Therefore, on the whole interval [1, N − 1] , the derivative is positive, is strictly increasing over that same domain, and we have that for all i ∈ [1, N − 1] in general and for 1 ≤ i ≤ N − 1 in particular. ◻

Lemma 5
For the three-point problem from Definition 6, any N ≥ 2 , and any p ≤ 1 2 : Proof The term R(N, 0, 0) gives the risk in case the training set only contains samples from location A, which means we have one mean at −1 and therefore Using this term together with Eq. (16) from Lemma 2, we can rewrite the part of Eq. (29) on the left-hand side of the inequality, as For the first term in the sum of the right-hand side, as p+1 2 ≤ 1 , we have the following upper bound.
For the third term, we have d di (33) 1 2 (c 2 − 2c + 2) ≤ 1 2 2 = 1 because c ∈ (0, 1) and the quadratic function takes on its supremum at c = 0 . In addition, Proof For N > 1 , the denominator 2(N + 6)(N − 1) is positive. Multiplying all terms with this term and simplifying leads to The logarithm of the expression 16N(N − 1)(N + 6)(4N + 1)2 −N is concave for N > 1 and so we can upper-bound that log-expression by a linear function. In particular, based on the first Taylor polynomial (or tangent) at N = 14 , we may write: As the coefficient 45299 157092 − log(2) for the linear term in the latter part of the inequality is negative, we can fill in N = 14 to obtain a value that upper-bounds the expression in Eq. (37) for all N ≥ 13 . The exponent of this value, which equals 25935 128 , we can then use to upper-bound Eq. (36) for N ≥ 14: This last quadratic upper bound is concave and takes on its maximum value at N = 11 4 , which is smaller than 14. Therefore, the value of the upper bound in Eq. (38) at 14, provides an upper bound for 25935 128 − 2N 2 + 11N − 5 for all N ≥ 14 . Its value is 25935 128 − 2 ⋅ 4 2 + 11 ⋅ 4 − 5 = − 5169 128 < 0 and, as 2(N + 6)(N − 1) is positive in that case, we also have that the left-hand side of Eq. (35) is strictly smaller than 0 for all N ≥ 14 . ◻

Finalizing the proof
Proof of Theorem 2 To start with, that A 2 is not weakly monotonic for any n ∈ ℕ follows readily from the first part of the theorem. If for every integer N ≥ 14 , there exists a distribution D such that Δ N+1 N < 0 then there is always a distribution and an N ≥ n for which The remainder of the proof therefore focuses on demonstrating the first claim from the theorem.
Without loss of generality, we can limit our attention to one-dimensional ℝ , as we can always embed a problem from that space in ℝ d . As such, nonmonotonicity of A 2 in ℝ carries over to ℝ d . Let us therefore merely consider the one-dimensional problem from Definition 6. Moreover, take N ≥ 14 and take c = 2 N , in which case c 2 is strictly smaller than 4(N−1) N 2 . Additionally, take p = 1 4N 2 , which is strictly smaller than 1 N+7 for N ≥ 14 . These choices make sure that all six lemmas hold and, in addition, determines a specific three-point distribution D for every N. We now show that the learning curve increases on this D when going from N to N + 1 training samples.
Following Lemma 2 and Lemma 3, we can write Subsequently, let us consider the difference Δ N+1 N = E(N) − E(N + 1) , where both expectations are taken with respect to the same underlying distribution D. This difference needs to be smaller than 0 to show that A 2 is not monotonic on the problem defined by D when going from N to N + 1.
Using Lemmas 4 and 5, we find that where the last inequality holds because all terms in the summation are smaller than 0, so removing those for i ∈ {2, … , N} only increases the value. In our next step, we fill in our choices for c and p in the above inequality and simplifying the expression. This gives us the following bound on the change in expected risk on D:

Discussion and conclusion
Having shown that 1-means is monotonic, while 2-means is not even weakly monotonic, the obvious question that comes up is what we can say about k > 2 . Our current conviction is that we can probably design problematic cases, similar to the one for 2-means, for k > 2 as well. We have, however, not been able to do so up until now. A first step could be to empirically show that, say, for 3-means there are at all distributions where nonmonotonicity occurs. It may only then be sensible to look for a proof showing that it is not weakly monotonic. Incidentally, note that for 3-means, we need at least a four-point problem, which may point at an even more involved proof for this case. The current proof seems to strongly hinge on the discreteness of the chosen three-point distributions from Definition 6. We conjecture, however, that a similar proof can be constructed on the basis of a class of continuous distributions, rather than discrete ones. Our current idea is that every one of the three discrete locations A, B, and C can probably be replaced by a narrow enough uniform distribution around that point such that the crucial steps in our proof still go through. We do admit, however, that it may be rather nontrivial to precisely reformulate all arguments. Nonetheless, we do believe that the distribution's discreteness is not essential.
Two further research direction that, we think, are of interest are, firstly, whether k-means can be turned into a monotonic learner and, secondly, what we can say about the learning curve behavior of closely related Gaussian mixture models (Welling & Kurihara, 2006;McLachlan et al., 2019;Lücke & Forster, 2019). The latter are closely linked to k-means in particular when the covariance matrices of the mixture components are assumed to be the identity and the mixture priors are all equal. What we wonder especially is whether moving from the within-group squared loss to the (negative) log-likelihood and/or going from hard to soft assignments of points to clusters provides any benefits when it comes to monotonic behavior.
As for the former research question, there are some wrapper techniques to turn learners monotonic in expectation. Viering et al. (2020) and Bousquet et al. (2022) rely on the 0-1 loss specifically and cannot be applied directly to our setting. On the other hand, Mhammedi (2021)'s wrapper only assumes the loss to be bounded, which is readily fulfilled for k-means in case the support of the distribution is bounded as well. Unfortunately, the proof in (Mhammedi, 2021) turned out to be defective and the original result was updated in a correction mentioned in (Footnote 6 (Mhammedi, 2022)). It now only ensures monotonicity up to an additive term with an N −1 rate. Still, assuming bounded distributional support, Mhammedi (2021) at least gives us some possibility to control the monotonicity of k-means.
In the context of potential wrappers, what is important to note, however, is that these algorithms essentially change the base algorithm. Are we still dealing with k-means now or is it a different learner altogether? More directly related to the idealized version that we consider are the k-means algorithms in practical use (Sect. 2.2 provides some pointers).
From a monotonicity point of view, the fact that these typically do actually not provide a globally optimal clustering could result in a smoother, maybe even monotonous learning curve. In other words, the regularizing effect from the suboptimality of practical k-means optimization could actually promote monotonicity. Regularization can, however, both fix and create nonmonotonicity Nakkiran et al., 2020; and it could be interesting to see how different k-means++ possibly behaves compared to other lesser-optimal algorithms.
All in all, our findings add clustering to the list of potentially nonmonotonic behaving learners, next to some classifiers, regression techniques, and density estimators (cf. Loog et al, 2019). From a theoretical point of view, this is crucial as it provides us, for instance, with a deeper understanding of how learning curves can at all behave. The practitioner may object that "this does not happen on real-world problems." Though we are, in principle, willing to believe this, what proof do we really have? If anything, some recent study by Mohr et al. (2022), on a large number of data sets in combination with various classifiers, showed that nonmonotonic learning curve behavior does occur. The least any practitioner should be is aware that nonmonotonicity can happen.