Machine Learning Friendly Set Version of Johnson-Lindenstrauss Lemma

In this paper we make a novel use of the Johnson-Lindenstrauss Lemma. The Lemma has an existential form saying that there exists a JL transformation $f$ of the data points into lower dimensional space such that all of them fall into predefined error range $\delta$. We formulate in this paper a theorem stating that we can choose the target dimensionality in a random projection type JL linear transformation in such a way that with probability $1-\epsilon$ all of them fall into predefined error range $\delta$ for any user-predefined failure probability $\epsilon$. This result is important for applications such a data clustering where we want to have a priori dimensionality reducing transformation instead of trying out a (large) number of them, as with traditional Johnson-Lindenstrauss Lemma. In particular, we take a closer look at the $k$-means algorithm and prove that a good solution in the projected space is also a good solution in the original space. Furthermore, under proper assumptions local optima in the original space are also ones in the projected space. We define also conditions for which clusterability property of the original space is transmitted to the projected space, so that special case algorithms for the original space are also applicable in the projected space.


Introduction
Dimensionality reduction plays an important role in many areas of data processing, and especially in machine learning (cluster analysis, classifier learning, model validation, data visualization, etc.).
Usually, it is associated with manifold learning, that is a belief that the data lie in fact in a low-dimensional subspace that needs to be identified and the data projected onto it so that the number of degrees of freedom is reduced and as a consequence also sample sizes can be smaller without loss of reliability. Techniques such as reduced k-means [40], PCA (Principal Component Analysis), Kernel PCA, LLE (Locally Linear Embedding), LEM (Laplacian Eigenmaps), MDS (Metric Multidimensional Scaling), Isomap, SDE (Semidefinite Embedding), just to mention a few, are applied in order to achieve dimensionality reduction.
But there exists still another possibility of approaching the dimensionality reduction problems, in particular when such intrinsic subspace where data is located cannot be identified. The problem of choice of the subspace has been circumvented by several authors by the socalled random projection, applicable in extremely high-dimensional spaces (tens of thousands of dimensions) and correspondingly large data sets (of at least hundreds of points).

Johnson-Lindenstrauss lemma and dimensionality reduction
The starting point here is the Johnson-Lindenstrauss (JL) Lemma [25]. Roughly speaking, it states that there exists a linear 1 mapping from a higher dimensional space into a sufficiently high-dimensional subspace that will preserve approximately the distances between points, as needed, e.g. by k-means algorithm [5]. In fact, when designing optimal k-means clustering algorithms, the possibility of dimensionality limitation via Johnson-Lindenstrauss Lemma is assumed, see, e.g. [3].
To be more formal, consider a set Q of m objects Q = {1, . . . , m}. An object i ∈ Q may have a representation x i ∈ R n . Then, the set of these representations will be denoted by Q. An object i ∈ Q may have a representation x i ∈ R n , in a different space. Then, the set of these representations will be denoted by Q .
With this notation, let us state: Theorem 1.1 (Johnson-Lindenstrauss) Let δ ∈ (0, 1 2 ). Let Q be a set of m objects and Q-a set of points representing them in R n , and let n ≥ C ln m δ 2 , where C is a sufficiently large constant. There exists a Lipschitz mapping f : R n → R n such that for all u, v ∈ Q The above formulation is cited after [31]. Larsen and Nelson [28] demonstrated that if the function f has to be linear, then the lower bound on n is optimal for δ and m.
Other papers propose slightly different formulas for n , for example Gupta and Dasgupta [20]: prove the bound n ≥ 4 ln m δ 2 −δ 3 . with a different denominator. But as δ is limited from above by 1 2 , it is easily seen that the formula of [20] implies n ≥ 8 ln m δ 2 , that is the original formulation with C = 8. Still another formulation by Frankl and Maehara [23] implies C is around 9. Empirical studies [42] suggest that C = 2 is sufficient. 2 The researchers are interested also in extensions to original JL Lemma. For example, Matousek [31] considers other spaces than Euclidean, e.g. with 1 norm or sparse spaces. Achlioptas [1] studies random projections with sparse matrices. Baraniuk et al. [13] extend the JL Lemma to manifolds. Magen [30] explored the preservation of n-dimensional angles when projecting to lower dimensional spaces so that volumes can be preserved also.
The JL Lemma has diverse applications. It has been used for rounding, embedding into low-dimensional spaces, neural network-based machine learning, information retrieval, compressed sensing, data stream analysis, approximate nearest neighbour search, just to mention a few. Image analysis, in particular motion analysis, is a quite typical domain giving rise to high-dimensional data that can be reduced via JL Lemma (see e.g. [35]).
In this paper, we are particularly interested in applications to cluster analysis. Already, Schulman [36] was interested in this topic, in particular in optimizing the intracluster sum of weights in graph clustering. He uses JL Lemma to reduce the computational time. Tremblay et al. [41] exploit the JL Lemma also for the task of graph clustering in a kind of graph sampling approach (in the spectral space).
A number of proofs and applications of Theorem 1.1 have been proposed which in fact do not prove Theorem 1.1 as such but rather create a probabilistic version of it, e.g. [1,4,17,20,24,29]. For an overview of Johnson-Lindenstrauss Lemma variants, see, e.g. [31].
Essentially, the idea behind these probabilistic proofs is as follows: Assume that it is proven that the probability of reconstructing the length of a random vector from a projection onto a subspace within a reasonable error bounds is high.
One then inverts the thinking and states that the probability of reconstructing the length of a given vector from a projection onto a (uniformly selected) random subspace within a reasonable error bounds is high.
But uniform sampling of high-dimensional subspaces is a hard task. So n vectors with random coordinates are sampled instead from the original n-dimensional space and one uses them as a coordinate system in the n -dimensional subspace which is a much simpler process. One hopes that the sampled vectors will be orthogonal (and hence the coordinate system will be orthogonal) which in case of vectors with thousands of coordinates is reasonable. That means we create a matrix M of n rows and n columns as follows: for each row i we sample n numbers from N (0, 1) forming a row vector a T i . We normalize it obtaining the row vector b T i = a T i · (a T i a i ) −1/2 . This becomes the ith row of the matrix M. Then, for any data point x in the original space its random projection is obtained as x = Mx.
Then, the mapping we seek is the projection multiplied by a suitable factor. It is claimed afterwards that this mapping is distance-preserving not only for a single vector, but also for large sets of points with positive but usually very small probability, as Dasgupta and Gupta [20] maintain. Via applying the above process, many times one can finally get the mapping f that is needed. That is, each time we sample a subspace from the space of subspaces and check whether condition expressed by inequality (1) holds for all the points. If it does not, we sample again, while we have the reasonable hope that we will get the subspace with required properties after a finite number of steps with probability that we assume. 3

JL dimensionality reduction and k-means clustering problem
In this paper, we explore the following deficiency of the discussed approach: If we want to apply, for example, a k-means clustering algorithm, we are in fact not interested in resampling the subspaces in order to find a convenient one so that the distances are sufficiently preserved. Computation over and over again of m 2 /2 distances between the points in the projected space may turn out to be much more expensive than computing O(mk) distances during k-means clustering (if m k) in the original space. In fact, we are primarily interested in clustering data. But we do not have any criterion for the k-means algorithm that would say that this particular subspace is the right one via, e.g. minimization of k-means criterion (and in fact for any other clustering algorithm). Therefore, we rather seek a scheme that will allow us to say that by a certain random sampling we have already found the suitable subspace that we sought with a sufficiently high probability. As far as we know, this is the first time such a problem has been posed.
But from the point of view of clustering (and more generally, other data mining tasks), the fact of keeping the projection errors of pairs of points in a certain range is not by itself sufficient from practical point of view. We want also to know whether or not the projection will distort the solution to the problem in the original space.
In particular, we take a closer look at the k-means algorithm and prove that a good solution in the projected space is also a good solution in the original space (see Theorem 2.3). Furthermore, under proper assumptions local optima in the original space are also ones in the projected space and vice versa (Theorems 2.4, 2.5). Finally, we show that a perfect k-means algorithm (an ideal algorithm that returns global optimum solution to the k-means clustering problem) in the projected space provides with a constant factor approximation of the global optimum in the original space (Theorem 2.6). In this way, we state conditions under which it is worthwhile performing k-means in the projected spaces with guarantees that the solutions will tell us something about the original problem in the original space.
To formulate claims concerning k-means, we need to introduce additional notation. Let us denote with C a partition of Q into k clusters {C 1 , . . . , C k }. For any i ∈ Q, let C(i) denote the cluster C j to which i belongs. Let Q = {x 1 , . . . , x m } be the set of representations of these objects in some Euclidean space, and let Q = {x 1 , . . . , x m } be the set of representations of these objects in another Euclidean space (after a random projection). For any set of objects C j , let µ(C j ) = 1 Under this notation, the k-means cost function may be written as 2 (2) for the sets Q, Q .

JL dimensionality reduction and clusterability problem
We investigate also the broader issue of preserving clusterability when JL Lemma projection is applied. We define the conditions for which clusterability property of the original space is transformed to the projected space, so that a broad class of clustering algorithms for the original space is applicable in the projected space. The property of clusterability is informally speaking understood as a special property of data that makes finding the good clustering of data an easy task. In the literature, five brands of clusterability are usually distinguished (see [14]): Perturbation Robustness, σ -Separatedness, c, σ -Approximation-Stability, β-Centre-Stability and the Weak Deletion stability. Perturbation Robustness refers to the property of the data that under slight distortions of distances between data points the optimal clustering will be safeguarded. The projection under JL Lemma may obviously lead to violation of this property, because the projection already distorts the distances. Therefore in Theorem 3.6 we establish conditions under which the Perturbation Robustness will be preserved. σ -Separatedness refers to the property that the cost of optimal clustering into k clusters should be by some factor lower than clustering into k − 1 clusters. As the JL projection changes the distances between points, the cost function value proportions will also change in a way that cannot be determined in advance (decrease or increase). In Theorem 3.1, we establish conditions under which this property is preserved. c, σ -Approximation-Stability property refers to closeness of costs between alternative partitions. This closeness may be obviously distorted by JL projection, so that partitions close in the original space will not be sufficiently close in the projected space and vice versa. Theorem 3.2 establishes conditions, under which the property may be nonetheless preserved under projection. The β-Centre-Stability property requires that the distance of a point to its own cluster centre is by a factor smaller than to other cluster centres. Under the JL projection, a violation of this property may take place as a point may be moved further from its own cluster centre and closer to some other. Theorem 3.7 gives conditions under which the property is retained. Finally, the Weak Deletion stability requires that removal of one cluster centre will impose some minimal increase in the cost function of the clustering. Changes in the distances may clearly lead to violation of this property, and it cannot then be exploited in the projected space. Theorem 3.8 shows conditions under which the projection will uphold the property.

Our contribution
Our contribution is as follows: • We formulate and prove a set version of JL Lemma-see Theorem 2.1.
• Based on it, we demonstrate that a good solution to k-means problem in the projected space is also a good one in the original space-see Theorem 2.3. • We show that there is 1-1 correspondence between local k-means minima in the original and the projected spaces under proper conditions-see Theorems 2.4, 2.5. • We demonstrate that a perfect k-means algorithm in the projected space is a constant factor approximation of the global optimum in the original space-see Theorem 2.6 • We prove that the projection preserves several clusterability properties such as Multiplicative Perturbation Robustness, (Theorem 3.6), σ -Separatedness (Theorem 3.1), c, σ -Approximation-Stability (Theorem 3.2), β-Centre-Stability (Theorem 3.7) and the Weak Deletion stability (Theorem 3.8).
The structure of the paper is as follows. In Sect. 2, we introduce the set-friendly version of Johnson-Lindenstrauss Lemma together with theorems investigating properties of results of k-means algorithm in the original and the projected spaces (local and global optima). In Sect. A.1, we prove the set-friendly version of JL Lemma, whereas the proofs of the remaining theorems introduced in Sect. 2, that is, the ones relating k-means clustering results in the original and the projected spaces, are deferred to Appendix B. In Sect. 3, we demonstrate an additional advantage of our version of JL Lemma consisting in preservation of various data clusterability criteria. In Sect. 4, we illustrate the advantage of Theorem 2.1 by some numerical simulation results, showing at the same time the impact of various parameters of our version of Johnson-Lindenstrauss Lemma on the dimensionality of the projected space. We show that k-means clustering behaves as expected under JL Lemma projection. We verify also via simulations the clusterability preserving properties of the JL Lemma related projection. In Sect. 5, we recall the corresponding results of other authors. Section 6 contains some concluding remarks.

Relationship between k-means solutions in the projected and original space
While it is a desirable property if we can reduce the dimensionality of space in which the data is embedded (e.g. the computational burden is reduced), we should be aware of the fact that we distort the data and in this way we run at risk that operating on the projected data we miss the solution of our original problem. This is the primary concern about the original JL Lemma as expressed in Theorem 1.1. This is in particular worthwhile considering risk in the domain of k-centre data clustering. Distortion of distances may lead to an undesirable situation that some data may switch between clusters and in an extreme case the optimal clusters in the original and in the projected domains are barely overlapping.

Reformulated JL lemma and the impact on cost function value in projected space
We begin this section with presentation of our version of the JL Lemma, expressed in Theorem 2.1. While the original JL Lemma is concerned with the existence of a solution to the projection problem (a projection fitting error bounds on pairwise distances), we ensure under what circumstances the projection almost surely fits the pairwise distance error bounds. This theorem constitutes a significant step forward towards applicability of the projection. However, it is not enough. Only keeping error bounds is ensured, but we need more: assurance that the solutions to the clustering problem in the projected space faithfully represent the solutions in the original space. Therefore, we present subsequently a series of our claims about the relationship between the k-means clustering algorithm results in the projected and the original space. A close relationship is needed if the random projection is to be used as a step preceding application of the k-means clustering algorithm to the data. We express the properties of the original clustering problem required so that the dimensionality reduction makes real sense.
We restrict in this way our attention to a particular class/family of clustering algorithms. Such a restriction is on the one hand necessary in order to get precise results, and on the other hand the popularity of k-means family is so widespread that the results can still be of vital interest for a broad audience.
This limitation, as shown in Theorem 2.3, allows us to exploit the particular form of JL Lemma to limit directly the solution distortions in the projected space compared to the original one.
This limitation is also related to specific properties of the k-means algorithm family. These algorithms aim at optimizing a cost function that has quite a rough landscape. Therefore, frequently the algorithms get stuck at a local minimum. To demonstrate the equivalence of solutions in both original and projected spaces, we show in Theorem 2.4 that the local minima of the original space may under some conditions correspond to local minima in the projected space. This means if we look for local minima in the original space, we can as well limit our attention to the projected space. Theorem 2.5 considers the relationship in the opposite direction-local minima in the projected space may be well local minima in the original space. This means: if we found a local minimum in the projected space, we have found one in the original space.
Concerning the global optimum, we demonstrate in Theorem 2.6 that the global optimum in the projected space is a constant factor approximation of the global optimum of the original space. Hence it makes sense to look for a global optimum in the projected space.
So let us turn to the general theorem on random projection.
. Let Q ⊂ R n be a set of m points in an n-dimensional orthogonal coordinate system C n and let Let C n be a randomly selected n -dimensional orthogonal coordinate system. For each v ∈ Q, let v be its projection onto C n . Then, for all pairs u, v ∈ Q holds with probability of at least 1 − .
The proof of this Theorem can be found in Appendix A.1.
The fundamental new insight of this theorem is to explicitly refer to the failure probability . In the literature, see, e.g. [10][11][12]20], the failure probability was not referred to as a control parameter of dimensionality. It was rather derived from other parameters controlling n in the respective formulas. With our formulation, the user has the freedom to choose as low as she/he wants to have the probability of success 1 − to get (in a single pass) a random projection fitting the pairwise (relative) distance error to be limited by δ.
Note that the lower bound on dimensionality n grows with the inverted square of the permissible relative error range δ (− ln(1 + δ) + δ ≈ δ 2 /2). One can see it in Fig. 1 (the black line). The sample size, on the other hand, affects n logarithmically only, as visible in Fig. 2 (black line). Figure 3 (black line) illustrates the strong dependence of lower bound of n on the error rate (logarithmic dependence for values below 1).
The formula (4) provides with an explicit expression for computing a lower bound on n . We will refer to it sometimes therefore as n E , "E" standing for Explicit. It is also independent of n. The question may be raised, whether it is possible to reduce n via exploitation of n.
We propose for this purpose Algorithm 1. The algorithm seeks via bisection the solution of minimalization problem for the function The algorithm proceeds as follows: One starts with n L := 1, n H := round(n/3) − 1. If I (n H ) > , then seeking n I has failed. Otherwise, one determines in a loop n M := round((n L + n H )/2) and computes I (n L ), I (n M ), I (n H ), then if I (n M ) < then one sets n H := n M , otherwise n L := n M (n M is always rounded up to the next integer). This process is continued till n M does not change. n I is then set to n H . Theorem 2.2 Let δ ∈ (0, 1 2 ), ∈ (0, 1). Let Q ⊂ R n be a set of m points in an n-dimensional orthogonal coordinate system C n and let n = n I be computed according to Algorithm 1 Let C n be a randomly selected (via sampling from a normal distribution) n -dimensional orthogonal coordinate system. For each v ∈ Q, let v be its projection onto C n . Then, for all pairs u, v ∈ Q the relation (5) holds with probability at least 1 − Let us stress at this point the significance of these theorems. Earlier forms of JL Lemma required sampling of the coordinates over and over again, with quite a low success rate (e.g. in [20] about 1 m ) until a mapping is found satisfying the error constraints (see, e.g. [39]). In our theorems, we need only one sampling in order to achieve the required success probability of selecting a suitable subspace to perform k-means.
These sampling theorems can be used for any algorithm that requires a dimensionally reduced sample keeping some distortion constraints.
It turns out, however, that the properties of this sampling technique are particularly relevant for k-means.
We make the following claim for k-means objective: Theorem 2.3 Let Q be a set of m representatives of objects from Q in an n-dimensional orthogonal coordinate system C n . Let δ ∈ (0, 1 2 ), ∈ (0, 1). and let n satisfy condition (4) or (7). Let C n be a randomly selected (via sampling from a normal distribution) ndimensional orthogonal coordinate system. Let the set Q consist of m objects such that for each i ∈ Q, x i ∈ Q is a projection of x i ∈ Q onto C n . If C is a partition of Q, then holds with probability of at least 1 − . Note that the inequality (8) can be rewritten as The above theorem tells us how close we can approximate the cost function of k-means in the original space via the solution to k-means problem in the projected space-the relative estimation error is limited by δ.

Local and global cost function minima in projected space
But the interdependence between solutions to k-means problem in both spaces is even deeper, as the two subsequent theorems show. Let us look at the local minima of k-means at which the k-means algorithms usually get stuck. Theorem 2.4 states conditions under which the local minima of the original space correspond to local minima in the projected space. Theorem 2.5 considers the relationship in the opposite direction.
Before proceeding, let us introduce the concept of the gap between clusters. It is obvious that if there are points at the border of two clusters, then under projection there exists a high risk that the points would move to the other cluster. Therefore, in order to keep cluster membership unchanged under projection, a gap between the clusters needs to exist. We shall understand the gap as follows: k-means assures that for any two clusters C 1 , C 2 there exists a hyperplane h orthogonal to the line segment connecting both cluster centres and cutting it at half of the distance between these centres. The points of one cluster lie on the one side of this plane, the points of the other on the other side. The absolute gap G between the two clusters is understood as twice the smallest distance of any element of these two clusters to this hyperplane h. That is, for an absolute gap G C 1 ,C 2 between the two clusters, the distance between any point of these clusters and the border is larger than G C 1 ,C 2 /2. We prefer subsequently to refer to the relative gap g, that is G divided by the distance between the two cluster centres. Theorem 2.4 Under the assumptions and notation of Theorem 2.3, if the partition C * yields a local minimum (in the original space) of J(Q, C) over all possible partitions C of Q and if for any two clusters for some α ∈ [0, 1) g = 2(1 − α) is lower or equal the relative gap between these clusters, and ( p to be defined later by inequality (38) on page 46), then this same partition is (in the projected space) also a local minimum of J(Q , C) over C, with probability of at least 1 − .
Theorem 2.4 tells us that we need to meet two conditions if we want that local minima in the original space have their corresponding local minima in the projected space. The one refers to the need of separation of each pair of clusters by a gap-the relative width of area between clusters (where no data points are present) shall amount at least to some g. The second condition imposes restrictions on the upper bound for δ which does not equal to 1/2 any longer, but may be a smaller value, depending on the mentioned gap (the smaller the gap, the smaller the δ). δ is influenced also by a factor p representing compactness of the clusters. Indirectly of course the projected space dimensionality is affected because the smaller the δ is, the larger the n will be.
Quite a similar behaviour is observed if we request that the local minimum found in the projected space should correspond to a local minimum in the original space Theorem 2.5 Under the assumptions and notation of Theorem 2.3, if the clustering C * constitutes a local minimum (in the projected space) of J(Q , C) over C and if for any two clusters 1 − α times the distance between their centres is the gap between these clusters, where α ∈ [0, 1), and then the very same partition C * is also (in the original space) a local minimum of J(Q, C) over C, with probability of at least 1 − .
Note that in both theorems, the conditions on δ are quite similar and converge to one another for vanishing δ. In fact the condition in Theorem 2.5 implies that of Theorem 2.4, so if local minima in the projected space correspond to ones in the original one, then those of the original one correspond to those of the projected one.
The fact that the local minima correspond to each other in both spaces does not automatically mean that the global minimum in the original space is also the global minimum in the projected space. However, the theorem as follows indicates that the values at both optima correspond closely to one another. That is, if we find the global minimum in the projected space then we can estimate the global optimum in the original space quite well. Theorem 2.6 Under the assumptions and notation of Theorem 2.3, if C G denotes the clustering reaching the global optimum in the original space, and C G denotes the clustering reaching the global optimum in the projected space, then with probability of at least 1 − . That is, the perfect k-means algorithm in the projected space is a constant factor approximation of k-means optimum in the original space.
We defer the proof of Theorems 2.3-2.6 till Appendix B. We will derive the basic Theorem 2.1 in Appendix A.1 which is essentially based on the results reported by Dasgupta and Gupta [20]. And the proof of Theorem 2.2 can be found in Appendix A.2.

Clusterability and the dimensionality reduction
In Sect. 2, we have stated conditions under which our formulation of JL Lemma (Theorem 2.1) allows to use random projection to perform the clustering in the projected space instead of one in original space using a specific class of clustering algorithms, namely k-means. One may rephrase our results by saying that we have formulated conditions for k-means family under which the transformed data can be clustered the same way as the original data.
Let us turn in this section to the somehow related topic of clusterability of data. The property that the data is clusterable usually means that the clustering of that data can be easily found, i.e. with an algorithm of polynomial complexity. This is a very useful property for large data sets. Clusterability conditions usually include requirements of identical clustering under some data transformation. An important question is whether or not adding the extra random projection as data transformation may lead to the loss of clusterability property. The specific contribution of this section is the discussion how well the general property of clusterability is preserved under random projection using our version of JL Lemma in case when the cost function is defined as for k-means.
While clusterability property has some intuitive appeal as a property of data allowing for "easy clustering", concrete formal notions of clusterability differ substantially. The "easiness" refers generally to (low) algorithm complexity given some restrictions imposed on the form of the data, but these restrictions may have different forms hence the various notions of clusterability.
And in fact, in the literature a number of notions of the so-called clusterability have been introduced [2, [6][7][8]14,15,32]. 4 Under these notions of clusterability algorithms have been developed clustering the data nearly optimally in polynomial times, when some constraints are matched by the clusterability parameters.
It seems therefore worth to have a look at the issue if the aforementioned projection technique would affect the clusterability property/properties of the data sets.

Selected notions of clusterability and issues with JL projection
Let us consider, as representatives, the following notions of clusterability, present in the literature: σ -Separatedness [32] means that the cost J(Q, C k ) of optimal clustering C k of the data set Q into k clusters is less than σ 2 (0 < σ < 1) times the cost J(Q, C k−1 ) of optimal clustering C k−1 into k − 1 clusters -(c, σ )-Approximation-Stability [8] means that if the cost function values of two partitions , then the distance (in some space) between the partitions is at most σ (d(C a , C b ) ≤ σ for some distance function d between partitions). As Ben-David [14] remarks, this implies the uniqueness of optimal solution. -Perturbation Robustness means that small perturbations of distances / positions in space of set elements do not result in a change in the optimal clustering for that data set. Two kinds may be distinguished: additive [2] and multiplicative ones [15] (the limit of perturbation is upper-bounded either by an absolute value or by a coefficient). The s-Multiplicative Perturbation Robustness (0 < s < 1) holds for a data set with d 1 being its distance function if the following holds. Let C be an optimal clustering of data points with respect to this distance. Let d 2 be any distance function over the same set of points such that for any two points u, v, ). Then, the same clustering C is optimal under the distance function d 2 .
The s-Additive Perturbation Robustness (s > 0) holds for a data set with d 1 being its distance function if the following holds. Let C be an optimal clustering of data points for this distance. Let d 2 be any distance function over the same set of points such that for any two points u, v, Then, the same clustering C is optimal under the distance function d 2 .
Next, we are interested only in the multiplicative version.
β-Centre Stability [7] means, for any centric clustering, that the distance of an element to its cluster centre is β > 1 times smaller than the distance to any other cluster centre under optimal clustering. -(1 + β) Weak Deletion Stability [6] (β > 0) means that given an optimal cost function value O PT for k centric clusters, the cost function of a clustering obtained by deleting one of the cluster centres and assigning elements of that cluster to one of the remaining clusters should be bigger than Subsequently, we consider the question what is the impact of random projection under our proposal of dimensionality reduction on preservation of various forms of clusterability. In Theorem 3.1, we show that the σ -Separatedness in the projected space increases as a function of δ which means that the larger error δ is allowed, the larger must be the advantage of (optimal) clustering into k clusters over clustering into k − 1 clusters in order to be able to take advantage of clusterability property by clustering algorithms (smaller σ indicates a bigger advantage of clustering into k clusters over one into k − 1 clusters, and usually σ must lie below some threshold for optimal algorithms to be applicable). Conversely, if this advantage is lower in the original space, then also δ has to be lower and therefore also a smaller dimensionality reduction is permitted. Theorem 3.2 points out that the random projection narrows (with increasing δ) the range c of clustering cost functions that have to lie within the same distance σ from a given clustering for (c, σ )-Approximation-Stability. The decrease in c, as shown by Balcan et al. [8], is disadvantageous as it worsens the approximation complexity of the optimal clustering of a data set, possibly to the point of NP-hardness. Therefore, low error values δ are preferred.
The next notion of clusterability that we will investigate is the Multiplicative Perturbation Stability. The effect of random projection on this property will be explained in Theorem 3.6. In order to prove it, we will need two lemmas: Lemma 3.3 on double perturbation and Lemma 3.5 on conditions when multiplicative perturbation stability ensures that the global k-means optima in the original and the projected space are identical.
Finally, we will turn to theorems on β-Centre-Stability (Theorem 3.7) and 1 + β Weak-Deletion-Stability (Theorem 3.8). The discussion of both these properties will be performed in conjunction with Multiplicative Perturbation Stability.

-separatedness in the projected space
Let us first have a look at the σ -Separatedness. Assume that the data Q in the original space have the property of σ -Separatedness for some σ . Let C G,k denote an optimal clustering into k clusters in the original space and C G,k in the projected space.
From Theorem 2.6, we know that and σ -Separatedness (14) implies that Note that the latter inequality was obtained by applying inequality (15) in the nominator.
which was obtained by applying inequality (16) in the denominator This implies We have thus proved Theorem 3.1 Under the assumptions and notation of Theorem 2.3, if the data set Q has the property of σ -Separatedness in the original space, then with probability at least 1 − it has the property of σ 1+δ 1−δ -Separatedness in the projected space. The fact that this Separatedness increases under projection is of course a deficiency, because clustering algorithms require as low Separatedness as possible (because the clusters are then better separated). Recall that Ostrovsky et al. [32] developed a series of algorithms for efficient computation of near-to-optimal k-means clustering in polynomial time in case the data possesses the σ -Separatedness property (with required σ being small numbers, usually 0.01 or lower). The above theorem establishes conditions to what extent the random projection keeps this property so that the special algorithms are still applicable to the projected data.

(c, )-approximation-stability in the projected space
Let us turn to the (c, σ )-Approximation-Stability property. We can reformulate it as follows: If the distance (in some space) between the partitions is more than σ , then the cost function values of two partitions differ by more than the factor c > 1.
So assume we have the (c, σ )-Approximation-Stability property in the original space. Consider two partitions C 1 , C 2 with d(C 1 , C 2 ) > σ. Without loss of generality, assume that therefore in the original space the following holds: By applying Theorem 2.3 to both clusterings C 1 , C 2 , we get and hence: This result holds for any two clusterings , σ -Approximation-Stability in the projected space, with appropriate probability. Thus, we have Theorem 3.2 Under the assumptions and notation of Theorem 2.3, if the data set Q has the property of (c, σ )-Approximation-Stability in the original space, and c > 1+δ 1−δ , then it has the property of c · 1−δ 1+δ , σ -Approximation-Stability property in the projected space with probability at least 1 − .
Recall that Balcan et al. [8] developed a series of algorithms, including one for k-means (see their Lemma 4.1) that solve the clustering problem efficiently for data with (c, σ )-Approximation-Stability. The above theorem states under what conditions the very same algorithms are applicable to projected space. Note, however, the probabilistic nature of this stability. Note also that the property can be lost altogether if c is too small.

s-multiplicative perturbation robustness in the projected space
Let us now consider s-Multiplicative Perturbation Robustness. First, we need two auxiliary results. The first one concerns transitiveness of this kind of robustness. We claim that: For a set Q of representatives of objects from Q with the distance function d 1 , define the set Q p of representatives of the very same Q with distance function d 2 as If the data set Q has the property of s-Multiplicative Perturbation Robustness under the distance d 1 , and the set Q p is its ν-perturbation with distance d 2 and s = ν · s p , where 0 < ν, s p < 1, then set Q p has the property of s p -Multiplicative Perturbation Robustness.
Proof Q p is a ν-perturbation of Q and as ν > s, it is also an s-perturbation of Q, therefore by definition of Multiplicative Perturbation Robustness, both share same optimal clustering. Let Q q be an s p -perturbation of Q p , that is one with distance d 3 , such that Then, that is Q q is a perturbation of Q such that both share same optimal clustering. So Q p and Q q share common optimal clustering, hence Q p has the property of s p -Multiplicative Perturbation Robustness Let us note here in passing that Additive Perturbation Robustness implies Multiplicative Perturbation Robustness. Consider a data set Q which has the property of s-Additive The second lemma, that is needed to prove the Theorem 3.6 on Multiplicative Robustness, relates the global minima of the original and projected space when multiplicative perturbation robustness is present in the data. We claim that: Lemma 3.5 Under the assumptions and notation of Theorem 2.3, if the data set Q has the property of s-Multiplicative Perturbation Robustness with s 2 < 1 − δ, and if C G is the global optimum of k-means in Q, then it is also the global optimum in Q with probability at least 1 − Proof Assume to the contrary that is that in Q some other clustering C G is the global optimum. Let us define the distance d 1 The distance d 2 is a realistic distance in the coordinate system C as we assume n > n . As the k-means optimum does not change under rescaling, so C G is also an optimal solution for clustering task under d 2 . But hence the distance d 2 is a s-multiplicative perturbation of d 1 and hence C G should be optimal under d 2 also, as we assumed robustness of s-multiplicative perturbation of the data set. Thus, we arrived at a contradiction and the claim of the lemma follows.
This implies that Proof Let Q be the projection of the data set Q. In order to demonstrate that Q has the property s p -Multiplicative Perturbation Robustness, we need to show that for each s p -perturbation of Q this perturbation has the same optimum as Q . Obviously, any data set in the projected space, so also each perturbation of Q , is a projection of some set from the original space. So we take any set Q o from the original space and look at its projection Q o . If Q o happens to be an s p -perturbation of Q , then we show that Q o is an s-perturbation of Q. We demonstrate that, due to Q's s-robustness, Q o , Q o and Q have the same optima. As this holds for all s p perturbations of Q , Q is s p -robust. In detail, we proceed as follows: As s 2 ≤ 1 − δ, Lemma 3.5 implies that the global optima of Q in the original and Q in projected spaces are identical. So assume that in the original space for the distance d 1 Furthermore, let the set Q o with elements y i , i = 1, . . . , m be another representation of the set Q in the original space. Let the set Q o be its image under the very same projection that was applied to Q and let this projection keep the error range δ as well. The probability, that something like this happens, amounts to 1 − 2 as we have now twice as many projected points as was originally considered when calculating n for Q. d 2 (i, j) = y i − y j and for the projected points of , we obtain So d 2 is a perturbation of d 1 with the factor s.
As Q with d 1 is s-multiplicative perturbation robust, therefore (by definition of multiplicative perturbation robustness) both Q and Q o have the same optimal clustering C G . What is more, the above derivation shows also that Q o with d 2 is a s p 1−δ 1+δ -perturbation of Q with d 1 . As Q with d 1 is s-multiplicative perturbation robust, it is also (by assumption on s) Therefore, its counterpart Q o with d 2 has the same optimum clustering C G as Q o with d 2 (see Lemma 3.5). As we already showed, Q o has the same optimal clustering as Q, Q-same as Q , so Q o has the same optimal clustering as Q .
Note that for any s p -perturbation Q o of Q , there exists a Q o in the original space such that Q o is its image under the projection. And it turned out that it yields the same optimal solution as d 1 for Q . So with high probability (factor 2 is taken as we deal with two data sets, comprising points x i and y i ), Q with d 1 possesses s p -Multiplicative Perturbation Robustness in the projected space.
Let us note at this point that Balcan et al. [9] developed a special polynomial clustering algorithm suitable among others for k-means exploiting the Multiplicative Perturbation Robustness. The above theorem shows that with careful choice of dimensionality reduction we can uphold applicability of such algorithms. Also via Lemma 3.4 it is possible to extend these results to Additive Perturbation Robustness.

3.5ˇ-centre-and (1 +ˇ)-weak-deletion-stability in the projected space
Let us discuss now two remaining clusterability properties, β-Centre-Stability and 1 + β-Weak-Deletion-Stability. They differ substantially from the previously discussed ones, if we look from the perspective of k-means under projection. The former allowed to establish a kind of link between the clustering in the original space and the projected space. We were able, for example, to say for Multiplicative Perturbation Robustness, when the optimal clusterings in the original and in the projected spaces are identical. σ -Separatedness dealt with optimal clustering costs for clustering into k and k − 1 clusters and we knew from Theorem 2.6 what was the relationship between optimal clustering costs in the original and the projected spaces. In the case of c, σ -Approximation-Stability, we considered all the possible clusterings, not just the optimal ones. In these two types of stability that we shall consider now, we have to handle the optimal clusterings explicitly, and we do not have a possibility to derive a relationship between the optimal clusterings of the original and the projected space from the stability property alone. Hence, we need some additional knowledge about the optimal clusterings. We have chosen here the Multiplicative Perturbation Robustness, as it establishes a straightforward relation between the optimal clusterings in both spaces. Hence, our formulation of the subsequent results.
We claim that: Proof The s-Multiplicative Perturbation Robustness ensures that both the original and the projected space share same optimal clustering C (see Lemma 3.5). To prove the claim, we need hence to explore for each data point to what extent the distance to its own cluster centre will relatively increase while the distance to the other cluster centres will decrease upon projection. When considering the relationship of a point to its own cluster centre, we will use the same technique as in the proof of Lemma B.1. When considering the relationship of a point to some other cluster centre, we will proceed as if we would merge the point with the other cluster, and so the relationship of Lemma B.1 applies again to the extended cluster and we need only to explore the relationship between the centre of the other cluster and the centre of the extended cluster. Consider a data point x i and a cluster C ∈ C not containing i. Then, x i , µ(C) and µ(C ∪{i}) are co-linear. So are x i , µ (C) and µ (C ∪ {i}), that is the respective (linear) projections.
Furthermore, x i −µ(C∪{i}) µ(C)−µ(C∪{i}) = |C| 1 , hence Likewise, in the projected space, Upon projection, the distance to own cluster centre can increase relatively by √ 1 + δ and to the C ∪ {i} centre can decrease by √ 1 − δ, see Lemma B.1. That means and Due to the aforementioned relations, that is if we multiply both sides of (20) with |C|+1 |C| substitute (18) on the left-hand side and (17) on the right-hand side into the relation (20), we will obtain Due to β-Centre-Stability in the original space we have: Hence, by substituting on the right-hand side of (21), we get By substituting with (19) on the right-hand side of (22), we get That is Hence, the data centre stability can drop to β 1−δ 1+δ .
Awasthi et al. [7] developed algorithms to find optimal k-means clustering in polynomial time if the data fits the requirements of β-Centre Stability. The results of the above theorem indicate to what extent their algorithms can be applied to randomly projected data given the original data fit the requirements.
β-Centre Stability implies that in the optimal solution each data point preserves some proportion of distances to the neighbouring cluster centres. Awasthi et al. considered also somewhat weakened condition in that a constraint is imposed on cost function under deletion of a cluster centre. Also in this case, we could find conditions when random projection upholds the new deletion-based stability condition.
We claim that: Proof The s-Multiplicative Perturbation Robustness ensures that both the original and the projected space share same optimal clustering (see Lemma 3.5). Let this optimal clustering be called C o . By C, denote any clustering obtained from C o by deletion of one cluster centre and assigning cluster elements to one of the remaining clusters. Theorem 2.3 implies that (1 − δ) n n J(Q, C) ≤ J(Q , C). Therefore, By the assumption of (1+β)-Weak Deletion stability (1+β)J(Q, C o ) ≤ J(Q, C). Therefore, So we can conclude that which implies the claim.
Awasthi et al. [6] have demonstrated among others for k-means that the data having the property of (1 + β) Weak Deletion Stability possess also the so-called PTAS property (existence of Polynomial-time approximation scheme). The above theorem states conditions under which this property is preserved under random projection.
Summarizing this section we can say that the random projection, under suitable conditions, may preserve at least several clusterability properties, known in the literature, and thus conditions may be identified when efficient clustering, according to k-means cost function, is applicable in the projected space if it is applicable in the original space. So not only distance relations, but also clusterability can be maintained under projection according to JL Lemma.

Experiments
In the experimental part of this work, in a series of numerical experiments, we want to demonstrate the validity of various aspects of JL Theorem applied to projected data from the perspective of k-means algorithm.

Numerical experiments on importance of differences between various dimensionality reduction formulas
It is a frequently stated question that to what extent the concrete formula for dimensionality reduction overshoots the real need for embedding dimensions. In order to give an idea how effective the random projection is, see Fig. 4. Figure 4 illustrates a typical distribution of distortions of distances between pairs of points in the projected space for one of the runs characterized by figure caption. It illustrates the distribution of discrepancies between squared distances in the projected and in the original spaces. The distortions are expressed as One can see that they correspond quite well to the imposed constraints. The vast majority of point pairs have a distortion much lower than δ. There exist, however, sufficiently many pairs for which the distortion is close to δ (in terms of the order of magnitude); therefore, one can assume that not much more can be gained. Another important question related to δ is its relation to k-means clustering under projection. Figure 5 illustrates the role of the intrinsic gap between clusters in the original space and the permitted value of δ. As one would expect, the bigger the relative gap between clusters, the larger the error value δ is permitted, if class membership shall not be distorted by the projection.
Finally, when discussing the various formulas on dimensionality reduction under random projection, the question may be raised whether or not, under realistic values of the parameters

Error margin versus gap between clusters
expected gap required error delta parameter  Symbols: n I -implicit n , n E -explicit n , n G -for comparison n as computed by Dasgupta and Gupta [20], r -the number of repetitions of sampling needed to compute k-means under Dasgupta and Gupta dimensionality reduction approach δ, , m and n there is a real advantage of our newly derived formulas of computing n over the ones provided by the literature, in particular that of Gupta and Dasgupta [20] and whether or not the implicit n computation gives us an advantage over the explicit n formula. The practical reason for an interest in getting as low n as possible is the following: the lower the dimensionality, the lower numerical effort for computing distances between the objects. We have considered the following value ranges for these parameters: δ ∈ [0.01, 0.5], ∈ [0.001, 0.1], m ∈ [10, 10 8 ] and n ∈ [4 · 10 5 , 10 · 10 5 ]. Let us recall that under application of k-means it is vital that we have a high success probability (1 − ) of selecting a random subspace such that under projection of data points onto this subspace the distortion of distances between pairs of points is at most the assumed δ.
We investigate the behaviour of n I versus n E and at the ration to n G / I (n G being the reduced dimensionality of Gupta/Dasupta). We also want to know how many times the random projection has to be repeated under Gupta/Dasgupta proposal in order to get the assumed success probability 1 − . We investigated the impact of the original data dimensionality n (Table 4 and Fig. 6), the sample size m (Table 1 and Fig. 2), the accepted error δ (Table  3 and Fig. 1, the assumed failure rate (Table 2 and Fig. 3). In these experiments, we Symbols: n I -implicit n , n E -explicit n , n G -for comparison n as computed by Dasgupta and Gupta [20], r -the number of repetitions of sampling needed to compute k-means under Dasgupta and Gupta dimensionality reduction approach Symbols: n I -implicit n , n E -explicit n , n G -for comparison n as computed by Dasgupta and Gupta [20], r -the number of repetitions of sampling needed to compute k-means under Dasgupta and Gupta dimensionality reduction approach investigated only the theoretical values (checking for low sample sizes (below 1000) and low dimensionality (below 50,000) whether or not the values are confirmed in multiple (10) simulation runs-they were confirmed on each run so no extra reporting is done). We did not experiment whether lower values of n than those suggested by our formulas on n E , n I and their corresponding n G would be sufficient, though it is a good subject for further research. Note that we have two formulas for computing the reduced space dimensionality n , the formula (7) for n I and (4) for n E . The latter does not engage the original dimensionality n, while it is explicit in n . The value of n in the former depends on n, however n can be only computed iteratively. Symbols: n I -implicit n , n E -explicit n , n G -for comparison n as computed by Dasgupta and Gupta [20], r -the number of repetitions of sampling needed to compute k-means under Dasgupta and Gupta dimensionality reduction approach The content of the tables indicates the limitations of dimensionality reduction via JL . There is no point of applying dimensionality reduction via JL Lemma if the dimensionality lies below 1000 (for δ < 0.1, < 0.5, m = 10).
Let us investigate the differences between n computation in implicit and explicit cases. Let us check the impact of the following parameters: n-the original dimensionality (see Table 4 and Fig. 6), δ-the limitation of deviation of the distances between data points in the original and the reduced space (see Table 3 and Fig. 1), m-the sample size (see Table 1 and Fig. 2), as well as -the maximum failure probability of the JL transformation (see Table 2 and Fig. 3). Note that in all figures, the X-axis is on log scale.
Concerning the differences between n E and n I , we see from Table 1 (see also Fig. 2), that increase in sample size in reasonable range 10-10 8 increases the advantage of n I over n E to even 15%. On the other hand, in Table 2 (see also Fig. 3) we see that the failure probability does not give a particular advantage to any of these dimensionality sizes, which may be partially attributed to the considered sample size m, though the implicit one approaches the explicit one with increase in . We see also that when we increase the acceptable failure rate , the requested dimensionality n drops. The decrease in error rate δ, as visible from Table 3 (see also Fig. 1), increases the gap between n E and n I up to the point when the original dimensionality n is exceeded and hence usage of dimensionality "reduction" is pointless (see the last line of Table 3). Figure 1 shows that the requested dimensionality drops quite quickly with increased relative error range δ till a kind of saturation is achieved.
As visible in Fig. 6, the value of n from the explicit formula does not depend on the original dimensionality n, while the implicit one does. From Table 4 (see also Fig. 6), we see that the increase in the original dimensionality gradually reduces the advantage of n I over n E . In fact, the value computed from the implicit formula approaches the explicit value with the growing dimensionality n.
On the other hand, the implicit n departs from the explicit one with growing sample size m, as visible in Fig. 2. Both grow with increasing m.
In summary, these differences are not very big, but nonetheless can be of significant advantage when the computations over large data sets are likely to have long run times.
The behaviour of explicit n is not surprising, as it is visible directly from the formula (4). The important insight here is, however, the required dimensionality of the projected data, of hundreds of thousands for realistic , δ. Thus, the random projection via the Johnson-Lindenstrauss Lemma is not yet another dimensionality reduction technique. It is suitable for very large data only, and it proved to be a useful ingredient to techniques such as PCA, see, e.g. the so-called compressive PCA [37].
The behaviour of implicit n for the case of increasing original dimensionality n is as expected-the explicit n reflects the "in the limit" behaviour of the implicit formulation. The discrepancy for and the divergence for growing m indicate that there is still space for better explicit formulas on n . In particular, it is worth investigating for increasing m as the processing becomes more expensive in the original space when m is increasing.
With regard to n I /n G quotient, one shall stress that n G has always numerically a clear advantage, of up to 400%, but one shall take into account that the warranty of obtaining a useful projection from the point of view of k-means is low according to theoretical considerations. Hence, while we can perform random projection only once for our dimensionality reduction method, the Dasgupta/Gupta projection needs to be repeated for a multitude of times (r column in the tables). So in Table 1, the number of needed repetitions r is already 44 in the most advantageous case of n I /n G . with increase in sample size m the quotient falls down to 100% advantage, while r increases radically to hundreds of millions. This fact renders Dasgupta/Gupta projection useless. The same disadvantage of Dasgupta/Gupta projection is visible in other tables. However, note that according to Table 2 with decrease in the advantage of n G over n I raises from 90 to 120%. However, the increase in r is disproportional with respect to this advantage. The increase in original dimensionality n gives advantage to the value of n G . Error range δ does not exhibit such an obvious pattern when influencing the quotient.

Impact of projection on correctness of clustering
In this series of experiments, the validity of Theorem 2.6 along with Theorem 2.3 was verified.
The problem with an experimental investigation results from the fact that k-means and its variants are known to usually stick at local minima (at the cost of speed), especially in the high-dimensional landscape, and our Theorem 2.6 relates to the global optimum. So for an arbitrary data set, we would not know the optimum in advance, neither from experimental runs nor from theoretical considerations. Therefore, we created a special generator, providing with well-separated clusters in some sense for which we knew the theoretical solution in the original space (being a parameter of the generator). The solution in the projected space was not known in advance, and it was considered to be the same as in the original space due to theorems on perturbative robustness, and k-means and k-means++ were run in order to discover it experimentally. The experiment was considered as a failure when either the deemed clustering was not discovered or the inequalities of the theorems were violated.
A series of experiments consisting in generating samples of n-dimensional data (n = 4900) consisting of m records (m = 5000), which had a cluster structure known in advance, was performed. In order to know the clustering in the original space in advance, the sample generator proceeded as follows: m points from n-dimensional normal distribution centred at zero, with unit standard deviation and zero correlation between dimensions were generated. Then, in a random manner, these data points were assigned to the desired number of clusters so that sizes of any two clusters differ only by at most 1. Then, the required distance between balls enclosing the clusters was computed. The required distance was set to twice the largest intrinsic cluster radius (measured from the centre to the most exterior cluster element) multiplied by square root of the number of clusters k. Then, each cluster was moved away from the zero point along a different direction (which was possible as k < n) by the required distance divided by √ 2. This division by √ 2 ensures that the distances between cluster enclosing balls were equal to the required distance.
The data was projected onto a lower dimensional subspace with dimensionality according to the formula (4), and k-means clustering procedure was executed both in the original space and in the target space. k-means was executed at least k/2 times up to k 2 /2 times till an agreement with the true clustering was obtained. If the true clustering was not obtained, k-means++ was applied up to k times. If still there was a disagreement, it was checked whether the total within sum of squares of the intrinsic clustering was lower than the one obtained from k-means++. If it were not the case in the projected space, then it would mean a failure of the theory. It would be a failure because it would mean that k-means++ was able to find a better clustering than one predicted by our theoretical considerations.
Multi-start k-means was applied because the algorithm is known to get stuck in local optima, while we were checking whether the global minimum is achieved. k-means++ is known to have guaranteed vicinity to the intrinsic optimum (while k-means does not) though it is more time-consuming than k-means. It turned out that for larger values of k ≥ 9, the k-means++ had to be called always to get the intrinsic clustering, while for k = 2 one or more calls of k-means were sufficient.
Counts of complete and incomplete matches of clusterings in the original scape and in the projected space have been performed for various numbers of clusters k = 2, 3, 9, 81, 243, 729. The other parameter of the experiments, , was fixed at 0.05, which implied δ ≈ 0.23 and n p = 1732.
The results are presented in Table 5. It is clearly visible that the probability of not violating projection cost constraints is even higher than the respective theorem predicts.

Impact of projection onto multiplicative perturbation stability
The experimental set-up was exactly the same as in Sect. 4.2 except that now multiplicative perturbations were performed. So we had an original data set X , that was projected resulting into the set X . Furthermore, we had a set Y being a multiplicative perturbation of X and a set Y being a multiplicative perturbation of X . We clustered each of them as indicated in the previous section via k-means or k-means++ and checked if all these clusterings agree with the intrinsic one. We assumed that the maximal multiplicative perturbation permissible s is the one not violating the criterion on knowing in advance the intrinsic clustering. Then, we perturbed the elements of X by a randomly selected factor from the permissible range (1/s to s) with respect to the respective cluster centre. This perturbation was effectively bigger than the theoretical value allowed by perturbation definition, nonetheless it is obvious that if the properties hold for stronger perturbation then they will hold also for proper ones. In this way, the set Y was obtained. We computed the actual value of perturbation s o of Y , by the respective formula we computed maximal theoretically possible perturbation s p in the projected space and applied this perturbation to X to obtain the set Y . We expected that under this perturbation to X the result of (optimal) clustering should be the same.
The results are presented in Table 6. It is clearly visible that the probability of not violating multiplicative perturbation robustness constraints is even higher than the respective theorem predicts.

Impact of projection onto -separatedness
We performed the experiment in the same way as described in 4.3 computing additionally the σ -separation both in X and in X and checked if they fit the restrictions of the respective theorem.
The results are presented in Table 7.

Impact of projection ontoˇ-centric Stability
We performed the experiment in the same way as described in 4.3 computing additionally the β-centric Stability both in X and in X and checked if they fit the restrictions of the respective theorem.
The results are presented in Table 8.

Impact of projection onto 1 +ˇ-weak deletion stability
We performed the experiment in the same way as described in 4.3 computing additionally the 1 + β-Weak Deletion Stability both in X and in X and checked if they fit the restrictions of the respective theorem.
The results are presented in Table 9. Note that we did not experiment with c, σ -Approximation-Stability because it requires not only a substantially larger set of experiments (not only the optimal clusterings need to be investigated but also all the other), but also generation of samples (for k-means algorithm) exhibiting c, σ -Approximation-Stability is rather tedious because in typical large samples this property is violated (e.g. local minima exist with close cost function and radically different structures).

Previous work
As already mentioned in the introduction, there exists a vast number of research papers that explored the consequences of the Johnson-Lindenstrauss Lemma in various dimensions.
We shall focus here on those aspects that are relevant for the research results presented in this paper. The original formulation of JL Lemma was rather existential without the primary goal to apply it to machine learning tasks. The applied research was therefore concentrated around the issue of actual computational burden related to use of random projection. First of all, an attempt was made to reduce maximally the dimensionality of the projected space.
The denominator of the expression for n of original JL Lemma was δ 2 , [25,28,31]. Later, papers suggest δ 2 − δ 3 [1,20] (decreasing the nominator) so that a slight decrease in the number of dimensions is achieved. We suggest the denominator − ln(1 + δ) + δ that reduces the allowed dimensionality slightly more.
Larsen and Nelson [29] concentrate on finding the largest value of n for which Johnson-Lindenstrauss Lemma does not hold demonstrating that the value they found is tight even for nonlinear mappings f . Though not directly related to our research, they discuss the flip side of the problem, that is the dimensionality below which at least one point of the data set has to violate the constraints.
Kane and Nelson [26] and Cohen et al. [19] pursue a research on Sparse Johnson-Lindenstrauss transform (SJLT). The SJLT deals with the problem that the original JL transform densifies vectors coming from sparse spaces. Also Clarkson et al. [18] are proposing an algorithm for low dimensionality embedding for sparse matrices that has low computational complexity in the number of nonzero entries in the data matrix. This is an interesting research direction for sparse matrices because the traditional random projection usually densifies the projected vectors causing losses to efficiency gained by dimensionality reduction. We do not pursue this problem though the densification may be an issue for versions of clustering algorithms that explicitly address sparse spaces. k-means in its original version densifies in fact the cluster centre vectors. So in fact k-means itself would require some changes.
Note that if we would set close to 1, and expand by Taylor method the ln function in denominator of the inequality (4) to up to three terms then we get the value of n from equation (2.1) from the paper [20]: Note, however, that setting to a value close to 1 does not make sense as we want to keep rare the event that the data does not fit the interval we are imposing.
Though one may be tempted to view our results as formally similar to those of Dasgupta and Gupta, there is one major difference. Let us first recall that the original proof of Johnson and Lindenstrauss [25] is probabilistic, showing that projecting the m -point subset onto a random subspace of O(ln m/ 2 ) dimensions only changes the (squared) distances between points by at most 1 − δ with positive probability. Dasgupta and Gupta showed that this probability is at least 1/m, which is not much indeed. That is, if we pick with their method one random projection, we may fail to obtain the projection with required properties with probability 1 − 1/m. For m = 1000, we will fail with probability of 99.9%. In order to get failure probability below say 0.05%, one needs to proceed as follows: repeat the process of picking the random projection r times, r to be specified below. Then, among the resulting r projections, with probability P s (r ) < 1 − , at least one will have the required properties. But we do not know which of the r projections. So for each projection, we need to check whether or not the distances under projection have the desired properties. In our method, we need to pick only one projection and the check is not needed. Let us go over to the estimation of r and of P s (r ). Note that each choice of a random projection is independent of the other. Therefore, the probability P f (r ) of failing to pick a projection with desired properties in each of the r trials, amounts to (1 − 1 m ) r , as 1 − 1 m is the failure probability in a single experiment. (Note that by definition P s (r ) = 1− P f (r ).) If we want to ensure that P f (r ) < , that is we need an r > ln( )/ ln(1 − 1 m ). In case of m = 1000, this means over r = 2995 repetitions, and with m = 1, 000, 000over r = 2, 995, 000 repetitions, In this paper, we have shown that this success probability can be increased to 1− for an given in advance. Hereby, the increase in target dimensionality is small enough compared to Dasgupta and Gupta formula, that our random projection method is orders of magnitude more efficient. A detailed comparison is contained in Tables 1, 2, 3, 4 in the last three columns. We compare in these tables n computed using our formulas with those proposed by Dasgupta and Gupta as well as we present the required number of repetitions of projection onto sampled subspaces in order to obtain a faithful distance discrepancies with reasonable probability. Dasgupta and Gupta generally obtain several times lower number of dimensions. However, as stated in the introduction, the number of repeated samplings nullifies this advantage and in fact a much higher burden when clustering is to be expected.
Note that the choice of n has been estimated by [1] n ≥ (4 + 2γ ) ln m where γ is some positive number. They propose a projection based on two or three discrete values randomly assigned instead of ones from normal distribution. With the quantity γ , they control the probability that a single element of the set Q leaves the predefined interval ±δ. However, they do not control the probability that none of the elements leaves the interval of interest. Rather, they derive expected values of various moments.
-His lower bound on n is higher than ours, because − ln(1 + δ) + δ > δ 2 − δ 3 . -As he does not investigate the proof of his formulation, he also fails to find still lower bound on n as we have done investigating the implicit formula for n in Sect. A.2.
We provided possibilities to reduce n via our implicit formulation of conditions for n and proving that the implicit function is invertible. As Table 1 shows, for example, the dimensionality reduction may be up to 15%. We are unaware of this being observed by anyone else. -From a practical point of view, Bandeira's formulation is misleading as to the nature of increase in n with increase in the sample size. The formula above would superficially suggest that it grows linearly with the logarithm of the sample size m, while our formulation clearly shows that this growth is slower when keeping the failure probability . See, for example, the results in Table 1. For a practical illustration, consider the case of m = 100, δ = 0.1, = 0.05 and accordingly τ = 0.65. Bandeira's n would amount to 5230. Our explicit n would be 5205 which are pretty close (though our is lower (by 0.5%), while at the same time their failure probability m −τ > 0.0501 is slightly bigger. If we increase, however, m to say 10,000, while keeping respective parameters of Ban-deira's relation, he gets n = 10, 460, while we require only n = 9134 (12% less), and for m = 1, 000, 000 he needs 15,691 dimensions and we only 13,061 (16% less). In this sense, our formulation is more user-friendly. -His parameter τ has no practical meaning from the point of view of a user, while our has a clear-cut semantics. -He does not draw conclusions with respect to the clustering task and clusterability as done in this paper in our theorems, especially in Theorem 3.6 which rely on the fact that is explicitly mentioned in the formula for n . It would be really hard to explain the user of Bandeira's formula what it means to having to double our .
The publications [11,12] by Baraniuk et al. present a theorem similar in spirit to [10], in a bit different area of transformations with the so-called RIP (restricted isometry property). Our above remarks apply also to those publications, respectively.
Our research was oriented towards applicability of Johnson-Lindenstrauss Lemma to the task of clustering. This application area has been explored already by Schulman [36]. He optimized the intracluster sum of weights in graph clustering and used JL Lemma to reduce the computational time. Recently, a number of papers pursued the research path of applying JL Lemma to improve clustering process for graphs [27]. It is a combination of compressive sampling with spectral clustering [33,34,38,41], spectral compressive principal component analysis [22] and similar approaches. Our research differs from these developments. The graph clustering explores the fact that data items (graph nodes) are interdependent. Therefore, it is possible to explore these dependencies to reduce the number of attributes (they are dependent as they are the nodes of the graph) and at the same time the sample size and hence automatically reduce the dimensionality further (as it depends on the sample size). We considered here only the case of independent data items (objects) and therefore could not benefit from their results. Note also that typical graph-based clustering methods ignore long distances between nodes and their usage of JL Lemma keeps rather distances between eigenvectors of Laplacians of such graph.
While we insisted on avoiding multiple repetitions of projections, Cannings and Samworth [16] explicitly use multiple projections for purposes of classification.
See also Fedoruk et al. [21] for an overview of theoretical and practical bounds for dimensionality reduction via JL Lemma.

Conclusions
In this paper, we investigated a novel aspect of the well-known and widely investigated and applied Johnson-Lindenstrauss lemma on the possibility of dimensionality reduction by projection onto a random subspace.
In this paper, we made three main claims: -JL Lemma can be enhanced in such a way that, in the process of dimensionality reduction, with user-controlled probability, all the projected points keep error bounds; -the proposed framework can identify the suitable subspace by random projection that preserves the cluster structure of higher dimension in the embedding with some controllable error; -in the proposed framework, we derived deterioration degrees of a number of clusterability properties under the projection.
With respect to claim one, the original formulation of JL Lemma means in practice that we have to check whether or not we have found a proper transformation f leading to error bounds within required range for all pairs of points, and if necessary (and it is theoretically necessary very frequently), to repeat the random projection process over and over again.
We have shown here that it is possible to determine in advance the choice of dimensionality in the random projection process as to assure with desired certainty that none of the points of the data set violates restrictions on error bounds. This new formulation, expressed in Theorem 2.1, proven in Sect. A.1 and empirically validated in Sect. 4.1, can be of importance for many data mining applications, such as clustering, where the distortion of distances influences the results in a subtle way (e.g. k-means clustering). Via some numerical examples, we have pointed at the real application areas of this kind of projections, that is problems with high number of dimensions, starting with dozens of thousands and hundreds of thousands of dimensions. Also the superiority of our approach to that of Gupta/Dasgupta was demonstrated by pointing at the computational burden resulting from the need to repeat the projections multiple times.
As the second claim is concerned, we have broadened the analysis of JL Lemma-based random projection for k-means algorithms in that we identified the conditions under which clusterings yielding local minima of k-means objective coincide in the original and the projected spaces, and also conditions when the values of global optima of this objective for the original and projective spaces are close to one another. This has been expressed in Theorems 2.3-2.6, proven in Appendix B. An empirical investigation was performed in the Sect. 4.2.
Additionally, as stated in the third claim, we have formulated Theorems 3.1-3.8 and proved them in Sect. 3 showing, when our reformulation of the JL Lemma permits to uphold five well-known clusterability properties at the projection. An empirical investigation was performed in Sects. 4.3-4.6 on four of them.
The scope of this investigation was restricted in a number of ways. Hence, there exist numerous possibilities to extend the research. First of all, this research and papers of other (e.g. [41,42]) indicate that the JL Lemma-induced dimensionality reduction is too conservative. We have seen this, for example, by comparison of explicit and implicit dimensionality reduction differences. Various empirical studies suggest that the dimensionality could be radically reduced, though no analytical results are yet available.
We restricted ourselves to studying the impact of JL Lemma on k-means clustering algorithm cost function. It may be of interest to study particular brands of k-means algorithms in terms of not the theoretical optimum, but rather the practically achievable ones (an "in the limit behaviour study"). An extension to other families of algorithms, based on other principles, may turn out to be interesting.
Furthermore, we insisted on keeping bounds for distance distortions for all pairs of points. From the perspective of clustering algorithms, it may not be so critical if the distance distortion bounds are violated by a sufficiently small number of points. This may be an interesting study direction on JL Lemma itself. And also for the study of clustering algorithms based on subsampling rather than on the whole data set. One may suspect, for example, that k-means algorithm will stabilize under increasing sample size. But the sample size increase delimits the possibilities of dimensionality reduction. Hence, the subsampling may be an interesting research direction for generalizations of JL Lemma.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and repro-duction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

A.1 Derivation of the set-friendly Johnson-Lindenstrauss lemma (Theorem 2.1)
Let us recall the process of seeking the mapping f from Theorem 1.1, as proposed by Dasgupta and Gupta [20]. We then switch to our target of selecting the size of the subspace guaranteeing that the projected distances maintain their proportionality in the required range.
Let us consider first a single vector x = (x 1 , .x 2 , . . . , x n ) of n independent random variables drawn from the normal distribution N (0, 1) with mean 0 and variance 1. Let x = (x 1 , .x 2 , . . . , x n ), where n < n, be its projection onto the first n coordinates. 5 Dasgupta and Gupta [20] in their Lemma 2.2 demonstrated that for a positive β -if β > 1 then As probabilities are non-negative numbers, and we seek only possible projections, this latter bound makes sense only if 1 + n (1−β) n−n ≥ 0, that is n ≤ n/β. But what happens if this condition is violated? Rewrite Pr( x 2 ≥ β n n x 2 ) as Pr(x 2 1 + · · · + x 2 n ≥ β n n x 2 1 + · · · + x 2 n ) = Pr 1 − β n n x 2 1 + · · · + x 2 n ≥ β n n x 2 n +1 + · · · + x 2 n If n > n/β, then 1 − β n n ≤ 0 which means that the mentioned probability is equal to zero, as it is the case when n = n β. Therefore, the Gupta/Dasgupta Lemma is formally correct if we assume subsequently that there exists some n T RU E being the true dimensionality and we take n = max(n T RU E , n β + ) , > 0 Now, imagine we want to keep the error of squared length of x bounded within a range of ±δ (relative error) upon projection, where δ ∈ (0, 1). Hence, 1 − δ ≤ β ≤ 1 + δ. Then, we get the probability . It is decreasing if its logarithm is decreasing too. So let us look at (−1, 1). Therefore, the above expression can be rewritten as: Let us compute the derivative of the latter expression In the same way, we can obtain the derivative of the logarithm of (1−δ) Convergence of both is granted (because the summands in absolute value decrease quicker than in power series) if n n−n ≤ 1 2 (or equivalently 3n < n) 10 which is a reasonable value for n , if we would bother about random projection.

B Proofs of Theorems 2.3-2.6
The proofs of theorems 2.3-2.6 require several intermediate lemmas, establishing stepwise partial results of these theorems.

B.1 Proof of Theorem 2.3
So the proof of Theorem 2.3 on cost function under projection requires first demonstration of Lemma B.1 showing that our Theorem 2.1 establishes limitations not only on distortions of point-to-point distances but also on point-its-own-cluster-centre distances.
Lemma B.1 Let δ ∈ (0, 1 2 ), ∈ (0, 1). Let Q ⊂ R n be a set of m representatives of elements of Q in an n-dimensional orthogonal coordinate system C n and let the inequality (4) or (7) hold. Let C n be a randomly selected (via sampling from a normal distribution) n -dimensional orthogonal coordinate system. For each x i ∈ Q, let x i ∈ Q be its projection onto C n . Let C be a partition of Q. Then, for all data points hold with probability of at least 1 − , Proof As we know, data points under k-means are assigned to clusters having the closest cluster centre. On the other hand, the cluster centre µ is the average of all the data point representatives in the cluster. 10 Then jn−(2 j−1)n Hence, the cluster element i has the squared distance to its cluster centre µ(C(i)) amounting to Hence, Note that here µ (C(i)) is, by intention, not the projective image of µ(C(i)), but rather the centre of projected images of cluster elements. However, due to the linear nature of the projection, µ (C(i)) and the centre of projected images of cluster elements coincide.  (31): Based on defining equations (2) and (3), we get the formula (8)

Lemma B.2
Under the assumptions of preceding lemmas for any two clusters C 1 , C 2 where the non-negative p, defined by relation (38), depends on degree of balance between cluster distance and cluster shape, holds with probability at least 1 − . Now, let us consider the choice of δ in such a way that with high probability no data point will be classified into some other cluster. We claim the following: So we see that the decision on the permitted error depends on the size of the gap between clusters that we hope to observe. Lemma B.3 allows us to prove Theorem 2.4 in a straightforward manner.
Proof (Theorem 2.4) Observe that in this theorem we impose the condition of this lemma on each cluster. So all projected points are closer to their set centres than to any other centre. So the k-means algorithm would get stuck at this clustering and hence we get at a local minimum.

B.3 Proofs of Theorems 2.5 and 2.6
In order to prove Theorem 2.5 on mapping of local minima from the projected space to the original space, we need first a proof of Lemma B.5 on no cluster change when stepping back from the projected space to the original space (keeping locally optimal assignment to a cluster).
Having these results, we complete the section with the proof of Theorem 2.6. Note that we can make another characterization of the situation of no cluster change, not related to cluster centres but rather to point-to-point distances. Now, let us consider the choice of δ in such a way that with high probability no data point will be classified into some other cluster. We claim the following: Lemma B.4 Consider two clusters C 1 , C 2 . Let δ ∈ (0, 1 2 ), ∈ (0, 1). Let Q ⊂ R n be a set of m points in an n-dimensional orthogonal coordinate system C n and let the inequality (4) or (7) hold. Let C n be a randomly selected (via sampling from a normal distribution) n -dimensional orthogonal coordinate system. For each x i ∈ Q, let x i be its projection onto C n . For two clusters C 1 , C 2 , obtained via k-means, in the original space let µ 1 , µ 2 be their centres and µ 1 , µ 2 be centres to the corresponding sets of projected cluster members.
Then, all projected points of the first cluster are (each) closer to the centre of the set of projected points of the first than to the centre of the set of projected points of the second if with probability of at least 1 − .
Proof For the change in clusters to be prevented, for any point j ∈ C 1 , the combined sum of squared distances to cluster centres in these two clusters C 1 , C 2 should be lower or equal to such a sum for clusters resulting from the switch of the element j from C 1 to C 2 , that is for clusters C 1 − { j}, C 2 ∪ { j}. This means that the following has to hold: By replacing the sums of squared distances to centres by squared distances between data points, we get: By rearranging terms, we obtain: We consider now the centric sums of squares for smaller clusters (C 2 , C 1 − {i}) By simplification, we obtain: We subtract from both sides i∈C 2 x i − µ (C 2 ) 2 + i∈C 1 −{ j} x i − µ(C 1 − { j}) 2 : In order that the above formula is valid, it suffices that because of the relationship between the original and projected space distances from Theorems. 2.1. and 2. 3 Now, we obtain the claim of the Lemma, by dividing both sides with (1 + δ) n n : Note that the above relation can be interpreted as stating that the "mean" squared distance to elements of the own cluster increased by mean squared distance to cluster centre in the opposite cluster must be lower from the "mean" squared distance to elements of the other clusters increased by the mean squared distance to cluster centre within the rest of the cluster reduced the factor 1−δ 1+δ . Note also that when ignoring the 1−δ 1+δ fraction, the condition is normally satisfied in the original space if clustering gets at a local minimum. Hence, this factor is actually the only one required additionally for the clustering to be kept under projection. Lemma B.5 Let δ ∈ (0, 1 2 ), ∈ (0, 1). Let Q ⊂ R n be a set of m points in an n-dimensional orthogonal coordinate system C n and let the inequality (4) or (7) hold. Let C n be a randomly selected (via sampling from a normal distribution) n -dimensional orthogonal coordinate system. For each x i ∈ Q, let x i be its projection onto C n . For any two k-means clusters C 1 , C 2 in the projected space, let µ 1 , µ 2 be their centres in the projected space and µ 1 , µ 2 be centres to the corresponding sets of cluster members in the original space. Furthermore, let d be the distance of the first cluster centre to the common border of both clusters in the projected space and let the closest point of the first cluster to this border in that space be at the distances of αd from its cluster centre , where α ∈ [0, 1).
Then, all points of the first cluster in the original space are (each) closer to the centre of the set of points of the first than to the centre of the set of points of the second cluster in the