Optimal Fast Johnson-Lindenstrauss Embeddings for Large Data Sets

Johnson-Lindenstrauss embeddings are widely used to reduce the dimension and thus the processing time of data. To reduce the total complexity, also fast algorithms for applying these embeddings are necessary. To date, such fast algorithms are only available either for a non-optimal embedding dimension or up to a certain threshold on the number of data points. We address a variant of this problem where one aims to simultaneously embed larger subsets of the data set. Our method follows an approach by Nelson: A subsampled Hadamard transform maps points into a space of lower, but not optimal dimension. Subsequently, a random matrix with independent entries projects to an optimal embedding dimension. For subsets whose size scales at least polynomially in the ambient dimension, the complexity of this method comes close to the number of operations just to read the data under mild assumptions on the size of the data set that are considerably less restrictive than in previous works. We also prove a lower bound showing that subsampled Hadamard matrices alone cannot reach an optimal embedding dimension. Hence, the second embedding cannot be omitted.


Johnson-Lindenstrauss Embeddings and Applications
Dimension reduction has played an increasingly significant role in data science in recent years due to the increasing size and dimensionality of available data. So-called Johnson-Lindenstrauss embeddings (JLEs) are of central importance in this context. Such maps reduce the dimension of an arbitrary finite set of points while preserving the pairwise distances up to a small deviation. JLEs were first introduced in [1] in the context of Banach spaces. They continue to be intensively studied in both mathematics and computer science. For reasons of practicability, most works focus on linear maps based on random constructions, aiming to be independent of the data to be considered. In this spirit, we work with the following definition of a Johnson-Lindenstrauss embedding (E can be thought of as the set of pairwise distances). Definition 1.1 (Johnson-Lindenstrauss Embedding). Let A ∈ R m×N be a random matrix where m < N , ǫ, η ∈ (0, 1) and p ∈ N. We say that A is a (p, ǫ, η)-JLE (Johnson-Lindenstrauss embedding) if for any subset E ⊆ R N with |E| = p holds simultaneously for all x ∈ E with a probability of at least 1 − η.
The original work of Johnson and Lindenstrauss constructed (p, ǫ, η)-JLEs with an embedding dimension of m ≥ C η ǫ −2 log(p), which has recently been shown to be order optimal under minimal assumptions [2]. Various subsequent works developed simplified approaches for constructing JLEs. Notably, Achlioptas [3] considered a matrix A ∈ R m×N with independent entries A jk ∈ {±1} satisfying P(A jk = 1) = P(A jk = −1) = 1 2 . In this case the normalized matrix 1 √ m A is a (p, ǫ, η)-JLE in the above setting, again for order optimal embedding dimension. Dasgupta and Gupta [4] showed the same for a matrix with independent normally distributed (Gaussian) entries. In many applications of Johnson-Lindenstrauss embeddings, even such simplified constructions are of limited use due to the trade-off between the complexity benefit for the original problem resulting from the dimension reduction and the additional computations required to apply the JLE. Ailon and Chazelle [5] addressed this issue in connection to the approximate nearest neighbor search problem (ANN), the problem of finding a pointx in a given finite set E ⊆ R N such that one has x −x ≤ (1 + ǫ) min v∈E x − v for a given x ∈ R N . Their algorithm uses a preprocessing step that transforms all p points in E using a JLE. This step is known to have a high computational complexity, but as it can be performed offline, it is not considered to be the main computational bottleneck. Rather the focus of the analysis has been on the subsequent query steps, in which the JLE is applied to new inputs x, reducing the dimension for the subsequent computations. For this purpose, the authors design the so-called Fast Johnson-Lindenstrauss Transform (FJLT) whose application to x requires a particularly low computation time. It should be noted, however, that if only one or very few query steps are to be performed, the problem is likely to be feasible even without the FJLT -a challenging scenario of interest will be to embed a larger number of points. In the query step, these are typically significantly fewer than in the full data set; at the same time, for the preprocessing step, also a fast embedding of the entire set may be of interest. In other applications, what would correspond to the preprocessing step in [5] is the central task of interest. That is, the data set to be embedded is given in full and the goal is to efficiently compute the dimension reduction of the whole set at once. For example, such a setup appears in various approaches for nonlinear dimension reduction such as Isomap [6], Locally Linear Embedding [7] or Laplacian eigenmaps [8]. These methods exploit that a high-dimensional data set lying near a low-dimensional manifold can be locally approximated by linear spaces. Hence it is of key importance to identify the data points close to a point of interest in the data set, as the linear approximation will be valid only for those. Consequently, one only works with points in the data set, no new points enter at a later stage, and it suffices to apply the JLE simultaneously to the whole data set before searching for multiple approximate nearest neighbors within its projection. A simultaneous fast transformation of the entire data set is also a central goal for various applications in randomized numerical linear algebra. The following lemma, for example, introduces a randomized approach for approximate matrix multiplication by multiplying matricesÂ = (SA * ) * andB = SB shrunk by a Johnson-Lindenstrauss embedding S rather than computing the product of the full matrices.
O(qmp) for the reduced product and, on the other hand, of the computational effort to calculate SB and SA * , i.e. to apply the JLE to the entire data set (the columns of B and the rows of A). If the JLE used does not admit a fast transform and standard matrix multiplication is applied, its computational cost can easily dominate the computation time for the whole approximate product. Namely, assuming w.l.o.g. that q ≤ p, applying S to the data set requires O(mN q + mN p) = O(mN p) operations. If q ≤ N , this becomes the dominant part of the computation, if q ≤ m, it even surpasses the complexity of the original multiplication O(qN p). With a fast Johnson-Lindenstrauss transform, this changes and the cost of the multiplication in the embedding space will typically become dominant. Namely, if applying the transform requires say O(N log N ) operations per data point, the total cost for the JLE step is O(pN log N ), which will be less than O(mN p) in basically all interesting cases. Thus minimizing the total computational cost boils down to choosing the embedding dimension m as small as possible. Given that the Fast Johnson-Lindenstrauss Transform by Ailon and Chazelle [5] and its extension by Ailon and Liberty [10] yield order-optimal embedding dimensions, they provide near-optimal solutions to these problems whenever the regime is admissible. However, these works only apply to data sets of size p ≤ exp(O ǫ (N 1 3 )) or p ≤ exp(O ǫ (N 1 2 −δ )) for some arbitrary δ > 0 (that will impact the constants appearing in the embedding dimension), respectively. Here the notation O ǫ indicates that the involved constants may depend on ǫ. Follow-up works without such restrictions (e.g., [11], [12]) require additional logarithmic factors in the embedding dimension, which can directly impact the total computational cost as, e.g., in the second example discussed. To our knowledge, there were no approaches known before this work of applying Johnson-Lindenstrauss embeddings with an optimal embedding dimension to data sets of size larger than exp(O ǫ (N 1 2 )) which admit a fast transform, neither for an individual query step nor for transforming larger subsets or the full data set simultaneously.

Contributions of this Work
The goal of this note is to provide such methods for the case of simultaneous embeddings of larger subsets of the data sets or the entire set. To do this, we consider a class of matrices based on a composition of a dense matrix with random signs and a randomly subsampled Hadamard transform. Such embeddings have been considered in the literature before, for example in [13]. Our contribution is the following. Assuming only that p = exp(O ǫ (N 1−δ )) for some δ > 0, we provide a method for simultaneously applying this map to p ′ points in time O(p ′ N log m) where p ′ ≥ N d(δ) and the exponent d(δ) depends only on δ. Up to the logarithmic factor, this complexity is just the number of operations required to read the data. The assumption on p is indeed mild; it admits data sets of size close to p = exp(Θ ǫ (N )), for which the identity provides a trivial order optimal embedding, so our result almost unrestricted for large p. Note that this statement includes a fast transformation of the entire data set E, but it is more general. In particular, with explicit bounds for d(δ) for certain choices of δ, we obtain fast simultaneous embeddings of as few as N points and data sets of admissible size significantly beyond state-of-the-art results. With slight modifications this can also be made to work for embeddings into a space equipped with the ℓ 1 instead of the ℓ 2 norm. We also provide a lower bound on the possible embedding dimension of subsampled Hadamard transforms without the composition with the dense matrix. This shows that they cannot provide an optimal embedding dimension as desired in this paper.

Outline
We start by reviewing the construction of the Fast Johnson-Lindenstrauss Transform in Section 2.2. Section 2.3 discusses previous work on alternative fast Johnson-Lindenstrauss embeddings with no restriction on p, but an embedding dimension that is suboptimal by logarithmic factors. Section 3 then presents the composed embedding used for the analysis and our main contribution, a method for fast (simultaneous) transformations of large data sets without an essential restriction on p. The proof of our main result regarding the complexity is presented in Section 4. In the subsequent Section 5 we show that this method can be adapted for fast embeddings into spaces with the ℓ 1 norm. In Section 6 we show the lower bound on the embedding dimension for subsampled Hadamard matrices. Then we conclude with a discussion in Section 7.

Notation and Basic Terms
We make use of the standard O notation, i.e., f (x) = O(g(x)) if f (x) ≤ Cg(x) for a constant C > 0 and all x > x 0 or 0 < x < x 0 for fixed x 0 depending on whether the limit x → ∞ or ) and f (x) = Ω(g(x)) at the same time. We write O c for a parameter c if the corresponding x 0 and C can depend on c, analogously for Ω c and Θ c . Sometimes we also write g(x) = 0, with the limit depending on the context. We use the notion of subgaussian random variables X and two equivalent characterizations: For the proof of equivalence and more characterizations, see [14] for a summary. Every subgaussian random variable satisfies both conditions with K 1 and K 2 that differ from each other by at most a constant factor. We define the corresponding K 1 to be the subgaussian norm X ψ 2 . We call a vector ξ ∈ {±1} N a Rademacher vector if it has independent entries and each of them takes the value ±1 with a probability of 1 2 each. We use H ∈ R N ×N with N = 2 n a power of 2 for the Hadamard transformation on R N . This is defined as the n-times Kronecker Product of H 1 := 1 H 1 is the Fourier transform on the group Z 2 , thus we can also regard H as the Fourier transform on the group Z n 2 by fixing a bjiection between Z n 2 (the same as F n 2 ) and [N ]. Then for u, v ∈ F n 2 , the corresponding entry of the matrix is also given by H uv = (−1) u,v where · denotes the inner product in F n 2 . The matrix H is unitary and all its entries have an absolute value of 1 √ N . Additionally, there is a fast algorithm (Walsh-Hadamard transform) which can compute the matrix vector product Hx for any x ∈ R N in time O(N log N ). For details about the algorithm, see Section 7 of [10]. There are also other matrices in complex numbers which have these three properties such as the discrete Fourier transform. The main results of this paper can also be adapted to these matrices. However, we are going to use the Hadamard transform since it ensures that our embedding stays real valued. For counterexamples we are also going to use specific structural properties of the Hadamard matrix. Denote G n,r for the set of all r-dimensional subspaces of the vector space F n 2 . For a subset V ⊆ F n we denote ½ V ∈ R N for the indicator vector corresponding to V with the normalization ½ V 2 = 1.
We use the following fact about subspaces: If V ∈ G n,r , then (see for example [15], Lemma II.1).

Optimal Fast Johnson-Lindenstrauss Embeddings
The Fast Johnson-Lindenstrauss Transformation (FJLT) by Ailon and Chazelle [5] consists of a fast Hadamard transform combined with a sparse projection. More precisely, the embedding A = P HD ξ ∈ R m×N is defined as follows: • P ∈ R m×N is a sparse projection with independent entries P jk = b jk g jk where all b jk and g jk are independent and P(b jk = 1) = q, Lemma 1 in [5] states that this random matrix is a (p, ǫ, 2 3 )-JLE for m = Θ(ǫ −2 log p). Since the transformation by H can be computed in time O(N log N ) (using the Walsh-Hadamard transform described e.g. in [10]) and P is sparse with a high probability, it is also shown that the entire transformation needs operations with a probability of at least 2 3 . This is of order O(N log N ) provided log p = O ǫ (N 1 3 ). Furthermore, the embedding dimension exhibits optimal scaling in N , p and ǫ [2]. A further improvement of this approach is achieved by Ailon and Liberty in [10] where the bound for the fast transformation is raised from log p = O ǫ (N

Unrestricted Fast Johnson-Lindenstrauss Embeddings
A different approach for the construction of Johnson-Lindenstrauss matrices was introduced by Ailon and Liberty in [11]. Compared to the aforementioned result, this construction does not have a significant restriction on the number p of points in E for a fast transformation. However, its embedding dimension has an additional polylogarithmic factor in N as well as a suboptimal dependence on ǫ. The dependence on ǫ is improved in [12], making the construction optimal up to logarithmic factors in N . Both constructions are based on the restricted isometry property (RIP) of the embedding matrix, that is, approximate norm preservation of vectors with many vanishing entries.
for all k-sparse x ∈ R N , i.e. for all x ∈ R N that have at most k non-zero entries.
The following theorem of [12] shows that RIP matrices can be made to Johnson-Lindenstrauss transforms by randomizing their column signs.
Using an argument based on Gelfand widths, one can show that for sufficiently small δ, for a (k, δ)-RIP matrix in R m×N , one must have m = Ω(k log( N k )) (see Chapter 10 in [16]). Thus Theorem 2.2 can only yield the Johnson-Lindenstrauss property for an embedding dimension of at least m = Ω(ǫ −2 log(p) log( N k )) while the optimal one is independent of N . Theorem 2.2 can be applied to the transform RHD ξ with a Rademacher vector ξ ∈ R N , a Hadamard matrix H ∈ R N ×N and a random projection (subsampling) R ∈ R m×N selecting m entries uniformly with replacement and rescaling by N m . As RH ∈ R m×N is shown to have the (k, δ)-RIP with high probability for m = Θ(δ −2 k(log N ) 4 ) in [17], this construction requires an embedding dimension of m = Θ(ǫ −2 log( p η )(log N ) 4 ) which is optimal up to a factor of (log N ) 4 . At the same time, the construction admits a fast computation of the matrix vector product even for large values of p: Using the fast Walsh-Hadamard transform, one can always achieve a computation time of O(N log m) per data point. Similar results can also be obtained for subsampled random convolutions instead of subsampled Hadamard transform ( [18], [19]). Improved embedding dimensions can be obtained using new RIP bounds for partial Hadamard matrices (more generally, subsampled unitary matrices with entries bounded by O( 1 √ N )) that have been shown in [20]. Namely, it is shown that m = Ω(δ −2 (log 1 δ ) 2 k(log k δ ) 2 log(N )) implies the (k, δ)-RIP with probability arbitrarily close to 1 for sufficiently large N . Thus, for k ≤ N and a fixed δ, m = Ω(k(log N ) 3 ) is sufficient. So for a fixed ǫ, this implies that the JLE construction introduced above needs an embedding dimension of m = Θ(log(p)(log N ) 3 ), improving the previous result by a factor of log N . However, for reasons of simplicity of presentation due to the simpler dependence on ǫ, we continue to work with the bound m = Θ(ǫ −2 log( p η )(log N ) 4 ) and leave the possible refinement to the reader.

Rectangular Matrix Multiplication
We denote by M E ∈ R N ×p the matrix with the vectors in E as columns. Then simultaneously applying a transform A ∈ R m×N to all vectors in E corresponds to computing the matrix product AM E . Thus the question of computing the simultaneous embedding closely connects to the topic of fast algorithms for matrix multiplication, which, starting with the seminal work of Strassen [21], developed into a core area of complexity theory (see, for example, Chapter 15 of [22] for an overview). Given that we are interested in dimension reduction, algorithms for the fast multiplication of rectangular matrices will be of particular relevance to this paper. When one of the two matrices is square and the other has many more columns, an asymptotically optimal family of algorithms has been provided by Lotti and Romani [23]. More precisely, their result can be summarized in the following proposition. That is, in the limit, the number of operations will just be what is required to read the two matrices.

Main Result
Unless otherwise noted, in all theorems we assume that ǫ ∈ (0, 1) and η ∈ (0, 1 2 ) are arbitrary and that m < N . A helpful and already known observation for creating Johnson-Lindenstrauss embeddings with good embedding dimension and fast multiplication is that the composition of two JLEs is again a JLE. We include a proof for completeness.
The probability that the norms in BE are distorted by more than 1 ± ǫ 3 compared to E is ≤ η 4 . For each fixed value of B that satisfies this norm preservation, the probability that a norm in is less than η 4 . By a union bound, with a probability of at least 1−η all norms in ABE lie within This lemma can be used to combine the advantages of fast JLEs with non-optimal embedding dimension (as in Theorem 2.2) and dense random matrices with optimal embedding dimension similar to those of Achlioptas [3]. A set of p vectors in R N is first mapped into a space of nonoptimal, but reduced dimension using a fast Johnson-Lindenstrauss transform and then a dense matrix maps the vectors from this space of reduced dimension to a space of optimal embedding dimension. Note that this leads to a smaller dense matrix and thus a faster computation compared to a JLE consisting of only a dense matrix. We apply this procedure to the fast transform from [12] and the dense ±1 matrix from [3], obtaining a JLE of optimal embedding dimension. The usage of this construction has already been suggested for example in [13], see footnote 2.
Corollary 3.2. Consider the following matrices.
• R ∈ R n×N is a random projection that selects n of the N entries of a vector uniformly at random with replacement and rescales by N n .
• H ∈ R N ×N is a full Hadamard matrix.
Note that the resulting construction is very similar to the one introduced in [5]. Namely, similarly to the matrix P in the construction of [5] (cf. Section 2.2 above), the matrixP = GR has many zero entries and independent random entries in all other locations. The main difference is that in this construction the non-zero entries are concentrated in just a few columns while in [5], their locations are random. As captured by the following main theorem, this fact leads to a significant speed up when simultaneously projecting large enough point sets, as the structure allows for the use of fast matrix multiplication. (N 1−δ )) for an arbitrary δ > 0 and a failure probability η ∈ (p −c , 1 2 ) for constant c. Then for embedding dimensions m = Θ(ǫ −2 log p η ) and n = Θ(m(log N ) 4 ), the transformation A = GRHD ξ introduced in Corollary 3.2 is a (p, ǫ, η)-JLE for the set E. There is an exponent d(δ) depending only on δ such that the following holds. For any finite set The dependence of the exponent d(δ) on δ is not explicitly stated in this theorem. This is caused by Proposition 2.3 (and the source [23]) not stating a bound on the convergence speed. However, certain estimates for specific values of the function f in Proposition 2.3 have been made before and we use those to get explicit estimates for d(δ) in Section 4.3. This will lead to the following subsequent theorem summarizing two particularly interesting cases. 4 Complexity Analysis -Proof of Theorem 3.3

Simultaneous Transformation of Data Points
The main novelty in our approach is the use of Proposition 2.3 in the last step of the embedding which is a (smaller) dense matrix G. The proposition will be applied to provide a speedup for embedding dimensions m ≥ N To prove Theorem 3.3, we have to cover the cases p ′ ≥ N d(δ) and p ′ = p < N d(δ) . Consider the first case now. Recall that by assumption p = exp(O ǫ (N 1−δ )), more precisely we assume that p ≤ exp(c ′ ǫ 2 N 1−δ ) for a constant c ′ chosen to ensure that m ≤ N 1−δ . To analyze the complexity, we note that computing the product consists of the following three steps. Recall that M E ′ ∈ R N ×p ′ is the matrix whose columns are all the vectors in the finite set E ′ . • Compute M 3 := GM 2 where G ∈ R m×n and M 2 ∈ R n×p ′ . Choose α = log p ′ log m , yielding p ′ = m α . Split G and M 2 up into blocks for r = ⌈ n m ⌉ such that G 1 , . . . , G r ∈ R m×m and V j ∈ R m×p ′ for j ∈ [r]. If necessary, pad the matrices with zero rows and columns such that the corresponding submatrices have the same size. Since n ≥ m, this only changes the numbers of rows or columns by at most a constant factor.
Each of the r block multiplications required to compute this product is a multiplication of an m × m matrix by an m × m α matrix and thus requires O(m f (α) ) operations with f as in Proposition 2.3. In the end, we need O(p ′ mr) operations to sum up the values for all entries of the result. In total, the number of operations is where in the last equality we used that by Proposition 2.3 f (α) − α ≥ 1.
Combining the runtime complexity of the three steps yields Hence, for sufficiently large d(δ), one has that α ≥ β, i.e., Substituting this into (4.1), using that m = O(N 1−δ ) yields a complexity of Now consider the case that m ≥ N 1 2 − δ 4 but p ′ = p < N d(δ) . Using these inequalities together with m = O(ǫ −2 log p η ) = O(ǫ −2 log p) (the last equality follows from η > p −c ), we obtain that . This case can be excluded by imposing the stronger condition that p = exp(O(ǫ 4 N 1−δ )), which still is of the form p = exp(O ǫ (N 1−δ )). Namely, the two conditions for ǫ and p would imply that i.e., p is bounded by a constant ≤ p 0 (δ). By definition of the O notation however, it is sufficient to show the complexity bound only for p > p 0 (δ). This proves the theorem for the case of embedding dimensions m ≥ N  Note that for small data sets of size p ≤ N d(δ) and (the second case in the previous consideration), the increased exponent of ǫ in our condition for p creates a gap between the applicability range of our result and the regime where the identity operator can be applied. Namely, our proof only applies to cases with embedding dimension at most m = O δ (log(N ) · N 1 2 − δ 4 ), while the identity cannot be used below m = N . To circumvent this, one can apply the construction of Ailon and Chazelle [5] to avoid this gap (as resulting from (2.2), the complexity is O(pN log N + pǫ −2 (log p) 3 ) which is O δ (pN (log N ) 2 ) in this case).

Transformation of Small Data Sets
For m ≤ N 1 2 − δ 4 (the case still missing in the previous section), we will just apply the embedding individually to each point in E. As noted in [13], the map defined in Corollary 3.  [13]

Remark 4.2. As noted in
. This is still linear up to logarithmic factors when r = 1 2 and close to linear for r slightly larger than 1 2 . Thus, one obtains a fast transform also in these cases, in contrast to [10] whose bound does not apply in either of these cases.

The Exponent d(δ) -Proof of Theorem 3.4
While Theorem 3.3 applies for arbitrary subsets of size p ′ bounded below by a polynomial in N , the exponent d(δ) remains unspecified since Proposition 2.3 (i.e., [23]) does not quantify the convergence speed of f (s) − s. For certain values of δ explicit bounds for d(δ) can be derived from estimates for f (s) available in the literature, such as those derived in [24] and subsequently improved in [25]. More precisely, Theorem 5 in [25] adapted to the setup of Theorem 2.3 states that where g(x) = x x for x > 0 and g(0) = 1.
Some specific bounds for f (s) resulting from specific choices for k, m 2 , β, β ′ in (4.4) are given in and check it for the values given in Table 1. E.g., for δ = 1 4 , by Table 1 it is sufficient to have s = d(δ) .99 · · · < 1 is admissible. Analogously we obtain that the following values are admissble. To show the second statement of Theorem 3.4, we observe that the bound (4.1) for the complexity holds independently of the upper bound on log p which can lead to upper bounds on the complexity which are close to linear in N . From the assumption p ′ ≥ N 4 ≥ m 4 we deduce that α ≥ 4 and hence by Table 1, f (α) − α ≤ 1.1915. Thus one obtains a runtime complexity of O(p ′ N log N + p ′ m 1.1915 (log N ) 4 ), which is bounded by O(p ′ N 1.2 ); this holds for any size p of the entire data set. This completes the proof of Theorem 3.4.
5 Embeddings from ℓ 2 to ℓ 1 Unlike the standard case for Johnson-Lindenstrauss embeddings with Euclidean norm considered in this work until here, it can also be interesting to consider embeddings from ℓ 2 to ℓ 1 . To study these, we work with the following definition which is analogous to Definition 1.1.
Random matrices have already been used for embeddings into an ℓ 1 space by Johnson and Schechtman [26]. Ailon and Chazelle [5] also studied an embedding of this type to give an improved algorithm for the approximate nearest neighbor search problem. Their ℓ 1 approach is a version of the FJLT described in Section 2.2 that only uses a different scaling and a different probability parameter q. The resulting complexity for the transformation of a single point is This implies that a fast transformation of p ′ points in time O(p ′ N log N ) is only possible if log p = O ǫ (N 1/2 ). The same bound also applies to the subsequent improvement by Ailon and Liberty [10]. Without this restriction of log p = O ǫ (N 1/2 ), the best method available in the literature for embed- (1) ) is given by [27] although this result applies to a significantly more general scenario of embedding the entire space ℓ N 2 into ℓ m 1 . However, that result requires an embedding dimension m = Ω(N 1+o(1) ) which is far from being optimal for the problem of embedding just finitely many points. To extend our approach to embeddings into ℓ 1 , we recall that Gaussian matrices also yield ℓ 2 → ℓ 1 JLEs. This well known fact arises for example as a special case in [28]. We include a proof for completeness.

Proof. Let
By rotation invariance of subgaussian variables (see Lemma 5.9 in [14]), we obtain m r=1 X r for a constant C > 0. Using a union bound over all x ∈ E, we get This bound is smaller than η for m > 1 C ǫ −2 (1 + log( p η )).
Using this construction, the results of this paper can be generalized. Lemma 3.1 in the same way also holds for the composition of an ℓ 2 → ℓ 2 and an ℓ 2 → ℓ 1 JLE with slightly different constants. Corollary 3.2 can be adjusted to the ℓ 1 norm too. Note that this only holds if we replace the matrix G in Corollary 3.2 by the matrix given in Theorem 5.2 which has Gaussian entries instead of ±1 entries. Otherwise the equality E Ax 1 = x 2 for the expectation would not hold. Since Theorem 3.3 is not affected by this, it also holds in the ℓ 1 case which enables the fast transformation of the entire data set or a subset of polynomial size. This is the main advantage of this method compared to the results given by Ailon, Chazelle [5] and by Ailon, Liberty [10].
As shown in the previous sections, the construction GRHD ξ from Corollary 3.2 provides JLEs into ℓ 2 and ℓ 1 with state-of-the-art embedding dimensions m and up to a certain number of embedded points, this is also a fast pointwise embedding. For very large data sets, however, the application of the dense matrix G prevents the complexity from staying below the desired threshold. Thus, a natural question is whether this last embedding step is required or whether the matrix construction RHD ξ without the dense G already yields JLEs for comparable joint embedding dimensions m.
In the following two sections, we provide negative answers to this question, both, for embeddings into ℓ 2 and into ℓ 1 .

A Lower Bound for Embeddings into ℓ 2
We will prove a lower bound on the possible embedding dimension necessary for RHD ξ to be a JLE into ℓ 2 , showing in particular that it has to depend on the ambient dimension N . Our approach is based on [15] which provides a lower bound on the embedding dimensions for the restricted isometry property of Hadamard matrices. We will adapt this counterexample to the setting of Johnson-Lindenstrauss embeddings.
To this end, we use the properties of the Hadamard transform H ∈ R N ×N and its interpretation as the Fourier transform on F n 2 mentioned in Section 2.1. Let now s = 2 r be a power of two. Let Q be the random multiset of row indices that the subsampling matrix R selects. For an r-dimensional subspace V of F n 2 , we consider the indicator variable The proof in [15] considers where G n,r is the set of all r-dimensional subspaces of F n p . If X = 0, then there is a V ∈ G n,r satisfying Q ∩ V ⊥ = ∅. Recall that by (2.1), it holds that H½ V = ½ V ⊥ . This implies that RH½ V = R½ V ⊥ = 0 since R only selects the entries indexed by Q.
In contrast to our way of selecting the subsampled entries Q, their result works with a random subset of indicesQ := {j ∈ [N ]|δ j = 1} based on independent random selectors δ j that are 1 with probability m N and 0 otherwise. Together with rescaling by factor N m , we denote this subsampling asR. Note that in contrast to the matrix R used in the previous sections, the number of rows of R might not exactly be m. Upper bounding P(X = 0) from above leads to the following main result. Theorem 6.1 (Theorem III.1 in [15]). Let N , s andR be as above. Assume min(r, n − r) ≥ 12 log 2 n. Then there is a positive constant C > 0 such that if m ≤ Cs log(s) log( N s ), then with a probability of 1 − o(1) (for N → ∞), there is a subspace V ∈ G n,r which satisfiesRH½ V = 0. However, with slight modifications, this result can also be used for the subsampling matrix R instead ofR. The key idea in [15] used to prove Theorem 6.1 is applying the second moment method (see e.g. [29], Section 21.1) to X yielding where d 0 := max(n − 2r, 0) is the minimum dimension of U ⊥ ∩ V ⊥ and V 0 ∈ G n,r is arbitrary. For the last equality, we used that EX V is the same for all V . The bounds on the first and second moments can be adapted to subsampling with replacement as used in R.
For any V ∈ G n,r , V ⊥ has N s elements and thus each subsampled index is not in V ⊥ with probability 1 − 1 s and For any pair of subspaces U, V ∈ G n,r with dim(U ⊥ ∩ V ⊥ ) = d we obtain Combining these expressions yields with s ≥ 2, Substituting (6.2) into (6.1) leads to This is the same bound found in [15] with m N replaced by 4m N . Thus the proof of [15] that P(X = 0) = o(1) carries over to the subsampling matrix R as summarized in the following corollary. Corollary 6.2. Assume s ≥ 2. The statement of Theorem 6.1 also holds if we replace the subsampling with random selectorsR by subsampling with replacement R as used in Corollary 3.2.
Now we can use this result to prove a lower bound for the embedding dimension of these matrices as Johnson-Lindenstrauss embeddings. Theorem 6.3. For absolute constants C, c > 0 and arbitrary η > 0, there exists N (η) such that the following holds. Let N = 2 n ≥ N (η) be a power of two and R ∈ R m×N the matrix representing independent random sampling with replacement and subsequent rescaling by N m . Let p ≥ p 0 be a sufficiently large integer and assume it satisfies (log 2 N ) c ≤ log 2 p ≤ N (log 2 N ) c . Fix η > 0. If m ≤ C(log p)(log log p)(log N log p ), then RHD ξ is not a (p, ǫ, η) − JLE for any 0 < ǫ < 1.
Every vector in E is s-sparse, so there are only 2 s different sign patterns corresponding to each V . In addition, the number of subspaces is |G n,r | ≤ 2 r(n−r) (see [15]). Together this gives |E| ≤ 2 s |G n,r | ≤ 2 s 2 r(n−r) ≤ 2 1 2 log 2 p 2 nr ≤ 2 using that log 2 (N ) ≤ log 2 p by assumption.
To check the assumptions of Corollary 6.2, note that r ≥ log 2 log 2 p − 2 ≥ log 2 (n c ) − 2 ≥ 12 log 2 (n) for sufficiently large c and N (η). Furthermore, n−r ≥ n−log 2 log 2 p+1 ≥ n−log 2 N (log 2 N ) c +1 = n − n + c log 2 n + 1 ≥ 12 log 2 n. By the definition of s, we obtain 1 4 log 2 (p) ≤ s ≤ 1 2 log 2 (p) and thus the assumption for m is equivalent to m ≤ C ′ (log p)(log s)(log N s ) for a suitable constant C ′ . Together with this upper bound on m and N ≥ N (η), Corollary 6.2 now implies that with a probability of at least 1 − η (with respect to the randomness in R) there exists a subspace V ∈ G n,r such that RH½ V = 0. For any value of the Rademacher vector ξ, we have x := D ξ ½ V ∈ E and RHD ξ x = RH½ V = 0. This means that | RHD ξ x 2 2 − x 2 2 | < ǫ x 2 2 cannot hold for any ǫ < 1.
Remark 6.4. Note that the factor (log log p)(log N log p ) can take values of orders of magnitude between log N and (log N ) 2 . Indeed, for the lower bound observe that log log p ≥ 2 and log N − log log p ≥ 2, so one obtains (log log p)(log N log p ) ≥ (log log p) + (log N log p ) = log N . The upper bound follows as log p = O(N ). The maximal order is in fact achieved for log p ≍ N α where α is an arbitrary constant in (0, 1).

An Impossibility Result for Embeddings into ℓ 1
When considering a subsampled Hadamard matrix to be a JLE into ℓ 1 , we obtain a much stronger impossibility result than the one just stated for ℓ 2 . In fact, such matrices cannot, in general, lead to a JLE into ℓ 1 for any embedding dimension. To see this, again recall that the Hadamard matrix can be interpreted as the Fourier transform on F n 2 . Consider dimensions N = 2 n and s = 2 r < N that are powers of 2. Take a subspace V ⊂ F n 2 of dimension r and x (1) := ½ V . By (2.1), Hx (1) = ½ V ⊥ . Similarly observe that for x (2) := ½ {0} , one has that Hx (2) = ½ F n 2 . Now let R ∈ R m×N be a random subsampling matrix in analogy to Corollary 3.2. Noting that Hx (1) has N s entries of value s N while Hx (2) has N entries of value 1 √ N , we obtain that E RHx (1)  there cannot be a scaling factor γ > 0 such that γRHx (1) 1 and γRHx (2) 1 are concentrated around 1 at the same time. Also randomized column signs cannot avoid this problem. To see that, consider the set E := {Dξx (1) |ξ ∈ {±1} N } ∪ {±x (2) } which has 2 s + 2 elements which all have an ℓ 2 norm of 1. Then the image γRHD ξ E will always contain γRHx (1) and γRHx (2) , and by the previous paragraph, the norms of the vectors in γRHD ξ E cannot concentrate around 1. Thus γRHD ξ cannot be a JLE for E even for an embedding dimension m as large as N . By Theorem 5.2, in contrast, there are ℓ 2 → ℓ 1 JLEs of size m × N for E already if m ǫ −2 s which can be chosen to be much smaller than N . That is, subsampled Hadamard matrices with random column signs are intrinsically suboptimal as ℓ 2 → ℓ 1 JLEs.

Discussion
In this note, we considered a family of Johnson-Lindenstrauss embeddings with order optimal embedding dimension for all possible choices of the parameters N , p and ǫ. We showed that it allows for a fast simultaneous embedding of p ′ points in time O(p ′ N log m), provided that p = exp(O ǫ (N 1−δ )) for some δ > 0 and that p ′ is large enough, i.e., p ′ = p or p ′ ≥ N d(δ) . This improves upon the previously least restricted fast Johnson-Lindenstrauss transform by Ailon and Liberty [10], which required that p = exp(O ǫ (N 1 2 −δ )). Our construction incorporates algorithms for fast multiplication of dense matrices. One may of course ask whether the Hadamard transformation step in our construction is necessary or whether such matrix multiplication algorithms could also be applied directly for a full Bernoulli matrix (as studied, for example, by Achlioptas [3]). This, however, does not seem to be the case. Namely, the analogous analysis to our proof above yields for the number of operations the bound of Θ(p ′ N m f (α)−α−1 ).
So if m = Θ(N r ), i.e. p = exp(Θ ǫ (N r )) for any r > 0, we obtain a complexity of Θ(p ′ N 1+t ) for a t > 0 which is always larger than Θ(p ′ N log N ). Of course this argument only yields that this particular approach fails, but as the number of entries of M E ′ , one of the matrices to be multiplied, is N p ′ , which is more or less the total complexity bound that we seek, we see very little leverage room for improvements. Independent from that, it remains a very interesting question to find Johnson-Lindenstrauss embeddings with optimal embedding dimension that allow for a fast query step, i.e., a fast application to a point not part of the given data set. Beyond its use for the approximate nearest neighbor search outlined above, such a result would also yield improvements in other application areas such as the construction of fast transforms with the restricted isometry property as in [30].
As we found, a subsampled Hadamard matrix with randomized columns signs as considered in [12] cannot achieve this optimal embedding dimension, so a more involved construction will be necessary.