On randomized sketching algorithms and the Tracy–Widom law

There is an increasing body of work exploring the integration of random projection into algorithms for numerical linear algebra. The primary motivation is to reduce the overall computational cost of processing large datasets. A suitably chosen random projection can be used to embed the original dataset in a lower-dimensional space such that key properties of the original dataset are retained. These algorithms are often referred to as sketching algorithms, as the projected dataset can be used as a compressed representation of the full dataset. We show that random matrix theory, in particular the Tracy–Widom law, is useful for describing the operating characteristics of sketching algorithms in the tall-data regime when the sample size n is much greater than the number of variables d. Asymptotic large sample results are of particular interest as this is the regime where sketching is most useful for data compression. In particular, we develop asymptotic approximations for the success rate in generating random subspace embeddings and the convergence probability of iterative sketching algorithms. We test a number of sketching algorithms on real large high-dimensional datasets and find that the asymptotic expressions give accurate predictions of the empirical performance. Supplementary Information The online version contains supplementary material available at 10.1007/s11222-022-10148-5.


Introduction
Sketching is a probabilistic data compression technique that makes use of random projection (Cormode, 2011;Mahoney, 2011;Woodruff, 2014).Suppose interest lies in a n × d dataset A. When n and or d are large, typical data analysis tasks will involve a heavy numerical computing load.This computational burden can be a practical obstacle for statistical learning with Big Data.When the sample size n is the computational bottleneck, sketching algorithms use a linear random projection to create a smaller sketched dataset of size k × d, where k n.The random projection can be represented as a k × n random matrix S, and the sketched dataset A is generated through the linear embedding A = SA.The smaller sketched dataset A is used as a surrogate for the full dataset A within numerical routines.Through a judicious choice of the distribution on the random sketching matrix S, it is often possible to bound the error that is introduced stochastically into calculations given the use of the randomized approximation A in place of A The selected distribution of the random sketching matrix S can be divided into two categories, dataoblivious sketches, where the distribution is not a function of the source data A, and data-aware sketches, where the distribution is a function of A. The majority of data-aware sketches perform weighted sampling with replacement, and are closely connected to finite population survey sampling methods (Ma et al., 2015;Quiroz et al., 2018).The analysis of data-oblivious sketches requires different methods to data-aware sketches, as there are no clear ties to finite-population subsampling.In general, data-oblivious sketches generate a dataset of k pseudo-observations, where each instance in the compressed representation A has no exact counterpart in the original source dataset A.
Three important data-oblivious sketches are the Gaussian sketch, the Hadamard sketch and the Clarkson-Woodruff sketch.The Gaussian sketch is the simplest of these, where each element in the k × n matrix S is an independent sample from a N (0, 1/k) distribution.The Hadamard sketch uses structured elements for fast matrix multiplication, and the Clarkson-Woodruff uses sparsity in S for efficient computation of the sketched dataset.The comparative performance between distributions on S is of interest, as there is a trade-off between the computational cost of calculating A and the fidelity of the approximation A with respect to original A when choosing the type of sketch.Our results help to establish guidelines for selecting the sketching distribution.
Sketching algorithms are typically framed using stochastic (δ, ) error bounds, where the algorithm is shown to attain (1 ± ) accuracy with probability at least 1 − δ (Woodruff, 2014).These notions are made more precise in Section 2. Existing bounds are typically developed from a worst-case non-asymptotic viewpoint (Mahoney, 2011;Woodruff, 2014;Tropp, 2011).We take a different approach, and use random matrix theory to develop asymptotic approximations to the success probability given the sketching distortion factor .
Our main result is an asymptotic expression for the probability that a Gaussian based sketching algorithm satisfies general (1 ± ) probabilistic error bounds in terms of the Tracy-Widom law (Theorem 1), which describes the distribution of the extreme eigenvalues of large random matrices (Tracy and Widom, 1994;Johnstone, 2001).We then identify regularity conditions where other data-oblivious projections are expected to demonstrate the same limiting behavior (Theorem 3).If the motivation for using a sketching algorithm is data compression due to large n, the asymptotic approximations are of particular interest as they become more accurate as the computational benefits afforded by the use of a sketching algorithm increase in tandem.
Empirical work has found that the quality of results can be consistent across the choice of random projections (Venkatasubramanian and Wang, 2011;Le et al., 2013;Dahiya et al., 2018), and our results shed some light on this issue.An application is to determine the convergence probability when sketching is used in iterative least-squares optimisation.We test the asymptotic theory and find good agreement on datasets with large sample sizes n d.Our theoretical and empirical results show that random matrix theory has an important role in the analysis of data-oblivious sketching algorithms for data compression.

Data-oblivious sketches
As mentioned, a key component in a sketching algorithm is the distribution on S. Four important random linear maps are: • The uniform sketch implements subsampling uniformly with replacement followed by a rescaling step.
The Uniform projection can be represented as S = n/kΦ.The random matrix Φ subsamples k rows of A with replacement.Element Φ r,i = 1 if observation i in the source dataset is selected in the rth subsampling round (r = 1, . . ., k; i = 1 . . ., n).The uniform sketch can be implemented in O(k) time.
• A Gaussian sketch is formed by independently sampling each element of S from a N (0, 1/k) distribution.Computation of the sketched data is O(ndk).
• The Hadamard sketch is a structured random matrix (Ailon and Chazelle, 2009).The sketching matrix is formed as S = ΦHD/ √ k, where Φ is a k×n matrix and H and D are both n×n matrices.The fixed matrix H is a Hadamard matrix of order n.A Hadamard matrix is a square matrix with elements that are either +1 or −1 and orthogonal rows.Hadamard matrices do not exist for all integers n, the source dataset can be padded with zeroes so that a conformable Hadamard matrix is available.The random matrix D is a diagonal matrix where each of the n diagonal entries is an independent Rademacher random variable.The random matrix Φ subsamples k rows of H with replacement.The structure of the Hadamard sketch allows for fast matrix multiplication, reducing calculation of the sketched dataset to O(nd log k) operations.
• The Clarkson-Woodruff sketch is a sparse random matrix (Clarkson and Woodruff, 2013).The projection can be represented as the product of two independent random matrices, S = ΓD, where Γ is a random k × n matrix and D is a random n × n matrix.The matrix Γ is initialized as a matrix of zeros.
In each column, independently, one entry is selected and set to +1.The matrix D is a diagonal matrix where each of the n diagonal entries is an independent Rademacher random variable.This results in a sparse S, where there is only one nonzero entry per column.The sparsity of the Clarkson-Woodruff sketch speeds up matrix multiplication, dropping the complexity of generating the sketched dataset to O(nd).
The Gaussian sketch was central to early work on sketching algorithms (Sarlos, 2006).The drawback of the Gaussian sketch is that computation of the sketched data is quite demanding, taking O(ndk) operations.
As such, there has been work on designing more computationally efficient random projections.

Definition 1. -subspace embedding
For a given n × d matrix A, we call a k × n matrix S an -subspace embedding for A, if for all vectors z ∈ R d An -subspace preserves the linear structure of the original dataset up to a multiplicative (1 ± ) factor.
Broadly speaking, the covariance matrix of the sketched dataset A = SA is similar to the covariance matrix of the source dataset A if is small.Mathematical arguments show that the sketched dataset is a good surrogate for many linear statistical methods if the sketching matrix S is an -subspace embedding for the original dataset, with sufficiently small (Woodruff, 2014).Suitable ranges for depend on the task of interest and structural properties of the source dataset (Mahoney and Drineas, 2016).
The Gaussian, Hadamard and Clarkson-Woodruff projections are popular data-oblivious projections as it is possible to argue that they produce -subspace embeddings with high probability for an arbitrary data matrix A. It is considerably more difficult to establish universal worst case bounds for the uniform projection (Drineas et al., 2006;Ma et al., 2015).We include the uniform projection in our discussion as it is a useful baseline.

Sketching algorithms
Sketching algorithms have been proposed for key linear statistical methods such as low rank matrix approximation, principal components analysis, linear discriminant analysis and ordinary least squares regression Sketch Sketching time Required sketch size k Table 1: Properties of different data-oblivious random projections (see Woodruff (2014) and the references therein).The third column refers to the necessary sketch size k to obtain an -subspace embedding for an arbitrary n × d source dataset with at least probability (1 − δ).(Mahoney, 2011;Woodruff, 2014;Erichson et al., 2016;Falcone et al., 2021).Sketching has also been investigated for Bayesian posterior approximation (Bardenet and Maillard, 2015;Geppert et al., 2017).A common thread throughout these works is the reliance on the generation of an -subspace embedding.In general, serves an approximation tolerance parameter, with smaller guaranteeing higher fidelity to exact calculation with respect to some divergence measure.An example application of sketching is ordinary least squares regression (Sarlos, 2006).The sketched responses and predictors are defined as y = Sy, X = SX.Let 2 , and It is possible to establish the concrete bounds, that if S is an -subspace embedding for A = (y, X) (Sarlos, 2006), then where σ min (X) represents the smallest singular value of the design matrix X.If is very small, then β S is a good approximation to β F .
Given the central role of -subspace embeddings (Definition 1), the success probability, Pr(S is an -subspace embedding for A) is thus an important descriptive measure of the uncertainty attached to the randomized algorithm.The probability statement is over the random sketching matrix S with the dataset A treated as fixed.The embedding probability is difficult to characterize precisely using existing theory (Venkatasubramanian and Wang, 2011).The bounds in Table 1 only give qualitative guidance about the embedding probability.Users will benefit from more prescriptive results in order to choose the sketch size k, and the type of sketch for applications (Grellmann et al., 2016;Geppert et al., 2017;Ahfock et al., 2020;Falcone et al., 2021).
Another use for sketching is in iterative solvers for ordinary least squares regression.A sketch X = SX can be used to generate a random preconditioner, ( X T X) −1 , that is then applied to the normal equations X T Xβ = X T y.Given some initial value β (0) , the iteration is defined as If X T X = X T X the iteration will converge in a single step.The degree of noise in the preconditioner will be influenced by the sketch size k.A sufficient condition for convergence of the iteration (2) is that S is an -subspace embedding for X with < 0.5 (Pilanci and Wainwright, 2016).As is typical with randomized algorithms, we accept some failure probability in order to relax the computational demands.It is of interest to develop expressions for the failure probability of the algorithm as a function of the sketch size k, as this can give useful guidelines in practice.It is possible to establish worst case bounds using the results in Table 1, however we will aim to give a point estimate of the probability.Although it is possible to improve on the iteration (2) using acceleration methods (Meng et al., 2014;Dahiya et al., 2018;Lacotte et al., 2020), we focus on the basic iteration to introduce our asymptotic techniques.

Operating characteristics
Let the singular value decomposition of the source dataset be given by A = U DV T .Let σ min (M ) and σ max (M ) denote the minimum and maximum singular values respectively, of a matrix M .Likewise, let λ min (M ) and λ max (M ) denote the minimum and maximum eigenvalues of a matrix M .It is possible to show where U is the n × d matrix of left singular vectors of the source data matrix A (Woodruff, 2014).Now as the extreme eigenvalues of U T S T SU are the critical factor in generating -subspace embeddings.The convergence behavior of the basic iteration ( 2) is also tied to the eigenvalues of U T S T SU where A = X.
Providing that ( X T X) is of rank d, the maximum eigenvalue satisfies From standard results on iterative solvers (Hageman and Young, 2012), a necessary and sufficient condition for the iteration to converge is lim The probability of convergence can then be expressed as Pr lim Most existing results on the probabilities (3) and ( 5) are finite sample lower bounds (Tropp, 2011;Nelson and Nguyên, 2013;Meng, 2014).Worst case bounds can be conservative in practice, and there is value in developing other methods to characterize the performance of randomized algorithms (Halko et al., 2011;Raskutti and Mahoney, 2014;Lopes et al., 2018;Dobriban and Liu, 2018).The embedding probability (3) and the convergence probability ( 5) are related to the extreme eigenvalues of U T S T SU .In Section 3 we study this distribution for the Gaussian sketch and develop a Tracy-Widom approximation.The approximation is then extended to the Clarkson-Woodruff and Hadamard sketches in Section 4.

Exact representations
Meng (2014, Section 2.3) notes that when using a Gaussian sketch, it is instructive to consider directly the distribution of the random variable σ max (I d − U T S T SU ) to study the embedding probability (3).Consider an arbitrary n × d data matrix A. As S is a matrix of independent Gaussians with mean zero and variance 1/k, it is possible to show that The key term U T S T SU is in some sense a pivotal quantity, as its distribution is invariant to the actual values of the data matrix A. When using a Gaussian sketch, the probability of obtaining an -subspace embedding has no dependence on the number of original observations n, or on the values in the data matrix A. This is a useful property for a data-oblivious sketch, as it is possible to develop universal performance guarantees that will hold for any possible source dataset.This invariance property is also noted in Meng (2014), although the derivation is different.
Let us define the random matrix W ∼ Wishart(k, I d /k).The success probability of interest can then be expressed in terms of the extreme eigenvalues of the Wishart distribution The embedding probability of interest has the representation: where we have made use of the expression for the maximum singular value (4).
It is difficult to obtain a mathematically tractable expression for the embedding probability as it involves the joint distribution of the extreme eigenvalues (Chiani, 2017).Meng forms a lower bound on the probability (6) using concentration results on the eigenvalues of the Wishart distribution.
The convergence probability ( 5), can also be related to the eigenvalues of the Wishart distribution.
Assuming k ≥ d, the matrix X T X has full rank with probability one.As such, using the same pivotal quantity U T S T SU as before, Pr lim where W ∼ Wishart(k, I d /k).The convergence probability (7) has no dependence on the specific response vector y or design matrix X under consideration.Problem invariance is a highly desirable property for a randomized iterative solver (Roosta-Khorasani and Mahoney, 2016;Lacotte et al., 2020).Both the embedding probability and the convergence probability are related to the extreme eigenvalues of the Wishart distribution.The extreme eigenvalues of Wishart random matrices are a well studied topic in random matrix theory (Edelman, 1988), and we can make use of existing results to analyse the operating characteristics of sketching algorithms.In the following section we develop approximations to the embedding probability and the convergence probability in the asymptotic regime: The limit is asymptotic in n, d and k, with the constraint that the number of variables to sketch size tends to a constant α.This can be interpreted as a type of Big Data asymptotic, where we consider tall and wide datasets through the limit in n and d, and increasing sketch sizes k to cope with the expanding number of variables d.Although there is no explicit dependence on n for the finite sample expressions (3) and ( 7) for the Gaussian sketch, the asymptotic limit in n is still used to emphasize that we are taking limits in the tall-data setting.Dobriban and Liu (2018) analyse the mean squared error of single-pass sketching algorithms for linear regression in this asymptotic framework under the assumption of a generative model.Our analysis is different as we are concerned with the embedding and convergence probabilities ((3) and ( 5)), rather than the accuracy of population parameter estimates.In independent work, Lacotte et al. ( 2020) study the limiting empirical spectral distribution of Hadamard sketch in the asymptotic regime ( 8).Here we are concerned with the fluctuations of the extreme eigenvalues rather than the bulk of the spectrum.

Random matrix theory
Random matrix theory involves the analysis of large random matrices (Bai and Silverstein, 2010).The Tracy-Widom law is an important result in the study of the extreme eigenvalue statistics (Tracy and Widom, 1994).Johnstone (2001) showed that Tracy-Widom law gives the asymptotic distribution of the maximum eigenvalue of a Wishart(k, I d /k) matrix after appropriate centering and scaling.In subsequent work Ma (2012) showed that the rate of convergence could be improved from O(d −1/3 ) to O(d −2/3 ) by using different centering and scaling constants than in Johnstone (2001).We build from the convergence result given by Ma.
The R package RMTstat contains a number of functions for working with the Tracy-Widom distribution (Johnstone et al., 2014).The main application of the Tracy-Widom law to statistical inference has been its use in hypothesis testing in high-dimensional statistical models (Johnstone, 2006;Bai and Silverstein, 2010).
To the best of our knowledge, the connection to sketching algorithms has not been explored in great depth.
The Tracy-Widom law can be used to approximate the embedding probability (3).
Set Z ∼ F 1 where F 1 is the Tracy-Widom distribution.Let ψ n,k,d give the exact embedding probability and let ψ n,k,d give the asymptotic approximation to the embedding probability: Then asymptotically in n, d and k, for any > 0, lim n,d,k→∞ The proof is given in the supplementary material.
The convergence probability of the iterative algorithm (5) can also be approximated using the Tracy-Widom law.
Theorem 2. Suppose we have an arbitrary n×d data matrix A where n > d and A is of rank d.Furthermore, assume we take a Gaussian sketch of size k.Consider the limit in n, k and d, such that d/k → α with , and define the following centering and scaling constants where F 1 is the Tracy-Widom distribution.Let γ n,k,d give the exact convergence probability, and γ n,k,d give the asymptotic approximation to the convergence probability: Then for all starting values β (0) , asymptotically in n, d and k, The proof is given in the supplementary material.
The embedding probability for the Gaussian sketch can be estimated by simulating W ∼ Wishart(k, I d /k) and using the empirical distribution of the random variable σ max (I d − W ). To assess the accuracy of the approximation in Theorem 1, we generated B = 10, 000 random Wishart matrices W [1] , . . ., W [B] .For each simulated matrix W [b] we computed the distortion factor [b] = σ max (I d − W [b] ) for b = 1, . . ., B.
The simulated distortion factors [1] , . . ., [B] were used to give a Monte Carlo estimate of the embedding probability: Pr(S is an -subspace embedding for We used the ARPACK library (Lehoucq et al., 1998) to compute the maximum singular values σ max (I d − W [b] ).The estimated embedding probabilities are displayed in Figure 1  Asymptotic methods are useful to analyse data-oblivious sketches that do not admit interpretable finite sample distributions (Li et al., 2006;Ahfock et al., 2020;Lacotte et al., 2020).Here we describe the limiting behavior of the sketched algorithms for fixed k and d as the number of source observations n increases.
Under an assumption on the limiting leverage scores of the source data matrix, we can establish a limit Assumption 1. Define the singular value decomposition of the n×d source dataset as Assume that the maximum leverage score tends to zero, that is The asymptotic probability of obtaining an -subspace embedding for the Hadamard and Clarkson-Woodruff sketches can be related to the Wishart distribution.The proof is given in the supplementary material.
Theorem 3 states the the embedding probability for the Hadamard and Clarkson-Woodruff sketches converges to that of the Gaussian sketch as n → ∞.Therefore, Theorem 1 can also be used to approximate the embedding probability.Empirical studies have shown that the Hadamard and Clarkson-Woodruff sketches can give similar quality results to the Gaussian projection (Venkatasubramanian and Wang, 2011;Le et al., 2013;Dahiya et al., 2018).Theorem 3 helps to characterize situations where this phenomenon is expected to be observed.
Remark 1.The same line of proof used in Theorem 3 can be used to show that the convergence probability of (2) using the Hadamard and Clarkson-Woodruff projections converges to that of the Gaussian sketch under Assumption 1. Theorem 2 also gives an asymptotic approximation for the Hadamard and Clarkson-Woodruff sketches.
It remains to establish a formal limit theorem in terms of the Tracy-Widom distribution for the Hadamard and Clarkson-Woodruff sketches.The proof of Theorem 3 treats k and d as fixed, with only n being taken to infinity.It is possible that Assumption 1 on the leverage scores will remain sufficient in the expanding dimension scenario.For any d, the maximum leverage score must be greater than the average leverage score, max i=1,...,n If we maintain that Assumption 1 holds on the leverage scores as n, d, k → ∞, this implies that d/n → 0.
As we have assumed that our primary motivation for sketching is data compression when n d, we feel that analysis in the asymptotic regime d/n → 0 is reasonable for this use-case setting.The asymptotic approximations developed here are recommended for applications of sketching in tall-data problems n d. The

Uniform sketch
It is considerably more difficult to approximate the embedding probability for the uniform sketch compared to the other data-oblivious projections.Vershynin (2010) provides a bound for the uniform sketch that is useful for comparative purposes.Let S be a k × d uniform sketch of size k.Then for every t ≥ 0, with probability at least 1 − 2d exp(−ct 2 ) one has Theorem 4 can be used to give a lower bound on the probability of obtaining an -subspace embedding.
Both Theorem 4 and Theorem 3 involve the maximum leverage score.Holding k and d fixed, in order for the bound in Theorem 4 to remain controlled as the sample size n increases, the maximum leverage score m must decrease at a sufficient rate.In contrast, Assumption 1 does not enforce a rate of decay on the maximum leverage score, only that it eventually tends to zero as n → ∞.This suggests that the uniform projection could be more sensitive to the maximum leverage score than the Gaussian, Hadamard and Clarkson-Woodruff projections.As mentioned earlier, it is very difficult to give a general expression for the embedding probability (3) when using the uniform sketch as it will be a complicated function of the source dataset A. An advantage of the Gaussian, Hadamard and Clarkson-Woodruff projections is that a Tracy-Widom approximation can be motivated under mild regularity conditions.The region was chosen as many associations with haemoglobin concentration were discovered in a genomewide scan using univariable models; these associations were with variants with different allele frequencies, suggesting multiple distinct causal variants in the region.We also considered a subset of this dataset with     shows greater variance than expected.Panel (b) compares the theoretical to the simulation results on the bootstrapped dataset.In (b) there is very good agreement between the empirical distribution and the theoretical distribution.It seems that for this dataset n ≈ 400, 000 is not big enough for the large sample asymptotics to kick in.At n ≈ 4 million the Tracy-Widom approximation is very good.As mentioned earlier, our motivation for using a sketching algorithm is to perform data compression with tall datasets n d.This example highlights that the asymptotic approximations become more accurate as the sample size n grows while the computational incentives for using sketching increase in parallel.

Iterative optimisation
We considered iterative least-squares optimisation using the song year dataset available from the UCI machine learning repository.The dataset has n = 515, 344 observations, p = 90 covariates, and year of song q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Gaussian Hadamard Clarkson−Woodruff Uniform  release as the response.We assessed the convergence probability by running the iteration (2) with the sketched preconditioner.The initial parameter estimate β (0) was a vector of zeros.The iteration was run for 2000 steps, with convergence being declared if the gradient norm condition X T (y − Xβ (t) ) 2 < 10 −6 was satisfied any time step t.This convergence criterion was used instead of β F − β (t)  2 as β F will not be known in practice.This was repeated one hundred times for each of the random projections discussed in Section 2.1 using different sketch sizes k. Figure 5 compares the empirical (black solid points) and theoretical convergence probabilities (dashed red line) against the sketch size k.The point-ranges represent 95% confidence intervals.The Gaussian, Hadamard and Clarkson-Woodruff show near identical behavior, and the empirical convergence probabilities closely match the theoretical predictions using Theorem 2. The uniform sketch was much less successful in generating preconditioners, the algorithm did not show convergence in any replication at each sketch size k.In this example, the additional computational cost of the Gaussian, Hadamard and Clarkson-Woodruff sketches compared to the Uniform subsampling has clear benefits.

Conclusion
The analysis of the asymptotic behavior of common data-oblivious random projections revealed an important connection to the Tracy-Widom law.The probability of attaining an -subspace embedding (Definition 1) is an integral descriptive measure for many sketching algorithms.The asymptotic embedding probability can approximated using the Tracy-Widom law for the Gaussian, Hadamard and Clarkson-Woodruff sketches.
The Tracy-Widom law can also be used to estimate the convergence probability for iterative schemes with a sketched preconditioner.We have tested the predictions empirically and seen close agreement.The majority of existing results for sketching algorithms have been established using non-asymptotic tools.Asymptotic results are a useful complement that can provide answers to important questions that are difficult to address concretely in a finite dimensional framework.
There was a stark contrast between the performance of the basic uniform projection and the other dataoblivious projections (Gaussian, Hadamard and Clarkson-Woddruff) in the data application.The Hadmard and Clarkson-Woodruff projections are expected to behave like the Gaussian projection under mild regularity conditions on the maximum leverage score.We observed this phenomenon when n/d was large, as is required by Theorem 3. The Hadamard and Clarkson-Woodruff projections are substantially more computationally efficient than the Gaussian projection (recall Table 1), so their universal limiting behavior implies that the trade-off between computation time and performance guarantees is asymptotically negligible in the regime (8).
The Tracy-Widom law has found many applications in high-dimensional statistics and probability (Edelman and Wang, 2013), and we have shown that it useful for describing the asymptotic behavior of sketching algorithms.The asymptotic behaviour with respect to large n is of practical interest, as this is the regime where sketching is attractive as a data compression technique.The universal behavior of high-dimensional random matrices has practical and theoretical consequences for randomized algorithms that use linear dimension reduction (Dobriban and Liu, 2018;Lacotte et al., 2020).

S1.3 Proof of Theorem 1
Proof.The extreme eigenvalues of a Wishart random matrix converge in probability to fixed values as both the dimension and degrees of freedom expand.The result for the largest eigenvalue is due to Geman (1980) and the result for the smallest eigenvalue is due to Silverstein (1985).
Theorem S.7.(Geman, 1980;Silverstein, 1985) Consider a sequence of Wishart(k, I d /k) random matrices where the degrees of freedom k and dimension d are both taken to infinity.Suppose that the variables to samples ratio d/k converges to a constant (d/k) → α, where α ∈ (0, 1].Then the extreme eigenvalues of the random matrix, λ min and λ max converge in probability to the limits Pr(S is an -subspace embedding for A ), and let λ min and λ max denote the minimum and maximum eigenvalues of W respectively.Using Slutsky's theorem and the continuous mapping theorem we have the joint convergence result where the equality uses the fact that α ∈ (0, 1].For large k and d, the maximum eigenvalue λ max is expected to show greater deviation from one than the minimum eigenvalue λ min .Over the interval α ∈ (0, 1] it holds that Applying the continuous mapping theorem to the random vector in (S.12), 2 is greater than one for all α > 0, the absolute value sign can be removed in the limit giving the equivalent statement max , we establish convergence of the

Theorem 1 .
Suppose we have an arbitrary n×d data matrix A where n > d and A is of rank d.Furthermore assume we take a Gaussian sketch of size k.Consider the limit in n, k and d, such that d/k → α with α ∈ (0, 1].Define centering and scaling constants µ k,d and σ k,d as for different dimensions d.The sketch size to variables ratio, k/d, was held fixed at 20.The solid red line shows the empirical probability of obtaining an -subspace embedding.The dashed black line gives the Tracy-Widom approximation given in Theorem 1.The agreement is consistently good over dimensions d, and the range of sketch sizes k that were considered.

Figure 1 :
Figure 1: Accuracy of Tracy-Widom approximation for embedding probability (6) for the Gaussian sketch.The dashed black line gives the asymptotic limit, the solid red line gives the empirical probability.When d ≥ 20 the approximation given in Theorem 1 is very accurate.

Theorem 3 .
Consider a sequence of arbitrary n × d data matrices A (n) , where each data matrix is of rank d, and d is fixed.Let A (n) = U (n) D (n) V T (n) represent the singular value decomposition of A (n) .Let S (n) be a k × n Hadamard or Clarkson-Woodruff sketching matrix where k is also fixed.Suppose that Assumption 1 is satisfied.Then as n tends to infinity with k and d fixed, lim n→∞ Pr S (n) is an -subspace embedding for A (n) = Pr (σ max (I d − W ) ≤ ) , where W ∼ Wishart(k, I d /k).
key result is that the Hadamard and Clarkson-Woodruff sketches behave like the Gaussian projection for large n, with k and d fixed.If the Tracy-Widom approximation in Theorem 1 is good for finite k and d with the Gaussian sketch, it should hold well for the Hadamard and Clarkson-Woodruff projections for n sufficiently large.

Theorem 4 (
Vershynin (2010), Theorem 5.1).Consider an n × d matrix U such that U T U = I d .Let u T i represent the i-th row in U for i = 1, . . ., n.Let m give an upper bound on the leverage scores, so max

Figure 2 :
Figure 2: Analysis of subset of PKCε dataset (n = 407, 779, d = 132) with B = 1, 000 sketches of size k = 20d.The dashed black line and the solid red line gives the theoretical and empirical embedding probabilities respectively.The Tracy-Widom approximation is accurate for the Gaussian, Hadamard and Clarkson-Woodruff sketches.

p
= 130 representative markers identified by hierarchical clustering.When including the intercept and response, the PKCε subset has n = 407, 779, d = 132, and the full PKCε dataset has n = 407, 779, d = 1034.The full PKCε dataset is of moderate size, so it was feasible to take the singular value decomposition of the full n × d dataset A = U DV T .Given the singular value decomposition we ran an oracle procedure to estimate the exact embedding probability.We generated B sketching matrices S[1] , . . ., S[B] .These were used to compute[b] = σ max (I d − U T S [b]T S [b] U ) for b = 1, . . ., B and give an estimated embedding probability as in (9).When working with the full PKCε dataset we simulated directly from the matrix normal distribution U ∼ MN(I k , I d /k) for the Gaussian sketch, rather than computing the matrix multiplication SU .We took B = 1, 000 sketches of the PKCε subset, and B = 100 sketches of the full PKCε dataset using the uniform, Gaussian, Hadamard and Clarkson-Woodruff projections, with k = 20 × d.

Figure 2
Figure2shows the empirical and theoretical embedding probabilities for the PKCε subset (n = 407, 779, d = 132) for each type of sketch.The observed and theoretical curves match well for the Gaussian, Hadamard and Clarkson-Woodruff projection.The uniform projection performs worse than the other data-oblivious random projections, as larger values of indicate weaker approximation bounds.The uniform projection does not satisfy a central limit theorem for fixed k, so we do not necessarily expect the Tracy-Widom law to give a good approximation for the uniform projection.

Figure 3
Figure 3 shows the empirical and theoretical embedding probabilities for the full PKCε dataset (n = 407, 779, d = 1032) for each type of sketch.The Tracy-Widom approximation is accurate for the Gaussian sketch, but there are some deviations for the Hadamard and the Clarkson-Woodruff sketch.Interestingly, the empirical cdf for the Hadamard sketch (red) is to the left of the theoretical value (black), indicating

Figure 3 :
Figure 3: Analysis of full PKCε dataset (n = 407, 779, d = 1, 034) with B = 100 sketches of size k = 20d.The x-axis is different in each panel.The dashed black line and the solid red line gives the theoretical and empirical embedding probabilities respectively.The Uniform projection is much less successful at generating -subspace embeddings than the other data-oblivious projections.

Figure 4 :
Figure 4: Comparison of results on the original PKCε dataset (n = 407, 779) and the bootstrapped larger PKC dataset (n = 4, 077, 790).The dashed black line and the solid red line gives the theoretical and empirical embedding probabilities respectively.As expected from Theorem 3, the accuracy of the Tracy-Widom increases with n.

Figure 5 :
Figure 5: Convergence probability on year dataset (n = 515, 344, d = 91).Black solid points show the empirical convergence probability over B = 100 sketches.The red dashed line gives the theoretical convergence probability using Theorem 2. The Tracy-Widom approximation is accurate for the Gaussian, Hadamard and Clarkson-Woodruff sketches.The uniform sketch fails to generate useful preconditioners.
.11) Theorem S.7 and the continuous mapping theorem can be used to determine the asymptotic embedding probability for the Gaussian sketch.Lemma S.3.Suppose we have an arbitrary n × d data matrix A where n > d and A is of rank d.Assume we take a Gaussian sketch of size k.Then asymptotically in n, k and d, with d/k → α where α ∈ (0

Table 2 :
Mean sketching time (seconds) over ten sketches for each dataset.The Gaussian sketch is considerably slower than the Hadamard and Clarkson-Woodruff sketches on the subset as is expected from Table1