Quantum algorithms for SVD-based data representation and analysis

This paper narrows the gap between previous literature on quantum linear algebra and practical data analysis on a quantum computer, formalizing quantum procedures that speed-up the solution of eigenproblems for data representations in machine learning. The power and practical use of these subroutines is shown through new quantum algorithms, sublinear in the input matrix's size, for principal component analysis, correspondence analysis, and latent semantic analysis. We provide a theoretical analysis of the run-time and prove tight bounds on the randomized algorithms' error. We run experiments on multiple datasets, simulating PCA's dimensionality reduction for image classification with the novel routines. The results show that the run-time parameters that do not depend on the input's size are reasonable and that the error on the computed model is small, allowing for competitive classification performances.


Introduction
Quantum computation is a computing paradigm that promises substantial speed-ups in a plethora of tasks that are computationally hard for classical computers. In 2009, Harrow, Hassidim, and Lloyd [23] presented quantum procedures to create a quantum state proportional to the solution of a linear system of equations Ax = b in time logarithmic in the size of A. This result has promoted further research on optimization, linear algebra, and machine learning problems, leading to faster quantum algorithms for linear regressions [8], support vector machines [48], k-means [32], and many others [3]. Following this research line, in this work, we focus on quantum algorithms for singular value based data analysis and representation. When handling big data, it is crucial to learn effective representations that reduce the data's noise and help the learner perform better on the task. Many data representation methods for machine learning, such as principal component analysis [46], correspondence analysis [18], slow feature analysis [28], or latent semantic analysis [12], heavily rely on singular value decomposition and are impractical to compute on classical computers for extensive datasets.
We have gathered and combined state-of-the-art quantum techniques to present a useful and easy-to-use framework for solving eigenvalue problems at large scale. While we focus on machine learning problems, these subroutines can be used for other problems that are classically solved via an SVD of a suitable matrix. More specifically, we formalize novel quantum procedures to compute classical estimates of the most relevant singular values, factor scores, factor score ratios of an n × m matrix in time poly-logarithmic in nm, and the most relevant singular vectors sub-linearly in nm. We show how to use these procedures to obtain a classical description of the models of three machine learning algorithms: principal component analysis, correspondence analysis, and latent semantic analysis. We also discuss how to represent the data in the new feature space with a quantum computer. We provide a thorough theoretical analysis for all these algorithms bounding the run-time, the error, and the failure probability.
The remainder of the paper is organized as follows. Section 2 introduces our notation and discusses the relevant quantum preliminaries. Section 3 presents the novel quantum algorithms. In Section 4, we show applications of the algorithms to principal component analysis, correspondence analysis, and latent semantic analysis. Section 5 presents numerical experiments assessing the runtime parameters. Finally, we provide detailed information on the experiments and extensively discuss related work in quantum and classical literature in the appendix.
2 Quantum preliminaries and notation 2

.1 Notation
Given a matrix A, we write a i,· to denote its i th row, a ·,j for its j th column, and a ij for the element at row i, column j. We write its singular value decomposition as A = UΣV T . U and V are orthogonal matrices, whose column vectors u i and v i are respectively the left and right singular vectors of A. Σ is a diagonal matrix with positive, non-negative, entries σ i : the singular values. The row/column size of Σ is the rank of A and is denoted as r. We use λ i to denote the i th eigenvalue of the covariance matrix A T A = VΣ 2 V T , and λ (i) = λi r j λj to denote the relative magnitude of each eigenvalue. Using the notation of Hsu et al [26] for correspondence analysis, we refer to λ i as factor scores and to λ (i) as factor score ratios. Note that λ i = σ 2 i and λ (i) = We denote the number of non-zero elements of a matrix/vector with nnz(). Given a scalar a, |a| is its absolute value. The ∞ and 0 norm of a vector a are defined as a ∞ = max i ( a i ), a 0 = nnz(a). If the vector norm is not specified, we refer to the 2 norm. The Frobenius norm of a matrix is A F = r i σ 2 i , its spectral norm is A = max x∈R m Ax x = σ max , and finally A ∞ = max i ( a i,· 1 ). A contingency table is a matrix that represents categorical variables in terms of the observed frequency counts. Finally, when stating the complexity of an algorithm, we use O instead of O to omit the polylogarithmic terms on the size of the input data (i.e., O(polylog(nm)) = O(1)), on the error, and the failure probability.

Quantum preliminaries
We represent scalars as states of the computational basis of H n , where n is the number of bits required for binary encoding. The quantum state corresponding to a vector v ∈ R m is defined as a state-vector |v = 1 v m j v j |j . Note that to build |v we need log m qubits.

Data access.
To access data in the form of state-vectors, we use the following definition of quantum access.
Definition 1 (Quantum access to a matrix) We have quantum access to a matrix A ∈ R n×m , if there exists a data structure that allows performing the mappings |i |0 → |i a i,· = |i By combining the two mappings we can create the state |A = Kerenidis and Prakash [29,30] have described one implementation of such quantum data access. Their implementation is based on a classical data structure such that the cost of updating/deleting/inserting one element of the matrix is poly-logarithmic in the number of its entries. In addition, their structure gives access to the Frobenius norm of the matrix and the norm of its rows in time O (1). The cost of creating this data structure is O(nnz(A)). This input model requires the existence of a QRAM [15]. While there has been some skepticism on the possibility of error-correcting such a complex device, recent results show that bucket-brigade QRAMs are highly resilient to generic noise [22].
Sometimes it is desirable to normalize the input matrix to have a spectral norm smaller than one. Kerenidis and Prakash [30] provide an efficient routine to estimate the spectral norm.
Theorem 1 (Spectral norm estimation [30]) Let there be quantum access to the matrix A ∈ R n×m , and let > 0 be a precision parameter. There exists a quantum algorithm that estimates A to additive error A F in time O If we have A , we can create quantum access to A = A A = U Σ σmax V T in time O (nnz(A)) by dividing each entry of the data structure. Once we have quantum access to a dataset, it is possible to apply a pipeline of quantum machine learning algorithms for data representation, analysis, clustering, and classification [1,28,32,35,49,55]. Since the cost of each step of the pipeline should be evaluated independently, we consider O(nnz(A)) to be a pre-processing cost and do not include it in our run-times.
We conclude this section by stating a useful claim that connects errors on classical vectors with errors on quantum states. Claim 2 (Closeness of state-vectors [30]) Let θ be the angle between vectors x, x and assume that θ < π/2. Then,

Useful subroutines
We state two relevant quantum linear algebra results: quantum singular value estimation (SVE) and quantum matrix-vector multiplication.
Theorem 3 (Singular value estimation [30]) Let there be quantum access to A ∈ R n×m , with singular value decomposition A = r i σ i u i v T i and r = min(n, m). Let > 0 be a precision parameter. It is possible to perform the mapping |b = Unlike previous results with Hamiltonian simulations [49], this algorithm enables performing conditional rotations using the singular values of a matrix without any special requirement (e.g., sparsity, being square, Hermitian, etc.). By choosing the same matrix |A = 1 as starting state |b , we obtain a superposition of all the singular values entangled with the respective left and right singular vectors In this case, the requirement r = min(n, m) is not needed anymore as |b = |A can be fully decomposed in terms of A's right singular vectors.
This algorithm uses phase estimation. In this work, we consider this algorithm to use a consistent version of phase estimation, so that the errors in the estimates of the singular values are consistent across multiple runs [30,54].
Theorem 4 (Matrix-vector multiplication [8] (Lemma 24, 25)) Let there be quantum access to the matrix A ∈ R n×n , with σmax ≤ 1, and to a vector x ∈ R n . Let Ax ≥ γ. There exists a quantum algorithm that creates a state |z such that |z − |Ax ≤ in time O( 1 γ µ(A) log(1/ )), with probability at least 1 − 1/poly(n). Increasing the run-time by a multiplicative factor O 1 η one can retrieve an estimate of Ax to relative error η.

Data output.
Finally, to read out the quantum states, we state one version of amplitude amplification and estimation, and two state-vector tomographies.

Novel quantum methods
Building from the previous section's techniques, we formalize a series of quantum algorithms that allow us to retrieve a classical description of the singular value decomposition of a matrix to which we have quantum access.

Estimating the quality of the representation
Algorithms such as principal component analysis and correspondence analysis are often used for visualization or dimensionality reduction purposes. These applications work better when a small subset of factor scores have high factor score ratios. We provide a fast procedure that allows verifying if this is the case: given efficient quantum access to a matrix A ∈ R n×m , it retrieves the most relevant singular values, factor scores, and factor score ratios in time poly-logarithmic in the number of elements of A, with no strict dependencies on its rank. The main intuition behind this algorithm is that it is possible to create the state The third register, when measured in the computational basis, outputs the estimate σ i of a singular value with probability equal to its factor score ratio λ (i) . This enables sampling the singular values of A directly from the factor score ratios' distribution. When a matrix has a huge number of small singular values and only a few of them that are very big, the ones with the greatest factor score ratios will appear many times during the measurements. In contrast, the negligible ones are not likely to be measured. This intuition has already appeared in literature [6,20]. Nevertheless, the analysis and the problem solved in these works are different from ours. In the context of data representation and analysis, this intuition has only been sketched for sparse or low rank square symmetric matrices by Lloyd et al [44], without a precise formalization. We thoroughly formalize it for any real matrix.
Algorithm 1 Quantum factor score ratio estimation.
Input: Quantum access to a matrix A ∈ R n×m . Two precision parameters γ, ∈ R >0 . Output: An estimate of the factor score ratios λ (i) > γ. An estimate of the corresponding singular values and factor scores.
Prepare the state 1 A F n i m j a ij |i |j .

3:
Apply SVE to get

4:
Measure the last register and store it in a set data structure. 5: end for 6: For each σ i measured, output σ i , its factor score λ i = σ 2 i , and its factor score ratio λ (i) = λi Theorem 8 (Quantum factor score ratio estimation) Let there be quantum access to a matrix A ∈ R n×m , with singular value decomposition A = i σ i u i v T i . Let γ, be precision parameters. There exists a quantum algorithm that runs in time O 1 γ 2 µ(A) and estimates: • all the factor score ratios λ (i) > γ, with probability at least 1 − 1/poly(r), such that , with probability at least 1 − 1/poly(n); • the corresponding singular values σ i , such that |σ i − σ i | ≤ with probability at least 1 − 1/poly(n); • the corresponding factor scores λ i , such that λ i − λ i ≤ 2 √ λ i with probability at least 1 − 1/poly(n).
The proof consists in bounding the run-time, the error, and the probability of failure of Algorithm 1.
Proof By the definition of quantum access, the cost of step 2 is O(1). The singular value estimation in step 3 can be performed using Theorem 3 in time O µ(A) τ , such that σ i − σ i ≤ with probability at least 1 − 1/poly(n). A measurement of the third register at step 4 can output any σ i with probability Theorem 7 guarantees that with O(1/γ 2 ) measurements we can get estimates In particular, Kerenidis et al [33] estimate that N = 36 log(r)/γ 2 measures should suffice for our goal. Alternatively, we could consider the measurement process as performing r Bernoulli trials: one for each λ (i) , so that if we measure σ i it is a success for the i th Bernoulli trial and a failure for all the others. Given a confidence level z, it is possible to use the Wald confidence interval to determine a value for N such that λ (i) − ζσ i N ≤ γ with confidence level z, where ζ σi is the number of times that σ i has appeared in the measurements. In this case, it suffice to choose N = z 2 4γ 2 [52, Section 5.1.3]. Having λ (i) − λ (i) ≤ γ means measuring all the σ i whose factor score ratio is greater than λ. We now proceed with the error analysis. We can compute λ i = σ 2 i .
If we keep the error analysis at the first order and consider that σ i = √ λ i , we can conclude the bound as The parameter γ is the one that controls how big a factor score ratio should be for the singular value/factor score to be measured. If we choose γ bigger than the least factor scores ratio of interest, the estimate for the smaller ones is likely to be 0, as λ (i) − 0 ≤ γ would be a plausible estimation.
Often in data representations, the cumulative sum of the factor score ratios is a measure of the quality of the representation. By slightly modifying Algorithm 1 to use Theorem 7, it is possible to estimate this sum such that k i λ (i) − k i λ (i) ≤ k with probability 1−1/poly(r). However, a slight variation of Algorithm IV.3 for spectral norm estimation in Kerenidis and Prakash [30] provides a more accurate estimation in less time, given a threshold θ for the smallest singular value to retain.
Algorithm 2 Quantum check on the factor score ratios' sum.
Input: Quantum access to a matrix A ∈ R n×m . A threshold parameter θ. Two precision parameters , η ∈ R >0 , such that the greatest singular value smaller than θ is not more than distant to it. Output: An estimate p of the factor score ratios' sum p = i:σi≥θ λ (i) . Theorem 9 (Quantum check on the factor score ratios' sum) Let there be quantum access to a matrix A ∈ R n×m , with singular value decomposition A = i σ i u i v T i . Let η, be precision parameters, and θ be a threshold for the smallest singular value to consider. There exists a quantum algorithm that estimates p = i:σi≥θ λ (i) , where Proof As discussed in the previous proof, the cost of preparing the state at step 2 is O µ(A) . The complexity of step 3 is O(1), as it is an arithmetic operation that only depends on the encoding of |σ i .
Step 4 consists in uncomputing step 2 and has its same cost. Finally, the cost of amplitude estimation, with relative precision η, on the last register being |0 is equal to O T (U 4 ) 1 η √ p , where p = i:σ i ≥θ σ 2 i r j σ 2 j is the probability of measuring |0 (Theorem 5). The overall complexity is proven: Since the sum of factor score ratios p is a measure of the representation quality, in problems such as PCA, CA, and LSA, this is usually a constant number bigger than 0 (i.e., often in practice, p ∈ [0. 3,1]). This makes the term √ p negligible in most of the practical applications. Moreover, we further modify Algorithm IV.3 to perform a binary search of θ given the desired sum of factor score ratios.
Algorithm 3 Quantum binary search for the singular value threshold.
Input: Quantum access to a matrix A ∈ R n×m . The desired amount of factor score ratios sum p ∈ [0, 1]. Two precision parameters , η ∈ R >0 .  Prepare the state |A and apply SVE to get
Theorem 10 (Quantum binary search for the singular value threshold [30]) Let there be quantum access to a matrix A ∈ R n×m . Let η, be precision parameters, and θ be a threshold for the smallest singular value to consider. Let p ∈ [0, 1] be the factor score ratios sum to retain. There exists a quantum algorithm that runs in time O The proof consists in proving the correctness and the run-time of Algorithm 3.
Proof The algorithm searches for θ using τ as an estimate between 0 and 1. The search is performed using sign(pτ − p) as an oracle that tells us whether to update the lower or upper bound for τ .
The algorithm terminates when |p τ − p| ≤ η/2 or when it is not possible to update τ anymore (i.e., there are not enough qubits to express the next τ ). In this last case, there is no θ that satisfies the requisites and the algorithm returns −1.
In the first case, instead, we need to guarantee that p − i:σi≥θ λ (i) = |p − pτ | ≤ η. Since we run amplitude estimation with additive error η/2 we have |p τ − pτ | ≤ η/2, and we require |p τ − p| ≤ η/2 to stop. This two conditions entail If we want θ to be comparable with the singular values of A and use τ for the binary search, we have to use Theorem 3 with error  . The maximum number of updates of τ is bounded by the number of qubits that we use to store the singular valuesσ i . This is given by the logarithm of the error used in Step 6, and is O log .

The run-time of this algorithm is bounded by
Using the quantum counting algorithms of Brassard et al [4] after step 3 of Algorithm 2, it is possible to count the number of singular values retained by a certain threshold θ.
Corollary 11 (Quantum reduced rank estimation) Let there be quantum access to a matrix A ∈ R n×m , with singular value decomposition A = r i σ i u i v T i and rank r. Let be a precision parameter, and θ be a threshold for the smallest singular value to consider. There exists a quantum algorithm that estimates the exact number k of singular values such that . Similarly, given a parameter η, it is possible to produce an estimate k such that Estimating the number of singular values retained by θ is helpful. When the singular values are dense around θ, this Corollary, together with Theorem 9, can help the analyst evaluate trade-offs between big p and small k. On the one hand, the bigger p is, the more information on the dataset one can retain. On the other hand, the bigger k is, the slower will the algorithms in the next section be.

Algorithm 4 Quantum top-k singular vectors extraction.
Input: Quantum access to a matrix A ∈ R n×m . A threshold θ that captures the top-k singular values. Two precision parameters δ, ∈ R >0 . Output: The top-k singular vectors such that Append a quantum register |0 to the state and set it to |1 if |σ i < θ. 4: Perform amplitude amplification for |0 , to get the state 5: Append a second ancillary register |0 and perform the controlled rotation where C is a normalization constant. 6: Perform again amplitude amplification for |0 to get the uniform superpo- Measure the last register and, according to the measured |σ i , apply statevector tomography on |u i for the i th left singular vector or on |v i for the right one. 8: Repeat 1-7 until the tomography requirements are met. 9: Output the k singular vectors u i or v i and, optionally, the singular values σ i .

Extracting the SVD representation
After introducing the procedures to test for the most relevant singular values, factor scores and factor score ratios of A, we present a routine to extract the corresponding right/left singular vectors. The inputs of this algorithm, other than the matrix, are a parameter δ for the precision of the singular vectors, a parameter for the precision of the singular value estimation, and a threshold θ to discard the non interesting singular values/vectors. The output guarantees a unit estimate x i of each singular vector such that x i − x i ≤ δ, ensuring that the estimate has a similar orientation to the original vector. Additionally, this subroutine can provide an estimation of the singular values greater than θ, to absolute error .
Theorem 12 (Top-k singular vectors extraction) Let there be efficient quantum access to a matrix A ∈ R n×m , with singular value decomposition A = r i σ i u i v T i . Let δ > 0 be a precision parameter for the singular vectors, > 0 a precision parameter for the singular values, and θ > 0 be a threshold such that A has k singular values greater than θ. Define p = i:σ i ≥θ σ 2 i r j σ 2 j . There exist quantum algorithms that estimate: • The top k singular values σ i , factor scores λ i , and factor score ratios k or during any of the two procedures above.
The proof consists in proving the time complexity and the error of Algorithm 4.
Proof Like in the previous proofs, the cost of preparing the state at step 4, is is the one of amplitude amplification.
Step 5 is a conditional rotation and similarly to step 3 it has a negligible cost. The next step is to analyze the amplitude amplification at 6. The constant C is a normalization factor in the order of O(1/κ(A (k) )) where κ(A (k) ) = σmax σmin is the condition number of the low-rank matrix A (k) . Since for construction σ min ≥ θ, we can bound the condition number κ( From the famous work of Harrow, Hassidim and Lloyd [23] we know that applying amplitude amplification on the state above, with the the third register being |0 , This last amplitude amplification leaves the registers in the state where σ i ∈ [σ i − , σ i + ] and σi σi± → 1 for → 0. When measuring the last register of state 6 in the computational basis, we measure |σ i and the first two registers collapse in the state |u i |v i . It is possible to perform vector-state tomography on |u i |v i , using Theorem 6 on the first register to retrieve u i , or on the second one to retrieve v i . The costs are O( n log n δ 2 ) and O( m log m δ 2 ), respectively. Using a coupon collector's argument [13], if the k states |σ i are uniformly distributed, to get all the k possible couples |u i |v i at least once, we would need k log k measurements on average. This proves that it is possible to estimate all the singular values, factor scores and factor score ratios with the guarantees of Theorem 3 in time O( To perform tomography on each state-vector, one should satisfy the coupon collector the same number of times as the measurements needed by the tomography procedure. The costs of the tomography for all the vectors . Therefore, the following com- In the appendix, Section A.3, we provide experiments that show that the coupon collector's argument of Eq. 4 is accurate for practical . Besides 1/ √ p being negligible, it is interesting to note that the parameter θ can be computed using: 1. the procedures of Theorems 8, 9; 2. the binary search of Theorem 10; 3. the available literature on the type of data stored in the input matrix A. About the latter, the original paper of latent semantic indexing [12] states that the first k = 100 singular values are enough for a good representation. We believe that, in the same way, fixed thresholds θ can be defined for different machine learning applications. The experiments of Kerenidis and Luongo [28] on the run-time parameters of the polynomial expansions of the MNIST dataset support this expectation: even though in qSFA they keep the k smallest singular values and refer to θ as the biggest singular value to retain, this value does not vary much when the the dimensionality of their dataset grows.
In our experiments, we observe that different datasets for image classification have similar θs. For completeness, we also state a different version of Theorem 12, with ∞ guarantees on the vectors.

Corollary 13 (Fast top-k singular vectors extraction)
The run-times of 12 can be with estimation guarantees on the ∞ norms.
Proof The proof consists in using ∞ tomography (Theorem 7) at step 7 of Algorithm 4.
Note that, given a vector with d non-zero entries, performing ∞ tomography with error δ √ d provides the same guarantees of 2 tomography with error δ. This implies that the extraction of the singular vectors with 2 guarantees can be faster if we can make assumptions on their sparseness:

Applications to machine learning
The new quantum procedures can be used for principal component analysis, correspondence analysis, and latent semantic analysis. Besides extracting the orthogonal factors and measuring their importance, we provide a procedure to represent the data in PCA's reduced feature space on a quantum computer. In a similar way, it is possible to compute the representations of CA and LSA.

Principal Component Analysis
Principal component analysis is a widely-used multivariate statistical method for continuous variables with applications in machine learning. Its uses range from outlier detection to dimensionality reduction and data visualization. Given a matrix A ∈ R n×m storing information about n data points with m coordinates, its principal components are the set of orthogonal vectors along which the variance of the data points is maximized. The goal of PCA is to compute the principal components with the amount of variance they capture and rotate the data points to express their coordinates along the principal components. It is possible to represent the data using only the k coordinates that express the most variance for dimensionality reduction.

PCA Model
The model of PCA is closely related to the singular value decomposition of the data matrix A, shifted to row mean 0. The model consists of the principal components and the amount of variance they explain. The principal components coincide with the right singular vectors v i , the factor scores λ i = σ 2 i represent the amount of variance along each of them, and the factor score ratios λ (i) = λi r j λj express the percentage of retained variance. For datasets with 0 mean, the transformation consists in a rotation along the principal components: Y = AV = UΣV T V = UΣ ∈ R n×m . When performing dimensionality reduction, it suffice to use the top k singular values and vectors.
Using the procedures from Section 3 it is possible to extract the model for principal component analysis. In particular, Theorems 8, 9, and 10 allow to retrieve information on the factor scores and on the factor score ratios, while Theorem 12 allows extracting the principal components. The run-time of the model extraction is the sum of the run-times of the theorems: . The model comes with the following guarantees: This run-time is generally smaller than the number of elements of the input data matrix, providing polynomial speed-ups on the best classical routines for non-sparse matrices. In writing the time complexity of the routines, we have omitted the term 1 √ p because usually p is chosen to be a number greater than 0.5 (generally in the order of 0.8/0.9).
When performing dimensionality reduction, the goal is to obtain the matrix Y = UΣ ∈ R n×k , where U ∈ R n×k and Σ ∈ R k×k are composed respectively of the top k left singular vectors and singular values. In Lemma 14, we provide a theoretical error bound for Y, using the estimated entries of U and Σ. For sake of completeness, the error bound is also stated for VΣ. These bounds stand regardless of how the singular values and vectors are extracted and hold when the multiplication is done with a classical computer.
Lemma 14 (Accuracy of UΣ and VΣ) Let A ∈ R n×m be a matrix. Given some approximate procedures to retrieve estimates σ i of the singular values σ i such that |σ i − σ i | ≤ and unit estimates u i of the left singular vectors u i such that u i − u i 2 ≤ δ, the error on UΣ can be bounded as We prove this result for UΣ − UΣ F . The proof for VΣ − VΣ F is analogous.
Proof We first bound the error on the columns: Because of the triangular inequality, x is an increasing monotone function, it is possible to prove: Using matrix-multiplication from Theorem 4, we can have algorithms to produce quantum states proportional to the data representation in the new feature space. Having access to V (k) ∈ R m×k , these routines create the new data points in almost constant time and are helpful when chained to other quantum machine learning algorithms that need to be executed multiple times.
Corollary 15 (Quantum PCA: vector dimensionality reduction) Let ξ be a precision parameter. Let there be efficient quantum access to the top k right sin- . Given quantum access to a row a i of A, the quantum state ξ. An estimate of y i , to relative error η, can be computed in It is possible to use Theorem 4 to multiply the quantum state |a i by V T , appropriately padded with 0s to be a square R m×m matrix. In this way, we can cre- For what concerns the error, we start by bounding y i − y i and then use Claim 2 to bound the error on the quantum states. Assume to have estimates v i of the Considering that This result also holds when a i is a previously unseen data point, not necessarily stored in A. Note that from the row orthogonality of Furthermore, yi ai is expected to be close to 1, as it is the percentage of support of a i on the new feature space spanned by V (k) . We formalize this better using Definition 2 below. It is known that, in practical machine learning datasets, α is a number fairly close to one. We have tested the value of α for the MNIST, Fashion MNIST and CIFAR-10 datasets, finding values over 0.85 for any p ∈ (0, 1].
The next corollary shows how to perform perform dimensionality reduction on the whole matrix, enabling quantum access to the data matrix in the reduced feature space.
Corollary 17 (Quantum PCA: matrix dimensionality reduction) Let ξ be a precision parameter and p be the amount of variance retained after the dimensionality reduction. Let there be efficient quantum access to A = UΣV T ∈ R n×m and to its There exists a quantum algorithm that, with probability at least 1 − 1/poly(m), creates the Proof Here with V we denote V (k) ∈ R m×k . Using the same reasoning as the proof above and giving a closer look at the proof of Theorem 4 (Lemma 24 [8]), we see that it is possible to create the state |0 ( V T µ(V) |a i ) + |0 ⊥ in time O(1) and that the term is introduced to boost the probability of getting the right state. Indeed, if we apply Theorem 4 without the amplitude amplification step to the superposition of the rows of A, we obtain the following mapping in time O(1): where y i,·⊥ are normalization factors. Keeping in mind that A F = r i σ 2 i and We can set δ = ξ √ 2k The error requirements of the two corollaries propagate to the run-time of the model extraction in the following way. Proof The procedure to train the model consists in using Theorem 10 or 8 to extract the threshold θ, given the amount of variance to retain p, and to leverage Theorem 12 to extract the k right singular vectors that compose V ∈ R m×k . The run-time of Theorem 10 and 8 are smaller than the one of Theorem 12, so we can focus on the last one.
in the run-time of Theorem 12, we get O( µ(A)k 2 m p 3/2 θ ξ 2 ). If we consider that p to be a reasonable number (e.g., at least grater than 0.05), we can consider it a constant factor that is independent from the input's size. The asymptotic run-time is proven to be O( When training the model for Corollary 17, the run-time has a dependency on 1/p 3/2 . However, this term is constant and independent from the size of the input dataset. With this additional 1/p 3/2 cost, the error of Corollary 15 drops to ξ for every row of the matrix and generally decreases in case of new data points.
Using the same framework and proof techniques, it is possible to produce similar results for the representations of CA and LSA.
Remark: Note that Yu et al [62,Theorem 1] propose a lower bound for a quantity similar to our α. However, their result seems to be a loose bound: using their notation and setting η = 1, θ = 1 they bound this quantity with 0, while a tight bound should give 1.

Correspondence analysis
Correspondence analysis is a multivariate statistical tool from the family of factor analysis methods. It is used to explore relationships among categorical variables. Given two random variables, X and Y , with possible outcomes in {x 1 , · · · , x n } and {y 1 , · · · , y m }, the model of Correspondence Analysis enables representing the outcomes as vectors in two related Euclidean spaces. These vectors can be used for data visualization, exploration, and other unsupervised machine learning tasks.

Model
Given a contingency table for X and Y (see Section 2), it is possible to compute the matrix wherê P X,Y ∈ R n×m is the estimated matrix of joint probabilities,p X ∈ R n and p X ∈ R m are the vectors of marginal probabilities, and D . The computation of A requires linear time in the nonzero entries of the contingency table. The singular value decomposition of A is strictly related to the model of correspondence analysis [18,26]. The new coordinates of X's outcomes are given by the rows of D −1/2 X U ∈ R n×k , while the ones of Y by the rows of D −1/2 Y V ∈ R m×k . Like in PCA, it is possible to choose only a subset of the orthogonal factors as coordinates for the representation. Factor scores and factor score ratios measure of how much "correspondence" is captured by the respective orthogonal factor, giving an estimate of the quality of the representation.
Similarly to what we have already discussed, it is possible to extract the model for CA by creating quantum access to the matrix A and using Theorems 8, 9, and 12 to extract the orthogonal factors, the factor scores and the factor score ratios in time O 1 γ 2 + k(n+m) θδ 2 µ(A) . We provide a theoretical bound for the data representations in Lemma 19.
Given some approximate procedures to retrieve unit estimates u i of the left singular vectors u i such that u i − u i ≤ δ, the error on D −1/2 X U can be bounded as Proof It suffices to note that D

Latent semantic analysis
Latent semantic analysis is a data representation method used to represent words and text documents as vectors in Euclidean spaces. Using these vector spaces, it is possible to compare terms, documents, and terms and documents. LSA spaces automatically model synonymy and polysemy [12], and their applications in machine learning range from topic modeling to document clustering and retrieval.

Model
The input of LSA is a contingency table of n words and m documents A ∈ R n×m . Inner products of rows are a measure of words similarity, and can be computed at once as AA T = UΣ 2 U T . Inner products of columns A T A = VΣ 2 V T are a measure of documents similarity, and the a ij entry of A = UΣV T is a measure of similarity between word i and document j. We can use SVD to express words and documents in new spaces where we can compare them with respect to this similarity measure. In particular, we can compute: 1. a representation for word comparisons UΣ ∈ R n×k ; 2. a representation for document comparisons VΣ ∈ R m×k ; 3. two representations for word and document comparisons UΣ 1/2 ∈ R n×k and VΣ 1/2 ∈ R m×k . When using LSA for document indexing, like in a search engine, we need to represent the query as a vector in the document space. In this case, instead of increasing A's size and recomputing the document space, the new vector can be expressed as v T q = x T q UΣ −1 , where x q ∈ R n is obtained using the same criteria used to store a document in A. The representation of the query can then be used to compare the query to the other documents in the document representation space. Finally, factor score ratios play an important role in LSA too. For instance, the columns of V can be seen as latent topics of the corpus. The importance of each topic is proportional to the corresponding factor score ratio. This paragraph only stresses how computing the SVD of A is connected to LSA. For a better introduction to LSA and indexing, we invite the reader to consult the original paper [12].
Even in this case, the cost of extracting the orthogonal factors and the factor scores is bounded by O 1 γ 2 + k(n+m) θδ 2 µ(A) . In some applications, the data analyst might use a fixed number of singular values and vectors, regardless of the factor score ratios. In Deerwester et al [12], k = 100 is found to be a good number for document indexing. Similarly, we believe that if we scale the singular values by the spectral norm, it is possible to empirically determine a threshold θ to use in practice. Determining such threshold would reduce the complexity of model computation to the one of Theorem 12: O k(n+m) For what concerns the error bounds, we already know that it is possible to retrieve an approximation UΣ and VΣ with precision Both are bounded by We prove this result for UΣ 1/2 − UΣ 1/2 F . Proof We start by bounding √ σ i − √ σ i . Let's define = γσ i as a relative error: By definition γ = σi and we know that σ min ≥ θ: Using the bound on the square roots, we can bound the columns of UΣ 1/2 : From the error bound on the columns we derive the bound on the matrices: Lemma 21 (Accuracy of UΣ −1 and VΣ −1 ) Let A ∈ R n×m be a matrix. Given some approximate procedures to retrieve estimates σ i of the singular values σ i such that |σ i − σ i | ≤ and unitary estimates u i of the left singular vectors u i such that u i − u i ≤ δ, the error on UΣ −1 can be bounded as We prove this result for UΣ −1 − UΣ −1 F . Proof We start by bounding 1 σi − 1 σi . Knowing that σ min ≥ θ and < θ: From the bound on the inverses, we can obtain the bound on the columns of UΣ −1 : To complete the proof, we compute the bound on the matrices:

Experiments
All of our experiments are numerical and can be carried out on classical computers. 1 We have analysed the distribution of the factor score ratios in the MNIST, Fashion MNIST, CIFAR-10, Tiny Imagenet and Research Papers datasets. They decrease exponentially fast (figures in the appendix), confirming the low rank nature of the data. Focusing on MNIST, Fashion-MNIST, and CIFAR-10, we have simulated PCA's dimensionality reduction for image classification. The datasets have been shifted to row mean 0 and normalized so that σ max = 1. We have simulated Algorithm 1 by sampling 1/γ 2 = 1000 times from the state r i λ i |u i |v i |σ i to search the first k principal components that account for a factor score ratios sum p = 0.85. The simulation occurs by sampling with replacement from the discrete probability distribution given by the λ i . We then estimated the measured λ i using the Wald estimator (see the proof of Theorem 8) and searched for the most important k. 2 In all cases, sampling the singular values has been enough to decide how many to keep. However, as p increases, the gap between the factor score ratios decreases and the quality of the estimation of k or θ decreases. As discussed in Section 3.1, it is possible to detect this problem using Theorem 9 and solve it with a binary search for θ (Theorem 10). We have tested the quality of the representation by observing the accuracy of 10-fold cross-validation k-nearest neighbors with k = 7 as we introduce error in the representation's Frobenius norm (see Figure 1). To introduce the error, we have added truncated Gaussian noise to each element of UΣ to have UΣ − UΣ ≤ ξ = √ k( + δ) (Lemma 14). The parameter δ has been estimated using the bound above, choosing the error so that the accuracy drops no more than 0.01 and fixing to a number that allows for correct thresholding. Table 1 summarizes the run-time parameters. The results show that Theorems 8, 9, 10 are already advantageous on small datasets, while Theorem 12 requires bigger datasets to express its speed-up. We have also simulated the creation of the state at step 6 of Algorithm 4 to test the average number of measurements needed to collect all the singular values as increases. The analysis has confirmed the run-time's expectations. To end with, we have tested the value of α (Definition 2, Claim 16) for the MNIST dataset, fixing ε = 0 and trying p ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. We have observed that α = 0.97 ± 0.03, confirming that the run-time of Corollary 15 can be assumed O(µ(V) (k) ) for the majority of the data points of a PCA-representable dataset.
We point out that more experiments on the run-time parameters have been extensively discussed in other works that rely on the same parameters [28,36]. These works study the scaling of the parameters as the dataset size increases, both in features and samples, and conclude that the parameters of interest are almost constant. In addition to the existing experiments, we have studied the trend of the run-time parameters on the Tiny Imagenet dataset as the number of samples scales. While the spectral norm increases, the other runtime parameters become constant after a certain number of samples. Figure  2 shows that the algorithms discussed in Section 3.1 are already of practical 2 Note that, in practice, one could also estimate the factor score ratios as λi =

Conclusions
In this paper, we formulate many eigenvalue problems in machine learning within a useful framework, filling identifying the proper quantum tools and formalizing the novel quantum algorithms, the main technical difficulty was analyzing how the error propagates to bound the algorithms' run-time properly. We do not expect run-time improvements that exceed poly-logarithmic factors or constant factors, using similar techniques. For non-zero singular values and dense singular vectors, the run-time of the extraction can not be smaller than kz, as one needs to read vectors of size kz. The δ 2 parameter is a tight bound for the 2 norm of the vectors, as it is a result of Chernoff's bound. The parameter is a tight error bound from phase estimation, which is necessary to distinguish the singular vectors. θ is the condition number of the low-rank approximation of the matrix, and it is necessary to amplify the amplitudes of the smallest singular values.
As future work, we deem it interesting to explore quantum algorithms for incremental SVD or for datasets whose points are available as a data streaming. It might be possible to reduce the overhead due to tomography and achieve greater speed-ups in these settings. It also remains an open question whether there are particular applications and dataset distributions for which the singular vector extraction algorithms offer a practical advantage over their classical counterparts. Finally, an appropriate resource estimation that takes into consideration different quantum hardware architectures, noise models, and error correction codes is out of the scope of this paper and is left for future work.

MNIST
MNIST [41] is probably the most used dataset in image classification. It is a collection of 70000 images of 28 × 28 = 784 pixels. Each image is a black and white hand-written digit between 0 and 9 and it is paired with a label that specifies the digit. Since the images are black and white, they are represented as arrays of 784 values that encode the lightness of each pixel. The dataset, excluding the labels, can be encoded in a matrix of size 70000 × 784.

Fashion MNIST
Fashion MNIST [60] is a recent dataset for benchmarking in image classification. Like the MNIST, it is a collection of 70000 images composed of 28 × 28 = 784 pixels. Each image represents a black and white fashion item among {T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot}. Each image is paired with a label that specifies the item represented in the image. Since the images are black and white, they are represented as arrays of 784 values that encode the lightness of each pixel. The dataset, excluding the labels, can be encoded in a matrix of size 70000 × 784.

CIFAR-10
CIFAR-10 [38] is another widely used dataset for benchmarking image classification. It contains 60000 colored images of 32 × 32 pixels, with the values for each of the 3 RGB colors. Each image represents an object among {airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck} and is paired with the appropriate label. We use all the images, reshaping them to unroll the three channels in a single vector. The resulting size of the dataset is 60000 × 3072.

Tiny Imagenet
Tiny Imagenet [40] is a subset of Imagenet, a large dataset for image classification. It is a collection of 100000 colored images of 64×64 pixels. Tiny Imagenet contains images of 200 object classes. Each class is composed of 500 images. We process the dataset to have only black and white images. Though the size is considerably less than the one of Imagenet, its complexity is higher than CIFAR-10's. The dataset, excluding the labels, can be encoded in a matrix of size 100000 × 4096.

Research Paper
Research Paper [24] is a dataset for text classification, available on Kaggle. It contains 2507 titles of papers together with the labels of the venue where they have been published. The labels are {WWW, INFOCOM, ISCAS, SIG-GRAPH, VLDB}. We pre-process the titles to compute a contingency table of papers × words: the value of the i th − j th cell is the number of times that the j th word is contained in the i th title. We remove the English stop-words, the words that appear in only one document, and those that appear in more than half the documents. The result is a contingency table of size 2507 × 2010.
Except for Research Paper, all the datasets have been shifted to row mean 0 and normalized so that σ max = 1. Figure A1 shows the factor score ratios distributions in these datasets. The rapid decrease is exponential and confirms the expectations.

A.2 Run-time parameters
We have computed the run-time parameters on the Tiny Imagenet dataset, maintaining the number of features steady (i.e., 4096 black and white pixels) and observing how the parameters scale as we consider an increasing number of data points. The results are shown in Figure A2. In these plots, epsilon is half the gap between the least singular value to retain and the one below, leading to correct thresholding, while theta is computed as the least singular value to retain. Although we would fine-tune θ and better in practice, the trend and the order of magnitudes of these parameters would remain like our plots. We have computed the best µ(A) over a finite set of p ∈ [0, 1], and for any number of data points, the Frobenius norm was the most convenient. Finally, in this experiment, we did not estimate δ. This is because δ can only be estimated with respect to a specific classification task. We did not run classification on this dataset for practical computational reasons. However, the following sections contain more run-time parameters for image classification datasets on smaller datasets, including estimates for δ.   From the plots, we can see that the spectral norm increases with the number of data points and that the thresholding epsilon is independent of this quantity. All the other parameters asymptotically approach a constant after introducing a certain number of data points. Our intuition suggests that the number of data points after which the parameters are constant depends on the number of classes in the dataset. Indeed, this quantity should be related to the amount of information that a new data point adds to the dataset. The reader might find it weird that the Frobenius norm, in Figure A2d, slightly decreases towards the end. However, this trend is justified by the fact that we compute these parameters after the dataset is divided by the spectral norm, and this parameter continues to increase ( Figure A2e). The fact that µ(A) is a positive homogeneous function makes it so that scaling by the spectral norm does not improve the overall run-time. If we did not divide the dataset by the spectral norm, we would have seen the effect of its trend in , θ, and µ(A). The decrease of µ after the normalization corresponds to a decrease of and θ, making the overall run-time remain the same.
We have used this data to generate the run-time plots in the main text ( Figure 2). In that figure, we can see that the algorithms of Section 3.1 are already convenient on datasets of this size. In contrast, the ones for singular vector extraction of Section 3.2 require datasets of greater size to show their potential.

A.3 Image classification with quantum PCA
To provide the reader with a clearer view of our new algorithms and their use in machine learning, we provide experiments on quantum PCA for image classification. We perform PCA on the three datasets for image classification (MNIST, Fashion MNIST, and CIFAR 10) and classify them with a K-Nearest Neighbors model. First, we simulate the extraction of the singular values and the percentage of variance explained by the principal components (top k factor score ratios' sum) using the procedure from Theorem 8. Then, we study the error of the model extraction, using Lemma 14, by introducing errors on the Frobenius norm of the representation to see how this affects the accuracy.

Estimating the number of principal components
We shift MNIST, Fashion MNIST, and CIFAR-10 to row mean 0 and divide them by their spectral norm. We directly simulate Theorem 8 to decide the number of principal components needed to retain 0.85 of the total variance. For each dataset, we classically compute the singular values with an exact classical algorithm and simulate the quantum state 1 √ r j σ 2 j r i σ i |σ i to emulate the measurement process of Algorithm 1. After initializing the random object with the correct probabilities, we measure it 1 γ 2 = 1000 times and estimate the factor score ratios with a frequentist approach (i.e., dividing the number of measurements of each outcome by the total number of measurements). Measuring 1000 times guarantees us an error of at most γ = 0.03 on each factor score ratio. In practice, the error is much smaller. To determine the number of principal components to retain, we sum the factor score ratios until the percentage of explained variance becomes more significant than 0.85. We report the results of these experiments in Table A1. We obtained good results for all the datasets, estimating no more than three extra principal components than needed. We could further refine the number of principal components using Theorems 9, 10. When we increase the percentage of variance to retain, the factor score ratios become smaller and the estimation worsens. When the factor score ratios become too small to perform efficient sampling, it is possible to establish the threshold θ for the smaller singular value to retain using Theorems 9 and 10. Suppose one is interested in refining the exact number k of principal components, rather than θ. In that case, it is possible to obtain it using a combination of the Theorems 9, 10 and the quantum counting algorithm in time that scales with the square root of k (Theorem 11) to find a good trade-off. Once one sets the number of principal components, the next step is to use Theorem 12 to extract the top singular vectors. To do so, we can retrieve the threshold θ from the previous step by checking the gap between the last singular value to retain and the first to exclude.

Studying the error in the data representation
We continue the experiment by checking how much error in the data representation a classifier can tolerate. We compute the exact PCA representation for the three datasets and the 10-fold Cross-validation error using k-Nearest Neighbors with 7 neighbors. For each dataset, we introduce errors in the representation and check how the accuracy decreases. To simulate the error, we perturb the exact representation by adding truncated Gaussian error (zero mean and unit variance, truncated on the interval [ −ξ √ nm , ξ √ nm ]) to each matrix entry. The graph in Figure A3 shows the distribution of the simulated error on 2000 approximation of a matrix A, such that A − A ≤ 0.1. The distribution is still Gaussian, centered almost at half the bound. The results show a reasonable tolerance of the errors; we report them in two sets of figures. Figure  A4 shows the drop of accuracy in classification as the error bound increases. Figure A5 shows the accuracy trend against the approximation's error.  Analyzing the run-time parameters As discussed in Section 4, the model extraction's run-time is a parameter bounded by min( A F , A ∞ ), k is the number of principal components retained, θ is the value of the last singular value retained, γ is the precision to estimate the factor score ratios, bounds the absolute error on the estimation of the singular values, δ bounds the 2 norm of the distance between the singular vectors and their approximation, and z is either n, m depending on whether we extract the left singular vectors, to compute the classical representation, or the right ones, to retrieve the model and allow for further quantum/classical computation. This run-time can be further lowered using Theorem 10 if we are not interested in the factor score ratios. This paragraph aims to show how to determine the run-time parameters for a specific dataset. We enrich the parameters of Table A1 with the ones in Table  A2, and we discuss how to compute them. From the previous paragraphs, it should be clear how to determine k, θ, γ, and p, and it is worth noticing again that 1/  and have seen that A F is the best µ(A) (this is true for CIFAR-10, Fashion MNIST, Tiny Imagenet, and Research Papers as well). To compute the parameter one should consider the epsilon that allows for a correct singular value thresholding. We refer to this as the thresholding and set it as the difference between the last retained singular value and the first that is excluded. For the sake of completeness, we have run experiments to check how the Coupon Collector's problem changes as increases. Recall that in the proof of Theorem 12, we use that the number of measurements needed to observe all the singular values is O(k log(k)), and this is true only if is small enough to let the singular values distribute uniformly. We observe that the thresholding always satisfies the Coupon Collector's argument, and we have plotted the results of our tests in Figure A6. Furthermore, we have computed δ by using the fact   we have fixed A − A to the biggest value in our experiments so that the accuracy doesn't drop more than 1%. These results show that Theorem 8, 9, and 10 can already provide speed-ups on datasets as small as the MNIST. Even though their speed-up is not exponential, they still run sub-linearly on the number matrix entries even though all the entries are taken into account during the computation, offering a polynomial speed-up with respect to their traditional classical counterparts. On the other hand, Theorem 12 requires bigger datasets. These algorithms are expected to show their full speed-up on big low-rank datasets that maintain a good distribution of singular values. As a final remark, the parameters have similar orders of magnitude.

Appendix B Related works
One of the first papers that faced the problem of performing the eigendecomposition of a matrix with a quantum computer is the well-known Lloyd et al [44], which leveraged the intuition that density matrices are covariance matrices whose trace has been normalized. In this work, the authors assume to have quantum access to a matrix in the form of a density matrix and develop a method for fast density matrix exponentiation that enables preparing the eigendecomposition of the input matrix in time logarithmic on its dimensions. However, this algorithm requires the input matrix to be square, symmetric, and sparse or low-rank. More recently, the works of Kerenidis et al. on recommendation systems [29] and least-squares [30] have used a different definition of quantum access to a matrix (the one used throughout this work) and defined the task of singular value estimation. Their singular value decomposition scales better with respect to the error parameters, eliminates the dependency on the condition number, and does not have requirements on the input matrix. Several recent works, such as Gu et al [19], Lin et al [43], Rebentrost et al [50], have improved or extended the quantum singular value decomposition techniques. Almost none of them have provided a formal analysis of an algorithm that ensures classical access to the singular vectors, values and the amount of variance explained by each. There have also been attempts at creating near-term quantum algorithms for singular value decomposition. These works propose quantum circuits for singular value decomposition of quantum states on noisy intermediate-scale quantum (NISQ) devices using variational circuits [5,59]. However, the complexity of such methods is unclear, and recent works have questioned the efficacy of the speed-ups of variational quantum algorithms due to (entanglement and noise-induced) barren plateaus in the optimization landscape [45,57]. In classical computer science, most diffused implementations of PCA, CA, and LSA available [47] relays on ARPACK [42] or similar packages, which implement improvements of the Lanczos method, like the Implicitly restarted Arnoldi method (IRAM) [53], an improvement upon the simple Arnoldi iteration, which dates back to 1951 (a more general case of Lancsoz algorithm, which works only for Hermitian matrices). The run-time of these algorithms is bounded by O(nmk ln(m/ ) √ ), where is an approximation error related to the relative spectral gap between eigenvalues [51].
The realization of quantum procedures that provide exponential speedups in linear algebra tasks has given inspiration for the realization of classical quantum-inspired algorithms that try to achieve the same run-time as their quantum counterparts. The process of transforming a quantum algorithm into a classical algorithm with a similar speed-up is usually referred to as "dequantization". In our case, the comparison with dequantized algorithms is often not easy, as they solve problems that are different from ours. Most of these works are based on a famous algorithm by Frieze, Kannan, and Vempala, which computes a low-rank approximation of a matrix in time that is sub-linear in the number of entries [2,10,14]. Such algorithms promise exponential speedups over the traditional SVD algorithm for low-rank matrices. However, the high polynomial dependency of the run-times on the condition number, the rank, and the estimation error makes them advantageous only for matrices of extremely large dimensions, with low ranks and small condition numbers. The research described in Arrazola et al [2] observed that the dependencies like O A 6 F 6 are far from being tight in real implementations, but still order of magnitudes slower than the best classical algorithms.
Concomitantly to our work, a new important result [9] was able to lower the complexity of these dequantizations by better leveraging all the previous literature of classical algorithms in randomized linear algebra and re-framing them into a more complete mathematical framework. Indeed, previous samplebased dequantizations were just doing a form of leverage score sampling. These new algorithms seem to be tighter than previous results and offer a better comparison with quantum algorithms, solving problems related to ours. While we believe that it is not possible to have classical algorithms with run-times comparable to the ones of Theorems 8, 9, 10 (see the relationships between LLSD, SUES, and DQC1 in Cade and Montanaro [6]) and Corollaries 15 and 17, we have found that the work of Chepurko et al [9] may question the practical advantage of our Theorem 12 over a classical counterpart. At first sight, their Theorem 33 might seem relevant for this work, as it provides a set of linearly independent rows of the input matrix. We stress that this problem is not related to finding the singular vectors provided by SVD, which are linearly independent and orthonormal. Moreover, even after further orthonormalization processing (e.g., Gram-Schmidt), the computed row basis wouldn't necessarily be the one provided by SVD. This is why we cannot compare the run-time of this procedure to our Theorem 12. On the other hand, Theorem 37 is more similar to our Theorem 12 but still aims to solve a different problem. While ours provides estimates v i − v i ≤ , ∀i ∈ [k] (which we recall are also relative-error estimates, as v i = 1), their Theorem 37 provides a rank-k projector matrix Q (k) , with orthonormal columns, such that ). While it is easy to see that Q → V as → 0, it is not easy to see how Q − V F varies as varies and that becomes even less clear if we are interested in the error on a specific singular vector. If the run-time of this algorithm is shown to be better than its quantum equivalent, it would still be great to include it in our framework instead of Theorem 12 and continue to take advantage of the speed-ups of the other quantum procedures. One downside of using the dequantized subroutines would be that, in general, the O(nnz(A)) data preprocessing step is different from the one required to provide efficient quantum access. Even though it can be possible that a classical algorithm could extract the singular vectors with a run-time comparable to the quantum one, using it would require paying additional costs both in time and space. Those costs arise from the need for an ad hoc data structure that would not be adequate to provide competitive speed-ups with respect to the other available quantum machine learning and data analysis algorithms. We believe that both the classical and quantum versions of singular vectors extraction may be used in the future, depending on the computational capabilities available to the interested data analysts.

B.1 Principal component analysis
Probably no other algorithm in ML has been studied as much as PCA, so the literature around this algorithm is vast [21,27]. To mention an improvement upon the standard Lanczos method for PCA [58], the authors used more Lanczos iterations to improve the numerical stability of PCA, by obtaining a better description of the Krylov subspace (i.e., more iterations help obtain a more orthonormal base). As mentioned, the problem of PCA has been studied previously within the model of quantum computation. He et al [25], Lin et al [43] focus on a circuit implementation of qPCA, whose run-time has been superseded by more recent techniques used in this paper. The work of Yu et al [62] faces the problem of performing PCA for dimensionality reduction on quantum states achieving an exponential advantage over the best known classical algorithms. However, their algorithm is somewhat impractical, due to the overall error dependence, which can be of O( −5 ). Furthermore they use old Hamiltonian simulation techniques, superseded by the techniques that we use in our paper. To our knowledge, there are no works that provide a theoretical analysis of the run-time for the procedure needed to select the number of singular vectors needed to retain enough variance, obtain a classical description of the model, and map new data points in the new feature space with theoretical guarantees on the run-time (which we believe cannot be improved, as in this work we show that the run-time for this mapping is almost constant).

B.2 Correspondence analysis
While correspondence analysis has been really popular in the past, so much that entire books have been written about it [11,17], it seems to have become out of fashion in the last decades, probably overshadowed by the wave of results in deep learning. The novel formulation of Hsu et al [26] gives a new perspective of CA. The authors connects correspondence analysis to the principal inertia components theory, making it relevant also in tasks that concern privacy in machine learning [56]. As said before, similarly and independently from us, Koide-Majima and Majima [37] have extended the dequantized subroutines to perform canonical correspondence analysis. This algorithm is not expected to beat the performance of our quantum algorithm, let alone the performance of the best classical algorithm for CA.

B.3 Latent semantic indexing
LSA was first introduced in Deerwester et al [12], which spurred a flurry of applications [39]. Some notable works are streaming and/or distributed algorithms for incremental LSA [7,63,64]. While these work might offer inspiration for new quantum algorithms, their distributed nature make it an unfair comparison with a single-QPU quantum algorithm. LSA with neural networks has also been explored in the past years [61], albeit without guarantees on the runtime or the approximation error. During the preparation of this manuscript we discovered a previous work on quantum LSA, which pointed at the similarities between quantum states and LSA, albeit without offering any practical algorithm [16].