Abstract
In classical statistical learning theory, one of the most well-studied problems is that of binary classification. The information-theoretic sample complexity of this task is tightly characterized by the Vapnik-Chervonenkis (VC) dimension. A quantum analog of this task, with training data given as a quantum state has also been intensely studied and is now known to have the same sample complexity as its classical counterpart. We propose a novel quantum version of the classical binary classification task by considering maps with classical input and quantum output and corresponding classical-quantum training data. We discuss learning strategies for the agnostic and for the realizable case and study their performance to obtain sample complexity upper bounds. Moreover, we provide sample complexity lower bounds which show that our upper bounds are essentially tight for pure output states. In particular, we see that the sample complexity is the same as in the classical binary classification task w.r.t. its dependence on accuracy, confidence and the VC-dimension.
1 Introduction
The fields of machine learning and of quantum computation provide new ways of looking at computational problems and have seen a significant increase in academic as well as practical interest since their origins in the 1970s and 1980s. More recently, attention was directed to paths for combining ideas from these two fruitful research areas. This gave rise to new approaches under different names such as “quantum machine learning” or “quantum learning theory”.
In classical statistical learning theory, one of the most influential frameworks is that of probably approximately correct (PAC) learning due to Vapnik and Chervonenkis (1971) and Valiant (1984). It is particularly well studied for the task of binary classification. For this problem the so-called VC-dimension Vapnik and Chervonenkis (1971) is known to characterize the sample complexity of learning a function class (Blumer et al. 1989; Hanneke 2016). Motivated by these strong theoretical results, a quantum analog of this problem was soon defined and studied in a series of papers (an overview over which is given in Arunachalam and de Wolf (2017)), which culminated in the results of Arunachalam and de Wolf (2018). There it is shown that the information-theoretic complexity of the task of quantum PAC learning a 0-1-valued function class is characterized by the VC-dimension in exactly the same way as for the classical scenario.
The scenario studied in Arunachalam and de Wolf (2018) assumes the training data available to the learner to be given in a specific quantum form and allows the learner to perform quantum computational operations on that training data. The functions to be learned, however, still map classical inputs to classical outputs. We propose a different quantum version of the binary classification task by not only considering the possibility of quantum training data but by allowing the objects to be learned to be inherently quantum. More specifically, we consider functions that map classical inputs to one of two possible quantum output states (“quantum labels”). These maps describe state preparation procedures. A more general learning task of this type, for which our problem can be seen as a toy model, could be relevant for cases in which state preparation is either costly or time-consuming, e.g., preparing thermal states at low temperatures (see ; Brandão and Kastoryano 2019; Chowdhury 2020, and references therein). Here, one could first produce sample data, learn a predictor, and then reproduce the preparation more efficiently using the predictor.
1.1 Main results
We consider maps \(f:{\mathscr{X}}\to \{ \sigma _{0},\sigma _{1}\}\) that assign to points in a classical input space \({\mathscr{X}}\) one of two labelling quantum states {σ0,σ1}. (Here, σ0 and σ1 are, in general, mixed states described by density matrices.) Let \({\mathscr{F}}\) be a function class consisting of such functions. We assume the training data to be given as a classical-quantum state about which, according to the laws of quantum theory, we can only gain information by performing measurements.
Our learning model is that of PAC-learning with accuracy ε and confidence δ. Here, we require a learning algorithm, given as input classical-quantum training data generated according to some unknown underlying distribution, to output with probability ≥ 1 − δ over the choice of training data a hypothesis that achieves accuracy ε. (Accuracy is measured in terms of the trace distance.)
We present a learning strategy that (ε,δ)-PAC learns \({\mathscr{F}}\subseteq \{ f:{\mathscr{X}}\to \{ \sigma _{0},\sigma _{1}\}\}\) in the agnostic scenario from classical-quantum training data of size \({\mathscr{O}}\left (\frac {d}{\varepsilon ^{2}} + \frac {\log {1}/{\delta }}{\varepsilon ^{2}}\right )\), where d is the VC-dimension of the {0,1}-valued function class \(\tilde {{\mathscr{F}}}\subseteq \{\tilde {f}:{\mathscr{X}}\to \{0,1\}\}\) induced by \({\mathscr{F}}\) via σi↦i, i = 0,1. Here, “agnostic” means that there need not be a function in \({\mathscr{F}}\) that would achieve perfect accuracy. We also show that solving this learning problem requires training data size \({\varOmega }\left (\frac {d}{\varepsilon ^{2}} + \frac {\log {1}/{\delta }}{\varepsilon ^{2}} \right )\), so our strategy is optimal w.r.t. the sample complexity dependence on ε, δ and d.
For the realizable scenario of the quantum learning problem, i.e., under the assumption that perfect accuracy can be achieved using \({\mathscr{F}}\), we prove a sample complexity upper bound of
where {E0,E1} is the Holevo-Helstrom measurement for distinguishing σ0 and σ1, and a sample complexity lower bound of \({\varOmega }\left (\frac {d}{\varepsilon } + \frac {\log {1}/{\delta }}{\varepsilon }\right )\). Also here, these bounds coincide w.r.t. their dependence on ε, δ and d. The prefactor \((1-2\max \limits \lbrace \text {tr}[E_{0}\sigma _{1}]\), tr[E1σ0]})− 2 in the upper bound comes from our procedure trying to distinguish σ0 and σ1 by measuring single copies. (Note: Even though we formulate this in terms of the Holevo-Helstrom measurement, we could use any other two-outcome POVM \(\{ \tilde {E}_{0},\tilde {E}_{1}\}\) that satisfies \(\max \limits \lbrace \text {tr}[\tilde {E}_{0}\sigma _{1}],\text {tr}[\tilde {E}_{1}\sigma _{0}]\rbrace <{1}/{2}.\)).
In proving the sample complexity upper bound for the realizable scenario, we combine algorithms from Laird (1988) and Hanneke (2016) to show that \({\mathscr{O}}\left (\frac {1}{\varepsilon (1-2\eta _b)^{2}}\right .\) \(\left .\left (d + \log {1}/{\delta }\right )\right )\) classical examples with two-sided classification noise, i.e., in which each label is flipped with probability given by a noise rate, suffice for classical (ε,δ)-PAC learning a function class of VC-dimension d in the realizable scenario if the noise rate is bounded by ηb < 1/2. This upper bound has, to the best of our knowledge, not been observed before and, when combined with the lower bound from Arunachalam and de Wolf (2018), establishes the optimal sample complexity of this classical noisy learning problem.
As is common in statistical learning theory, our main focus lies on the information-theoretic complexity of the learning problem, i.e., the necessary and sufficient number of quantum examples, whereas we do not discuss the computational complexity. Our proposed strategies are “semi-classical” in the following sense: After initially performing simple tensor product measurements, in which each tensor factor is a two-outcome POVM, the remaining computation is done by a classical learning algorithm. In particular, the procedure does not require (possibly hard to implement) joint measurements and its computational complexity will be determined by the (classical) computational complexity of the classical learner used as a subroutine.
1.2 Overview over the proof strategy
We first sketch how we obtain the sample complexity upper bounds. We propose a simple (semi-classical) procedure that consists of first performing local measurements on the quantum part of the training data examples to obtain classical training data and then applying a classical learning algorithm.
We observe that the learning problem for which the classical learner is applied, can then be viewed as a classical binary classification problem with two-sided classification noise, i.e., in which the labels are flipped with certain error probabilities determined by the outcome probabilities of the performed quantum measurements. Therefore, we have reduced our problem to obtaining sample complexity upper bounds for a classical learning problem with noise.
In the general (so-called agnostic) case, we can use known sample complexity bounds formulated in terms of a complexity measure called Rademacher complexity to show that classical empirical risk minimization w.r.t. a suitably modified loss function (as suggested in ; Natarajan et al. 2013) achieves optimal sample complexity for this classical learning problem with noise.
In the realizable case, i.e., under the assumption that any non-noisy training data set can be perfectly represented by some hypothesis in our class \(\tilde {{\mathscr{F}}}\), the optimal sample complexity for binary classification with two-sided classification noise has not been established in the literature. We combine ideas from Laird (1988) and Hanneke (2016) to exhibit an algorithm that achieves information-theoretic optimality for this scenario.
To obtain the sample complexity lower bounds, we apply ideas from Arunachalam and de Wolf (2018). Namely, we observe that for sufficiently small accuracy parameter, any quantum strategy that solves our learning problem indeed has to be able to distinguish between the possible different training data states with high success probability.
In the simple case of distinguishing between two quantum states, arising from two different “hard-to-distinguish” underlying distributions, this probability can be upper bounded in terms of the trace distance of the states. In the more general case of many states, we do not study this success probability directly. Instead, we consider the information contained in the quantum training data about the choice of the underlying distribution, again chosen out of a set of “hard-to-distinguish” distributions.
1.3 Related work
Bshouty and Jackson (1998) introduced a notion of quantum training data for learning problems with classical concepts and used it to learn DNF (Disjunctive Normal Form) formulae w.r.t. the uniform distribution. This was extended to product distributions by Kanade et al. (2019). Using ideas from Fourier-based learning, this type of quantum training data was also studied in the context of fixed-distribution learning of Boolean linear functions (Bernstein and Vazirani 1993; Cross et al. 2015; Ristè et al. 2017; Grilo et al. 2017; Caro 2020), juntas Atıcı and Servedio (2007), and Fourier-sparse functions (Arunachalam et al. 2019a). Arunachalam and de Wolf (2017) and Arunachalam et al. (2019b) study the limitations of these quantum examples. A broad overview over work on quantum learning classical functions is given in Arunachalam and de Wolf (2017).
Also for the model of learning from membership queries, a quantum counterpart can be considered. Servedio and Gortler (2004) showed that the number of required classical queries is at most polynomially larger than the number of required quantum queries. Recently,this polynomial relation was improved upon in Arunachalam et al. (2019a). A more specific scenario, namely that of learning multilinear polynomials more efficiently from quantum membership queries, is studied in Montanaro (2012).
Similarly, also a quantum counterpart of the classical model of statistical query learning can be defined. This was recently studied in Arunachalam et al. (2020).
Another line of research at the intersection of learning theory and quantum information focuses on applying classical learning to concept classes arising from quantum theory, e.g., from states or measurements. This was initiated by Aaronson (2007) and studied further by Cheng et al. (2016) and Aaronson (2018), and Aaronson et al. (2018).
Our learning model is similar to the one studied in Chung and Lin (2018). Also there, the inputs are assumed to be classical and the outputs are quantum states. The crucial difference to our scenario is that we assume that there are only two possible label states and these are known in advance. In Chung and Lin (2018), there can be a continuum of possible label states.
Our additional assumption allows us to study infinite function classes \({\mathscr{F}}\), whereas the results in Chung and Lin (2018) are for classes of finite size. (We expect that the reasoning of Chung and Lin (2018) can be extended to infinite classes using the so-called “growth function” when restricting to a finite set of possible target states. This might lead to a learning procedure that can be applied in our scenario without prior knowledge of the possible quantum label states.) As a further difference between the approaches, whereas the strategy of Chung and Lin (2018) requires the ability to perform measurements in random orthonormal bases, the measurements in our procedures can be taken to be fixed and of product form and are thus potentially easier to implement.
The classical problems to which our quantum learning problems are reduced are problems of learning from noisy training data. These were first proposed by Angluin and Laird (1988) and Laird (1988) and studied further, e.g., by Aslam and Decatur (1996) and Cesa-Bianchi et al. (1999) and Natarajan et al. (2013).
1.4 Structure of the paper
In Section 2 we recall some notions from learning theory as well as from quantum information and computation. The central learning problem of this contribution is formulated in Section 3. The next section exhibits strategies for solving the task and establishes sample complexity upper bounds. In doing so, we derive a tight upper bound on the sample complexity of classical binary classification with two-sided classification noise (see Appendix Appendix). The quantum sample complexity upper bounds are complemented by lower bounds in Section 5. We conclude with open questions and the references.
2 Preliminaries
2.1 Basics of quantum information and computation
A finite-dimensional quantum system is described by a (mixed) state and mathematically represented by a density matrix of some dimension \(d\in \mathbb {N}\), i.e., an element of \({\mathscr{S}}(\mathbb {C}^d):=\lbrace \rho \in \mathbb {C}^{d\times d}\ |\ \rho \geq 0, \text {tr}[\rho ]=1\rbrace \). Here, ρ ≥ 0 means that ρ is a self-adjoint and positive semidefinite matrix. The extreme points of the convex set \({\mathscr{S}}(\mathbb {C}^d)\) are the rank-1 projections, the pure states. We employ Dirac notation to denote a unit vector \(\psi \in \mathbb {C}^d\) also by \(|\psi \rangle \in \mathbb {C}^d\) and the corresponding pure state by |ψ〉〈ψ|.
To make an observation about a quantum system, a measurement has to be performed. Measurements are built from the set of effect operators \({\mathscr{E}}(\mathbb {C}^d):=\lbrace E\) . For our purposes it suffices to consider a measurement as a collection \(\lbrace E_{i}\rbrace _{i=1}^{\ell }\) of effect operators \(E_{i}\in {\mathscr{E}}(\mathbb {C}^d)\) s.t.
. (For the more general notion of a POVM see Nielsen and Chuang (2009) or Heinosaari and Ziman (2012).) When performing a measurement \(\lbrace E_{i}\rbrace _{i=1}^{\ell }\) on a state ρ, output i is observed with probability tr[Eiρ]. A projective measurement is one where the effect operators are rank-1 projections, i.e., there exists an orthonormal basis \(\lbrace |i\rangle \rbrace _{i=1}^d\) s.t. Ei = |i〉〈i|.
When multiple quantum systems with spaces \(\mathbb {C}^{d_i}\) are considered, the composite system is described by the tensor product \(\bigotimes _{i=1}^n \mathbb {C}^{d_i}\simeq \mathbb {C}^{{\prod }_i d_i}\) and the set of states becomes \({\mathscr{S}} (\bigotimes _{i=1}^n \mathbb {C}^{d_i} )\). Given a state \(\rho _{AB}\in {\mathscr{S}}(\mathbb {C}^{d_A}\otimes \mathbb {C}^{d_B})\) of a composite system, we can obtain states of the subsystems as partial traces ρA = trB[ρAB], ρB = trA[ρAB]. Here, the partial trace is defined as satisfying the relation .
The dynamics of a quantum system are usually described by unitary evolution or, more generally, by quantum channels. For our purposes, these dynamics will not have to be discussed explicitly since they can be considered as part of the performed measurement by changing to the so-called Heisenberg picture (see ; Nielsen and Chuang 2009). We will take this perspective in proving our sample complexity lower bounds because it allows us to restrict our attention to proving limitations of measurements rather than of channels.
We will also make use of some standard entropic quantities which have been generalized from their classical origins Shannon (1948) to the realm of quantum theory. We denote the Shannon entropy of a random variable X with probability mass function p by \(H(X)=-{\sum }_x p(x)\log (p(x))\), the conditional entropy of a random variable Y given X as \(H(Y|X)={\sum }_{x,y} p(x,y) \log \left (\frac {p(x,y)}{p(x)}\right )\) and the mutual information between X and Y as I(X : Y ) = H(X) + H(Y ) − H(X,Y ). Similarly, the von Neumann entropy of a quantum state ρ will be denoted as \(S(\rho )=-\text {tr}[\rho \log \rho ]\) and the mutual information for a bipartite quantum state ρAB as I(ρAB) = I(A : B) = S(ρA) + S(ρB) − S(ρAB). All the standard results and inequalities connected to these quantities which appear in our arguments can be found in Nielsen and Chuang (2009) or in Wilde (2013).
2.2 Basics of the PAC framework and the binary classification problem
The setting of Probably Approximately Correct (PAC) learning was introduced by Vapnik and Chervonenkis (1971) and Valiant (1984). The general setting is as follows: Let \({\mathscr{X}}, {\mathscr{Y}}\) be input and output space, respectively, let \({\mathscr{F}}\subset {\mathscr{Y}}^{{\mathscr{X}}}\) be a class of functions, a concept class, and let \(\ell :{\mathscr{Y}}\times {\mathscr{Y}}\to \mathbb {R}_+\) be a loss function. A learning algorithm (to which \({\mathscr{X}},{\mathscr{Y}},{\mathscr{F}}\) and ℓ are known) has access to training data of the form \(S=\lbrace (x_{i},y_{i})\rbrace _{i=1}^{m}\), where (xi,yi) are drawn i.i.d. from a probability measure \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{Y}})\). Moreover, the learner is given as input a confidence parameter δ ∈ (0,1) and an accuracy parameter ε ∈ (0,1). Then a learner must output a hypothesis \(h\in {\mathscr{Y}}^{{\mathscr{X}}}\) s.t., with probability ≥ 1 − δ w.r.t. the choice of training data,
Note that the first term on the right-hand side vanishes if there exists an \(f^{*}\in {\mathscr{F}}\) s.t. \(\mu (x,y)=\mu _{1}(x)\delta _{y,f^{*}(x)}\) \(\forall (x,y)\in {\mathscr{X}}\times {\mathscr{Y}}\). In this case, we call the learning problem realizable, otherwise we refer to it as agnostic.
Both in the agnostic and in the realizable scenario, a learning algorithm that always outputs a hypothesis \(h\in {\mathscr{F}}\) is called a proper learner, and otherwise it is called improper.
A quantity of major interest is the number of examples featuring in such a learning problem. Given a learning algorithm \({\mathscr{A}}\), the smallest \(m=m(\varepsilon ,\delta )\in \mathbb {N}\) s.t. the learning requirement (2.1) is satisfied with confidence 1 − δ and accuracy ε is called the sample complexity of \({\mathscr{A}}\). The sample complexity of the learning problem is the infimum over the sample complexities of all learning algorithms for the problem. This characterizes, from an information-theoretic perspective, the hardness of a learning problem, but leaves aside questions of computational complexity.
The binary classification problem now arises as a special case from the above if we specify the output space \({\mathscr{Y}}=\lbrace 0,1\rbrace \) and take the loss function to be \(\ell (y,\tilde {y})=1-\delta _{y,\tilde {y}}\), the 0-1-loss. This setting is well studied and a characterization of its sample complexity is known. At its core is the following combinatorial parameter:
Definition 1 (VC-Dimension Vapnik and Chervonenkis (1971))
Let \({\mathscr{F}}\subseteq \lbrace 0,1\rbrace ^{{\mathscr{X}}}\). A set S = {x1,…,xn}⊂ X is said to be shattered by \({\mathscr{F}}\) if for every b ∈{0,1}n there exists \(f_b\in {\mathscr{F}}\) s.t. fb(xi) = bi for all 1 ≤ i ≤ n.
The Vapnik-Chervonenkis (VC) dimension of \({\mathscr{F}}\subset \lbrace 0,1\rbrace ^{{\mathscr{X}}}\) is defined to be
The main insight of VC-theory lies in the fact that learnability of a {0,1}-valued concept class is equivalent to finiteness of its VC-dimension. Even more, the sample complexity can be expressed in terms of the VC-dimension. This is the content of the following
Theorem 1
(see, e.g., Blumer et al. 1989; Hanneke 2016; Shalev-Shwartz and Ben-David 2014; Vershynin 2018)
In the realizable scenario, the sample complexity of binary classification for a function class \({\mathscr{F}}\) of VC-dimension d is \(m=m(\varepsilon ,\delta )={{\varTheta }}\left (\frac {1}{\varepsilon }\left (d + \log {1}/{\delta }\right )\right )\).
In the agnostic scenario, the sample complexity of binary classification for a function class \({\mathscr{F}}\) of VC-dimension d is \(m=m(\varepsilon ,\delta )={{\varTheta }}\left (\frac {1}{\varepsilon ^{2}}\left (d + \log {1}/{\delta }\right )\right )\).
The proof of the sample complexity upper bound in the agnostic case typically goes via a different complexity measure, the Rademacher complexity, which is then related to the VC-dimension. As this will reappear later on in our analysis, we also recall this definition here.
Definition 2 (Rademacher Complexity (see 2002))
Let Z be some space, \({\mathscr{F}}\subseteq \mathbb {R}^{\mathcal {Z}}\), \(z\in \mathcal {Z}^n\). The empirical Rademacher complexity of \({\mathscr{F}}\) w.r.t. z is
where U({− 1,1}n) denotes the uniform distribution on {− 1,1}n.
If we consider n i.i.d. random variables Z1,...,Zn distributed according to a probability measure μ on \(\mathcal {Z}\) and write Z = (Z1,...,Zn), the Rademacher complexities of \({\mathscr{F}}\) w.r.t. μ are defined to be \({\mathscr{R}}_n({\mathscr{F}}):=\mathbb {E}_{Z\sim \mu ^n}\left [\hat {{\mathscr{R}}}_{{\mathscr{F}}}\right ]\), \(n\in \mathbb {N}.\)
3 The binary classification problem with classical instances and quantum labels
We introduce a generalization of the classical binary classification problem to the quantum realm by allowing the two labels to be quantum states. Thus let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\) be two (possibly mixed) quantum states, write \({\mathscr{D}}=\lbrace \sigma _{0},\sigma _{1}\rbrace \). We assume that classical descriptions of these states (their density matrices) are known to the learning algorithm as well as the fact that only these two quantum labels appear. The class to be learned is now a class of functions \({\mathscr{F}}\subset \{ f:{\mathscr{X}}\to {\mathscr{D}} \}\) and the underlying distribution will be a \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{D}})\), where \({\mathscr{X}}\) is some space of classical objects.
We now deviate from the standard PAC setting: We assume the training data to be \(S=\lbrace (x_{i},\rho _{i})\rbrace _{i=1}^{m}\), \(m\in \mathbb {N}\), where the (xi,ρi) are drawn independently according to μ (in particular, \(\rho _{i}\in {\mathscr{D}}\) for all i). Here, the ρi are the actual quantum states, not classical descriptions of them. Therefore, our learning problem is not a classical one, we have to perform measurements on the quantum labels to extract information from them. Equivalently, we represent an example (xi,ρi) drawn from μ as the classical-quantum state
with \(\lbrace |x\rangle \rbrace _{x\in {\mathscr{X}}}\) orthonormal.
Note that this model for the training data differs from the one introduced by Bshouty and Jackson (1998), where the training data consists of copies of a superposition state. Instead, here we assume copies of a mixture of states. This is done mainly for two reasons: First, it allows us to naturally talk about maps with mixed state outputs. Second, it is debatable whether assuming access to superposition examples as in Bshouty and Jackson (1998) is justified (see, e.g., Ciliberto et al. 2018, section 5), and this problem remains when considering maps with quantum outputs. In contrast, the mixtures assumed in our model arise naturally as statistical ensembles of outputs of state preparation procedures, if the parameters of the preparation are chosen according to some (unknown) distribution. In that sense, the form of classical-quantum training data assumed here is both a straightforward generalization of classical training data, given the standard probabilistic interpretation of mixed states, and can (at least in the realizable scenario) be easily imagined to be obtained as outcome of multiple runs of a state preparation experiment with different parameter settings.
A quantum learner for \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε from m = m(ε,δ) quantum examples has to output, for every \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{D}})\), with probability ≥ 1 − δ over the choice of training data of size m according to μ, a hypothesis \(h\in {\mathscr{D}}^{{\mathscr{X}}}\) s.t. \(R_{\mu }(h)\leq \underset {f\in {\mathscr{F}}}{\inf }R_{\mu }(f) + \varepsilon \). As before, we can consider agnostic versus realizable and proper versus improper variants of this learning model.
Here, we define the risk of a hypothesis \(h\in {\mathscr{F}}\) w.r.t. a distribution \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{D}})\) as
where \(\left \|\rho - \sigma \right \|_{1} = \text {tr}[|\rho -\sigma |]=\text {tr}[\sqrt {(\rho -\sigma )^{*}(\rho -\sigma )}]\) is the Schatten 1-norm.
Note that our assumption on \({\mathscr{F}}\) implies that \(h(x)\in {\mathscr{D}}\ \forall x\in {\mathscr{X}}\) and therefore we can easily rewrite
which is just the 0-1-risk multiplied by a constant. We choose the slightly more complicated looking definition for Rμ(h) for two reasons. On the one hand, \(\frac {\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}}{2}\) is a measure for the distinguishability of σ0 and σ1 and thus a natural scale w.r.t. which to measure the prediction error. (Note: If σ0,σ1 are orthogonal pure states and thus perfectly distinguishable, the classical scenario is recovered.) On the other hand, our definition of risk can be motivated operationally as we discuss in Appendix Appendix.
Example 1
Here, we describe a physically motivated problem that is captured by our scenario. The idea is as follows: Suppose we have available a (possibly complicated) ground state preparation procedure. Using this, we want to prepare a ground state |φ0〉 of a Hamiltonian H. However, H is perturbed by noise about which we have only partial information. We want to learn more about the noise and its influence on the prepared ground state.
We make this idea more concrete. We consider a (self-adjoint) Hamiltonian \(H\in \mathbb {C}^{(d+2)\times (d+2)}\) of the form , where
, with (non-unique) ground state \(|\varphi _{0}\rangle :=\begin {pmatrix} 0 & 1\end {pmatrix}^T\oplus 0\). Suppose that we have a ground state preparation procedure that, if run with Hamiltonian H, prepares |φ0〉. When implementing this procedure, we have to fix values of a parameter vector \(x\in \mathbb {R}^D\). (Think, e.g., of D = 3 and x denoting the location at which the experiment is set up.) But due to the laboratory being only imperfectly shielded, there is an unknown region \(R\subset \mathbb {R}^D\) in which the system is subject to noise. For simplicity, we assume that only two types of noise can occur and lead to the location-dependent Hamiltonian
, with noise Hamiltonians \(H^{(0)} = \begin {pmatrix} 1 & 0 \\ 0 & -1 \end {pmatrix}\oplus 0\), H(1) = \(\begin {pmatrix} 0 & 1 \\ 1 & 0\end {pmatrix}\oplus 0\).
The noise can lead to a perturbation of the ground state. Namely:
-
For x∉R, |φ0〉 is a ground state of \(H^{(i)}_x\). (This is the case of no effective noise.)
-
For x ∈ R, |φ0〉 is the unique ground state of \(H^{(0)}_x\). Hence, the noise H(0) is benign from the perspective of ground state preparation.
-
For x ∈ R, \(|\varphi _{1}\rangle :=\frac {1}{\sqrt {2}}\begin {pmatrix} 1 & -1 \end {pmatrix}^T\oplus 0\) is the unique ground state of \(H^{(1)}_x\). Hence, the noise H(1) is malicious from the perspective of ground state preparation.
Thus, we describe the ground state preparation by a function \(f^{(i)}_R:\mathbb {R}^D\to \{|\varphi _{0}\rangle \langle \varphi _{0}|, |\varphi _{1}\rangle \langle \varphi _{1}|\}\), . With this formulation, gaining information about the noise region R and the noise type i can be phrased as the problem of (PAC-)learning an unknown element of the (known) function class \({\mathscr{F}}=\left \{f^{(i)}_R\right \}_{i=0,1,~R\in {\mathscr{R}}}\subseteq \{|\varphi _{0}\rangle \langle \varphi _{0}|, |\varphi _{1}\rangle \langle \varphi _{1}|\}^{\mathbb {R}^D}\), where \({\mathscr{R}}\) is the class of possible error regions.
Note that |φ0〉 and |φ1〉 are not orthogonal and thus cannot be perfectly distinguished. Therefore, we cannot phrase the learning problem as one of binary classification with classical labels.
We return to this setting in Examples 2 and 3 to illustrate our learning strategies.
We want to conclude this section by discussing a drawback of our model. We assume \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\), i.e., outputs of any \(f\in {\mathscr{F}}\) are either σ0 or σ1. Considering the convex structure of the set of quantum states, which is intimately tied to the probabilistic interpretation of quantum theory, this restriction can be considered unnatural. We nevertheless make it, for two reasons: First, it is easy to show using a Bayesian predictor that, under the assumption of μ being supported on \({\mathscr{D}}\) (which could, of course, also be contested), the optimal choice of predictors among all functions \(({\mathscr{S}}\) \((\mathbb {C}^d))^{{\mathscr{X}}}\) is actually a function in \({\mathscr{D}}^{{\mathscr{X}}}\). Second, it is the most direct analog of the classical scenario with binary labels and we consider it a sensible first step that, as demonstrated in Example 1, can already be of physical relevance.
4 Sample complexity upper bounds
4.1 The agnostic case
Our learning strategy is motivated by interpreting the classical training data arising from performing a measurement on the label states as noisy version of the true training data. Before describing the learning strategy, we recall our assumption that classical descriptions of the label states σ0, σ1 are known to the learner. Based on this knowledge, the learner can derive the optimal measurement {E0,E1} for minimum error distinction between the two states, the so-called Holevo-Helstrom measurement (see ; Watrous 2018, Theorem 3.4), by choosing E0 to be the orthogonal projector onto the eigenspaces of σ0 − σ1 corresponding to nonnegative eigenvalues. This step is where knowledge of the states σ0 and σ1 is used.
The learning strategy is now the following, in which we use the Holevo-Helstrom measurement to produce classical training data and thus obtain a classical learning problem:

Note that the only non-classical step in the strategy is step (1), which consists only of performing local two-outcome measurements.
The modification of the loss function in step (3) gives an unbiased estimate of the true risk:
Lemma 1
(see ;Natarajan et al. 2013, Lemma 1)
Fix \(x\in {\mathscr{X}}\). With the notation introduced above, for every z ∈{0,1} it holds that

We can use a standard generalization bound in terms of Rademacher complexities (see, e.g., Theorem 26.5 of Shalev-Shwartz and Ben-David (2014)) to obtain: With probability ≥ 1 − δ over the choice of training data \(S=\{(x_{i},y_{i}) \}_{i=1}^{m}\) according to ν, we have that for all \(\tilde {f}^{\ast }\in \mathcal {\tilde {F}}\)
where we used that \(|\tilde {\ell }(y_{1},y_{2})|\leq \frac {1}{1-\eta _{0}-\eta _{1}}\) and defined the function class
Next, we relate the empirical Rademacher complexity of \(\tilde {\mathcal {G}}\) to that of \(\tilde {{\mathscr{F}}}\).
Lemma 2
For any training data set \(S=\{(x_{i},y_{i}) \}_{i=1}^{m}\), viewed as an element of \(({\mathscr{X}}\times \{ 0,1\})^{m}\), we have
Proof
(Sketch) The proof uses some standard steps that are typically used for example in proving the Lipschitz contraction property of the Rademacher complexity and in studying the Rademacher complexity in a binary classification scenario.
See Appendix Appendix for a detailed proof. □
With this, we now reformulate the above result in terms of the VC-dimension. Suppose \(\text {VCdim} (\tilde {{\mathscr{F}}})=d<\infty \). Then \(\hat {{\mathscr{R}}}(\tilde {{\mathscr{F}}}) \leq 31\sqrt {\frac {d}{m}}\) (see, e.g., Vershynin 2018, Theorem 8.3.23). Therefore, we obtain that, with probability ≥ 1 − δ over the choice of training data \(S=\{(x_{i},y_{i}) \}_{i=1}^{m}\) according to ν,
Note that, using Lemma 1, we can now bound

Now we can set this equal to ε and rearrange to conclude that a sample size of
suffices to guarantee that, with probability ≥ 1 − δ, \(R_{\mu }(\hat {h}) - \underset {f\in {\mathscr{F}}}{\inf } R_{\mu } (f)\leq \varepsilon \).
If we now observe that \(\frac {1}{1-\eta _{0}-\eta _{1}}\leq \frac {4}{\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}\), we obtain the sample complexity upper bound
Remark 1
The naive version of our learning strategy would be to perform Holevo-Helstrom measurements and then apply a classical learning strategy, like empirical risk minimization, without correcting for the noise in the resulting classical labels. Actually, this learning strategy already performs reasonably well and, in certain special cases, even allows to reduce the quantum learning problem to a fully classical one. For a detailed analysis of the performance of this simpler strategy, the reader is referred to Appendix Appendix.
Example 2
We illustrate our agnostic learning strategy for the scenario of Example 1. As discussed in Appendix Appendix, as both label states |φ0〉〈φ0| and |φ1〉〈φ1| are pure, we can actually dispense with the modification of the classical loss function and simply take the 0-1-loss. Therefore, the Holevo-Helstrom strategy will look as follows: We first perform local Holevo-Helstrom measurements with measurement operators \(E_{0} \propto \begin {pmatrix} -1+\sqrt {2} & 1\end {pmatrix}^T \begin {pmatrix} -1+\sqrt {2} & 1\end {pmatrix}\oplus 0\), . This gives rise to classical training data. With that data, we then perform (classical) empirical risk minimization over the class \(\tilde {{\mathscr{F}}}=\left \{\tilde {f}^{(i)}_R \right \}_{i=0,1,~R\in {\mathscr{R}}}\), where \(\tilde {f}^{(i)}_R:\mathbb {R}^D\to \{0,1\}\),
. Note that \(f^{(0)}_R\) is the zero-function for every \(R\in {\mathscr{R}}\).
Both the optimization procedure and the generalization capability depend on the class \({\mathscr{R}}\) of possible noise regions. Concerning the generalization performance, observerve that, if \(\emptyset \in {\mathscr{R}}\), then \(\text {VCdim} (\tilde {{\mathscr{F}}})=\text {VCdim}(\tilde {{\mathscr{F}}}_{{\mathscr{R}}})\), where we take to be the class of indicator functions of sets from \({\mathscr{R}}\). The VC-dimension of such classes is well known for different geometric classes \({\mathscr{R}}\). E.g., if \({\mathscr{R}}\) is the class of axis-aligned rectangles or that of Euclidean balls in \(\mathbb {R}^D\), then \(\text {VCdim} (\tilde {{\mathscr{F}}}_{{\mathscr{R}}})\) scales linearly in D and thus the dependence of the sample complexity upper bound on the number of parameters D is linear. If, however, we take \({\mathscr{R}}\) to be the class of compact and convex subsets of \(\mathbb {R}^D\), then \(\text {VCdim} (\tilde {{\mathscr{F}}}_{{\mathscr{R}}})=\infty \) and the sample complexity upper bound becomes void. This is congruent with the intuition that without prior assumptions on the structure of the regions that can be influenced by noise, learning the noise (in particular its region) will be hard and maybe infeasible.
4.2 The realizable case
The strategy from the previous subsection uses a generalization bound via the Rademacher complexity and yields a sample complexity bound depending quadratically on 1/ε. In the classical binary classification problem it is known (see Theorem 1) that under the realizability assumption this can be improved to 1/ε, but this typically requires a different kind of reasoning via ε-nets. (Compare section 28.3 of Shalev-Shwartz and Ben-David (2014)). In Theorem 6 we show how the reasoning by Hanneke (2016) can be combined with results by Laird (1988) to achieve the 1/ε-scaling also in the case of two-sided classification noise. This sample complexity upper bound is seen to be optimal in its dependence on the VC-dimension d, the error rate bound η, the confidence δ and the accuracy ε by a comparison to the lower bound in Theorem 27 of Arunachalam and de Wolf (2018).
If, as in the previous subsection, we consider the classical training data obtained by measuring the quantum training data as noisy version of a true sample, we can exchange step 3 in the Holevo-Helstrom strategy by the minimum disagreement-based classical learning strategy achieving the optimal sample complexity bound of Theorem D.2. This directly yields the following
Theorem 2
Let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\) be (distinct) quantum states. Let ε ∈ (0,1), \(\delta \in (0,2\cdot (\frac {2e}{d})^d)\), where d is the VC-dimension of \({\mathscr{F}}\subset \lbrace 0,1\rbrace ^{{\mathscr{X}}}\). Then
quantum examples of a function in \({\mathscr{F}}\) are sufficient for binary classification with classical instances and quantum labels σ0,σ1 with accuracy ε and confidence 1 − δ.
Example 3
When considering this learning strategy in the setting of Example 1, we first perform the Holevo-Helstrom measurements as in Example 2 to obtain classical data. Again, this is followed by a classical learning procedure for the class \(\tilde {{\mathscr{F}}}=\left \{\tilde {f}^{(i)}_R \right \}_{i=0,1,~R\in {\mathscr{R}}}\).
Whereas the sample complexity bound derived for the agnostic case in Section 4.1 applies to any (noise-corrected) classical empirical risk minimization, the procedure leading to the bound in Theorem 2 is a specific one, presented in the proof of Theorem D.2. First, the classical data is processed, using the subsampling algorithm of Hanneke (2016) (see Algorithm 2), to generate a collection of subsamples. For each of those subsamples, we then apply Algorithm 1: We use a first part of the subsample to group the elements of \(\tilde {{\mathscr{F}}}\) into equivalence classes (according how they act on that part of the subsample), and the remainder is used to test the performance of each equivalence class. Afterwards, we output as hypothesis for that subsample a representative of the equivalence class that performs best in that test, i.e., that minimizes the number of disagreements with the part of the subsample used for testing. Whether and how the grouping into equivalence classes and finding minimum disagreement strategies can be done (efficiently) depends on \(\tilde {{\mathscr{F}}}\), and thus on \({\mathscr{R}}\). Finally, we take a majority vote over all the subsample hypotheses to get the output hypothesis of the classical learning procedure.
The dependence of the sample complexity on \(\tilde {{\mathscr{F}}}\) via the VC-dimension of the class of indicator functions of sets from \({\mathscr{R}}\) is analogous to Example 2.
Remark 2
From the description of our noise-corrected Holevo-Helstrom strategy (either in the form of Section 4.1 or that of this subsection), we can directly see that whether it is a proper or an improper learner depends on whether the classical learning algorithm in step (3) is. As the classical learning algorithm used in Section 4.1 is a simple Empirical Risk Minimization, it is in particular proper. So our noise-corrected Holevo-Helstrom strategy for the agnostic case is proper as well. The classical learner used in this subsection, however, is in general improper. So also the noise-corrected Holevo-Helstrom strategy for the realizable case is in general improper.
5 Sample complexity lower bounds
Whereas the goal of the previous section was to give strategies for solving the binary classification problem with classical instances and quantum labels and to prove upper bounds on the sufficient number of classical-quantum examples, we now turn to the complementary question of lower bounds on the number of required examples. In this section, we derive lower bounds that match the respective upper bounds from the previous section, and therefore, we conclude that the procedures described in Section 4 are optimal w.r.t. sample size in terms of the dependence on ε, δ, and d.
5.1 The agnostic case
We prove the sample complexity lower bounds in two parts, the first depending on the confidence parameter δ but not on the VC-dimension of the function class and conversely for the second.
We establish the VC-dimension-independent sample complexity lower bound in the following
Lemma 3
Let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\), let \(\varepsilon \in (0,\frac {\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}{2\sqrt {2}})\), δ ∈ (0,1). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class. Suppose \({\mathscr{A}}\) is a learning algorithm that solves the binary classification task with classical instances and (distinct) label states σ0,σ1 and concept class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε using m = m(ε,δ) examples. Then \(m\geq {\varOmega }\left (\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}^{2}\frac {\log {1}/{\delta }}{\varepsilon ^{2}}\right )\).
Proof
(Sketch) As \({\mathscr{F}}\) is non-trivial, there exist concepts \(f, g\in {\mathscr{F}}\) and a point \(x\in {\mathscr{X}}\) s.t. f(x) = σ0 and g(x) = σ1. Let \(\lambda =\frac {\varepsilon }{2\left \|{\sigma _{0} - \sigma _{1}}\right \|_{1}}\in (0,1)\). Define probability distributions μ± on \({\mathscr{X}}\times {\mathscr{D}}\) via
By explicitly evaluating the risk R±(h), we see that achieving an excess risk ≤ ε with probability ≥ 1 − δ, requires the learner to distinguish between the underlying distributions μ±, and thus the corresponding training data states \(\rho _{\pm }^{\otimes m}\), with probability ≥ 1 − δ.
It is well known (see, e.g., Nielsen and Chuang 2009, chapter 9) that the optimal success probability of this quantum distinguishing task is given by
Via the Fuchs-van de Graaf inequalities, which state that
this can be upper bounded using lower bounds on the fidelity F(ρ+⊗m,ρ−⊗m) = F(ρ+,ρ−)m. The fidelity F(ρ+,ρ−) can be lower-bounded using its strong concavity and the explicit expressions for ρ±. The result then follows by comparing the obtained upper bound with the required lower bound popt ≥ 1 − δ.
See Appendix Appendix for a detailed proof. □
For the proof of the VC-dimension-dependent part of the lower bound we need a well known observation about the eigenvalues of a statistical mixture of two pure quantum states, which is the content of the following
Lemma 4
Let \(|\psi \rangle ,|\phi \rangle \in \mathbb {C}^n\) be distinct pure quantum states. Let α,β ≥ 0 be real numbers. Then the non-zero eigenvalues of the mixture ρ := α|ψ〉〈ψ| + β|ϕ〉〈ϕ| are given by
With this we can now prove a sample complexity lower bound for the case of pure label states.
Theorem 3
Let \(\sigma _{0}=|\psi _{0}\rangle \langle \psi _{0}|,\sigma _{1}=|\psi _{1}\rangle \langle \psi _{1}|\in {\mathscr{S}}(\mathbb {C}^n)\) be (distinct) pure quantum states, let \(\varepsilon \in (0,\frac {\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}}{8})\), \(\delta \in (0,1-H\left (\frac {1}{4}\right ))\). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class s.t. \(\tilde {{\mathscr{F}}}\) has VC-dimension d. Suppose \({\mathscr{A}}\) is a learning algorithm that solves the binary classification task with classical instances and (distinct) label states σ0,σ1 and concept class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε using m = m(ε,δ) examples. Then \(m\geq {\varOmega }\left (\frac {d}{\varepsilon ^{2}}\right )\).
Proof
(Sketch) We follow the information-theoretic proof strategy from Arunachalam and de Wolf (2018). Let \(S=(s_{1},\ldots ,s_d)\in {\mathscr{X}}\) be a set shattered by \(\tilde {{\mathscr{F}}}\), for each a ∈{0,1}d define the distribution μa on {1,…,d}×{0,1} via
Note that \(\forall a\in \lbrace 0,1\rbrace ^d\ \exists f_a\in \tilde {{\mathscr{F}}}: f_a(s_i)=a_i\) by shattering and that fa is a minimum error concept w.r.t. μa. By evaluating the excess error of an \(f_{\tilde {a}}\) compared to fa, we see that solving the learning problem with confidence 1 − δ requires the learner to output, with probability ≥ 1 − δ, a hypothesis described by a string whose Hamming distance to the true underlying string is \(\leq \frac {d}{4}\). We can use this observation to obtain the lower bound I(A : B) ≥Ω(d) on the mutual information between underlying string A (drawn uniformly at random) and corresponding quantum training data B.
We can also upper bound the mutual information. A standard argument shows I(A : B) ≤ m ⋅ I(A : B1), where m is the number of copies of the quantum example state and B1 describes a single quantum example state. Using Lemma 4 and the explicit expression for a quantum example state, we can compute I(A : B1) and use Taylor expansion to see that \(I(A:B_{1})\leq {\mathscr{O}}(\varepsilon ^{2})\). Comparing the lower and upper bounds on I(A : B) now gives \(m\geq {\varOmega }\left (\frac {d}{\varepsilon ^{2}}\right )\).
See Appendix Appendix for a detailed proof. □
If we now combine Lemma 3 and Theorem 3 with the result of Section 4.1 we obtain
Corollary 1
Let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\) be (distinct) pure quantum states, let \(\varepsilon \in (0,\frac {\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}{8})\), \(\delta \in (0,1-H\left (\frac {1}{4}\right ))\). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class s.t. \(\tilde {{\mathscr{F}}}\) has VC-dimension d. Then a sample size of \({{\varTheta }}\left (\frac {d}{\varepsilon ^{2}} + \frac {\log {1}/{\delta }}{\varepsilon ^{2}} \right )\) is necessary and sufficient for solving the binary classification task with classical instances and quantum labels σ0,σ1 and hypothesis class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε.
Therefore, we have shown that the strategy from Section 4.1 is, for pure states, optimal in sample complexity w.r.t. its dependence the VC-dimension, the accuracy and the confidence. But we do not make a statement on optimality w.r.t. the dependence on the distinguishability of the label states, because the parameter \(\left \|{\sigma _{0} - \sigma _{1}}\right \|_{1}\) is lacking from our lower bound.
5.2 The realizable case
We now show analogous lower bounds for the sample complexity in the realizable scenario with the same proof strategy.
Lemma 5
Let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\), let \(\varepsilon \in (0,\frac {\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}{2})\), \(\delta \in (0,\frac {1}{2})\). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class. Suppose \({\mathscr{A}}\) is a learning algorithm which solves the binary classification task with classical instances and (distinct) label states σ0,σ1 and concept class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε using m = m(ε,δ) examples in the realizable scenario. Then \(m\geq {\varOmega }\left (\frac {\log {1}/{\delta }}{\varepsilon }\right )\).
Proof
This can be proved similarly to Lemma 3. See Appendix Appendix for a detailed proof. □
We now provide the analog of Theorem 3 for the realizable case.
Theorem 4
Let \(\sigma _{0}=|\psi _{0}\rangle \langle \psi _{0}|,\sigma _{1}=|\psi _{1}\rangle \langle \psi _{1}|\in {\mathscr{S}}(\mathbb {C}^n)\) be (distinct) pure quantum states, let \(\varepsilon \in (0,\frac {\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}}{8})\), \(\delta \in (0,\frac {1}{2})\). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class s.t. \(\tilde {{\mathscr{F}}}\) has VC-dimension d + 1. Suppose \({\mathscr{A}}\) is a learning algorithm which solves the binary classification task with classical instances and (distinct) label states σ0,σ1 and concept class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε using m = m(ε,δ) examples in the realizable case. Then \(m\geq {\varOmega }\left (\frac {d}{\varepsilon }\right )\).
Proof
This can be proved similarly to Theorem 3. See Appendix Appendix for a detailed proof. □
Thus, we have obtained a sample complexity lower bound that matches the upper bound proved in Section 4.2 in the dependence on the VC-dimension, the confidence and the accuracy, but we do not make a statement about optimality w.r.t. the dependence on \(\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}\).
Remark 3
As already discussed in Section 2.1, in proving the sample complexity lower bounds we resort to the Heisenberg picture, which allows us to absorb the intermediate quantum channels performed by a learner into the measurement. These lower bounds therefore even hold for quantum learning algorithms that perform coherent and adaptive measurements on the training data. In particular, the information-theoretic complexity of our learning problem does not change if we restrict the quantum learner to only performing two-outcome POVMs locally (i.e., on one subsystem only). This is maybe not too much of a surprise, since the optimal measurement for distinguishing states drawn uniformly at random from \(\{\bigotimes _{i=1}^{m} \sigma _{x_{i}}\}_{x\in \{0,1\}^{m}}\) can, using the Holevo-Yuen-Kennedy-Lax optimality criterion (Holevo 1973; Yuen et al. 1975), be seen to be exactly given by local Holevo-Helstrom measurements.
6 Conclusion and outlook
We have proposed a novel way of modifying the classical binary classification problem to obtain a quantum counterpart. The conceptual difference to the framework of quantum PAC learning as discussed in Arunachalam and de Wolf (2017) is that we work with maps whose outputs are themselves quantum states, not classical labels. This naturally gives rise to training data given by quantum states, which is one aspect in which our setting differs from Aaronson (2007).
Using results from classical learning theory on dealing with classification noise in the training data, we exhibited learning strategies (based on the Holevo-Helstrom measurement) for binary classification with classical instances and quantum labels. The learning strategies consist of two main steps: First, classical information is extracted from the training data by performing a (localized) measurement. Second, classical learning strategies are applied. We complemented these procedures by sample complexity lower bounds thereby establishing the information-theoretic optimality of these strategies for pure label states w.r.t. the dependence on VC-dimension, confidence and accuracy.
We conclude with some open questions that we leave open for further research:
-
Can we derive sample complexity lower bounds which explicitly incorporate factors related to the hardness of distinguishing σ0 and σ1, e.g., in terms of \(\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}\) or \(\max \limits \lbrace \text {tr}[E_{0}\sigma _{1}],\text {tr}[E_{1}\sigma _{0}]\}\)? Or can the corresponding factors in the upper bounds be eliminated? Could this be related to another complexity measure from classical learning theory, the “fat-shattering dimension” of the class
$$\{\mathscr{X}\times\mathscr{E}(\mathbb{C}^{d}) \ni (x,E)\mapsto \text{tr}[Ef(x)]~|~f\in\mathscr{F}\}?$$ -
Our analysis is focused on the information-theoretic part of the learning problem, i.e., the sample complexity. Can we improve the computational complexity?
-
For deriving our sample complexity upper bounds, we used specific classical learning procedures applied to the post-measurement training data. In the agnostic case, we use empirical risk minimization, in the realizable case we use a combination of a minimum disagreement approach with a subsampling procedure. In both cases, we decided for these algorithms to achieve the (essentially) optimal sample complexity characterized via the VC-dimension.
However, we could use other classical learning procedures for “post-processing”. Can we identify situations in which procedures like structural risk minimization, compression schemes, or stable learning procedures yield useful sample complexity bounds?
-
We considered the case of classical instances. Can this be extended to a scenario of quantum instances with classical (or even quantum) labels? Whereas we were able to study the case of classical instances and quantum labels with methods from learning with label noise, once the instances themselves are quantum, we might have to employ ideas from learning models with restricted access to the instances such as that of “learning with restricted focus of attention” proposed in Ben-David and Dichterman (1998).
-
Our strategy uses the Holevo-Helstrom measurement which can be understood as inducing the minimum amount of noise. However, in classical learning theory it is well known that adding noise to the training data can be helpful in preventing overfitting. In this spirit, can we justify other measurements than the Holevo-Helstrom measurement?
-
We assumed throughout our analysis that the learning algorithm has to output a hypothesis that maps into {σ0,σ1}. What if we allow for hypotheses that map into \(\text {conv}\left (\lbrace \sigma _{0},\sigma _{1}\rbrace \right )\) or \({\mathscr{S}}(\mathbb {C}^d)\)?
-
Finally, we assume throughout that the label states σ0, σ1 are known in advance. Can this assumption be removed? Here, it might be helpful that Theorem 6 does not need explicit knowledge of the error rates η0, η1, but merely of an upper bound ηb on them.
References
Aaronson S (2007) The learnability of quantum states. Proc Roy Soc A Math Phys Eng Sci 463 (2088):3089–3114. https://doi.org/10.1098/rspa.2007.0113
Aaronson S (2018) Shadow tomography of quantum states. In: Proceedings of the 50th annual ACM SIGACT symposium on theory of computing. STOC. https://doi.org/10.1145/3188745.3188802, vol 2018. Association for Computing Machinery, New York, pp 325–338
Aaronson S, Hazan EE, Chen X, Kale S, Nayak A (2018) Online learning of quantum states. In: Advances in neural information processing systems, pp 8962–8972
Angluin D, Laird P (1988) Learning from noisy examples. Mach Learn 2(4):343–370. https://doi.org/10.1023/A:1022873112823
Arunachalam S, de Wolf R (2017) Guest column: A survey of quantum learning theory. SIGACT News 48. https://doi.org/10.1145/3106700.3106710
Arunachalam S, de Wolf R (2018) Optimal quantum sample complexity of learning algorithms. J Mach Learn Res 19(71):1–36. http://jmlr.org/papers/v19/18-195.html
Arunachalam S, Chakraborty S, Lee T, Paraashar M, de Wolf R (2019a) Two new results about quantum exact learning. In: Baier C, Chatzigiannakis I, Flocchini P, Leonardi S (eds) 46th International colloquium on automata, languages, and programming (ICALP 2019), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, Leibniz International Proceedings in Informatics (LIPIcs). https://doi.org/10.4230/LIPIcs.ICALP.2019.16. http://drops.dagstuhl.de/opus/volltexte/2019/10592, vol 132, pp 16:1–16:15
Arunachalam S, Grilo AB, Sundaram A (2019b) Quantum hardness of learning shallow classical circuits. arXiv:1903.02840
Arunachalam S, Grilo AB, Yuen H (2020) Quantum statistical query learning
Aslam JA, Decatur SE (1996) On the sample complexity of noise-tolerant learning. Inf Process Lett 57(4):189–195. https://doi.org/10.1016/0020-0190(96)00006-3
Atıcı A, Servedio RA (2007) Quantum algorithms for learning andtestingjuntas. QuantumInfProcess 6(5):323–348. https://doi.org/10.1007/s11128-007-0061-6
Bartlett PL, Mendelson S (2002) Rademacherandgaussiancomplexities:Riskboundsandstructuralresults. JMachLearnRes 3(Nov):463–482. http://www.jmlr.org/papers/volume3/bartlett02a/bartlett02a.pdf
Ben-David S, Dichterman E (1998) Learningwithrestrictedfocusofattention. JComputSystSci 56(3):277–298. https://doi.org/10.1006/jcss.1998.1569
Bernstein E, Vazirani U (1993) Quantumcomplexitytheory. In: Kosaraju R (ed) Proceedingsofthetwenty-fifthannualACMsymposiumonTheoryofcomputing. https://doi.org/10.1145/167088.167097. ACM, NewYork, pp 11–20
Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK (1989) Learnabilityandthevapnik-chervonenkisdimension, vol 36. https://doi.org/10.1145/76359.76371.http://dl.acm.org/ft_{g}ateway.cfm?id=76371&type=pdf
Brandão FGSL, Kastoryano MJ (2019) Finitecorrelationlengthimpliesefficientpreparationofquantumthermalstates. CommunMathPhys 365(1):1–16. https://doi.org/10.1007/s00220-018-3150-8
Bshouty NH, Jackson JC (1998) Learningdnfovertheuniformdistributionusingaquantumexampleoracle. SIAMJComput 28(3):1136–1153. https://doi.org/10.1137/S0097539795293123
Caro MC (2020) Quantumlearningbooleanlinearfunctionsw.r.t.productdistributions. QuantumInfProcess 19(6):1–41. https://doi.org/10.1007/s11128-020-02661-1
Cesa-Bianchi N, Dichterman E, Fischer P, Shamir E, Simon HU (1999) Sample-efficientstrategiesforlearninginthepresenceofnoise. JACM(JACM) 46(5):684–719. https://doi.org/10.1145/324133.324221
Cheng HC, Hsieh MH, Yeh PC (2016) Thelearnabilityofunknownquantummeasurements. QuantumInfComput 16(7-8):615–656
Chowdhury AN (2020) Low,GH, Avariationalquantumalgorithmforpreparingquantumgibbsstates, WiebeN
Chung KM, Lin HH (2018) Sampleefficientalgorithmsforlearningquantumchannelsinpacmodelandtheapproximatestatediscriminationproblem
Ciliberto C, Herbster M, Ialongo AD, Pontil M, Rocchetto A, Severini S, Wossnig L (2018) Quantummachinelearning:Aclassicalperspective. ProcRoySocAMathPhysEngSci 474(2209):20170551. https://doi.org/10.1098/rspa.2017.0551
Cross AW, Smith G, Smolin JA (2015) Quantumlearningrobustagainstnoise. PhysRevA 92 (1):97. https://doi.org/10.1103/PhysRevA.92.012327
Grilo AB, Kerenidis I, Zijlstra T (2017) Learningwitherrorsiseasywithquantumsamples.arXiv:1702.08255
Hanneke S (2016) Theoptimalsamplecomplexityofpaclearning. JMachLearnRes 17(1):1319–1333. http://dl.acm.org/ft_{g}ateway.cfm?id=2946683&type=pdf
Heinosaari T, Ziman M (2012) Themathematicallanguageofquantumtheory:Fromuncertaintytoentanglement. CambridgeUniversityPress, Cambridge. https://doi.org/10.1017/CBO9781139031103
Holevo A (1973) Statisticaldecisiontheoryforquantumsystems. JMultivarAnal 3(4):337–394. https://doi.org/10.1016/0047-259X(73)90028-6
Kanade V, Rocchetto A, Severini S (2019) Learningdnfsunderproductdistributionsviaμ-biasedquantumfouriersampling. QuantumInfComput 19(15&16):1261–1278. http://www.rintonpress.com/xxqic19/qic-19-1516/1261-1278.pdf
Laird PD (1988) Learningfromgoodandbaddata.TheKluwerinternationalseriesinengineeringandcomputersciences,knowledgerepresentation,learning andexpertsystems, vol 47. Springer, Boston. https://doi.org/10.1007/978-1-4613-1685-5
Montanaro A (2012) Thequantumquerycomplexityoflearningmultilinearpolynomials. InfProcessLett 112(11):438–442. https://doi.org/10.1016/j.ipl.2012.03.002
Natarajan N, Dhillon IS, Ravikumar P, Tewari A (2013) Learningwithnoisylabels, pp 1196–1204
Nielsen MA, Chuang IL (2009) Quantumcomputationandquantuminformation, 10th edn. CambridgeUniv.Press, Cambridge
Ristè D, daSilva MP, Ryan CA, Cross AW, Córcoles AD, Smolin JA, Gambetta JM, Chow JM, Johnson BR (2017) Demonstrationofquantumadvantageinmachinelearning. NpjQuantumInf 3 (1):16. https://doi.org/10.1038/s41534-017-0017-3
Servedio RA, Gortler SJ (2004) Equivalencesandseparationsbetweenquantumandclassicallearnability. SIAMJComput 33(5):1067–1092. https://doi.org/10.1137/S0097539704412910
Shalev-Shwartz S, Ben-David S (2014) Understandingmachinelearning:Fromtheorytoalgorithms. CambridgeUniversityPress, Cambridge. https://doi.org/10.1017/CBO9781107298019
Shannon CE (1948) Amathematicaltheoryofcommunication. BellSystTechJ 27(3):379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Valiant LG (1984) Atheoryofthelearnable. CommunACM 27(11):1134–1142. https://doi.org/10.1145/1968.1972
Vapnik VN, Chervonenkis AY (1971) Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabilities. TheoryProbabitsAppl 16(2):264–280. https://doi.org/10.1137/1116025
Vershynin R (2018) High-dimensionalprobability:Anintroductionwithapplicationsindatascience,Cambridgeseriesinstatisticalandprobabilistic mathematicsvol47. CambridgeUniversityPress, Cambridge
Watrous J (2018) Thetheoryofquantuminformation. CambridgeUniversityPress, Cambridge. https://doi.org/10.1017/9781316848142
Wilde M (2013) Quantuminformationtheory. CambridgeUniversityPress, Cambridge. https://doi.org/10.1017/CBO9781139525343
Yuen H, Kennedy R, Lax M (1975) Optimumtestingofmultiplehypothesesinquantumdetectiontheory. IEEETransInfTheory 21(2):125–134. https://doi.org/10.1109/TIT.1975.1055351
Acknowledgements
Open Access funding enabled and organized by Projekt DEAL. M.C.C. wants to thank Michael M. Wolf for suggesting this problem, Gael Sentís and Otfried Gühne for the opportunity to present and discuss the ideas of this paper at the University of Siegen, Srinivasan Arunachalam for his detailed feedback on an earlier draft, and Benedikt Graswald for discussions leading to Example 1. Also, M.C.C. thanks the anonymous reviewers at QTML 2020 and at Springer Quantum Machine Intelligence for their suggestions.
Support from the TopMath Graduate Center of TUM the Graduate School at the Technische Universität München, Germany, from the TopMath Program at the Elite Network of Bavaria, and from the German Academic Scholarship Foundation (Studienstiftung des deutschen Volkes) is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’snote
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix: 1. Proofs
Proof Proof of Lemma 2
Let \(z=((x_{i},y_{i}))_{i=1}^{m}\in ({\mathscr{X}}\times \{ 0,1\})^{m}\). If we use and
, then we can rewrite
Next, we use that \(\mathbb {E}_{\sigma } [\sigma _i]=0\) and that σi and (1 − 2yi)σi have the same distribution for all i. With this we obtain from the above
where the last step used that the expression is invariant w.r.t. interchanging \(\tilde {f}\) and \(\tilde {f}^{\prime }\), so we can drop the absolute value. Now we can iterate this reasoning for i = 2,…,m and obtain
the desired inequality. □
Proof Proof of Lemma 3
As \({\mathscr{F}}\) is non-trivial, there exist concepts \(f, g\in {\mathscr{F}}\) and a point \(x\in {\mathscr{X}}\) s.t. f(x) = σ0 and g(x) = σ1. Let λ ∈ (0, 1) (to be chosen appropriately later in the proof). Define probability distributions μ± on \({\mathscr{X}}\times {\mathscr{D}}\) via
The risk of a hypothesis \(h\in {\mathscr{D}}^{{\mathscr{X}}}\) w.r.t. these probability measures is given by
in particular the optimal achievable risk is \(\frac {1- \lambda }{4}\left \|\sigma _{0} - \sigma _{1}\right \|_{1}\). Note that a hypothesis which predicts the suboptimal label state for x has an excess risk of
So if we pick \(\lambda =\frac {\varepsilon }{2\left \|\sigma _{0} - \sigma _{1}\right \|_{1}}<1\), then in order to achieve an excess risk ≤ ε with probability ≥ 1 − δ, the learning algorithm has to be able to distinguish between the underlying distributions μ± with probability ≥ 1 − δ.
As the algorithm has access to the underlying distribution only via the training data, this means that the algorithm has to be able to distinguish the corresponding training data ensembles with probability ≥ 1 − δ. Here, we observe that the training data being drawn i.i.d. according to μ± is equivalent to the learning algorithm having access to m copies of the state
because this mixed state simply describes the statistical mixture. The optimal success probability for distinguishing between two quantum states is a well-studied object in quantum information theory. It can be characterized by the trace distance between the two states and is given (in our case) by (see, e.g., Nielsen and Chuang 2009)
As the trace distance of tensor products is not that easy to deal with, we will instead work with the fidelity defined as
According to the Fuchs-van de Graaf inequalities we have
where the last steps uses multiplicativity of the fidelity under tensor products. Now we require popt ≥ 1 − δ and rearrange to obtain
or equivalently after taking logarithms
By strong concavity of the fidelity, we have
This now implies
Thus, we obtain (after Taylor-expanding the logarithm in the denominator)
as desired. □
Proof Proof of Lemma 4
Pick an orthonormal basis {|k〉}k= 1,…,n of \(\mathbb {C}^n\) s.t. |ψ〉 = |0〉 and \(|\phi \rangle =\cos \limits (\varphi )|0\rangle + \sin \limits (\varphi )|1\rangle \) for an angle 0 ≤ φ < 2π. Then, when restricting to the relevant subspace spanned by |0〉 and |1〉, we get
We now easily see that
where λ1,λ2 are the two non-zero eigenvalues of ρ. We can solve the second of these two equations for λ2 and plug this back into the first equation to obtain
We now solve this quadratic equation and obtain the two eigenvalues
where we used that \(|\cos \limits (\varphi )| = |\langle \psi |\phi \rangle |\). □
Proof Detailed Proof of Theorem 3
Let \(S=(s_{1},\ldots ,s_d)\in {\mathscr{X}}\) be a set shattered by \(\tilde {{\mathscr{F}}}\), for each a ∈{0, 1}d define the distribution μa on {1,…,d}×{0, 1} via
Note that \(\forall a\in \lbrace 0,1\rbrace ^d\ \exists f_a\in \tilde {{\mathscr{F}}}: f_a(s_i)=a_i\) by shattering and that for each a ∈{0, 1}d, fa is a minimum error concept w.r.t. μa and a concept \(f_{\tilde {a}}\) has additional error
compared to fa. Hence, in order to solve the learning problem with confidence 1 − δ and accuracy ε the algorithm \({\mathscr{A}}\) has to output, with probability ≥ 1 − δ, a hypothesis (generated from the training data arising from the underlying string) that when evaluated on S yields a vector that is \(\frac {d}{4}\)-close to the underlying string in Hamming distance.
Let A be a random variable distributed uniformly on {0, 1}d (corresponding to the unknown underlying string a). Let B = B1…Bm be the training data with each example generated independently from μa described by the quantum ensemble
or, equivalently, by the quantum state
In particular, the composite system of underlying string and corresponding training data is described by the quantum state
We follow the information-theoretic proof strategy from Arunachalam and de Wolf (2018), i.e., we first show a lower bound on the mutual information I(A : B) which arises from the learning requirement, then observe that I(A : B) ≤ m ⋅ I(A : B1) and finally upper bound the mutual information I(A : B1).
First for the mutual information lower bound. Let h(B) ∈{0, 1}d denote the label vector assigned to S by the hypothesis produced by the learner upon input of training data B. Let . If Z = 1, then by the above deliberations we conclude \(d_H(A,h(B))\leq \frac {d}{4}\) and thus, given h(B), A ranges over a set of size \(\sum \limits _{i=0}^{\frac {d}{4}} \binom {n}{i}\leq 2^{H\left (\frac {1}{4}\right )d}\). Thus, we get (using data processing and the definition of conditional entropy)
in particular I(A : B) ≥Ω(d). (Here we use our assumption on δ.)
Now we show I(A : B) ≤ m ⋅ I(A : B1). We reproduce the reasoning provided in Arunachalam and de Wolf (2018) for completeness:
Here, the first step is by definition, the second uses the product structure of the subsystem B, the third follows from subadditivity of the entropy and the last is again by definition.
And finally, we prove an upper bound on I(A : B1). To this end, we have to study the reduced state
More precisely, we have
and thus have to study the entropies of \(\sigma _{AB_{1}}\) as well as those of the reduced states σA and \(\sigma _{B_{1}}\). As \(A\sim \text {Uniform}\left (\lbrace 0,1\rbrace ^d\right )\), we have S(A) = d. Now we consider the reduced state
Here, we have
By Lemma 4 we know that \(\frac {1}{2d}|\psi _{0}\rangle \langle \psi _{0}| + \frac {1}{2d}|\psi _{1}\rangle \langle \psi _{1}|\) has non-zero eigenvalues \(\mu _{1/2} = \frac {1}{2d}(1 \pm |\langle \psi _{0}|\psi _{1}\rangle |)\) and due to the block-diagonal structure of \(\sigma _{B_{1}}\) we conclude that the non-zero eigenvalues of \(\sigma _{B_{1}}\) are also μ1/2, each of multiplicity d. In particular, we have
Similarly, we see that the non-zero eigenvalues of \(\sigma _{AB_{1}}\) are
each of multiplicity d ⋅ 2d and that therefore
If we combine these expressions for the different entropies, we obtain
We now use Taylor’s theorem to understand the scaling of the different terms with ε. First, we have (by Taylor-expanding \(\log (1-|\langle \psi _{0}|\psi _{1}\rangle |^{2} - x)\) around x = 0)
Moreover, using the Taylor expansions
around x = 0 (with a > 0) and
we now obtain
Plugging these approximations back in gives us
Now combining our mutual information lower and upper bounds yields
which after rearranging becomes
as desired. □
Proof Detailed Proof of Lemma 5
As \({\mathscr{F}}\) is non-trivial, there exist \(f_{1},f_{2}\in {\mathscr{F}}\) and \(x_{1},x_{2}\in {\mathscr{X}}\) s.t. f1(x1) = f2(x1) = σ0 and f1(x2) = σ0≠σ1 = f2(x2). Now consider the distribution μ on \({\mathscr{X}}\) defined by
where λ ∈ (0, 1) is to be chosen later in the proof.
The risk of a hypothesis \(h\in {\mathscr{D}}^{{\mathscr{X}}}\) w.r.t. μ if the target concept is fi is given by
so in particular we have
So if we choose \(\lambda =\frac {2\varepsilon }{\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}<1\), then the learning requirement for \({\mathscr{A}}\) implies that with probability ≥ 1 − δ, \({\mathscr{A}}\) correctly identifies whether the target concept is f1 or f2. As the algorithm has access to the underlying distribution only via the training data, this means that the algorithm has to be able to distinguish the corresponding training data ensembles with probability ≥ 1 − δ. Here, we observe that the training data being drawn i.i.d. according to μ± is equivalent to the learning algorithm having access to m copies of the state
The optimal success probability for distinguishing between two quantum states is a well-studied object in quantum information theory. It can be characterized by the trace distance between the two states and is given (in our case) by (see ; Nielsen and Chuang 2009
As the trace distance of tensor products is not that easy to deal with, we will instead work with the fidelity defined as \(F(\rho ,\sigma ):= \text {tr}[\sqrt {\rho ^{\frac {1}{2}}\sigma \rho ^{\frac {1}{2}}}].\) According to the Fuchs-van de Graaf inequalities (see ; Nielsen and Chuang 2009, Section 9.2.3) we have
where the last steps uses multiplicativity of the fidelity under tensor products. Now we require popt ≥ 1 − δ and rearrange to obtain
or equivalently after taking logarithms
Now we use again the Fuchs-van de Graaf inequalities which tell us (after rearranging)
to obtain that
It is easy to see that \(\left \|\rho _{1}-\rho _{2}\right \|_{1} = \lambda \left \|\sigma _{0} - \sigma _{1}\right \|_{1} = 2\varepsilon .\) Now Taylor expansion of the logarithm gives
as desired. □
Proof Detailed Proof of Theorem 4
Let \(S=(s_{0},\ldots ,s_d)\in {\mathscr{X}}\) be a set shattered by \(\tilde {{\mathscr{F}}}\), define
with λ ∈ (0, 1) to be chosen later. By shattering, \(\forall a\in \lbrace 0,1\rbrace ^d\ \exists f_a\in \tilde {{\mathscr{F}}}\) s.t.
Observe that w.r.t. a distribution μ and target concept fa, another concept fb has error
So if we pick \(\lambda =\frac {8\varepsilon }{\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}\), then by the learning requirement, with probability ≥ 1 − δ, \({\mathscr{A}}\) has to output a hypothesis h that when evaluated on S yields a label vector that is \(\frac {d}{4}\)-close to the true underlying string in Hamming distance.
Denote by \(A\sim \text {Uniform}\left (\lbrace 0,1\rbrace ^d\right )\) a random variable describing the unknown underlying string, let B = B1…Bm be the corresponding quantum training data system. We want to repeat the three-step reasoning from the proof of Theorem 3. The first two steps work exactly as before. Step 3 will be slightly different. Again we have
In this case, the relevant composite state is
where \(\rho _a = \sum \limits _{j=0}^d \mu (s_j) |s_j\rangle \langle s_j|\otimes f_a(s_j) = (1-\lambda )|s_{0}\rangle \langle s_{0}|\otimes \sigma _{0} + \frac {\lambda }{d}\sum \limits _{j=1}^d |s_j\rangle \langle s_j|\otimes \sigma _{a_j}\).
We now again use Lemma 4 to compute eigenvalues and thus entropies. (Here our assumption that σ0 and σ1 are pure enters the proof.) We obtain
-
Each ρa has non-zero eigenvalues 1 − λ of multiplicity 1 and \(\frac {\lambda }{d}\) of multiplicity d.
-
\(\sigma _{B_{1}}=\frac {1}{2^d}\underset {a\in \lbrace 0,1\rbrace ^d}{\sum }\left ((1-\lambda )|s_{0}\rangle \langle s_{0}|\otimes \sigma _{0} + \frac {\lambda }{d}\sum \limits _{j=1}^d |s_j\rangle \langle s_j|\otimes \sigma _{a_j}\right ) = (1-\lambda )|s_{0}\rangle \langle s_{0}|\otimes \sigma _{0} + \frac {\lambda }{d}\sum \limits _{j=1}^d |s_j\rangle \langle s_j|\otimes \left (\frac {1}{2}\sigma _{0}+\frac {1}{2}\sigma _{1}\right )\) has non-zero eigenvalues 1 − λ of multiplicity 1 and \(\frac {\lambda }{d}\lambda _{1/2}\) of multiplicity d, where \(\lambda _{1/2} = \frac {1\pm |\langle \psi _{0} |\psi _{1}\rangle |}{2}\).
-
\(\sigma _{AB_{1}}\) has non-zero eigenvalues \(\frac {1}{2^d}(1-\lambda )\) of multiplicity 2d and \(\frac {\lambda }{d\cdot 2^d}\) of multiplicity d ⋅ 2d.
With this we can now compute the relevant entropies and obtain
as well as
Hence, we now have
Now we can finish the proof by combining steps 1, 2 and 3 as before. □
Appendix: 2. A physical motivation for our notion of risk
In our definition of the risk Rμ we use the trace distance. As the latter is a well-established measure of distinguishability of quantum states, it presents itself as a natural candidate loss function. Here, we give a more explicit operational reasoning as to why we choose to use the trace distance.
Imagine the learning task as a competition between two parties, a learner and a teacher. We assume that both parties obey the laws of quantum physics. The teacher knows (a classical description of) the probability distribution \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{D}})\) and will provide corresponding training data to the learner during a training phase. The learner’s goal is to persuade the teacher in a test phase that she has managed to learn the distribution μ, which was unknown to her in advance, i.e., that she has produced a good hypothesis \(h:{\mathscr{X}}\to {\mathscr{D}}\).
We first give an informal description of the test phase: The teacher prepares another (independent) example (x,ρ) drawn from μ. She then sends x to the learner. The latter applies her hypothesis h to prepare the quantum state h(x) which she then sends back to the teacher. The teacher now uses this one copy of h(x) and her knowledge of μ to evaluate whether the learner made a good prediction. As also the teacher is restricted by quantum theory, she can only do so by performing a measurement.
We now discuss the choice of measurement of the teacher in more detail. On the one hand, the teacher wants to maximize the probability of detecting a wrong prediction. On the other hand, she does not want to be unfair, so at the same time she tries to maximize the probability of detecting a correct prediction. In summary, the teacher wants to choose a 2-outcome measurement {Eaccept,Ereject} that maximizes
where σi = ρ and \(\sigma _j\in {\mathscr{D}}\setminus \{\rho \}\). As she knows (a classical description of) the state \(\rho \in {\mathscr{D}}\) and that \(h(x)\in {\mathscr{D}}\), she can achieve this by picking {Eaccept,Ereject} to be the optimal measurement for minimum error discrimination of \({\mathscr{D}}\) (where the states are taken with equal prior probabilities (see ; Watrous 2018, Theorem 3.4)). The measurement is basically the same independently of whether ρ = σ1 or ρ = σ2, only the outcome labels are interchanged.
Now the expected probability of the trainer rejecting the learner’s prediction is
The optimal measurement satisfies
It is easy to see that under the additional assumption that σ0 and σ1 have the same purity, i.e., \(\text {tr}[\sigma _{0}^{2}]=\text {tr}[\sigma _{1}^{2}]\), the rejection probabilities are symmetric, namely
and similarly
With this we now obtain when comparing the achieved with the optimal expected rejection probability
So we have recovered our notion of risk, at least in the case of states of equal purity, from a more basic analysis of the test phase.
Note that a similar analysis could be performed also in the case of more than two quantum labels. There, the teacher’s measurement would be the optimal measurement for minimum error discrimination of ρ and \(\frac {1}{| {\mathscr{D}}| - 1}\) \({\sum }_{\sigma \in {\mathscr{D}}\setminus \{\rho \}} \sigma \). Unfortunately, no closed-form expressions for the corresponding success probabilities are known. We do, however, see that in this scenario, using the trace distance as loss function would be too pessimistic from the perspective of the learner. As the teacher does not know the prediction state prepared by the learner, the teacher has to solve a state discrimination problem taking into account all possible label states.
Appendix: 3. The Holevo-Helstrom strategy
The naive learning strategy based on the Holevo-Helstrom measurement is the following:

The remainder of this section is devoted to studying the performance of this simple learning procedure. Note that we leave open for now the classical learning algorithm to be used, we first work towards characterizing the true risk Rμ(h) in terms of the intermediate classical risk \(\tilde {R}_{\nu } (g)\).
In the following we will often make use of the fact that when identifying \(i\leftrightarrow \sigma _i\), the probability measure μ on \({\mathscr{X}}\times {\mathscr{D}}\) gives rise to a probability measure on \({\mathscr{X}}\times \lbrace 0,1\rbrace \). We will abuse notation and also denote the latter measure by μ, however, which measure is meant will always be clear from the context.
Recall that \(R_{\mu } (h) = \frac {\left \| \sigma _{0} - \sigma _{1} \right \|}{2} \mathbb {P}_{(x,\rho )\sim \mu }[h(x)\neq \rho ].\) We now derive a similar expression for \(\tilde {R}_{\nu } (g)\).
Lemma C.1
With the notation as in the Holevo-Helstrom strategy (in particular h(x) = σg(x)) it holds that
Proof
This can be shown by direct computation using the definition of ν:
Now we use the specific property of the Holevo-Helstrom measurement that \(\text {tr}[(\sigma _{1}-\sigma _{0})E_{1}]=\frac {\left \|\sigma _{0}-\sigma _{1}\right \|}{2}\). Moreover, as g(x) ∈{0, 1}, we have |1 − g(x)| = 1 − g(x) and |g(x)| = g(x). Thus, we obtain
where the last step uses h(x) = σg(x). □
This allows us to easily compare the true and the intermediate risk and obtain
As \(g(x)\in \lbrace 0,1\rbrace \ \forall x\in {\mathscr{X}}\) and in particular \(0\leq \mathbb {E}_{\mu _{1}}[g]\leq 1\), this gives rise to the following
Corollary 2
With the notation as in the Holevo-Helstrom strategy it holds that
We can extend this to a comparison between the excess risks
which are the quantities of interest for agnostic learning scenarios.
Corollary 3
With the notation as in the Holevo-Helstrom strategy it holds that
So we see that solving the classical learning task in step 3 of the Holevo-Helstrom strategy does not necessarily imply success at the overall learning task if the target accuracy is ε < |tr[σ0E1] −tr[σ1E0]|. This problem is addressed by the noise-corrected Holevo-Helstrom strategy presented in Section 4.
Remark 4
We want to shortly discuss a special case in which the connection between Rμ(h) and \(\tilde {R}_{\nu } (g)\) takes a particularly appealing form. Namely, assume that σ0 and σ1 are such that the corresponding Holevo-Helstrom measurement produces equal probabilities of error, i.e., tr[E0σ1] = tr[E1σ0]. This is clearly not true in general, take, e.g., σ0 = |0〉〈0| and \(\sigma _{1}=\frac {1}{2}(|0\rangle \langle 0|+|1\rangle \langle 1|)\). It does, however, hold true in certain special cases, e.g., if both σ0 and σ1 are pure or if σ0 and σ1 have the same (non-trivial) purity and tr[E0] = tr[E1]. (The latter is, e.g., satisfied if σ0 and σ1 are qubit states of the same (non-zero) purity.)
In this simple case our previous discussion yields \(R_{\mu }(h)=\tilde {R}_{\nu } (g)\), in particular, if we succeed at the classical binary classification task in step 3, then we also succeed at the overall classification task with quantum labels, so the quantum learning task is reduced to a classical learning problem.
Appendix: 4. Sample complexity of binary classification with two-sided classification noise
Here, we discuss the sample complexity of the PAC learning task of binary classification in the presence of (two-sided) classification noise in the realizable scenario. To be in congruence with the literature on this and related problems, we will use a slightly different notation than in the main body of the paper. Namely, we will consider classical input space \({\mathscr{X}}\) and classical target space {0, 1}, a concept class \({\mathscr{F}}\subset \lbrace 0,1\rbrace ^{{\mathscr{X}}}\), a probability measure \(\mu \in \text {Prob}({\mathscr{X}})\), and noise probabilities \(0\leq \eta _{0},\eta _{1}<\frac {1}{2}\), with which labels are flipped. Moreover, we will work with the 0-1-loss function and denote the corresponding risk of a hypothesis h w.r.t. a target concept f by errμ(h; f) = μ[h(x)≠f(x)]. Finally, any training data sample S splits the concept class \({\mathscr{F}}\) into so-called S-equivalence classes, where \(f_{1},fc\in {\mathscr{F}}\) are equivalent if and only if \(f_{1}(x)=f_{2}(x)\ \forall x\in {\mathscr{X}}\) s.t. ∃y ∈{0, 1} with (x,y) ∈ S.
The basic learning strategy underlying our discussion is Algorithm 1. It is the natural analog of searching for a consistent function in the case of noisy labels. Namely, as such a consistent function will in general not exist, it searches for a function that disagrees with the training data on as few examples as possible.

Theorem 4.1
(see ;Laird 1988, Theorems 5.7 and 5.33)
The output hypothesis h of Algorithm 1 satisfies errμ(h; f) ≤ ε.
Laird’s original proof that this algorithm solves the PAC learning problem is for the case η0 = η1. It is, however, easily generalized to our case because we still assume the same noise bound on both error rates. (We only have to adapt the expression for the error rate and the corresponding Hoeffding bounds.)
In order to apply the reasoning by Hanneke (2016) we need to slightly reformulate the result of this algorithm s.t. we obtain a bound on the error in terms of the sample size. When following the proof of Theorem 5.7 in Laird (1988) we see that m1 is used to ensure that there is a hypothesis which performs better than some given error threshold and m2 is used to ensure that such a hypothesis is actually chosen. In particular, if we use the error bound by Blumer et al. (1989) in terms of the sample size, we see that m2 depends on m1 as follows:
Remark 5
Note that we cannot directly use the tighter error bound in terms of the sample complexity proved by Hanneke (2016) here because Laird’s proof explicitly makes use of the strategy employed by Blumer et al. (1989) which works via consistency with a given training sample.
We can now easily bound
If we now further assume that δ > 0 is chosen s.t. \(\log \left (\frac {2}{\delta }\right )>2d\log \left (\frac {d}{2e}\right )\), then we can continue upper bounding this and obtain
where we defined \(C(\eta _b):=\frac {2}{1-\exp (-\frac {1}{2}(1-2\eta _b)^{2})}\). It is easy to check that for \(0\leq \eta _b < \frac {1}{2}\), \(C(\eta _b)\leq \frac {4}{(1-2\eta _b)^{2}}\), which well be used later on.
Hence, using a sample of size m ≥ 2(1 + C(ηb)) for the minimum disagreement strategy with \(m_{2}=\lceil \frac {C(\eta _b)}{1 + C(\eta _b)} m\rceil \) and m1 = m − m2 gives - using \(\frac {m}{2(1+C(\eta _b))}\leq m_{1}\leq \frac {m}{1+C(\eta _b)}\leq \frac {m_{2}}{C(\eta _b)}\)—an error guarantee of
With this suboptimal base learner we will now follow the strategy by Hanneke (2016) in order to build a better learner from it. Note that Hanneke’s proof includes several steps in which the existence of a function consistent with the respective subsample is ensured. This is not necessary in our case because the minimum disagreement strategy does not require a consistent function to exist.
We recall the algorithm for preprocessing the training data to generate subsamples as introduced in Hanneke (2016) in our Algorithm 2.

Theorem 4.2
Let ε ∈ (0, 1), \(\delta \in (0,2\cdot (\frac {2e}{d})^d)\) and \(\eta _b \in (0,\frac {1}{2})\). Let \({\mathscr{F}}\subset \lbrace 0,1\rbrace ^{{\mathscr{X}}}\) be a function class of VC-dimension d. Then \(m=m(\varepsilon ,\delta )={\mathscr{O}}\left (\frac {1}{\varepsilon (1-2\eta _b)^{2}} \left (d + \log \left (\frac {1}{\delta }\right )\right )\right )\) noisy examples from a function in \({\mathscr{F}}\) are sufficient for binary classification in the presence of two-sided classification noise with error probabilities 0 ≤ η0,η1 < ηb with accuracy ε and confidence 1 − δ.
Proof
This proof is analogous to the proof of Theorem 2 in Hanneke (2016) with some minor simplifications and adaptations and is given here only for the sake of completeness.
Fix an \(f^{*}\in {\mathscr{F}}\) and a probability measure μ over \({\mathscr{X}}\). Denote by S = S1:m the corresponding noisy training data. For any classifier h denote by \(ER(h)=\lbrace x\in {\mathscr{X}}|h(x)\neq f^{*}(x)\rbrace \) the set of instances on which h errs.
Fix c = 7200. We will show by strong induction that \(\forall m^{\prime }\in \mathbb {N}\), \(\forall \delta ^{\prime }\in (0,\ldots )\) and for all finite sequences \(T^{\prime }\) with probability \(\geq 1-\delta ^{\prime }\) the classifier
satisfies the error bound
As base case consider \(m^{\prime }\leq C(\eta _b)c\cdot \ln (18e) - 1\). In this case, for any \(\delta ^{\prime }\in (0,1)\) and for any finite sequence \(T^{\prime }\), we trivially have
as desired.
For the induction step, assume that for some \(m>C(\eta _b)c\cdot \ln (18e) - 1\) for all \(m^{\prime }\in \mathbb {N}\) with \(m^{\prime }<m\), for all \(\delta ^{\prime }(0,2\cdot (\frac {2e}{d})^d)\) and for all finite sequences \(T^{\prime }\) with probability \(\geq 1-\delta ^{\prime }\), (4.3) holds.
Note that by our choice of c we have \(C(\eta _b)c\cdot \ln (18e) - 1\geq 3\). Thus, |S1:m|≥ 4 and therefore \(\mathbb {A}\left (S_{1:m};T\right )\) returns in step 3. Let S0,S1,S2,S3 be as in \(\mathbb {A}(S;T)\). Denote T1 = S2 ∪ S3 ∪ T, T2 = S1 ∪ S3 ∪ T, T3 = S1 ∪ S2 ∪ T and \(h_i = \text {Majority}\left (L (\mathbb {A}(S_{0};T_i))\right )\) for each i ∈{1, 2, 3}.
Note that \(S_{0}=S_{1:(m-3\lfloor \frac {m}{4}\rfloor )}\). As m ≥ 4, \(1\leq m-3\lfloor \frac {m}{4}\rfloor < m\). Also, \(h_i = \hat {h}_{(m-3\lfloor \frac {m}{4}\rfloor ),T_i}\). So by the induction hypothesis applied under the conditional distribution given S1,S2,S3, which are independent of S0, combined with the law of total probability, for every i ∈{1, 2, 3} there exists an event Ei of probability \(\geq 1-\frac {\delta }{9}\) on which
Next, fix an i ∈{1, 2, 3} and write \(\lbrace (\tilde {X}_{i,1},\tilde {Y}_{i,1}),\ldots ,\) \((\tilde {X}_{i_,N_i},\tilde {Y}_{i,N_i})\rbrace := S_i\cap (ER(h_i)\times {\mathscr{Y}})\). As hi and Si are independent, \(\tilde {X}_{i,1},\ldots ,\tilde {X}_{i,N_i}\) are conditionally independent given hi and Ni. Therefore, we can apply the error bound (4.2) for our base learner L under the conditional distribution given hi and Ni to conclude: There exists an event \(E^{\prime }_i\) of probability \(\geq 1-\frac {\delta }{9}\) s.t., if Ni > 0, then the output h of the base learner L upon input of \(S_i\cap (ER(h_i)\times {\mathscr{Y}})\) satisfies
In particular, on \(E^{\prime }_i\) (if Ni > 0) every \(h\in \underset {j\in \lbrace 1,2,3\rbrace \setminus \lbrace i\rbrace }{\bigcup }L\) \(\left (\mathbb {A}(S_{0};T_j)\right )\) satisfies
Using Chernoff bounds we get that there exists an event \(E^{\prime \prime }_i\) of probability \(\geq 1-\frac {\delta }{9}\) s.t., if \(\mu [ER(h_i)]\geq \frac {2 (\frac {10}{3})^{2}}{\lfloor \frac {m}{4}\rfloor } \ln \left (\frac {9}{\delta }\right )\), then \(N_i\geq \frac {7}{10}\mu [ER(h_i)]\lfloor \frac {m}{4}\rfloor \). In particular, on \(E^{\prime \prime }_i\) we have the implication
If we now combine this with (4.4) and (4.7), then we see: On \(E_{i}\cap E^{\prime }_i \cap E^{\prime \prime }_i\), if \(\mu [ER(h_i)]\geq \frac {2 (\frac {10}{3})^{2}}{\lfloor \frac {m}{4}\rfloor } \ln \left (\frac {9}{\delta }\right )\), then every \(h\in \bigcup \limits _{j\in \lbrace 1,2,3\rbrace \setminus \lbrace i\rbrace } L\left (\mathbb {A}(S_{0};T_j)\right )\) satisfies
where the last step uses the technical Lemma 5 from the Appendix of Hanneke (2016). As \(m>C(\eta _b)c\cdot \ln (18e) - 1>3200\), we have \(\lfloor \frac {m}{4}\rfloor >\frac {m-4}{4}>\frac {799}{800}\frac {m}{4} >\frac {799}{800}\frac {3200}{3201}\frac {m+1}{4}\). We use this relaxation and compute the logarithmic factors to obtain from the above that
Moreover, if \(\mu [ER(h_i)]<\frac {23}{\lfloor \frac {m}{4}\rfloor }\ln \left (\frac {9}{\delta }\right )\), then simply because μ is a probability measure, we conclude
Hence, no matter what value μ[ER(hi)] takes, on the event \(E_{i}\cap E^{\prime }_i\cap E^{\prime \prime }_i\) we have for all \(h\in \bigcup \limits _{j\in \lbrace 1,2,3\rbrace \setminus \lbrace i\rbrace } L\left (\mathbb {A}(S_{0};T_j)\right )\) that
Now denote \(h_{\text {maj}} = \hat {h}_{m,T} = \text {Majority}(L(\mathbb {A}(S;T)))\) for S = S1:m. By definition of the majority function, for any \(x\in {\mathscr{X}}\) at least \(\frac {1}{2}\) of the classifiers h in the sequence \(L(\mathbb {A}(S;T))\) satisfy h(x) = hmaj(x). So by the strong form of the pigeon hole principle, there exists an i ∈{1, 2, 3} s.t. hi(x) = hmaj(x). Also, since each \(\mathbb {A}(S_{0};T_j)\) contributes an equal number of entries to \(\mathbb {A}(S;T)\), for each i ∈{1, 2, 3}, at least \(\frac {1}{4}\) of the classifiers \(h\in \underset {j\in \lbrace 1,2,3\rbrace \setminus \lbrace i\rbrace }{\bigcup } L\left (\mathbb {A}(S_{0};T_j)\right )\) satisfy h(x) = hmaj(x).
In particular, if I is a random variable independent of the training data and distributed uniformly on {1, 2, 3} and if \(\tilde {h}\) is a random variable conditionally given I and S uniformly distributed on \(\bigcup \limits _{j\in \lbrace 1,2,3\rbrace \setminus \lbrace I\rbrace } L\left (\mathbb {A}(S_{0};T_j)\right )\), then for any fixed x ∈ ER(hmaj), with conditional probability \(\geq \frac {1}{12}\), \(h_I(x)=\tilde {h}(x)=h_{\text {maj}}(x)\) and thus \(x\in ER(h_I)\cap ER(\tilde {h})\).
Hence, for a random variable \(X\sim \mu \) independent of the data, of I and of \(\tilde {h}\) we can now conclude

So on the event \(\underset {i\in \lbrace 1,2,3\rbrace }{\bigcap } E_{i}\cap E^{\prime }_i \cap E^{\prime \prime }_i\) it holds that
Since by the union bound the event \(\underset {i\in \lbrace 1,2,3\rbrace }{\bigcap } E_{i}\cap E^{\prime }_i \cap E^{\prime \prime }_i\) has probability ≥ 1 − δ, the induction step is complete.
It remains to use the claim just proven by induction to derive the desired sample complexity upper bound. For this, take T = ∅ and note that for \(m\geq \lfloor \frac {cC(\eta )}{\varepsilon }\left (d + \ln \left (\frac {18}{\delta }\right )\right )\rfloor \) the right-hand side of (4.3) is ≤ ε. Therefore, such a sample size suffices for successful learning using \(\text {Majority}(L(\mathbb {A}(\cdot ;\emptyset )))\). Now recall the discussion before the Theorem, where we observed that \(C(\eta _b)\leq \frac {4}{(1-2\eta _b)^{2}}\), to finish the proof. □
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Caro, M.C. Binary classification with classical instances and quantum labels. Quantum Mach. Intell. 3, 18 (2021). https://doi.org/10.1007/s42484-021-00043-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42484-021-00043-z
Keywords
- Quantum learning theory
- Sample complexity
- Binary classification
- VC-dimension