Ju n 20 20 BINARY CLASSIFICATION WITH CLASSICAL INSTANCES AND QUANTUM LABELS

. In classical statistical learning theory, one of the most well studied problems is that of binary classiﬁcation. The information-theoretic sample complexity of this task is tightly characterized by the Vapnik-Chervonenkis (VC) dimension (see Blumer et al. 1989; Hanneke 2016). A quantum analog of this task, with training data given as a quantum state (see Bshouty and Jackson 1998) has also been intensely studied and is now known to have the same sample complexity as its classical counterpart (Arunachalam and de Wolf 2018). We propose a novel quantum version of the classical binary classiﬁcation task by considering maps with classical input and quantum output and corresponding classical-quantum training data. We discuss learning strategies for the agnostic and for the realizable case and study their performance to obtain sample complexity upper bounds. Moreover, we provide sample complexity lower bounds which show that our upper bounds are essentially tight for pure output states. In particular, we see that the sample complexity is the same as in the classical binary classiﬁcation task w.r.t. its dependence on accuracy, conﬁdence and the VC-dimension.


Introduction
The fields of machine learning and of quantum computation provide new ways of looking at computational problems and have seen a significant increase in academic as well as practical interest since their origins in the 1970s and 1980s. More recently, attention was directed to paths for combining ideas from these two fruitful research areas. This gave rise to new approaches under different names such as "quantum machine learning" or "quantum learning theory".
In classical statistical learning theory, one of the most influential frameworks is that of probably approximately correct (PAC) learning due to (Vapnik and Chervonenkis 1971) and (Valiant 1984). It is particularly well studied for the task of binary classification. For this problem the so-called VC-dimension (Vapnik and Chervonenkis 1971) is known to characterize the sample complexity of learning a function class (Blumer et al. 1989;Hanneke 2016). Motivated by these strong theoretical results, a quantum analog of this problem was soon defined and studied in a series of papers (an overview over which is given in (Arunachalam and de Wolf 2017)) which culminated in (Arunachalam and de Wolf 2018). Therein it is shown that the information-theoretic complexity of the task of quantum PAC learning a 0-1-valued function class is characterized by the VC-dimension in exactly the same way as for the classical scenario.
The scenario studied in (Arunachalam and de Wolf 2018) assumes the training data available to the learner to be given in a specific quantum form and allows the learner to perform quantum computational operations on that training data. The functions to be learned, however, still map classical inputs to classical outputs. We propose a different quantum version of the binary classification task by not only considering the possibility of quantum training data but by allowing the objects to be learned to be inherently quantum. More specifically, we consider functions which map classical inputs to one of two possible quantum output states ("quantum labels"). These maps describe state preparation procedures. A learning task of this or a similar type could be relevant for cases in which state preparation is either costly or time-consuming, e.g., preparing thermal states at low temperatures (see Brandão, F. G. S. L. and Kastoryano 2019;Chowdhury et al. 2020, and references therein). Here, one could first produce sample data, learn a predictor, and then reproduce the preparation more efficiently using the predictor.
1.1. Main Results. We consider maps f : X → {σ 0 , σ 1 } that assign to points in a classical input space X one of two labelling quantum states {σ 0 , σ 1 }. Let F be a function class consisting of such functions. We assume the training data to be given as a classical-quantum state about which, according to the laws of quantum theory, we can only gain information by performing measurements.
Our learning model is that of PAC-learning with accuracy ε and confidence δ. Here, we require a learning algorithm, given as input classical-quantum training data generated according to some unknown underlying distribution, to output with probability ≥ 1 − δ over the choice of training data a hypothesis that achieves accuracy ε. (Accuracy is measured in terms of the trace distance.) We present a learning strategy that (ε, δ)-PAC learns F ⊆ {f : X → {σ 0 , σ 1 }} in the agnostic scenario from classical-quantum training data of size O d ε 2 + log 1 /δ ε 2 , where d is the VCdimension of the {0, 1}-valued function classF ⊆ {f : X → {0, 1}} induced by F via σ i → i, i = 0, 1. Here, "agnostic" means that there need not be a function in F that would achieve perfect accuracy. We also show that solving this learning problem requires training data size Ω d ε 2 + log 1 /δ ε 2 , so our strategy is optimal w.r.t. the sample complexity dependence on ε, δ and d.
For the realizable scenario of the quantum learning problem, i.e., under the assumption that perfect accuracy can be achieved using F, we prove a sample complexity upper bound of where {E 0 , E 1 } is the Holevo-Helstrom measurement for distinguishing σ 0 and σ 1 , and a sample complexity lower bound of Ω d ε + log 1 /δ ε . Also here, these bounds coincide w.r.t. their dependence on ε, δ and d. The prefactor (1 − 2 max{tr[E 0 σ 1 ], tr[E 1 σ 0 ]}) −2 in the upper bound comes from our procedure trying to distinguish σ 0 and σ 1 by measuring single copies. (Note: Even though we formulate this in terms of the Holevo-Helstrom measurement, we could use any other two-outcome POVM {Ẽ 0 ,Ẽ 1 } that satisfies max{tr[Ẽ 0 σ 1 ], tr[Ẽ 1 σ 0 ]} < 1 /2.) In proving the sample complexity upper bound for the realizable scenario, we combine algorithms from (Laird 1988) and (Hanneke 2016) to show that O 1 ε(1−2η b ) 2 (d + log 1 /δ) classical examples with two-sided classification noise, i.e., in which the label is flipped with probability given by a noise rate, suffice for classical (ε, δ)-PAC learning a function class of VC-dimension d in the realizable scenario if the noise rate is bounded by η b < 1 /2. This upper bound has, to the best of our knowledge, not been observed before and, when combined with the lower bound from (Arunachalam and de Wolf 2018), establishes the optimal sample complexity of this classical noisy learning problem.
As is common in statistical learning theory, our main focus lies on the information-theoretic complexity of the learning problem, i.e., the necessary and sufficient number of quantum examples, whereas we do not discuss the computational complexity. Our proposed strategies are "semi-classical" in the sense that after initially performing simple tensor-product measurements, in which each tensor factor is a two-outcome POVM, the remaining computation is done by a classical learning algorithm. In particular, the procedure does not require (possibly hard to implement) joint measurements and its computational complexity will be determined by the (classical) computational complexity of the classical learner used as a subroutine.
1.2. Overview over the Proof Strategy. We first sketch how we obtain the sample complexity upper bounds. We propose a simple (semi-classical) procedure that consists of first performing local measurements on the quantum part of the training data examples to obtain classical training data and then applying a classical learning algorithm. We observe that the learning problem for which the classical learner is applied can then be viewed as a classical binary classification problem with two-sided classification noise, i.e., in which the labels are flipped with certain error probabilities determined by the outcome probabilities of the performed quantum measurements. Therefore, we have reduced our problem to obtaining sample complexity upper bounds for a classical learning problem with noise. In the general (so-called agnostic) case, we can use known sample complexity bounds formulated in terms of yet another complexity measure, the so-called Rademacher complexity, to obtain that classical empirical risk minimization w.r.t. a suitably modified loss function (as suggested in (Natarajan et al. 2013)) achieves optimal sample complexity for this classical learning problem with noise. In the realizable case, i.e., under the assumption that any non-noisy training data set can be perfectly represented by some hypothesis in our classF, the optimal sample complexity for binary classification with two-sided classification noise has not been established in the literature. We combine ideas from (Laird 1988) and(Hanneke 2016) to exhibit an algorithm that achieves information-theoretic optimality for this scenario.
To obtain the sample complexity lower bounds, we apply ideas from (Arunachalam and de Wolf 2018). Namely, we observe that for sufficiently small accuracy parameter, any quantum strategy that solves our learning problem indeed has to be able to distinguish between the possible different training data states with high success probability.
In the simple case of distinguishing between two quantum states, arising from two different "hard-to-distinguish" underlying distributions, this probability can be upper-bounded in terms of the trace distance of the states. In the more general case of many states, we do not study this success probability directly. Instead, we consider the information contained in the quantum training data about the choice of the underlying distribution, again chosen out of a set of "hardto-distinguish" distributions.
1.3. Related Work. (Bshouty and Jackson 1998) introduced a notion of quantum training data for learning problems with classical concepts and used it to learn DNF (Disjunctive Normal Form) formulae w.r.t. the uniform distribution. This was extended to product distributions by (Kanade et al. 2018). Using ideas from Fourier-based learning, this type of quantum training data was also studied in the context of fixed-distribution learning of Boolean linear functions (Bernstein and Vazirani 1993;Cross et al. 2015;Ristè et al. 2017;Grilo et al. 2017;Caro 2020), juntas (Atıcı and Servedio 2007), and Fourier-sparse functions (Arunachalam et al. 2019a). (Arunachalam and de Wolf 2017) and (Arunachalam et al. 2019b) study the limitations of these quantum examples. A broad overview over work on quantum learning classical functions is given in (Arunachalam and de Wolf 2017). Also for the model of learning from membership queries, a quantum counterpart can be considered. (Servedio and Gortler 2004) showed that the number of required classical queries is at most polynomially larger than the number of required quantum queries. Recently,this polynomial relation was improved upon in (Arunachalam et al. 2019a). A more specific scenario, namely that of learning multilinear polynomials more efficiently from quantum membership queries, is studied in (Montanaro 2012). Similarly, also a quantum counterpart of the classical model of statistical query learning can be defined. This was recently studied in (Arunachalam et al. 2020).
Another line of research at the intersection of learning theory and quantum information focuses on applying classical learning to concept classes arising from quantum theory, e.g., from states or measurements. This was initiated by (Aaronson 2007) and studied further by (Cheng et al. 2016;Aaronson 2018), and ).
Our learning model is similar to the one studied in (Chung and Lin 2018). Also there, the inputs are assumed to be classical and the outputs are quantum states. The crucial difference to our scenario is that we assume that there are only two possible label states and these are known in advance. In (Chung and Lin 2018), there can be a continuum of possible label states. Our additional assumption allows us to study infinite function classes F, whereas the results in (Chung and Lin 2018) are for classes of finite size. The reasoning of (Chung and Lin 2018) can, however, be extended to infinite classes using the so-called "growth function." As a further difference between the approaches, whereas the strategy of (Chung and Lin 2018) requires random orthonormal measurements, the measurements in our procedures can be taken to be fixed.
The classical problems to which our quantum learning problems are reduced, are problems of learning from noisy training data. These were first proposed by (Angluin and Laird 1988;Laird 1988) and studied further, e.g., by (Aslam and Decatur 1996;Cesa-Bianchi et al. 1999) and (Natarajan et al. 2013).
1.4. Structure of the Paper. In Section 2 we recall some notions from learning theory as well as from quantum information and computation. The central learning problem of this contribution is formulated in Section 3. The next Section exhibits strategies for solving the task and establishes sample complexity upper bounds. In doing so, we derive a tight upper bound on the sample complexity of classical binary classification with two-sided classification noise (see Appendix D). The quantum sample complexity upper bounds are complemented by lower bounds in Section 5. We conclude with open questions and the references.

Basics of Quantum Information and Computation.
A finite-dimensional quantum system is described by a (mixed) state and mathematically represented by a density matrix of some dimension d ∈ N, i.e., an element of S(C d ) := {ρ ∈ C d×d | ρ ≥ 0, tr[ρ] = 1}. Here, ρ ≥ 0 means that ρ is a self-adjoint and positive semidefinite matrix. The extreme points of the convex set S(C d ) are the rank-1 projections, the pure states. We employ Dirac notation to denote a unit vector ψ ∈ C d also by |ψ ∈ C d and the corresponding pure state by |ψ ψ|.
To make an observation about a quantum system, a measurement has to be performed. Measurements are built from the set of effect operators E(C d ) := {E ∈ C d×d | 0 ≤ E ≤ 1 d }. For our purposes it suffices to consider a measurement as a collection (For the more general notion of a POVM see (Nielsen and Chuang 2009) or (Heinosaari and Ziman 2012).) When performing a measurement {E i } ℓ i=1 on a state ρ, output i is observed with probability tr[E i ρ]. A projective measurement is one where the effect operators are rank-1 projections, i.e., there exists an orthonormal basis When multiple quantum systems with spaces C d i are considered, the composite system is described by the tensor product n i=1 C d i ≃ C i d i and the set of states becomes S( n i=1 C d i ). Given a state ρ AB ∈ S(C d A ⊗C d B ) of a composite system, we can obtain states of the subsystems as partial traces ρ A = tr B [ρ AB ], ρ B = tr A [ρ AB ]. Here, the partial trace is defined as satisfying the relation tr The dynamics of a quantum system are usually described by unitary evolution or, more generally, by quantum channels. For our purposes, these dynamics will not have to be discussed explicitly since they can be considered as part of the performed measurement by changing to the so-called Heisenberg picture (see Nielsen and Chuang 2009). We will take this perspective in proving our sample complexity lower bounds because it allows us to restrict our attention to proving limitations of measurements rather than of channels.
We will also make use of some standard entropic quantities which have been generalized from their classical origins (Shannon 1948) to the realm of quantum theory. We denote the Shannon entropy of a random variable X with probability mass function p by H(X) = − x p(x) log(p(x)), the conditional entropy of a random variable Y given X as H(Y |X) = x,y p(x, y) log p(x,y) p(x) and the mutual information between X and Y as I(X : Y ) = H(X)+H(Y )−H(X, Y ). Similarly, the von Neumann entropy of a quantum state ρ will be denoted as S(ρ) = − tr[ρ log ρ] and the mutual information for a bipartite quantum state ρ AB as I(ρ AB ) = I(A : B) = S(ρ A ) + S(ρ B ) − S(ρ AB ). All the standard results and inequalities connected to these quantities which appear in our arguments can be found in (Nielsen and Chuang 2009) (Valiant 1984). The general setting is as follows: Let X , Y be input and output space, respectively, let F ⊂ Y X be a class of functions, a concept class, and let ℓ : Y × Y → R + be a loss function. A learning algorithm (to which X , Y, F and ℓ are known) has access to training data of the form from a probability measure µ ∈ Prob(X × Y). Moreover, the learner is given as input a confidence parameter δ ∈ (0, 1) and an accuracy parameter ε ∈ (0, 1). Then a learner must output a hypothesis h ∈ Y X s.t., with probability ≥ 1 − δ w.r.t. the choice of training data, (2.1) Note that the first term on the right-hand side vanishes if there exists an f * ∈ F s.t. µ(x, y) = µ 1 (x)δ y,f * (x) ∀(x, y) ∈ X × Y. In this case, we call the learning problem realizable, otherwise we refer to it as agnostic.
Both in the agnostic and in the realizable scenario, a learning algorithm that always outputs a hypothesis h ∈ F is called a proper learner, and otherwise it is called improper.
A quantity of major interest, which we define now, is the number of examples featuring in such a learning problem. Given a learning algorithm A, the smallest m = m(ε, δ) ∈ N s.t. the learning requirement (2.1) is satisfied with confidence 1−δ and accuracy ε is called the sample complexity of A. The sample complexity of the learning problem is the infimum over the sample complexities of all learning algorithms for the problem. This characterizes, from an information-theoretic perspective, the hardness of a learning problem, but leaves aside questions of computational complexity.
The binary classification problem now arises as a special case from the above if we specify the output space Y = {0, 1} and take the loss function to be ℓ(y,ỹ) = 1 − δ y,ỹ , the 0-1-loss. This setting is well studied and a characterization of its sample complexity is known. At its core is the following combinatorial parameter: Definition 2.1. (VC-Dimension (Vapnik and Chervonenkis 1971)) The main insight of VC-theory lies in the fact that learnability of a {0, 1}-valued concept class is equivalent to finiteness of its VC-dimension and that the sample complexity can be expressed in terms of the VC-dimension. This is the content of the following In the realizable scenario, the sample complexity of binary classification for a function class F of VC-dimension d is m = m(ε, δ) = Θ 1 ε (d + log 1 /δ) . In the agnostic scenario, the sample complexity of binary classification for a function class F of The proof of the sample complexity upper bound in the agnostic case typically goes via a different complexity measure, the Rademacher complexity, which is then related to the VC-dimension. As this will reappear later on in our analysis, we also recall this definition here.
Definition 2.3. (Rademacher Complexity (see Bartlett and Mendelson 2002)) Let Z be some space, F ⊆ R Z , z ∈ Z n . The empirical Rademacher complexity of F w.r.t. z iŝ where U ({−1, 1} n ) denotes the uniform distribution on {−1, 1} n . If we consider n i.i.d. random variables Z 1 , ..., Z n distributed according to a probability measure µ on Z and write Z = (Z 1 , ..., Z n ), the Rademacher complexities of F w.r.t. µ are defined to be R n (F) := E Z∼µ n R F , n ∈ N.

Labels
We introduce a generalization of the classical binary classification problem to the quantum realm by allowing the two labels to be quantum states. Thus let σ 0 , σ 1 ∈ S(C n ) be two (possibly mixed) quantum states, write D = {σ 0 , σ 1 }. We assume that classical descriptions of these states (i.e., their density matrices) are known to the learning algorithm as well as the fact that only these two quantum labels appear. The class to be learned is now a class of functions F ⊂ {f : X → D} and the underlying distribution will be a µ ∈ Prob(X ×D), where X is some space of classical objects.
We now deviate from the standard PAC setting: We assume the training data to be S = where the (x i , ρ i ) are drawn independently according to µ (in particular, ρ i ∈ D for all i) and where the ρ i are the actual quantum states, not classical descriptions of them. Equivalently, we represent an example (x i , ρ i ) drawn from µ as the classical-quantum state Note that this model for the training data differs from the one introduced by (Bshouty and Jackson 1998), where the quantum training data consists of copies of a superposition state. Instead, here we assume copies of a mixture of states. This is done mainly for two reasons: First, it allows us to naturally talk about maps with mixed state outputs. Second, whereas it is debatable whether assuming access to superposition examples as in (Bshouty and Jackson 1998) is justified (see, e.g., Ciliberto et al. 2018, section 5), and this problem remains when considering maps with quantum outputs, the mixtures assumed in our model arise naturally as statistical ensembles of outputs of state preparation procedures, if the parameters of the preparation are chosen according to some (unknown) distribution. In that sense, the form of classical-quantum training data assumed here is both a straightforward generalization of classical training data, given the standard probabilistic interpretation of mixed states, and can (at least in the realizable scenario) be easily imagined to be obtained as outcome of multiple runs of a state preparation experiment with different parameter settings.
A quantum learner for F with confidence 1 − δ and accuracy ε from m = m(ε, δ) quantum examples has to output, for every µ ∈ Prob(X × D), with probability ≥ 1 − δ over the choice of training data of size m according to µ, a hypothesis h ∈ D X s.
As before, we can consider agnostic versus realizable and proper versus improper variants of this learning model.
Here, we define the risk of a hypothesis h ∈ F w.r.t. a distribution µ ∈ Prob(X × D) as Note that our assumption on F implies that h(x) ∈ D ∀x ∈ X and therefore we can easily , which is just the 0-1-risk multiplied by a constant. We choose the slightly more complicated looking definition for R µ (h) for two reasons. On the one hand, is a measure for the distinguishability of σ 0 and σ 1 and thus a natural scale w.r.t. which to measure the prediction error. (Note: If σ 0 , σ 1 are orthogonal pure states and thus perfectly distinguishable, the classical scenario is recovered.) On the other hand, our definition of risk can be motivated operationally as we discuss in Appendix B.
We want to conclude this section by discussing a drawback of our model. We assume F ⊂ D X , i.e., outputs of any f ∈ F are either σ 0 or σ 1 . Considering the convex structure of the set of quantum states, which is intimately tied to the probabilistic interpretation of quantum theory, this restriction can be considered unnatural. We nevertheless make it, for two reasons: First, it is easy to show using a Bayesian predictor that, under the assumption of µ being supported on D (which could, of course, also be contested), the optimal choice of predictors among all functions S(C d ) X is actually a function in D X . Second, it is the most direct analog of the classical scenario with binary labels and we consider it a sensible first step.

Sample Complexity Upper Bounds
4.1. The Agnostic Case. Our learning strategy is motivated by interpreting the classical training data arising from performing a measurement on the label states as noisy version of the true training data. Before describing the learning strategy, we recall our assumption that classical descriptions of the label states σ 0 , σ 1 are known to the learner. Based on this knowledge, the learner can derive the optimal measurement {E 0 , E 1 } for minimum-error distinction between the two states, the so-called Holevo-Helstrom measurement (see Watrous 2018, Theorem 3.4), by choosing E 0 to be the orthogonal projector onto the eigenspaces of σ 0 − σ 1 corresponding to nonnegative eigenvalues. This step is where knowledge of the states σ 0 and σ 1 is used.
The learning strategy is now the following, in which we use the Holevo-Helstrom measurement to produce classical training data and thus obtain a classical learning problem: Noise-corrected Holevo-Helstrom strategy Output: Hypothesisĥ : X → D Algorithm: (1) Perform a Holevo-Helstrom measurement on ρ i for each i. Let Then one can view (x i , y i ) as being drawn independently according to the probability measure ν on X × {0, 1} which has as first marginal and as the conditional probability distribution of y given x.
Note that the only non-classical step in the strategy is step (1), which consists only of performing local two-outcome measurements.
The modification of the loss function in step (3) gives an unbiased estimate of the true risk: Lemma 4.1. (see Natarajan et al. 2013, Lemma 1) Fix x ∈ X . With the notation introduced above, for every z ∈ {0, 1} it holds that We can use a standard generalization bound in terms of Rademacher complexities (see, e.g., Theorem 26.5 of (Shalev-Shwartz and Ben-David 2014) or Theorem 1.15 in (Wolf 2020)) to obtain: With probability ≥ 1 − δ over the choice of training data where we used that |l(y 1 , y 2 )| ≤ 1 1−η 0 −η 1 and defined the function class Next, we relate the Rademacher complexity ofG to that ofF .
Lemma 4.2. For any training data set S = {(x i , y i )} m i=1 , viewed as an element of (X ×{0, 1}) m , we haveR Proof: (Sketch) The proof uses some standard steps that are typically used for example in proving the Lipschitz contraction property of the Rademacher complexity and in studying the Rademacher complexity in a binary classification scenario. See Appendix A for a detailed proof. With this, we now reformulate the above result in terms of the VC-dimension. Suppose VCdim Vershynin 2018, Theorem 8.3.23). Therefore we obtain that, with probability ≥ 1 − δ over the choice of training data S Note that, using Lemma 4.1, we can now bound Now we can set this equal to ε and rearrange to conclude that a sample size of If we now observe that Remark 4.3. The naive version of our learning strategy would be to perform Holevo-Helstrom measurements and then apply a classical learning strategy without correcting for the noise in the resulting classical labels. Actually, this learning strategy already performs reasonably well and, in certain special cases, even allows to reduce the quantum learning problem to a fully classical one. For a detailed analysis of the performance of this simpler strategy, the reader is referred to Appendix C.
4.2. The Realizable Case. The strategy from the previous subsection uses a generalization bound via the Rademacher complexity and yields a sample complexity bound depending quadratically on 1 /ε. In the classical binary classification problem it is known (see Theorem 2.2) that under the realizability assumption this can be improved to 1 /ε, but this typically requires a different kind of reasoning via ε-nets. (Compare section 28.3 of (Shalev-Shwartz and Ben-David 2014).) In Theorem D.3 we show how the reasoning by (Hanneke 2016) can be combined with results by (Laird 1988) to achieve the 1 /ε-scaling also in the case of two-sided classification noise. This sample complexity upper bound is seen to be optimal in its dependence on the VC-dimension d, the error rate bound η, the confidence δ and the accuracy ε by a comparison to the lower bound in Theorem 27 of (Arunachalam and de Wolf 2018).
If, as in the previous subsection, we consider the classical training data obtained by measuring the quantum training data as noisy version of a true sample, we can exchange step 3 in the Holevo-Helstrom strategy by the minimum-disagreement-based classical learning strategy achieving the optimal sample complexity bound of Theorem D.3. This directly yields the following quantum examples of a function in F are sufficient for binary classification with classical instances and quantum labels σ 0 , σ 1 with accuracy ε and confidence 1 − δ.
Remark 4.5. From the description of our noise-corrected Holevo-Helstrom strategy (either in the form of subsection 4.1 or that of this subsection), we can directly see that whether it is a proper or an improper learner depends on whether the classical learning algorithm in step (3) is. As the classical learning algorithm used in subsection 4.1 is a simple Empirical Risk Minimization, it is in particular proper. So our noise-corrected Holevo-Helstrom strategy for the agnostic case is proper as well. The classical learner used in this subsection, however, is in general improper. So also the noise-corrected Holevo-Helstrom strategy for the realizable case is in general improper.

Sample Complexity Lower Bounds
Whereas the goal of the previous section was to give strategies for solving the binary classification problem with classical instances and quantum labels and to prove upper bounds on the sufficient number of classical-quantum examples, we now turn to the complementary question of lower bounds on the number of required examples. In this section, we derive lower bounds that match the respective upper bounds from the previous section and therefore we conclude that the procedures described in section 4 are optimal w.r.t. sample size in terms of the dependence on ε, δ, and d.
5.1. The Agnostic Case. We prove the sample complexity lower bounds in two parts, the first depending on the confidence parameter δ but not on the VC-dimension of the function class and conversely for the second.
We establish the VC-dimension-independent sample complexity lower bound in the following Lemma 5.1. Let σ 0 , σ 1 ∈ S(C n ), let ε ∈ (0, ), δ ∈ (0, 1). Let F ⊂ D X be a non-trivial concept class. Suppose A is a learning algorithm which solves the binary classification task with classical instances and (distinct) label states σ 0 , σ 1 and concept class F with confidence 1 − δ and accuracy ε using m = m(ε, δ) examples. Then m ≥ Ω σ 0 − σ 1 2 1 Proof: (Sketch) As F is non-trivial, there exist concepts f, g ∈ F and a point x ∈ X s.t. f (x) = σ 0 and g(x) = σ 1 . Let λ = ε 2 σ 0 −σ 1 1 ∈ (0, 1). Define probability distributions µ ± on X × D via By explicitly evaluating the risk R ± (h), we see that achieving an excess risk ≤ ε with probability ≥ 1 − δ, requires the learner to distinguish between the underlying distributions µ ± , and thus the corresponding training data states ρ ⊗m ± , with probability ≥ 1 − δ. It is well known that the optimal success probability of this quantum distinguishing task is given by p opt = 1 2 (1 + 1 2 ρ ⊗m + − ρ ⊗m − 1 ). Via the Fuchs-van de Graaf inequalities, which state that this can be upper-bounded using lower bounds on the fidelity F (ρ ⊗m + , ρ ⊗m − ) = F (ρ + , ρ − ) m . The fidelity F (ρ + , ρ − ) can be lower-bounded using its strong concavity and the explicit expressions for ρ ± . The result then follows by comparing the obtained upper bound with the required lower bound p opt ≥ 1 − δ. See Appendix A for a detailed proof.
For the proof of the VC-dimension-dependent part of the lower bound we need a well known observation about the eigenvalues of a statistical mixture of two pure quantum states, which is the content of the following Lemma 5.2. Let |ψ , |φ ∈ C n be distinct pure quantum states. Let α, β ≥ 0 be real numbers. Then the non-zero eigenvalues of the mixture ρ := α|ψ ψ| + β|φ φ| are given by With this we can now prove a sample complexity lower bound for the case of pure label states. Note that ∀a ∈ {0, 1} d ∃f a ∈F : f a (s i ) = a i by shattering and that f a is a minimum-error concept w.r.t. µ a . By evaluating the excess error of an fã compared to f a , we see that solving the learning problem with confidence 1 − δ requires the learner to output, with probability ≥ 1 − δ, a hypothesis described by a string whose Hamming distance to the true underlying string is ≤ d 4 . We can use this observation to obtain the lower bound I(A : B) ≥ Ω(d) on the mutual information between underlying string A (drawn uniformly at random) and corresponding quantum training data B. We can also upper-bound the mutual information. A standard argument shows I(A : B) ≤ mI(A : B 1 ), where m is the number of copies of the quantum example state and B 1 describes a single quantum example state. Using Lemma 5.2 and the explicit expression for a quantum example state, we can compute I(A : B 1 ) and use Taylor expansion to see that I(A : B 1 ) ≤ O(ε 2 ). Comparing the lower and upper bounds on I(A : B) now gives m ≥ Ω d ε 2 . See Appendix A for a detailed proof.
Therefore we have shown that the strategy from subsection 4.1 is, for pure states, optimal in sample complexity w.r.t. its dependence the VC-dimension, the accuracy and the confidence, but we do not make a statement on optimality w.r.t. the dependence on the distinguishability of the label states, because the parameter σ 0 − σ 1 1 is lacking from our lower bound.
5.2. The Realizable Case. We now show analogous lower bounds for the sample complexity in the realizable scenario with the same proof strategy.
Proof: This can be proved similarly to Lemma 5.1. See Appendix A for a detailed proof. We now provide the analog of Theorem 5.3 for the realizable case.
Proof: This can be proved similarly to Theorem 5.3 . See Appendix A for a detailed proof.
Again, we have obtained a sample complexity lower bound that matches the upper bound proved in subsection 4.2 in the dependence on the VC-dimension, the confidence and the accuracy, but we do not make a statement about optimality w.r.t. the dependence on σ 0 − σ 1 1 .
Remark 5.7. As already discussed in subsection 2.1, in proving the sample complexity lower bounds we resort to the Heisenberg picture, which allows us to absorb the intermediate quantum channels performed by a learner into the measurement. These lower bounds therefore even hold for quantum learning algorithms that perform coherent and adaptive measurements on the training data. In particular, the information-theoretic complexity of our learning problem does not change if we restrict the quantum learner to only performing two-outcome POVMs locally (i.e., on one subsystem only). This is maybe not too much of a surprise, since the optimal measurement for distinguishing states drawn uniformly at random from { m i=1 σ x i } x∈{0,1} n can be seen to be exactly given by local Holevo-Helstrom measurements using the Holevo-Yuen-Kennedy-Lax optimality criterion (Holevo 1973;Yuen et al. 1975).

Conclusion and Outlook
We have proposed a novel way of modifying the classical binary classification problem to obtain a quantum counterpart. The conceptual difference to the framework of quantum PAC learning as discussed in (Arunachalam and de Wolf 2017) is that we work with maps whose outputs are themselves quantum states, not classical labels. This naturally gives rise to training data given by quantum states, which is one aspect in which our setting differs from (Aaronson 2007).
Using results from classical learning theory on dealing with classification noise in the training data, we exhibited learning strategies (based on the Holevo-Helstrom measurement) for binary classification with classical instances and quantum labels. The learning strategies consist of two main steps: First, classical information is extracted from the training data by performing a (localized) measurement. Second, classical learning strategies are applied. We complemented these procedures by sample complexity lower bounds thereby establishing the information-theoretic optimality of these strategies for pure label states w.r.t. the dependence on VC-dimension, confidence and accuracy.
We leave the following questions open for further research: • Can we derive sample complexity lower bounds which explicitly incorporate factors related to the hardness of distinguishing σ 0 and σ 1 , e.g., in terms of σ 0 − σ 1 1 or max{tr[E 0 σ 1 ], tr[E 1 σ 0 ]? Could this be related to another complexity measure from classical learning theory, the "fat-shattering dimension" of the class • Our analysis is focused on the information-theoretic part of the learning problem, i.e., the sample complexity. Can we improve the computational complexity? • We considered the case of classical instances. Can this be extended to a scenario of quantum instances with classical (or even quantum) labels? • Our strategy uses the Holevo-Helstrom measurement which can be understood as inducing the minimum amount of noise. However, in classical learning theory it is well known that adding noise to the training data can be helpful in preventing overfitting. In this spirit, can we justify other measurements than the Holevo-Helstrom measurement?
• We implicitly assumed throughout our analysis that the learning algorithm has to output a hypothesis that maps into {σ 0 , σ 1 }. What if we allow for hypotheses that map into conv ({σ 0 , σ 1 }) or S(C d )?
• Finally, we assume throughout that the label states σ 0 , σ 1 are known in advance. Can this assumption be removed? Here, it might be helpful that Theorem D.3 does not need explicit knowledge of the error rates η 0 , η 1 , but merely of an upper bound η b on them.
, then we can rewritê Next, we use that E σ [σ i ] = 0 and that σ i and (1 − 2y i )σ i have the same distribution for all i. With this we obtain from the abovê where the last step used that the expression is invariant w.r.t. interchangingf andf ′ , so we can drop the absolute value. Now we can iterate this reasoning for i = 2, . . . , m and obtain the desired inequality.
Proof of Lemma 5.1: As F is non-trivial, there exist concepts f, g ∈ F and a point x ∈ X s.t. f (x) = σ 0 and g(x) = σ 1 . Let λ ∈ (0, 1) (to be chosen appropriately later in the proof). Define probability distributions µ ± on X × D via The risk of a hypothesis h ∈ D X w.r.t. these probability measures is given by in particular the optimal achievable risk is 1−λ 4 σ 0 − σ 1 1 . Note that a hypothesis which predicts the suboptimal label state for x has an excess risk of So if we pick λ = ε 2 σ 0 −σ 1 1 < 1, then in order to achieve an excess risk ≤ ε with probability ≥ 1−δ, the learning algorithm has to be able to distinguish between the underlying distributions µ ± with probability ≥ 1 − δ.
As the algorithm has access to the underlying distribution only via the training data, this means that the algorithm has to be able to distinguish the corresponding training data ensembles with probability ≥ 1 − δ. Here, we observe that the training data being drawn i.i.d. according to µ ± is equivalent to the learning algorithm having access to m copies of the state because this mixed state simply describes the statistical mixture. The optimal success probability for distinguishing between two quantum states is a well-studied object in quantum information theory. It can be characterized by the trace distance between the two states and is given (in our case) by (see, e.g., (Nielsen and Chuang 2009) As the trace distance of tensor products is not that easy to deal with, we will instead work with the fidelity defined as According to the Fuchs-van de Graaf inequalities we have where the last steps uses multiplicativity of the fidelity under tensor products. Now we require p opt ≥ 1 − δ and rearrange to obtain or equivalently after taking logarithms By strong concavity of the fidelity, we have Thus we obtain (after Taylor-expanding the logarithm in the denominator) as desired.
We now solve this quadratic equation and obtain the two eigenvalues where we used that | cos(ϕ)| = | ψ|φ |.
Detailed Proof of Theorem 5.3: Let S = (s 1 , . . . , s d ) ∈ X be a set shattered byF, for each Note that ∀a ∈ {0, 1} d ∃f a ∈F : f a (s i ) = a i by shattering and that for each a ∈ {0, 1} d , f a is a minimum-error concept w.r.t. µ a and a concept fã has additional error 4ε d compared to f a . Hence, in order to solve the learning problem with confidence 1−δ and accuracy ε the algorithm A has to output, with probability ≥ 1 − δ, a hypothesis (generated from the training data arising from the underlying string) that when evaluated on S yields a vector that is d 4 -close to the underlying string in Hamming distance.
Let A be a random variable distributed uniformly on {0, 1} d (corresponding to the unknown underlying string a). Let B = B 1 . . . B m be the training data with each example generated independently from µ a described by the quantum ensemble or, equivalently, by the quantum state In particular, the composite system of underlying string and corresponding training data is described by the quantum state We follow the information-theoretic proof strategy from ( Here, the first step is by definition, the second uses the product structure of the subsystem B, the third follows from subadditivity of the entropy and the last is again by definition. And finally, we prove an upper bound on I(A : B 1 ). To this end, we have to study the reduced state More precisely, we have and thus have to study the entropies of σ AB 1 as well as those of the reduced states σ A and σ B 1 . As A ∼ Uniform {0, 1} d , we have S(A) = d. Now we consider the reduced state Here, we have By Lemma 5.2 we know that 1 2d |ψ 0 ψ 0 | + 1 2d |ψ 1 ψ 1 | has non-zero eigenvalues µ 1/2 = 1 2d (1 ± | ψ 0 |ψ 1 |) and due to the block-diagonal structure of σ B 1 we conclude that the non-zero eigenvalues of σ B 1 are also µ 1/2 , each of multiplicity d. In particular, we have S(σ B 1 ) = d · (−µ 1 log(µ 1 ) − λ 2 log(µ 2 )) = log(2d) − 1 2 log(1 − | ψ 0 |ψ 1 | 2 ) + | ψ 0 |ψ 1 | log 1 + | ψ 0 |ψ 1 | 1 − | ψ 0 |ψ 1 | .
each of multiplicity d · 2 d and that therefore If we combine these expressions for the different entropies, we obtain We now use Taylor's theorem to understand the scaling of the different terms with ε. First, we have (by Taylor-expanding Moreover, using the Taylor expansions around x = 0 (with a > 0) and Plugging these approximations back in gives us

Now combining our mutual information lower and upper bounds yields
which after rearranging becomes as desired.
Detailed Proof of Lemma 5.5: As F is non-trivial, there exist f 1 , f 2 ∈ F and x 1 , x 2 ∈ X s.t. f 1 (x 1 ) = f 2 (x 1 ) = σ 0 and f 1 (x 2 ) = σ 0 = σ 1 = f 2 (x 2 ). Now consider the distribution µ on X defined by where λ ∈ (0, 1) is to be chosen later in the proof. The risk of a hypothesis h ∈ D X w.r.t. µ if the target concept is f i is given by so in particular we have So if we choose λ = 2ε σ 0 −σ 1 1 < 1, then the learning requirement for A implies that with probability ≥ 1 − δ, A correctly identifies whether the target concept is f 1 or f 2 . As the algorithm has access to the underlying distribution only via the training data, this means that the algorithm has to be able to distinguish the corresponding training data ensembles with probability ≥ 1 − δ. Here, we observe that the training data being drawn i.i.d. according to µ ± is equivalent to the learning algorithm having access to m copies of the state The optimal success probability for distinguishing between two quantum states is a well-studied object in quantum information theory. It can be characterized by the trace distance between the two states and is given (in our case) by (see Nielsen and Chuang 2009) As the trace distance of tensor products is not that easy to deal with, we will instead work with the fidelity defined as F (ρ, σ) := tr[ ρ 1 2 σρ 1 2 ]. According to the Fuchs-van de Graaf inequalities (see Nielsen and Chuang 2009, Section 9.2.3) we have where the last steps uses multiplicativity of the fidelity under tensor products. Now we require p opt ≥ 1 − δ and rearrange to obtain or equivalently after taking logarithms m ≥ log(4δ(1 − δ)) log(F (ρ 1 , ρ 2 ) 2 ) . Now we use again the Fuchs-van de Graaf inequalities which tell us (after rearranging) .
f a (s 0 ) = 0 and f a (s i ) = a i ∀1 ≤ i ≤ d.
Observe that w.r.t. a distribution µ and target concept f a , another concept f b has error So if we pick λ = 8ε σ 0 −σ 1 1 , then by the learning requirement, with probability ≥ 1 − δ, A has to output a hypothesis h that when evaluated on S yields a label vector that is d 4 -close to the true underlying string in Hamming distance. Denote by A ∼ Uniform {0, 1} d a random variable describing the unknown underlying string, let B = B 1 . . . B m be the corresponding quantum training data system. We want to repeat the three-step reasoning from the proof of Theorem 5.3. The first two steps work exactly as before.
Step 3 will be slightly different. Again we have In this case, the relevant composite state is We now again use Lemma 5.2 to compute eigenvalues and thus entropies. (Here our assumption that σ 0 and σ 1 are pure enters the proof.) We obtain • Each ρ a has non-zero eigenvalues 1 − λ of multiplicity 1 and λ d of multiplicity d.
With this we can now compute the relevant entropies and obtain as well as Hence, we now have Now we can finish the proof by combining steps 1, 2 and 3 as before.

Appendix B. A Physical Motivation for our Notion of Risk
In our definition of the risk R µ we use the trace distance. As the latter is a well-established measure of distinguishability of quantum states, it presents itself as a natural candidate loss function.
Here, we give a more explicit operational reasoning as to why we choose to use the trace distance.
Imagine the learning task as a competition between two parties, a learner and a teacher. We assume that both parties obey the laws of quantum physics. The teacher knows (a classical description of) the probability distribution µ ∈ Prob(X × D) and will provide corresponding training data to the learner during a training phase. The learner's goal is to persuade the teacher in a test phase that she has managed to learn the distribution µ, which was unknown to her in advance, i.e., that she has produced a good hypothesis h : X → D.
We first give an informal description of the test phase: The teacher prepares another (independent) example (x, ρ) drawn from µ. She then sends x to the learner. The latter applies her hypothesis h to prepare the quantum state h(x) which she then sends back to the teacher. The teacher now uses this one copy of h(x) and her knowledge of µ to evaluate whether the learner made a good prediction. As also the teacher is restricted by quantum theory, she can only do so by performing a measurement.
We now discuss the choice of measurement of the teacher in more detail. On the one hand, the teacher wants to maximize the probability of detecting a wrong prediction. On the other hand, she does not want to be unfair, so at the same time she tries to maximize the probability of detecting a correct prediction. In summary, the teacher wants to choose a 2-outcome measurement where σ i = ρ and σ j ∈ D \ {ρ}. As she knows (a classical description of) the state ρ ∈ D and that h(x) ∈ D, she can achieve this by picking {E accept , E reject } to be the optimal measurement for minimum error discrimination of D (where the states are taken with equal prior probabilities (see Watrous 2018, Theorem 3.4). The measurement is basically the same independently of whether ρ = σ 1 or ρ = σ 2 , only the outcome labels are interchanged.
Now the expected probability of the trainer rejecting the learner's prediction is The optimal measurement satisfies It is easy to see that under the additional assumption that σ 0 and σ 1 have the same purity, i. With this we now obtain when comparing the achieved with the optimal expected rejection probability So we have recovered our notion of risk, at least in the case of states of equal purity, from a more basic analysis of the test phase.
Note that a similar analysis could be performed also in the case of more than two quantum labels. There, the teacher's measurement would be the optimal measurement for minimum error discrimination of ρ and 1 |D|−1 σ∈D\{ρ} σ. Unfortunately, no closed-form expressions for the corresponding success probabilities are known. We do, however, see that in this scenario, using the trace distance as loss function would be too pessimistic from the perspective of the learner. As the teacher does not know the prediction state prepared by the learner, the teacher has to solve a state discrimination problem taking into account all possible label states.
Appendix C. The Holevo-Helstrom Strategy The naive learning strategy based on the Holevo-Helstrom measurement is the follwing: Holevo-Helstrom strategy Output: Hypothesisĥ : X → D Algorithm: (1) Perform a Holevo-Helstrom measurement on ρ i for each i. Let Then one can view (x i , y i ) as being drawn independently according to the probability measure ν on X × {0, 1} which has as the first marginal and as the conditional probability distribution of y given x.
The remainder of this subsection is devoted to studying the performance of this simple learning procedure. Note that we leave open for now the classical learning algorithm to be used, we first work towards characterizing the true risk R µ (h) in terms of the intermediate classical riskR ν (g). In the following we will often make use of the fact that when identifying i ↔ σ i , the probability measure µ on X × D gives rise to a probability measure on X × {0, 1}. We will abuse notation and also denote the latter measure by µ, however, which measure is meant will always be clear from the context.
We now derive a similar expression forR ν (g).
Lemma C.1. With the notation as in the Holevo-Helstrom strategy (in particular h(x) = σ g(x) ) it holds that Proof: This can be shown by direct computation using the definition of ν: Now we use the specific property of the Holevo-Helstrom measurement that tr and |g(x)| = g(x). Thus we obtainR where the last step uses h(x) = σ g(x) . This allows us to easily compare the true and the intermediate risk and obtaiñ As g(x) ∈ {0, 1} ∀x ∈ X and in particular 0 ≤ E µ 1 [g] ≤ 1, this gives rise to the following Corollary C.2. With the notation as in the Holevo-Helstrom strategy it holds that We can extend this to a comparison between the excess risks which are the quantities of interest for agnostic learning scenarios.
Corollary C.3. With the notation as in the Holevo-Helstrom strategy it holds that So we see that solving the classical learning task in step 3 of the Holevo-Helstrom strategy does not necessarily imply success at the overall learning task if the target accuracy is ε < In the next subsection we present an adapted strategy to overcome this problem.
Remark C.4. We want to shortly discuss a special case in which the connection between R µ (h) andR ν (g) takes a particularly appealing form. Namely, assume that σ 0 and σ 1 are such that the corresponding Holevo-Helstrom measurement produces equal probabilities of error, i.e., tr[E 0 σ 1 ] = tr[E 1 σ 0 ]. This is clearly not true in general, take, e.g., σ 0 = |0 0| and σ 1 = 1 2 (|0 0| + |1 1|). It does, however, hold true in certain special cases, e.g., if both σ 0 and σ 1 are pure or if σ 0 and σ 1 have the same (non-trivial) purity and tr[E 0 ] = tr[E 1 ]. (The latter is, e.g., satisfied if σ 0 and σ 1 are qubit states of the same (non-zero) purity.) In this simple case our previous discussion yields R µ (h) =R ν (g), in particular, if we succeed at the classical binary classification task in step 3, then we also succeed at the overall classification task with quantum labels, so the quantum learning task is reduced to a classical learning problem.

Appendix D. Sample Complexity of Binary Classification with Two-Sided Classification Noise
Here, we discuss the sample complexity of the PAC learning task of binary classification in the presence of (two-sided) classification noise in the realizable scenario. To be in congruence with the majority of the literature on this and related problems, we will use a slightly different notation than in the main body of the paper. Namely, we will consider classical input space X and classical target space {0, 1}, a concept class F ⊂ {0, 1} X , a probability measure µ ∈ Prob(X ), and noise probabilities 0 ≤ η 0 , η 1 < 1 2 with which labels are flipped. Moreover, we will work with the 0-1-loss function and denote the corresponding risk of a hypothesis h w.r.t. a target concept f by err µ (h; f ) = µ[h(x) = f (x)]. Finally, any training data sample S splits the concept class F into so-called S-equivalence classes, where f 1 , f 2 ∈ F are equivalent if and only if f 1 (x) = f 2 (x) ∀x ∈ X s.t. ∃y ∈ {0, 1} with (x, y) ∈ S. The basic learning strategy underlying our discussion is Algorithm 1. It is the natural analog of searching for a consistent function in the case of noisy labels, namely, as such a consistent function will in general not exist, it searches for a function which disagrees with the training data on as few examples as possible.
Laird's original proof that this algorithm solves the PAC learning problem is for the case η 0 = η 1 , it is, however, easily generalized to our case because we still assume the same noise bound on both error rates. (We only have to adapt the expression for the error rate and the corresponding Hoeffding bounds.) In order to apply the reasoning by (Hanneke 2016) we need to slightly reformulate the result of this algorithm s.t. we obtain a bound on the error in terms of the sample size. When following the proof of Theorem 5.7 in (Laird 1988) we see that m 1 is used to ensure that there is a hypothesis which performs better than some given error threshold and m 2 is used to ensure that such a hypothesis is actually chosen. In particular, if we use the error bound by (Blumer et al. With this suboptimal base learner we will now follow the strategy by (Hanneke 2016) in order to build a better learner from it. Note that Hanneke's proof includes several steps in which the existence of a function consistent with the respective subsample is ensured. This is not necessary in our case because the Minimum Disagreement Strategy does not require a consistent function to exist.
Proof: This proof is analogous to the proof of Theorem 2 in (Hanneke 2016) with some minor simplifications and adaptations and is given here only for the sake of completeness. Fix an f * ∈ F and a probability measure µ over X . Denote by S = S 1:m the corresponding noisy training data. For any classifier h denote by ER(h) = {x ∈ X |h(x) = f * (x)} the set of instances on which h errs. Fix c = 3600 · ln(2). We will show by strong induction that ∀m ′ ∈ N, ∀δ ′ ∈ (0, . . .) and for all finite sequences T ′ with probability ≥ 1 − δ ′ the classifier h m ′ ,T ′ = Majority L(A(S 1:m ′ ; T ′ )) satisfies the error bound Hence, for a random variable X ∼ µ independent of the data, of I and ofh we can now conclude Since by the union bound the event i∈{1,2,3} E i ∩ E ′ i ∩ E ′′ i has probability ≥ 1 − δ, the induction step is complete.
It remains to use the claim just proven by induction to derive the desired sample complexity upper bound. For this, take T = ∅ and note that for m ≥ ⌊ cC(η) ε d + ln 18 δ ⌋ the right hand side of (D.3) is ≤ ε. Therefore such a sample size suffices for successful learning using Majority(L(A(·; ∅))). Now recall the discussion before the Theorem, where we observed that C(η b ) ≤