Binary classification with classical instances and quantum labels

In classical statistical learning theory, one of the most well-studied problems is that of binary classification. The information-theoretic sample complexity of this task is tightly characterized by the Vapnik-Chervonenkis (VC) dimension. A quantum analog of this task, with training data given as a quantum state has also been intensely studied and is now known to have the same sample complexity as its classical counterpart. We propose a novel quantum version of the classical binary classification task by considering maps with classical input and quantum output and corresponding classical-quantum training data. We discuss learning strategies for the agnostic and for the realizable case and study their performance to obtain sample complexity upper bounds. Moreover, we provide sample complexity lower bounds which show that our upper bounds are essentially tight for pure output states. In particular, we see that the sample complexity is the same as in the classical binary classification task w.r.t. its dependence on accuracy, confidence and the VC-dimension.


Introduction
The fields of machine learning and of quantum computation provide new ways of looking at computational problems and have seen a significant increase in academic as well as practical interest since their origins in the 1970s and 1980s. More recently, attention was directed to paths for combining ideas from these two fruitful research areas. This gave rise to new approaches under different names such as "quantum machine learning" or "quantum learning theory".
In classical statistical learning theory, one of the most influential frameworks is that of probably approximately correct (PAC) learning due to Vapnik and Chervonenkis (1971) and Valiant (1984). It is particularly well studied for the task of binary classification. For this problem the socalled VC-dimension Vapnik and Chervonenkis (1971) is known to characterize the sample complexity of learning a function class (Blumer et al. 1989;Hanneke 2016). Motivated by these strong theoretical results, a quantum analog of this problem was soon defined and studied in Matthias C. Caro caro@ma.tum.de 1 Department of Mathematics, Technical University of Munich, Garching, Germany 2 Munich Center for Quantum Science and Technology (MCQST), Munich, Germany a series of papers (an overview over which is given in Arunachalam and de Wolf (2017)), which culminated in the results of Arunachalam and de Wolf (2018). There it is shown that the information-theoretic complexity of the task of quantum PAC learning a 0-1-valued function class is characterized by the VC-dimension in exactly the same way as for the classical scenario.
The scenario studied in Arunachalam and de Wolf (2018) assumes the training data available to the learner to be given in a specific quantum form and allows the learner to perform quantum computational operations on that training data. The functions to be learned, however, still map classical inputs to classical outputs. We propose a different quantum version of the binary classification task by not only considering the possibility of quantum training data but by allowing the objects to be learned to be inherently quantum. More specifically, we consider functions that map classical inputs to one of two possible quantum output states ("quantum labels"). These maps describe state preparation procedures. A more general learning task of this type, for which our problem can be seen as a toy model, could be relevant for cases in which state preparation is either costly or time-consuming, e.g., preparing thermal states at low temperatures (see Brandão and Kastoryano 2019;Chowdhury 2020, and references therein). Here, one could first produce sample data, learn a predictor, and then reproduce the preparation more efficiently using the predictor.

Main results
We consider maps f : X → {σ 0 , σ 1 } that assign to points in a classical input space X one of two labelling quantum states {σ 0 , σ 1 }. (Here, σ 0 and σ 1 are, in general, mixed states described by density matrices.) Let F be a function class consisting of such functions. We assume the training data to be given as a classical-quantum state about which, according to the laws of quantum theory, we can only gain information by performing measurements.
Our learning model is that of PAC-learning with accuracy ε and confidence δ. Here, we require a learning algorithm, given as input classical-quantum training data generated according to some unknown underlying distribution, to output with probability ≥ 1 − δ over the choice of training data a hypothesis that achieves accuracy ε. (Accuracy is measured in terms of the trace distance. ) We present a learning strategy that (ε, δ)-PAC learns F ⊆ {f : X → {σ 0 , σ 1 }} in the agnostic scenario from classical-quantum training data of size O d ε 2 + log 1/δ ε 2 , where d is the VC-dimension of the {0, 1}-valued function classF ⊆ {f : X → {0, 1}} induced by F via σ i → i, i = 0, 1. Here, "agnostic" means that there need not be a function in F that would achieve perfect accuracy. We also show that solving this learning problem requires training data size Ω d ε 2 + log 1/δ ε 2 , so our strategy is optimal w.r.t. the sample complexity dependence on ε, δ and d.
In proving the sample complexity upper bound for the realizable scenario, we combine algorithms from Laird (1988) and Hanneke (2016) to show that O 1 ε(1−2η b ) 2 (d + log 1/δ)) classical examples with two-sided classification noise, i.e., in which each label is flipped with probability given by a noise rate, suffice for classical (ε, δ)-PAC learning a function class of VC-dimension d in the realizable scenario if the noise rate is bounded by η b < 1/2. This upper bound has, to the best of our knowledge, not been observed before and, when combined with the lower bound from Arunachalam and de Wolf (2018), establishes the optimal sample complexity of this classical noisy learning problem.
As is common in statistical learning theory, our main focus lies on the information-theoretic complexity of the learning problem, i.e., the necessary and sufficient number of quantum examples, whereas we do not discuss the computational complexity. Our proposed strategies are "semiclassical" in the following sense: After initially performing simple tensor product measurements, in which each tensor factor is a two-outcome POVM, the remaining computation is done by a classical learning algorithm. In particular, the procedure does not require (possibly hard to implement) joint measurements and its computational complexity will be determined by the (classical) computational complexity of the classical learner used as a subroutine.

Overview over the proof strategy
We first sketch how we obtain the sample complexity upper bounds. We propose a simple (semi-classical) procedure that consists of first performing local measurements on the quantum part of the training data examples to obtain classical training data and then applying a classical learning algorithm.
We observe that the learning problem for which the classical learner is applied, can then be viewed as a classical binary classification problem with two-sided classification noise, i.e., in which the labels are flipped with certain error probabilities determined by the outcome probabilities of the performed quantum measurements. Therefore, we have reduced our problem to obtaining sample complexity upper bounds for a classical learning problem with noise.
In the general (so-called agnostic) case, we can use known sample complexity bounds formulated in terms of a complexity measure called Rademacher complexity to show that classical empirical risk minimization w.r.t. a suitably modified loss function (as suggested in Natarajan et al. 2013) achieves optimal sample complexity for this classical learning problem with noise.
In the realizable case, i.e., under the assumption that any non-noisy training data set can be perfectly represented by some hypothesis in our classF , the optimal sample complexity for binary classification with two-sided classification noise has not been established in the literature. We combine ideas from Laird (1988) and Hanneke (2016) to exhibit an algorithm that achieves information-theoretic optimality for this scenario.
To obtain the sample complexity lower bounds, we apply ideas from Arunachalam and de Wolf (2018). Namely, we observe that for sufficiently small accuracy parameter, any quantum strategy that solves our learning problem indeed has to be able to distinguish between the possible different training data states with high success probability.
In the simple case of distinguishing between two quantum states, arising from two different "hard-to-distinguish" underlying distributions, this probability can be upper bounded in terms of the trace distance of the states. In the more general case of many states, we do not study this success probability directly. Instead, we consider the information contained in the quantum training data about the choice of the underlying distribution, again chosen out of a set of "hard-to-distinguish" distributions. Bshouty and Jackson (1998) introduced a notion of quantum training data for learning problems with classical concepts and used it to learn DNF (Disjunctive Normal Form) formulae w.r.t. the uniform distribution. This was extended to product distributions by Kanade et al. (2019). Using ideas from Fourier-based learning, this type of quantum training data was also studied in the context of fixeddistribution learning of Boolean linear functions (Bernstein and Vazirani 1993;Cross et al. 2015;Ristè et al. 2017;Grilo et al. 2017;Caro 2020), juntas Atıcı and Servedio (2007), and Fourier-sparse functions (Arunachalam et al. 2019a). Arunachalam and de Wolf (2017) and Arunachalam et al. (2019b) study the limitations of these quantum examples. A broad overview over work on quantum learning classical functions is given in Arunachalam and de Wolf (2017). Also for the model of learning from membership queries, a quantum counterpart can be considered. Servedio and Gortler (2004) showed that the number of required classical queries is at most polynomially larger than the number of required quantum queries. Recently,this polynomial relation was improved upon in Arunachalam et al. (2019a). A more specific scenario, namely that of learning multilinear polynomials more efficiently from quantum membership queries, is studied in Montanaro (2012).

Related work
Similarly, also a quantum counterpart of the classical model of statistical query learning can be defined. This was recently studied in Arunachalam et al. (2020).
Another line of research at the intersection of learning theory and quantum information focuses on applying classical learning to concept classes arising from quantum theory, e.g., from states or measurements. This was initiated by Aaronson (2007) and studied further by Cheng et al. (2016) and Aaronson (2018), and Aaronson et al. (2018).
Our learning model is similar to the one studied in Chung and Lin (2018). Also there, the inputs are assumed to be classical and the outputs are quantum states. The crucial difference to our scenario is that we assume that there are only two possible label states and these are known in advance. In Chung and Lin (2018), there can be a continuum of possible label states.
Our additional assumption allows us to study infinite function classes F , whereas the results in Chung and Lin (2018) are for classes of finite size. (We expect that the reasoning of Chung and Lin (2018) can be extended to infinite classes using the so-called "growth function" when restricting to a finite set of possible target states. This might lead to a learning procedure that can be applied in our scenario without prior knowledge of the possible quantum label states.) As a further difference between the approaches, whereas the strategy of Chung and Lin (2018) requires the ability to perform measurements in random orthonormal bases, the measurements in our procedures can be taken to be fixed and of product form and are thus potentially easier to implement.
The classical problems to which our quantum learning problems are reduced are problems of learning from noisy training data. These were first proposed by Angluin and Laird (1988) and Laird (1988) and studied further, e.g., by Aslam and Decatur (1996) and Cesa-Bianchi et al. (1999) and Natarajan et al. (2013).

Structure of the paper
In Section 2 we recall some notions from learning theory as well as from quantum information and computation. The central learning problem of this contribution is formulated in Section 3. The next section exhibits strategies for solving the task and establishes sample complexity upper bounds. In doing so, we derive a tight upper bound on the sample complexity of classical binary classification with twosided classification noise (see Appendix 4). The quantum sample complexity upper bounds are complemented by lower bounds in Section 5. We conclude with open questions and the references.

Basics of quantum information and computation
A finite-dimensional quantum system is described by a (mixed) state and mathematically represented by a density matrix of some dimension d ∈ N, i.e., an element of S (C d ) := {ρ ∈ C d×d | ρ ≥ 0, tr[ρ] = 1}. Here, ρ ≥ 0 means that ρ is a self-adjoint and positive semidefinite matrix. The extreme points of the convex set S (C d ) are the rank-1 projections, the pure states. We employ Dirac notation to denote a unit vector ψ ∈ C d also by |ψ ∈ C d and the corresponding pure state by |ψ ψ|.
To make an observation about a quantum system, a measurement has to be performed. Measurements are built from the set of effect operators E (C d ) := {E . For our purposes it suffices to consider a measurement as a collection . (For the more general notion of a POVM see Nielsen and Chuang (2009) or Heinosaari and Ziman (2012).) When performing a measurement {E i } i=1 on a state ρ, output i is observed with probability tr [E i ρ]. A projective measurement is one where the effect operators are rank-1 projections, i.e., there exists an orthonormal basis {|i } d i=1 s.t. E i = |i i|. When multiple quantum systems with spaces C d i are considered, the composite system is described by the tensor product n i=1 C d i C i d i and the set of states becomes ) of a composite system, we can obtain states of the subsystems as partial traces Here, the partial trace is defined as satisfying the relation . The dynamics of a quantum system are usually described by unitary evolution or, more generally, by quantum channels. For our purposes, these dynamics will not have to be discussed explicitly since they can be considered as part of the performed measurement by changing to the so-called Heisenberg picture (see Nielsen and Chuang 2009). We will take this perspective in proving our sample complexity lower bounds because it allows us to restrict our attention to proving limitations of measurements rather than of channels.
We will also make use of some standard entropic quantities which have been generalized from their classical origins Shannon (1948) to the realm of quantum theory. We denote the Shannon entropy of a random variable X with probability mass function p by H (X) = − x p(x) log(p(x)), the conditional entropy of a random variable Y given X as H (Y |X) = x,y p(x, y) log p (x,y) p (x) and the mutual information between X and Y as I (X : Y ). Similarly, the von Neumann entropy of a quantum state ρ will be denoted as S(ρ) = −tr[ρ log ρ] and the mutual information for a bipartite quantum state ρ AB as I (ρ AB ) = I (A : B) = S(ρ A ) + S(ρ B ) − S(ρ AB ). All the standard results and inequalities connected to these quantities which appear in our arguments can be found in Nielsen and Chuang (2009) or in Wilde (2013).

Basics of the PAC framework and the binary classification problem
The setting of Probably Approximately Correct (PAC) learning was introduced by Vapnik and Chervonenkis (1971) and Valiant (1984). The general setting is as follows: Let X , Y be input and output space, respectively, let F ⊂ Y X be a class of functions, a concept class, and let : Y × Y → R + be a loss function. A learning algorithm (to which X , Y , F and are known) has access to training data of the form S = {(x i , y i )} m i=1 , where (x i , y i ) are drawn i.i.d. from a probability measure μ ∈ Prob(X × Y ). Moreover, the learner is given as input a confidence parameter δ ∈ (0, 1) and an accuracy parameter ε ∈ (0, 1). Then a learner must output a hypothesis h ∈ Y X s.t., with probability ≥ 1 − δ w.r.t. the choice of training data, (2.1) Note that the first term on the right-hand side vanishes if there exists an f * ∈ F s.t. μ(x, y) = μ 1 (x)δ y,f * (x) ∀(x, y) ∈ X ×Y . In this case, we call the learning problem realizable, otherwise we refer to it as agnostic.
Both in the agnostic and in the realizable scenario, a learning algorithm that always outputs a hypothesis h ∈ F is called a proper learner, and otherwise it is called improper.
A quantity of major interest is the number of examples featuring in such a learning problem. Given a learning algorithm A , the smallest m = m(ε, δ) ∈ N s.t. the learning requirement (2.1) is satisfied with confidence 1 − δ and accuracy ε is called the sample complexity of A . The sample complexity of the learning problem is the infimum over the sample complexities of all learning algorithms for the problem. This characterizes, from an informationtheoretic perspective, the hardness of a learning problem, but leaves aside questions of computational complexity.
The binary classification problem now arises as a special case from the above if we specify the output space Y = {0, 1} and take the loss function to be (y,ỹ) = 1−δ y,ỹ , the 0-1-loss. This setting is well studied and a characterization of its sample complexity is known. At its core is the following combinatorial parameter: Definition 1 (VC-Dimension Vapnik and Chervonenkis (1971) The main insight of VC-theory lies in the fact that learnability of a {0, 1}-valued concept class is equivalent to finiteness of its VC-dimension. Even more, the sample complexity can be expressed in terms of the VC-dimension. This is the content of the following Theorem 1 (see, e.g., Blumer et al. 1989;Hanneke 2016;Shalev-Shwartz and Ben-David 2014;Vershynin 2018) In the realizable scenario, the sample complexity of binary classification for a function class F of VCdimension d is m = m(ε, δ) = Θ 1 ε (d + log 1/δ) . In the agnostic scenario, the sample complexity of binary classification for a function class The proof of the sample complexity upper bound in the agnostic case typically goes via a different complexity measure, the Rademacher complexity, which is then related to the VC-dimension. As this will reappear later on in our analysis, we also recall this definition here.
Definition 2 (Rademacher Complexity (see Bartlett and Mendelson 2002) where U({−1, 1} n ) denotes the uniform distribution on {−1, 1} n . If we consider n i.i.d. random variables Z 1 , ..., Z n distributed according to a probability measure μ on Z and write Z = (Z 1 , ..., Z n ), the Rademacher complexities of F w.r.t. μ are defined to be R n (F ) := E Z∼μ n R F , n ∈ N.

The binary classification problem with classical instances and quantum labels
We introduce a generalization of the classical binary classification problem to the quantum realm by allowing the two labels to be quantum states. Thus let σ 0 , σ 1 ∈ S (C n ) be two (possibly mixed) quantum states, write D = {σ 0 , σ 1 }. We assume that classical descriptions of these states (their density matrices) are known to the learning algorithm as well as the fact that only these two quantum labels appear. The class to be learned is now a class of functions F ⊂ {f : X → D} and the underlying distribution will be a μ ∈ Prob(X × D), where X is some space of classical objects.
We now deviate from the standard PAC setting: We assume the training data to be S where the (x i , ρ i ) are drawn independently according to μ (in particular, ρ i ∈ D for all i). Here, the ρ i are the actual quantum states, not classical descriptions of them. Therefore, our learning problem is not a classical one, we have to perform measurements on the quantum labels to extract information from them. Equivalently, we represent an example (x i , ρ i ) drawn from μ as the classical-quantum state Note that this model for the training data differs from the one introduced by Bshouty and Jackson (1998), where the training data consists of copies of a superposition state. Instead, here we assume copies of a mixture of states. This is done mainly for two reasons: First, it allows us to naturally talk about maps with mixed state outputs. Second, it is debatable whether assuming access to superposition examples as in Bshouty and Jackson (1998) is justified (see, e.g., Ciliberto et al. 2018, section 5), and this problem remains when considering maps with quantum outputs. In contrast, the mixtures assumed in our model arise naturally as statistical ensembles of outputs of state preparation procedures, if the parameters of the preparation are chosen according to some (unknown) distribution. In that sense, the form of classical-quantum training data assumed here is both a straightforward generalization of classical training data, given the standard probabilistic interpretation of mixed states, and can (at least in the realizable scenario) be easily imagined to be obtained as outcome of multiple runs of a state preparation experiment with different parameter settings.
A quantum learner for F with confidence 1 − δ and accuracy ε from m = m(ε, δ) quantum examples has to output, for every μ ∈ Prob(X × D), with probability ≥ 1 − δ over the choice of training data of size m according As before, we can consider agnostic versus realizable and proper versus improper variants of this learning model.
Here, we define the risk of a hypothesis h ∈ F w.r.t. a distribution μ ∈ Prob(X × D) as Note that our assumption on F implies that h(x) ∈ D ∀x ∈ X and therefore we can easily rewrite which is just the 0-1-risk multiplied by a constant. We choose the slightly more complicated looking definition for R μ (h) for two reasons. On the one hand, σ 0 −σ 1 1 2 is a measure for the distinguishability of σ 0 and σ 1 and thus a natural scale w.r.t. which to measure the prediction error. (Note: If σ 0 , σ 1 are orthogonal pure states and thus perfectly distinguishable, the classical scenario is recovered.) On the other hand, our definition of risk can be motivated operationally as we discuss in Appendix 2.
Example 1 Here, we describe a physically motivated problem that is captured by our scenario. The idea is as follows: Suppose we have available a (possibly complicated) ground state preparation procedure. Using this, we want to prepare a ground state |ϕ 0 of a Hamiltonian H . However, H is perturbed by noise about which we have only partial information. We want to learn more about the noise and its influence on the prepared ground state.
We make this idea more concrete. We consider a (self- state preparation procedure that, if run with Hamiltonian H , prepares |ϕ 0 . When implementing this procedure, we have to fix values of a parameter vector x ∈ R D . (Think, e.g., of D = 3 and x denoting the location at which the experiment is set up.) But due to the laboratory being only imperfectly shielded, there is an unknown region R ⊂ R D in which the system is subject to noise. For simplicity, we assume that only two types of noise can occur and lead to the location-dependent Hamiltonian , The noise can lead to a perturbation of the ground state. Namely: -For x ∈ R, |ϕ 0 is a ground state of H (i) x . (This is the case of no effective noise.) -For x ∈ R, |ϕ 0 is the unique ground state of H (0) x . Hence, the noise H (0) is benign from the perspective of ground state preparation.
x . Hence, the noise H (1) is malicious from the perspective of ground state preparation.
Thus, we describe the ground state preparation by a function f With this formulation, gaining information about the noise region R and the noise type i can be phrased as the problem of (PAC-)learning an unknown element of the (known) func- where R is the class of possible error regions.
Note that |ϕ 0 and |ϕ 1 are not orthogonal and thus cannot be perfectly distinguished. Therefore, we cannot phrase the learning problem as one of binary classification with classical labels.
We return to this setting in Examples 2 and 3 to illustrate our learning strategies.
We want to conclude this section by discussing a drawback of our model. We assume F ⊂ D X , i.e., outputs of any f ∈ F are either σ 0 or σ 1 . Considering the convex structure of the set of quantum states, which is intimately tied to the probabilistic interpretation of quantum theory, this restriction can be considered unnatural. We nevertheless make it, for two reasons: First, it is easy to show using a Bayesian predictor that, under the assumption of μ being supported on D (which could, of course, also be contested), the optimal choice of predictors among all functions (S (C d )) X is actually a function in D X . Second, it is the most direct analog of the classical scenario with binary labels and we consider it a sensible first step that, as demonstrated in Example 1, can already be of physical relevance.

The agnostic case
Our learning strategy is motivated by interpreting the classical training data arising from performing a measurement on the label states as noisy version of the true training data. Before describing the learning strategy, we recall our assumption that classical descriptions of the label states σ 0 , σ 1 are known to the learner. Based on this knowledge, the learner can derive the optimal measurement {E 0 , E 1 } for minimum error distinction between the two states, the socalled Holevo-Helstrom measurement (see Watrous 2018, Theorem 3.4), by choosing E 0 to be the orthogonal projector onto the eigenspaces of σ 0 − σ 1 corresponding to nonnegative eigenvalues. This step is where knowledge of the states σ 0 and σ 1 is used.
The learning strategy is now the following, in which we use the Holevo-Helstrom measurement to produce classical training data and thus obtain a classical learning problem: Note that the only non-classical step in the strategy is step (1), which consists only of performing local two-outcome measurements.
The modification of the loss function in step (3) gives an unbiased estimate of the true risk: Lemma 1 (see Natarajan et al. 2013, Lemma 1) Fix x ∈ X . With the notation introduced above, for every z ∈ {0, 1} it holds that We can use a standard generalization bound in terms of Rademacher complexities (see, e.g., Theorem 26.5 of Shalev-Shwartz and Ben-David (2014)) to obtain: With probability ≥ 1 − δ over the choice of training data S = where we used that |˜ (y 1 , y 2 )| ≤ 1 1−η 0 −η 1 and defined the function class Next, we relate the empirical Rademacher complexity of G to that ofF .

Lemma 2 For any training data set
Proof (Sketch) The proof uses some standard steps that are typically used for example in proving the Lipschitz contraction property of the Rademacher complexity and in studying the Rademacher complexity in a binary classification scenario. See Appendix 1 for a detailed proof.
With this, we now reformulate the above result in terms Vershynin 2018, Theorem 8.3.23). Therefore, we obtain that, with probability ≥ 1 − δ over the choice of training data Note that, using Lemma 1, we can now bound Now we can set this equal to ε and rearrange to conclude that a sample size of If we now observe that , we obtain the sample complexity upper bound Remark 1 The naive version of our learning strategy would be to perform Holevo-Helstrom measurements and then apply a classical learning strategy, like empirical risk minimization, without correcting for the noise in the resulting classical labels. Actually, this learning strategy already performs reasonably well and, in certain special cases, even allows to reduce the quantum learning problem to a fully classical one. For a detailed analysis of the performance of this simpler strategy, the reader is referred to Appendix 3.
Example 2 We illustrate our agnostic learning strategy for the scenario of Example 1. As discussed in Appendix 3, as both label states |ϕ 0 ϕ 0 | and |ϕ 1 ϕ 1 | are pure, we can actually dispense with the modification of the classical loss function and simply take the 0-1-loss. Therefore, the Holevo-Helstrom strategy will look as follows: We first perform local Holevo-Helstrom measurements with measurement operators E 0 ∝ −1 + √ 2 1 T −1 + √ 2 1 ⊕ 0, . This gives rise to classical training data. With that data, we then perform (classical) empirical risk minimization over the classF = f (i) Both the optimization procedure and the generalization capability depend on the class R of possible noise regions. Concerning the generalization performance, observerve that, if ∅ ∈ R, then VCdim(F ) = VCdim(F R ), where we take to be the class of indicator functions of sets from R. The VC-dimension of such classes is well known for different geometric classes R. E.g., if R is the class of axis-aligned rectangles or that of Euclidean balls in R D , then VCdim(F R ) scales linearly in D and thus the dependence of the sample complexity upper bound on the number of parameters D is linear. If, however, we take R to be the class of compact and convex subsets of R D , then VCdim(F R ) = ∞ and the sample complexity upper bound becomes void. This is congruent with the intuition that without prior assumptions on the structure of the regions that can be influenced by noise, learning the noise (in particular its region) will be hard and maybe infeasible.

The realizable case
The strategy from the previous subsection uses a generalization bound via the Rademacher complexity and yields a sample complexity bound depending quadratically on 1/ε. In the classical binary classification problem it is known (see Theorem 1) that under the realizability assumption this can be improved to 1/ε, but this typically requires a different kind of reasoning via ε-nets. (Compare section 28.3 of Shalev-Shwartz and Ben-David (2014)). In Theorem 6 we show how the reasoning by Hanneke (2016) can be combined with results by Laird (1988) to achieve the 1/ε-scaling also in the case of two-sided classification noise. This sample complexity upper bound is seen to be optimal in its dependence on the VC-dimension d, the error rate bound η, the confidence δ and the accuracy ε by a comparison to the lower bound in Theorem 27 of Arunachalam and de Wolf (2018).
If, as in the previous subsection, we consider the classical training data obtained by measuring the quantum training data as noisy version of a true sample, we can exchange step 3 in the Holevo-Helstrom strategy by the minimum disagreement-based classical learning strategy achieving the optimal sample complexity bound of Theorem D.2. This directly yields the following quantum examples of a function in F are sufficient for binary classification with classical instances and quantum labels σ 0 , σ 1 with accuracy ε and confidence 1 − δ.
Example 3 When considering this learning strategy in the setting of Example 1, we first perform the Holevo-Helstrom measurements as in Example 2 to obtain classical data. Again, this is followed by a classical learning procedure for the classF = f (i) Whereas the sample complexity bound derived for the agnostic case in Section 4.1 applies to any (noise-corrected) classical empirical risk minimization, the procedure leading to the bound in Theorem 2 is a specific one, presented in the proof of Theorem D.2. First, the classical data is processed, using the subsampling algorithm of Hanneke (2016) (see Algorithm 2), to generate a collection of subsamples. For each of those subsamples, we then apply Algorithm 1: We use a first part of the subsample to group the elements of F into equivalence classes (according how they act on that part of the subsample), and the remainder is used to test the performance of each equivalence class. Afterwards, we output as hypothesis for that subsample a representative of the equivalence class that performs best in that test, i.e., that minimizes the number of disagreements with the part of the subsample used for testing. Whether and how the grouping into equivalence classes and finding minimum disagreement strategies can be done (efficiently) depends oñ F , and thus on R. Finally, we take a majority vote over all the subsample hypotheses to get the output hypothesis of the classical learning procedure.
The dependence of the sample complexity onF via the VC-dimension of the class of indicator functions of sets from R is analogous to Example 2.
Remark 2 From the description of our noise-corrected Holevo-Helstrom strategy (either in the form of Section 4.1 or that of this subsection), we can directly see that whether it is a proper or an improper learner depends on whether the classical learning algorithm in step (3) is. As the classical learning algorithm used in Section 4.1 is a simple Empirical Risk Minimization, it is in particular proper. So our noise-corrected Holevo-Helstrom strategy for the agnostic case is proper as well. The classical learner used in this subsection, however, is in general improper. So also the noise-corrected Holevo-Helstrom strategy for the realizable case is in general improper.

Sample complexity lower bounds
Whereas the goal of the previous section was to give strategies for solving the binary classification problem with classical instances and quantum labels and to prove upper bounds on the sufficient number of classical-quantum examples, we now turn to the complementary question of lower bounds on the number of required examples. In this section, we derive lower bounds that match the respective upper bounds from the previous section, and therefore, we conclude that the procedures described in Section 4 are optimal w.r.t. sample size in terms of the dependence on ε, δ, and d.

The agnostic case
We prove the sample complexity lower bounds in two parts, the first depending on the confidence parameter δ but not on the VC-dimension of the function class and conversely for the second.
We establish the VC-dimension-independent sample complexity lower bound in the following δ ∈ (0, 1). Let F ⊂ D X be a non-trivial concept class.
Suppose A is a learning algorithm that solves the binary classification task with classical instances and (distinct) label states σ 0 , σ 1 and concept class F with confidence 1 − δ and accuracy ε using By explicitly evaluating the risk R ± (h), we see that achieving an excess risk ≤ ε with probability ≥ 1 − δ, requires the learner to distinguish between the underlying distributions μ ± , and thus the corresponding training data states ρ ⊗m ± , with probability ≥ 1 − δ. It is well known (see, e.g., Nielsen and Chuang 2009, chapter 9) that the optimal success probability of this quantum distinguishing task is given by . Via the Fuchs-van de Graaf inequalities, which state that this can be upper bounded using lower bounds on the fidelity F (ρ ⊗m + , ρ ⊗m − ) = F (ρ + , ρ − ) m . The fidelity F (ρ + , ρ − ) can be lower-bounded using its strong concavity and the explicit expressions for ρ ± . The result then follows by comparing the obtained upper bound with the required lower bound p opt ≥ 1 − δ.
See Appendix 1 for a detailed proof.
For the proof of the VC-dimension-dependent part of the lower bound we need a well known observation about the eigenvalues of a statistical mixture of two pure quantum states, which is the content of the following Lemma 4 Let |ψ , |φ ∈ C n be distinct pure quantum states. Let α, β ≥ 0 be real numbers. Then the non-zero eigenvalues of the mixture ρ := α|ψ ψ| + β|φ φ| are given by With this we can now prove a sample complexity lower bound for the case of pure label states.
Theorem 3 Let σ 0 = |ψ 0 ψ 0 |, σ 1 = |ψ 1 ψ 1 | ∈ S (C n ) be (distinct) pure quantum states, let ε ∈ (0, σ 0 −σ 1 1 8 ), Note that ∀a ∈ {0, 1} d ∃f a ∈F : f a (s i ) = a i by shattering and that f a is a minimum error concept w.r.t. μ a . By evaluating the excess error of an fã compared to f a , we see that solving the learning problem with confidence 1 − δ requires the learner to output, with probability ≥ 1 − δ, a hypothesis described by a string whose Hamming distance to the true underlying string is ≤ d 4 . We can use this observation to obtain the lower bound I (A : B) ≥ Ω(d) on the mutual information between underlying string A (drawn uniformly at random) and corresponding quantum training data B.
We can also upper bound the mutual information. If we now combine Lemma 3 and Theorem 3 with the result of Section 4.1 we obtain Then a sample size of Θ d ε 2 + log 1/δ ε 2 is necessary and sufficient for solving the binary classification task with classical instances and quantum labels σ 0 , σ 1 and hypothesis class F with confidence 1 − δ and accuracy ε.
Therefore, we have shown that the strategy from Section 4.1 is, for pure states, optimal in sample complexity w.r.t. its dependence the VC-dimension, the accuracy and the confidence. But we do not make a statement on optimality w.r.t. the dependence on the distinguishability of the label states, because the parameter σ 0 − σ 1 1 is lacking from our lower bound.

The realizable case
We now show analogous lower bounds for the sample complexity in the realizable scenario with the same proof strategy.
We now provide the analog of Theorem 3 for the realizable case. Proof This can be proved similarly to Theorem 3. See Appendix 1 for a detailed proof.

be a non-trivial concept class s.t.F has VC-dimension d +1. Suppose
Thus, we have obtained a sample complexity lower bound that matches the upper bound proved in Section 4.2 in the dependence on the VC-dimension, the confidence and the accuracy, but we do not make a statement about optimality w.r.t. the dependence on σ 0 − σ 1 1 .
Remark 3 As already discussed in Section 2.1, in proving the sample complexity lower bounds we resort to the Heisenberg picture, which allows us to absorb the intermediate quantum channels performed by a learner into the measurement. These lower bounds therefore even hold for quantum learning algorithms that perform coherent and adaptive measurements on the training data. In particular, the information-theoretic complexity of our learning problem does not change if we restrict the quantum learner to only performing two-outcome POVMs locally (i.e., on one subsystem only). This is maybe not too much of a surprise, since the optimal measurement for distinguishing states drawn uniformly at random from { m i=1 σ x i } x∈{0,1} m can, using the Holevo-Yuen-Kennedy-Lax optimality criterion (Holevo 1973;Yuen et al. 1975), be seen to be exactly given by local Holevo-Helstrom measurements.

Conclusion and outlook
We have proposed a novel way of modifying the classical binary classification problem to obtain a quantum counterpart. The conceptual difference to the framework of quantum PAC learning as discussed in Arunachalam and de Wolf (2017) is that we work with maps whose outputs are themselves quantum states, not classical labels. This naturally gives rise to training data given by quantum states, which is one aspect in which our setting differs from Aaronson (2007).
Using results from classical learning theory on dealing with classification noise in the training data, we exhibited learning strategies (based on the Holevo-Helstrom measurement) for binary classification with classical instances and quantum labels. The learning strategies consist of two main steps: First, classical information is extracted from the training data by performing a (localized) measurement. Second, classical learning strategies are applied. We complemented these procedures by sample complexity lower bounds thereby establishing the information-theoretic optimality of these strategies for pure label states w.r.t. the dependence on VC-dimension, confidence and accuracy.
We conclude with some open questions that we leave open for further research: -Can we derive sample complexity lower bounds which explicitly incorporate factors related to the hardness of distinguishing σ 0 and σ 1 , e.g., in terms of σ 0 − σ 1 1 or max{tr[E 0 σ 1 ], tr[E 1 σ 0 ]}? Or can the corresponding factors in the upper bounds be eliminated? Could this be related to another complexity measure from classical learning theory, the "fat-shattering dimension" of the class -Our analysis is focused on the information-theoretic part of the learning problem, i.e., the sample complexity. Can we improve the computational complexity? -For deriving our sample complexity upper bounds, we used specific classical learning procedures applied to the post-measurement training data. In the agnostic case, we use empirical risk minimization, in the realizable case we use a combination of a minimum disagreement approach with a subsampling procedure.
In both cases, we decided for these algorithms to achieve the (essentially) optimal sample complexity characterized via the VC-dimension. However, we could use other classical learning procedures for "post-processing". Can we identify situations in which procedures like structural risk minimization, compression schemes, or stable learning procedures yield useful sample complexity bounds? -We considered the case of classical instances. Can this be extended to a scenario of quantum instances with classical (or even quantum) labels? Whereas we were able to study the case of classical instances and quantum labels with methods from learning with label noise, once the instances themselves are quantum, we might have to employ ideas from learning models with restricted access to the instances such as that of "learning with restricted focus of attention" proposed in Ben-David and Dichterman (1998). -Our strategy uses the Holevo-Helstrom measurement which can be understood as inducing the minimum amount of noise. However, in classical learning theory it is well known that adding noise to the training data can be helpful in preventing overfitting. In this spirit, can we justify other measurements than the Holevo-Helstrom measurement? -We assumed throughout our analysis that the learning algorithm has to output a hypothesis that maps into {σ 0 , σ 1 }. What if we allow for hypotheses that map into conv ({σ 0 , σ 1 }) or S (C d )? -Finally, we assume throughout that the label states σ 0 , σ 1 are known in advance. Can this assumption be removed? Here, it might be helpful that Theorem 6 does not need explicit knowledge of the error rates η 0 , η 1 , but merely of an upper bound η b on them.

Proof of Lemma 2 Let
If we use and , then we can rewritê Next, we use that E σ [σ i ] = 0 and that σ i and (1 − 2y i )σ i have the same distribution for all i. With this we obtain from the abovê where the last step used that the expression is invariant w.r.t. interchangingf andf , so we can drop the absolute value. Now we can iterate this reasoning for i = 2, . . . , m and obtain the desired inequality.

Proof of Lemma 3
As F is non-trivial, there exist concepts f, g ∈ F and a point x ∈ X s.t. f (x) = σ 0 and g(x) = σ 1 . Let λ ∈ (0, 1) (to be chosen appropriately later in the proof). Define probability distributions μ ± on X × D via The risk of a hypothesis h ∈ D X w.r.t. these probability measures is given by in particular the optimal achievable risk is 1−λ 4 σ 0 − σ 1 1 . Note that a hypothesis which predicts the suboptimal label state for x has an excess risk of So if we pick λ = ε 2 σ 0 −σ 1 1 < 1, then in order to achieve an excess risk ≤ ε with probability ≥ 1 − δ, the learning algorithm has to be able to distinguish between the underlying distributions μ ± with probability ≥ 1 − δ.
As the algorithm has access to the underlying distribution only via the training data, this means that the algorithm has to be able to distinguish the corresponding training data ensembles with probability ≥ 1 − δ. Here, we observe that the training data being drawn i.i.d. according to μ ± is equivalent to the learning algorithm having access to m copies of the state ρ ± := μ ± (x, f (x))|x x| ⊗ σ 0 + μ ± (x, g(x))|x x| ⊗ σ 1 , because this mixed state simply describes the statistical mixture. The optimal success probability for distinguishing between two quantum states is a well-studied object in quantum information theory. It can be characterized by the trace distance between the two states and is given (in our case) by (see, e.g., Nielsen and Chuang 2009) As the trace distance of tensor products is not that easy to deal with, we will instead work with the fidelity defined as According to the Fuchs-van de Graaf inequalities we have where the last steps uses multiplicativity of the fidelity under tensor products. Now we require p opt ≥ 1 − δ and rearrange to obtain or equivalently after taking logarithms By strong concavity of the fidelity, we have This now implies Thus, we obtain (after Taylor-expanding the logarithm in the denominator) Proof of Lemma 4 Pick an orthonormal basis {|k } k=1,...,n of C n s.t. |ψ = |0 and |φ = cos(ϕ)|0 + sin(ϕ)|1 for an angle 0 ≤ ϕ < 2π . Then, when restricting to the relevant subspace spanned by |0 and |1 , we get ρ| span{|0 ,|1 } = α + β cos 2 (ϕ) β cos(ϕ) sin(ϕ) β cos(ϕ) sin(ϕ) β sin 2 (ϕ) =: A.
We now easily see that det(A) = αβ sin 2 (ϕ) ! = λ 1 λ 2 and tr[A] = α + β where λ 1 , λ 2 are the two non-zero eigenvalues of ρ. We can solve the second of these two equations for λ 2 and plug this back into the first equation to obtain λ 2 1 − λ 1 (α + β) + αβ sin 2 (ϕ) = 0. We now solve this quadratic equation and obtain the two eigenvalues where we used that | cos(ϕ)| = | ψ|φ |. Note that ∀a ∈ {0, 1} d ∃f a ∈F : f a (s i ) = a i by shattering and that for each a ∈ {0, 1} d , f a is a minimum error concept w.r.t. μ a and a concept fã has additional error 4ε d compared to f a . Hence, in order to solve the learning problem with confidence 1 − δ and accuracy ε the algorithm A has to output, with probability ≥ 1 − δ, a hypothesis (generated from the training data arising from the underlying string) that when evaluated on S yields a vector that is d 4 -close to the underlying string in Hamming distance.
Let A be a random variable distributed uniformly on {0, 1} d (corresponding to the unknown underlying string a). Let B = B 1 . . . B m be the training data with each example generated independently from μ a described by the quantum ensemble ...,d, b=0,1 , or, equivalently, by the quantum state In particular, the composite system of underlying string and corresponding training data is described by the quantum state We follow the information-theoretic proof strategy from Arunachalam and de Wolf (2018), i.e., we first show a lower bound on the mutual information I (A : B) which arises from the learning requirement, then observe that I (A : B) ≤ m · I (A : B 1 ) and finally upper bound the mutual information I (A : B 1 ).
First for the mutual information lower bound. Let h(B) ∈ {0, 1} d denote the label vector assigned to S by the hypothesis produced by the learner upon input of training data B. Let . If Z = 1, then by the above deliberations we conclude d H (A, h(B)) ≤ in particular I (A : B) ≥ Ω(d). (Here we use our assumption on δ.) Now we show I (A : B) ≤ m · I (A : B 1 ). We reproduce the reasoning provided in Arunachalam and de Wolf (2018) for completeness: Here, the first step is by definition, the second uses the product structure of the subsystem B, the third follows from subadditivity of the entropy and the last is again by definition.
And finally, we prove an upper bound on I (A : B 1 ). To this end, we have to study the reduced state More precisely, we have and thus have to study the entropies of σ AB 1 as well as those of the reduced states σ A and σ B 1 . As A ∼ Uniform {0, 1} d , we have S(A) = d. Now we consider the reduced state Here, we have By Lemma 4 we know that 1 2d |ψ 0 ψ 0 | + 1 2d |ψ 1 ψ 1 | has non-zero eigenvalues μ 1/2 = 1 2d (1 ± | ψ 0 |ψ 1 |) and due to the block-diagonal structure of σ B 1 we conclude that the non-zero eigenvalues of σ B 1 are also μ 1/2 , each of multiplicity d. In particular, we have Similarly, we see that the non-zero eigenvalues of σ AB 1 are each of multiplicity d · 2 d and that therefore If we combine these expressions for the different entropies, we obtain We now use Taylor's theorem to understand the scaling of the different terms with ε. First, we have (by Taylorexpanding Moreover, using the Taylor expansions we now obtain Plugging these approximations back in gives us

Now combining our mutual information lower and upper bounds yields
which after rearranging becomes as desired.
Detailed Proof of Lemma 5 As F is non-trivial, there exist f 1 , f 2 ∈ F and x 1 , x 2 ∈ X s.t. f 1 (x 1 ) = f 2 (x 1 ) = σ 0 and f 1 (x 2 ) = σ 0 = σ 1 = f 2 (x 2 ). Now consider the distribution μ on X defined by where λ ∈ (0, 1) is to be chosen later in the proof. The risk of a hypothesis h ∈ D X w.r.t. μ if the target concept is f i is given by So if we choose λ = 2ε σ 0 −σ 1 1 < 1, then the learning requirement for A implies that with probability ≥ 1 − δ, A correctly identifies whether the target concept is f 1 or f 2 . As the algorithm has access to the underlying distribution only via the training data, this means that the algorithm has to be able to distinguish the corresponding training data ensembles with probability ≥ 1 − δ. Here, we observe that the training data being drawn i.i.d. according to μ ± is equivalent to the learning algorithm having access to m copies of the state The optimal success probability for distinguishing between two quantum states is a well-studied object in quantum information theory. It can be characterized by the trace distance between the two states and is given (in our case) by (see Nielsen and Chuang 2009) As the trace distance of tensor products is not that easy to deal with, we will instead work with the fidelity defined as According to the Fuchs-van de Graaf inequalities (see Nielsen and Chuang 2009, Section 9.2.3) we have where the last steps uses multiplicativity of the fidelity under tensor products. Now we require p opt ≥ 1 − δ and rearrange to obtain or equivalently after taking logarithms m ≥ log(4δ(1 − δ)) log(F (ρ 1 , ρ 2 ) 2 ) . Now we use again the Fuchs-van de Graaf inequalities which tell us (after rearranging) .
f a (s 0 ) = 0 and f a (s i ) = a i ∀1 ≤ i ≤ d.
Observe that w.r.t. a distribution μ and target concept f a , another concept f b has error So if we pick λ = 8ε σ 0 −σ 1 1 , then by the learning requirement, with probability ≥ 1 − δ, A has to output a hypothesis h that when evaluated on S yields a label vector that is d 4 -close to the true underlying string in Hamming distance.
Denote by A ∼ Uniform {0, 1} d a random variable describing the unknown underlying string, let B = B 1 . . . B m be the corresponding quantum training data system. We want to repeat the three-step reasoning from the proof of Theorem 3. The first two steps work exactly as before.
Step 3 will be slightly different. Again we have In this case, the relevant composite state is |s j s j | ⊗ σ a j .
We now again use Lemma 4 to compute eigenvalues and thus entropies. (Here our assumption that σ 0 and σ 1 are pure enters the proof.) We obtain -Each ρ a has non-zero eigenvalues 1 − λ of multiplicity 1 and λ d of multiplicity d.
With this we can now compute the relevant entropies and obtain as well as Hence, we now have Now we can finish the proof by combining steps 1, 2 and 3 as before.

Appendix 2. A physical motivation for our notion of risk
In our definition of the risk R μ we use the trace distance. As the latter is a well-established measure of distinguishability of quantum states, it presents itself as a natural candidate loss function. Here, we give a more explicit operational reasoning as to why we choose to use the trace distance. Imagine the learning task as a competition between two parties, a learner and a teacher. We assume that both parties obey the laws of quantum physics. The teacher knows (a classical description of) the probability distribution μ ∈ Prob(X × D) and will provide corresponding training data to the learner during a training phase. The learner's goal is to persuade the teacher in a test phase that she has managed to learn the distribution μ, which was unknown to her in advance, i.e., that she has produced a good hypothesis h : We first give an informal description of the test phase: The teacher prepares another (independent) example (x, ρ) drawn from μ. She then sends x to the learner. The latter applies her hypothesis h to prepare the quantum state h(x) which she then sends back to the teacher. The teacher now uses this one copy of h(x) and her knowledge of μ to evaluate whether the learner made a good prediction. As also the teacher is restricted by quantum theory, she can only do so by performing a measurement.
We now discuss the choice of measurement of the teacher in more detail. On the one hand, the teacher wants to maximize the probability of detecting a wrong prediction. On the other hand, she does not want to be unfair, so at the same time she tries to maximize the probability of detecting where σ i = ρ and σ j ∈ D \ {ρ}. As she knows (a classical description of) the state ρ ∈ D and that h(x) ∈ D, she can achieve this by picking {E accept , E rej ect } to be the optimal measurement for minimum error discrimination of D (where the states are taken with equal prior probabilities (see Watrous 2018, Theorem 3.4)). The measurement is basically the same independently of whether ρ = σ 1 or ρ = σ 2 , only the outcome labels are interchanged. Now the expected probability of the trainer rejecting the learner's prediction is With this we now obtain when comparing the achieved with the optimal expected rejection probability So we have recovered our notion of risk, at least in the case of states of equal purity, from a more basic analysis of the test phase. Note that a similar analysis could be performed also in the case of more than two quantum labels. There, the teacher's measurement would be the optimal measurement for minimum error discrimination of ρ and 1 |D|−1 σ ∈D\{ρ} σ . Unfortunately, no closed-form expressions for the corresponding success probabilities are known. We do, however, see that in this scenario, using the trace distance as loss function would be too pessimistic from the perspective of the learner. As the teacher does not know the prediction state prepared by the learner, the teacher has to solve a state discrimination problem taking into account all possible label states.

Appendix 3. The Holevo-Helstrom strategy
The naive learning strategy based on the Holevo-Helstrom measurement is the following: The remainder of this section is devoted to studying the performance of this simple learning procedure. Note that we leave open for now the classical learning algorithm to be used, we first work towards characterizing the true risk R μ (h) in terms of the intermediate classical riskR ν (g).
In the following we will often make use of the fact that when identifying i ↔ σ i , the probability measure μ on X × D gives rise to a probability measure on X × {0, 1}. We will abuse notation and also denote the latter measure by μ, however, which measure is meant will always be clear from the context.
Lemma C.1 With the notation as in the Holevo-Helstrom strategy (in particular h(x) = σ g(x) ) it holds that Proof This can be shown by direct computation using the definition of ν: Now we use the specific property of the Holevo-Helstrom measurement that tr and |g(x)| = g(x). Thus, we obtaiñ where the last step uses h(x) = σ g (x) .
This allows us to easily compare the true and the intermediate risk and obtaiñ As g(x) ∈ {0, 1} ∀x ∈ X and in particular 0 ≤ E μ 1 [g] ≤ 1, this gives rise to the following Corollary 2 With the notation as in the Holevo-Helstrom strategy it holds that space X and classical target space {0, 1}, a concept class F ⊂ {0, 1} X , a probability measure μ ∈ Prob(X ), and noise probabilities 0 ≤ η 0 , η 1 < 1 2 , with which labels are flipped. Moreover, we will work with the 0-1-loss function and denote the corresponding risk of a hypothesis h w.r.t. a target concept f by err μ (h; f ) = μ[h(x) = f (x)]. Finally, any training data sample S splits the concept class F into so-called S-equivalence classes, where f 1 , f c ∈ F are equivalent if and only if f 1 (x) = f 2 (x) ∀x ∈ X s.t. ∃y ∈ {0, 1} with (x, y) ∈ S.
The basic learning strategy underlying our discussion is Algorithm 1. It is the natural analog of searching for a consistent function in the case of noisy labels. Namely, as such a consistent function will in general not exist, it searches for a function that disagrees with the training data on as few examples as possible.
Laird's original proof that this algorithm solves the PAC learning problem is for the case η 0 = η 1 . It is, however, easily generalized to our case because we still assume the same noise bound on both error rates. (We only have to adapt the expression for the error rate and the corresponding Hoeffding bounds.) In order to apply the reasoning by Hanneke (2016) we need to slightly reformulate the result of this algorithm s.t. we obtain a bound on the error in terms of the sample size. When following the proof of Theorem 5.7 in Laird (1988) we see that m 1 is used to ensure that there is a hypothesis which performs better than some given error threshold and m 2 is used to ensure that such a hypothesis is actually chosen. In particular, if we use the error bound by Blumer et al. (1989) in terms of the sample size, we see that m 2 depends on m 1 as follows: Remark 5 Note that we cannot directly use the tighter error bound in terms of the sample complexity proved by Hanneke (2016) here because Laird's proof explicitly makes use of the strategy employed by Blumer et al. (1989) which works via consistency with a given training sample.
We can now easily bound If we now further assume that δ > 0 is chosen s.t. log 2 δ > 2d log d 2e , then we can continue upper bounding this and obtain m = m 1 + m 2 ≤ (1 + C(η b ))m 1 , where we defined C(η b ) := . It is easy to check that for 0 ≤ η b < 1 2 , C(η b ) ≤
Hence, using a sample of size m ≥ 2(1 + C(η b )) for the minimum disagreement strategy with m 2 = C(η b ) 1+C(η b ) m and m 1 = m − m 2 gives -using m 2(1+C(η b )) ≤ m 1 ≤ m 1+C(η b ) ≤ m 2 C(η b ) -an error guarantee of err μ (h; f * ) ≤ 4 m 1 d log 2em 1 d + log 2 δ (4.1) With this suboptimal base learner we will now follow the strategy by Hanneke (2016) in order to build a better learner from it. Note that Hanneke's proof includes several steps in which the existence of a function consistent with the respective subsample is ensured. This is not necessary in our case because the minimum disagreement strategy does not require a consistent function to exist.
We recall the algorithm for preprocessing the training data to generate subsamples as introduced in Hanneke (2016) in our Algorithm 2. Proof This proof is analogous to the proof of Theorem 2 in Hanneke (2016) with some minor simplifications and adaptations and is given here only for the sake of completeness.
Fix an f * ∈ F and a probability measure μ over X . Denote by S = S 1:m the corresponding noisy training data. For any classifier h denote by ER(h) = {x ∈ X |h(x) = f * (x)} the set of instances on which h errs.
Note that S 0 = S 1:(m−3 m 4 ) . As m ≥ 4, 1 ≤ m − 3 m 4 < m. Also, h i =ĥ (m−3 m 4 ),T i . So by the induction hypothesis applied under the conditional distribution given S 1 , S 2 , S 3 , which are independent of S 0 , combined with the law of total probability, for every i ∈ {1, 2, 3} there exists an event E i of probability ≥ 1 − δ 9 on which μ[ER(h i )] ≤ cC(η b ) 1 + |S 0 | d + ln 9 · 18 δ ≤ 4cC(η b ) m d + ln 9 · 18 δ . (4.4) Next, fix an i ∈ {1, 2, 3} and write {(X i,1 ,Ỹ i,1 ), . . . , (X i , N i ,Ỹ i,N i )} := S i ∩ (ER(h i ) × Y ). As h i and S i are independent,X i,1 , . . . ,X i,N i are conditionally independent given h i and N i . Therefore, we can apply the error bound (4.2) for our base learner L under the conditional distribution given h i and N i to conclude: There exists an event E i of probability ≥ 1 − δ 9 s.t., if N i > 0, then the output h of the base learner L upon input of S i ∩(ER(h i )×Y ) satisfies If we now combine this with (4.4) and (4.7), then we see: