1 Introduction

The fields of machine learning and of quantum computation provide new ways of looking at computational problems and have seen a significant increase in academic as well as practical interest since their origins in the 1970s and 1980s. More recently, attention was directed to paths for combining ideas from these two fruitful research areas. This gave rise to new approaches under different names such as “quantum machine learning” or “quantum learning theory”.

In classical statistical learning theory, one of the most influential frameworks is that of probably approximately correct (PAC) learning due to Vapnik and Chervonenkis (1971) and Valiant (1984). It is particularly well studied for the task of binary classification. For this problem the so-called VC-dimension Vapnik and Chervonenkis (1971) is known to characterize the sample complexity of learning a function class (Blumer et al. 1989; Hanneke 2016). Motivated by these strong theoretical results, a quantum analog of this problem was soon defined and studied in a series of papers (an overview over which is given in Arunachalam and de Wolf (2017)), which culminated in the results of Arunachalam and de Wolf (2018). There it is shown that the information-theoretic complexity of the task of quantum PAC learning a 0-1-valued function class is characterized by the VC-dimension in exactly the same way as for the classical scenario.

The scenario studied in Arunachalam and de Wolf (2018) assumes the training data available to the learner to be given in a specific quantum form and allows the learner to perform quantum computational operations on that training data. The functions to be learned, however, still map classical inputs to classical outputs. We propose a different quantum version of the binary classification task by not only considering the possibility of quantum training data but by allowing the objects to be learned to be inherently quantum. More specifically, we consider functions that map classical inputs to one of two possible quantum output states (“quantum labels”). These maps describe state preparation procedures. A more general learning task of this type, for which our problem can be seen as a toy model, could be relevant for cases in which state preparation is either costly or time-consuming, e.g., preparing thermal states at low temperatures (see ; Brandão and Kastoryano 2019; Chowdhury 2020, and references therein). Here, one could first produce sample data, learn a predictor, and then reproduce the preparation more efficiently using the predictor.

1.1 Main results

We consider maps \(f:{\mathscr{X}}\to \{ \sigma _{0},\sigma _{1}\}\) that assign to points in a classical input space \({\mathscr{X}}\) one of two labelling quantum states {σ0,σ1}. (Here, σ0 and σ1 are, in general, mixed states described by density matrices.) Let \({\mathscr{F}}\) be a function class consisting of such functions. We assume the training data to be given as a classical-quantum state about which, according to the laws of quantum theory, we can only gain information by performing measurements.

Our learning model is that of PAC-learning with accuracy ε and confidence δ. Here, we require a learning algorithm, given as input classical-quantum training data generated according to some unknown underlying distribution, to output with probability ≥ 1 − δ over the choice of training data a hypothesis that achieves accuracy ε. (Accuracy is measured in terms of the trace distance.)

We present a learning strategy that (ε,δ)-PAC learns \({\mathscr{F}}\subseteq \{ f:{\mathscr{X}}\to \{ \sigma _{0},\sigma _{1}\}\}\) in the agnostic scenario from classical-quantum training data of size \({\mathscr{O}}\left (\frac {d}{\varepsilon ^{2}} + \frac {\log {1}/{\delta }}{\varepsilon ^{2}}\right )\), where d is the VC-dimension of the {0,1}-valued function class \(\tilde {{\mathscr{F}}}\subseteq \{\tilde {f}:{\mathscr{X}}\to \{0,1\}\}\) induced by \({\mathscr{F}}\) via σii, i = 0,1. Here, “agnostic” means that there need not be a function in \({\mathscr{F}}\) that would achieve perfect accuracy. We also show that solving this learning problem requires training data size \({\varOmega }\left (\frac {d}{\varepsilon ^{2}} + \frac {\log {1}/{\delta }}{\varepsilon ^{2}} \right )\), so our strategy is optimal w.r.t. the sample complexity dependence on ε, δ and d.

For the realizable scenario of the quantum learning problem, i.e., under the assumption that perfect accuracy can be achieved using \({\mathscr{F}}\), we prove a sample complexity upper bound of

$$ \mathscr{O}\left( \frac{1}{\varepsilon (1-2\max\lbrace \text{tr}[E_{0}\sigma_{1}],\text{tr}[E_{1}\sigma_{0}]\rbrace)^{2}} \left( d + \log{1}/{\delta}\right)\right), $$

where {E0,E1} is the Holevo-Helstrom measurement for distinguishing σ0 and σ1, and a sample complexity lower bound of \({\varOmega }\left (\frac {d}{\varepsilon } + \frac {\log {1}/{\delta }}{\varepsilon }\right )\). Also here, these bounds coincide w.r.t. their dependence on ε, δ and d. The prefactor \((1-2\max \limits \lbrace \text {tr}[E_{0}\sigma _{1}]\), tr[E1σ0]})− 2 in the upper bound comes from our procedure trying to distinguish σ0 and σ1 by measuring single copies. (Note: Even though we formulate this in terms of the Holevo-Helstrom measurement, we could use any other two-outcome POVM \(\{ \tilde {E}_{0},\tilde {E}_{1}\}\) that satisfies \(\max \limits \lbrace \text {tr}[\tilde {E}_{0}\sigma _{1}],\text {tr}[\tilde {E}_{1}\sigma _{0}]\rbrace <{1}/{2}.\)).

In proving the sample complexity upper bound for the realizable scenario, we combine algorithms from Laird (1988) and Hanneke (2016) to show that \({\mathscr{O}}\left (\frac {1}{\varepsilon (1-2\eta _b)^{2}}\right .\) \(\left .\left (d + \log {1}/{\delta }\right )\right )\) classical examples with two-sided classification noise, i.e., in which each label is flipped with probability given by a noise rate, suffice for classical (ε,δ)-PAC learning a function class of VC-dimension d in the realizable scenario if the noise rate is bounded by ηb < 1/2. This upper bound has, to the best of our knowledge, not been observed before and, when combined with the lower bound from Arunachalam and de Wolf (2018), establishes the optimal sample complexity of this classical noisy learning problem.

As is common in statistical learning theory, our main focus lies on the information-theoretic complexity of the learning problem, i.e., the necessary and sufficient number of quantum examples, whereas we do not discuss the computational complexity. Our proposed strategies are “semi-classical” in the following sense: After initially performing simple tensor product measurements, in which each tensor factor is a two-outcome POVM, the remaining computation is done by a classical learning algorithm. In particular, the procedure does not require (possibly hard to implement) joint measurements and its computational complexity will be determined by the (classical) computational complexity of the classical learner used as a subroutine.

1.2 Overview over the proof strategy

We first sketch how we obtain the sample complexity upper bounds. We propose a simple (semi-classical) procedure that consists of first performing local measurements on the quantum part of the training data examples to obtain classical training data and then applying a classical learning algorithm.

We observe that the learning problem for which the classical learner is applied, can then be viewed as a classical binary classification problem with two-sided classification noise, i.e., in which the labels are flipped with certain error probabilities determined by the outcome probabilities of the performed quantum measurements. Therefore, we have reduced our problem to obtaining sample complexity upper bounds for a classical learning problem with noise.

In the general (so-called agnostic) case, we can use known sample complexity bounds formulated in terms of a complexity measure called Rademacher complexity to show that classical empirical risk minimization w.r.t. a suitably modified loss function (as suggested in ; Natarajan et al. 2013) achieves optimal sample complexity for this classical learning problem with noise.

In the realizable case, i.e., under the assumption that any non-noisy training data set can be perfectly represented by some hypothesis in our class \(\tilde {{\mathscr{F}}}\), the optimal sample complexity for binary classification with two-sided classification noise has not been established in the literature. We combine ideas from Laird (1988) and Hanneke (2016) to exhibit an algorithm that achieves information-theoretic optimality for this scenario.

To obtain the sample complexity lower bounds, we apply ideas from Arunachalam and de Wolf (2018). Namely, we observe that for sufficiently small accuracy parameter, any quantum strategy that solves our learning problem indeed has to be able to distinguish between the possible different training data states with high success probability.

In the simple case of distinguishing between two quantum states, arising from two different “hard-to-distinguish” underlying distributions, this probability can be upper bounded in terms of the trace distance of the states. In the more general case of many states, we do not study this success probability directly. Instead, we consider the information contained in the quantum training data about the choice of the underlying distribution, again chosen out of a set of “hard-to-distinguish” distributions.

1.3 Related work

Bshouty and Jackson (1998) introduced a notion of quantum training data for learning problems with classical concepts and used it to learn DNF (Disjunctive Normal Form) formulae w.r.t. the uniform distribution. This was extended to product distributions by Kanade et al. (2019). Using ideas from Fourier-based learning, this type of quantum training data was also studied in the context of fixed-distribution learning of Boolean linear functions (Bernstein and Vazirani 1993; Cross et al. 2015; Ristè et al. 2017; Grilo et al. 2017; Caro 2020), juntas Atıcı and Servedio (2007), and Fourier-sparse functions (Arunachalam et al. 2019a). Arunachalam and de Wolf (2017) and Arunachalam et al. (2019b) study the limitations of these quantum examples. A broad overview over work on quantum learning classical functions is given in Arunachalam and de Wolf (2017).

Also for the model of learning from membership queries, a quantum counterpart can be considered. Servedio and Gortler (2004) showed that the number of required classical queries is at most polynomially larger than the number of required quantum queries. Recently,this polynomial relation was improved upon in Arunachalam et al. (2019a). A more specific scenario, namely that of learning multilinear polynomials more efficiently from quantum membership queries, is studied in Montanaro (2012).

Similarly, also a quantum counterpart of the classical model of statistical query learning can be defined. This was recently studied in Arunachalam et al. (2020).

Another line of research at the intersection of learning theory and quantum information focuses on applying classical learning to concept classes arising from quantum theory, e.g., from states or measurements. This was initiated by Aaronson (2007) and studied further by Cheng et al. (2016) and Aaronson (2018), and Aaronson et al. (2018).

Our learning model is similar to the one studied in Chung and Lin (2018). Also there, the inputs are assumed to be classical and the outputs are quantum states. The crucial difference to our scenario is that we assume that there are only two possible label states and these are known in advance. In Chung and Lin (2018), there can be a continuum of possible label states.

Our additional assumption allows us to study infinite function classes \({\mathscr{F}}\), whereas the results in Chung and Lin (2018) are for classes of finite size. (We expect that the reasoning of Chung and Lin (2018) can be extended to infinite classes using the so-called “growth function” when restricting to a finite set of possible target states. This might lead to a learning procedure that can be applied in our scenario without prior knowledge of the possible quantum label states.) As a further difference between the approaches, whereas the strategy of Chung and Lin (2018) requires the ability to perform measurements in random orthonormal bases, the measurements in our procedures can be taken to be fixed and of product form and are thus potentially easier to implement.

The classical problems to which our quantum learning problems are reduced are problems of learning from noisy training data. These were first proposed by Angluin and Laird (1988) and Laird (1988) and studied further, e.g., by Aslam and Decatur (1996) and Cesa-Bianchi et al. (1999) and Natarajan et al. (2013).

1.4 Structure of the paper

In Section 2 we recall some notions from learning theory as well as from quantum information and computation. The central learning problem of this contribution is formulated in Section 3. The next section exhibits strategies for solving the task and establishes sample complexity upper bounds. In doing so, we derive a tight upper bound on the sample complexity of classical binary classification with two-sided classification noise (see Appendix Appendix). The quantum sample complexity upper bounds are complemented by lower bounds in Section 5. We conclude with open questions and the references.

2 Preliminaries

2.1 Basics of quantum information and computation

A finite-dimensional quantum system is described by a (mixed) state and mathematically represented by a density matrix of some dimension \(d\in \mathbb {N}\), i.e., an element of \({\mathscr{S}}(\mathbb {C}^d):=\lbrace \rho \in \mathbb {C}^{d\times d}\ |\ \rho \geq 0, \text {tr}[\rho ]=1\rbrace \). Here, ρ ≥ 0 means that ρ is a self-adjoint and positive semidefinite matrix. The extreme points of the convex set \({\mathscr{S}}(\mathbb {C}^d)\) are the rank-1 projections, the pure states. We employ Dirac notation to denote a unit vector \(\psi \in \mathbb {C}^d\) also by \(|\psi \rangle \in \mathbb {C}^d\) and the corresponding pure state by |ψ〉〈ψ|.

To make an observation about a quantum system, a measurement has to be performed. Measurements are built from the set of effect operators \({\mathscr{E}}(\mathbb {C}^d):=\lbrace E\) . For our purposes it suffices to consider a measurement as a collection \(\lbrace E_{i}\rbrace _{i=1}^{\ell }\) of effect operators \(E_{i}\in {\mathscr{E}}(\mathbb {C}^d)\) s.t. . (For the more general notion of a POVM see Nielsen and Chuang (2009) or Heinosaari and Ziman (2012).) When performing a measurement \(\lbrace E_{i}\rbrace _{i=1}^{\ell }\) on a state ρ, output i is observed with probability tr[Eiρ]. A projective measurement is one where the effect operators are rank-1 projections, i.e., there exists an orthonormal basis \(\lbrace |i\rangle \rbrace _{i=1}^d\) s.t. Ei = |i〉〈i|.

When multiple quantum systems with spaces \(\mathbb {C}^{d_i}\) are considered, the composite system is described by the tensor product \(\bigotimes _{i=1}^n \mathbb {C}^{d_i}\simeq \mathbb {C}^{{\prod }_i d_i}\) and the set of states becomes \({\mathscr{S}} (\bigotimes _{i=1}^n \mathbb {C}^{d_i} )\). Given a state \(\rho _{AB}\in {\mathscr{S}}(\mathbb {C}^{d_A}\otimes \mathbb {C}^{d_B})\) of a composite system, we can obtain states of the subsystems as partial traces ρA = trB[ρAB], ρB = trA[ρAB]. Here, the partial trace is defined as satisfying the relation .

The dynamics of a quantum system are usually described by unitary evolution or, more generally, by quantum channels. For our purposes, these dynamics will not have to be discussed explicitly since they can be considered as part of the performed measurement by changing to the so-called Heisenberg picture (see ; Nielsen and Chuang 2009). We will take this perspective in proving our sample complexity lower bounds because it allows us to restrict our attention to proving limitations of measurements rather than of channels.

We will also make use of some standard entropic quantities which have been generalized from their classical origins Shannon (1948) to the realm of quantum theory. We denote the Shannon entropy of a random variable X with probability mass function p by \(H(X)=-{\sum }_x p(x)\log (p(x))\), the conditional entropy of a random variable Y given X as \(H(Y|X)={\sum }_{x,y} p(x,y) \log \left (\frac {p(x,y)}{p(x)}\right )\) and the mutual information between X and Y as I(X : Y ) = H(X) + H(Y ) − H(X,Y ). Similarly, the von Neumann entropy of a quantum state ρ will be denoted as \(S(\rho )=-\text {tr}[\rho \log \rho ]\) and the mutual information for a bipartite quantum state ρAB as I(ρAB) = I(A : B) = S(ρA) + S(ρB) − S(ρAB). All the standard results and inequalities connected to these quantities which appear in our arguments can be found in Nielsen and Chuang (2009) or in Wilde (2013).

2.2 Basics of the PAC framework and the binary classification problem

The setting of Probably Approximately Correct (PAC) learning was introduced by Vapnik and Chervonenkis (1971) and Valiant (1984). The general setting is as follows: Let \({\mathscr{X}}, {\mathscr{Y}}\) be input and output space, respectively, let \({\mathscr{F}}\subset {\mathscr{Y}}^{{\mathscr{X}}}\) be a class of functions, a concept class, and let \(\ell :{\mathscr{Y}}\times {\mathscr{Y}}\to \mathbb {R}_+\) be a loss function. A learning algorithm (to which \({\mathscr{X}},{\mathscr{Y}},{\mathscr{F}}\) and are known) has access to training data of the form \(S=\lbrace (x_{i},y_{i})\rbrace _{i=1}^{m}\), where (xi,yi) are drawn i.i.d. from a probability measure \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{Y}})\). Moreover, the learner is given as input a confidence parameter δ ∈ (0,1) and an accuracy parameter ε ∈ (0,1). Then a learner must output a hypothesis \(h\in {\mathscr{Y}}^{{\mathscr{X}}}\) s.t., with probability ≥ 1 − δ w.r.t. the choice of training data,

$$ \mathbb{E}_{(x,y)\sim\mu}[\ell(y,h(x))] \leq \underset{f\in\mathscr{F}}{\inf} \mathbb{E}_{(x,y)\sim\mu}[\ell(y,f(x))] + \varepsilon. $$
(2.1)

Note that the first term on the right-hand side vanishes if there exists an \(f^{*}\in {\mathscr{F}}\) s.t. \(\mu (x,y)=\mu _{1}(x)\delta _{y,f^{*}(x)}\) \(\forall (x,y)\in {\mathscr{X}}\times {\mathscr{Y}}\). In this case, we call the learning problem realizable, otherwise we refer to it as agnostic.

Both in the agnostic and in the realizable scenario, a learning algorithm that always outputs a hypothesis \(h\in {\mathscr{F}}\) is called a proper learner, and otherwise it is called improper.

A quantity of major interest is the number of examples featuring in such a learning problem. Given a learning algorithm \({\mathscr{A}}\), the smallest \(m=m(\varepsilon ,\delta )\in \mathbb {N}\) s.t. the learning requirement (2.1) is satisfied with confidence 1 − δ and accuracy ε is called the sample complexity of \({\mathscr{A}}\). The sample complexity of the learning problem is the infimum over the sample complexities of all learning algorithms for the problem. This characterizes, from an information-theoretic perspective, the hardness of a learning problem, but leaves aside questions of computational complexity.

The binary classification problem now arises as a special case from the above if we specify the output space \({\mathscr{Y}}=\lbrace 0,1\rbrace \) and take the loss function to be \(\ell (y,\tilde {y})=1-\delta _{y,\tilde {y}}\), the 0-1-loss. This setting is well studied and a characterization of its sample complexity is known. At its core is the following combinatorial parameter:

Definition 1 (VC-Dimension Vapnik and Chervonenkis (1971))

Let \({\mathscr{F}}\subseteq \lbrace 0,1\rbrace ^{{\mathscr{X}}}\). A set S = {x1,…,xn}⊂ X is said to be shattered by \({\mathscr{F}}\) if for every b ∈{0,1}n there exists \(f_b\in {\mathscr{F}}\) s.t. fb(xi) = bi for all 1 ≤ in.

The Vapnik-Chervonenkis (VC) dimension of \({\mathscr{F}}\subset \lbrace 0,1\rbrace ^{{\mathscr{X}}}\) is defined to be

$$ \begin{array}{@{}rcl@{}} \text{VCdim}(\mathscr{F}):=\sup\lbrace n\in\mathbb{N}_{0}~|~&\exists S\subset X~\text{s.t. } |S|=n~\text{and } S~\text{is}\\ &\text{shattered by }\mathscr{F}\rbrace. \end{array} $$

The main insight of VC-theory lies in the fact that learnability of a {0,1}-valued concept class is equivalent to finiteness of its VC-dimension. Even more, the sample complexity can be expressed in terms of the VC-dimension. This is the content of the following

Theorem 1

(see, e.g., Blumer et al. 1989; Hanneke 2016; Shalev-Shwartz and Ben-David 2014; Vershynin 2018)

In the realizable scenario, the sample complexity of binary classification for a function class \({\mathscr{F}}\) of VC-dimension d is \(m=m(\varepsilon ,\delta )={{\varTheta }}\left (\frac {1}{\varepsilon }\left (d + \log {1}/{\delta }\right )\right )\).

In the agnostic scenario, the sample complexity of binary classification for a function class \({\mathscr{F}}\) of VC-dimension d is \(m=m(\varepsilon ,\delta )={{\varTheta }}\left (\frac {1}{\varepsilon ^{2}}\left (d + \log {1}/{\delta }\right )\right )\).

The proof of the sample complexity upper bound in the agnostic case typically goes via a different complexity measure, the Rademacher complexity, which is then related to the VC-dimension. As this will reappear later on in our analysis, we also recall this definition here.

Definition 2 (Rademacher Complexity (see 2002))

Let Z be some space, \({\mathscr{F}}\subseteq \mathbb {R}^{\mathcal {Z}}\), \(z\in \mathcal {Z}^n\). The empirical Rademacher complexity of \({\mathscr{F}}\) w.r.t. z is

$$ \begin{array}{@{}rcl@{}} \hat{\mathscr{R}}(\mathscr{F}) &:=\underset{\sigma\sim U(\lbrace -1,1\rbrace^{n})}{\mathbb{E}}\left[\underset{f\in\mathscr{F}}{\sup}\frac{1}{n}\sum\limits_{i=1}^{n} \sigma_{i} f(z_{i})\right]\\ &=\underset{\sigma\sim U(\lbrace -1,1\rbrace^{n})}{\mathbb{E}}\left[\underset{f\in\mathscr{F}}{\sup}\frac{1}{n}\langle\sigma,f(z)\rangle\right], \end{array} $$

where U({− 1,1}n) denotes the uniform distribution on {− 1,1}n.

If we consider n i.i.d. random variables Z1,...,Zn distributed according to a probability measure μ on \(\mathcal {Z}\) and write Z = (Z1,...,Zn), the Rademacher complexities of \({\mathscr{F}}\) w.r.t. μ are defined to be \({\mathscr{R}}_n({\mathscr{F}}):=\mathbb {E}_{Z\sim \mu ^n}\left [\hat {{\mathscr{R}}}_{{\mathscr{F}}}\right ]\), \(n\in \mathbb {N}.\)

3 The binary classification problem with classical instances and quantum labels

We introduce a generalization of the classical binary classification problem to the quantum realm by allowing the two labels to be quantum states. Thus let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\) be two (possibly mixed) quantum states, write \({\mathscr{D}}=\lbrace \sigma _{0},\sigma _{1}\rbrace \). We assume that classical descriptions of these states (their density matrices) are known to the learning algorithm as well as the fact that only these two quantum labels appear. The class to be learned is now a class of functions \({\mathscr{F}}\subset \{ f:{\mathscr{X}}\to {\mathscr{D}} \}\) and the underlying distribution will be a \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{D}})\), where \({\mathscr{X}}\) is some space of classical objects.

We now deviate from the standard PAC setting: We assume the training data to be \(S=\lbrace (x_{i},\rho _{i})\rbrace _{i=1}^{m}\), \(m\in \mathbb {N}\), where the (xi,ρi) are drawn independently according to μ (in particular, \(\rho _{i}\in {\mathscr{D}}\) for all i). Here, the ρi are the actual quantum states, not classical descriptions of them. Therefore, our learning problem is not a classical one, we have to perform measurements on the quantum labels to extract information from them. Equivalently, we represent an example (xi,ρi) drawn from μ as the classical-quantum state

$$\underset{x, \rho}{\sum}\mu (x,\rho)|x\rangle\langle x|\otimes\rho,$$

with \(\lbrace |x\rangle \rbrace _{x\in {\mathscr{X}}}\) orthonormal.

Note that this model for the training data differs from the one introduced by Bshouty and Jackson (1998), where the training data consists of copies of a superposition state. Instead, here we assume copies of a mixture of states. This is done mainly for two reasons: First, it allows us to naturally talk about maps with mixed state outputs. Second, it is debatable whether assuming access to superposition examples as in Bshouty and Jackson (1998) is justified (see, e.g., Ciliberto et al. 2018, section 5), and this problem remains when considering maps with quantum outputs. In contrast, the mixtures assumed in our model arise naturally as statistical ensembles of outputs of state preparation procedures, if the parameters of the preparation are chosen according to some (unknown) distribution. In that sense, the form of classical-quantum training data assumed here is both a straightforward generalization of classical training data, given the standard probabilistic interpretation of mixed states, and can (at least in the realizable scenario) be easily imagined to be obtained as outcome of multiple runs of a state preparation experiment with different parameter settings.

A quantum learner for \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε from m = m(ε,δ) quantum examples has to output, for every \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{D}})\), with probability ≥ 1 − δ over the choice of training data of size m according to μ, a hypothesis \(h\in {\mathscr{D}}^{{\mathscr{X}}}\) s.t. \(R_{\mu }(h)\leq \underset {f\in {\mathscr{F}}}{\inf }R_{\mu }(f) + \varepsilon \). As before, we can consider agnostic versus realizable and proper versus improper variants of this learning model.

Here, we define the risk of a hypothesis \(h\in {\mathscr{F}}\) w.r.t. a distribution \(\mu \in \text {Prob}({\mathscr{X}}\times {\mathscr{D}})\) as

$$ R_{\mu}(h):= \int\limits_{\mathscr{X}\times\mathscr{D}} \frac{1}{2} \left\|\rho - h(x)\right\|_{1} ~ \mathrm{d}\mu(x,\rho), $$

where \(\left \|\rho - \sigma \right \|_{1} = \text {tr}[|\rho -\sigma |]=\text {tr}[\sqrt {(\rho -\sigma )^{*}(\rho -\sigma )}]\) is the Schatten 1-norm.

Note that our assumption on \({\mathscr{F}}\) implies that \(h(x)\in {\mathscr{D}}\ \forall x\in {\mathscr{X}}\) and therefore we can easily rewrite

$$R_{\mu} (h)=\frac{\left\|{\sigma_{0}-\sigma_{1}}\right\|_{1}}{2}\mathbb{P}_{(x,\rho)\sim\mu}[h(x)\neq \rho],$$

which is just the 0-1-risk multiplied by a constant. We choose the slightly more complicated looking definition for Rμ(h) for two reasons. On the one hand, \(\frac {\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}}{2}\) is a measure for the distinguishability of σ0 and σ1 and thus a natural scale w.r.t. which to measure the prediction error. (Note: If σ0,σ1 are orthogonal pure states and thus perfectly distinguishable, the classical scenario is recovered.) On the other hand, our definition of risk can be motivated operationally as we discuss in Appendix Appendix.

Example 1

Here, we describe a physically motivated problem that is captured by our scenario. The idea is as follows: Suppose we have available a (possibly complicated) ground state preparation procedure. Using this, we want to prepare a ground state |φ0〉 of a Hamiltonian H. However, H is perturbed by noise about which we have only partial information. We want to learn more about the noise and its influence on the prepared ground state.

We make this idea more concrete. We consider a (self-adjoint) Hamiltonian \(H\in \mathbb {C}^{(d+2)\times (d+2)}\) of the form , where , with (non-unique) ground state \(|\varphi _{0}\rangle :=\begin {pmatrix} 0 & 1\end {pmatrix}^T\oplus 0\). Suppose that we have a ground state preparation procedure that, if run with Hamiltonian H, prepares |φ0〉. When implementing this procedure, we have to fix values of a parameter vector \(x\in \mathbb {R}^D\). (Think, e.g., of D = 3 and x denoting the location at which the experiment is set up.) But due to the laboratory being only imperfectly shielded, there is an unknown region \(R\subset \mathbb {R}^D\) in which the system is subject to noise. For simplicity, we assume that only two types of noise can occur and lead to the location-dependent Hamiltonian , with noise Hamiltonians \(H^{(0)} = \begin {pmatrix} 1 & 0 \\ 0 & -1 \end {pmatrix}\oplus 0\), H(1) = \(\begin {pmatrix} 0 & 1 \\ 1 & 0\end {pmatrix}\oplus 0\).

The noise can lead to a perturbation of the ground state. Namely:

  • For xR, |φ0〉 is a ground state of \(H^{(i)}_x\). (This is the case of no effective noise.)

  • For xR, |φ0〉 is the unique ground state of \(H^{(0)}_x\). Hence, the noise H(0) is benign from the perspective of ground state preparation.

  • For xR, \(|\varphi _{1}\rangle :=\frac {1}{\sqrt {2}}\begin {pmatrix} 1 & -1 \end {pmatrix}^T\oplus 0\) is the unique ground state of \(H^{(1)}_x\). Hence, the noise H(1) is malicious from the perspective of ground state preparation.

Thus, we describe the ground state preparation by a function \(f^{(i)}_R:\mathbb {R}^D\to \{|\varphi _{0}\rangle \langle \varphi _{0}|, |\varphi _{1}\rangle \langle \varphi _{1}|\}\), . With this formulation, gaining information about the noise region R and the noise type i can be phrased as the problem of (PAC-)learning an unknown element of the (known) function class \({\mathscr{F}}=\left \{f^{(i)}_R\right \}_{i=0,1,~R\in {\mathscr{R}}}\subseteq \{|\varphi _{0}\rangle \langle \varphi _{0}|, |\varphi _{1}\rangle \langle \varphi _{1}|\}^{\mathbb {R}^D}\), where \({\mathscr{R}}\) is the class of possible error regions.

Note that |φ0〉 and |φ1〉 are not orthogonal and thus cannot be perfectly distinguished. Therefore, we cannot phrase the learning problem as one of binary classification with classical labels.

We return to this setting in Examples 2 and 3 to illustrate our learning strategies.

We want to conclude this section by discussing a drawback of our model. We assume \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\), i.e., outputs of any \(f\in {\mathscr{F}}\) are either σ0 or σ1. Considering the convex structure of the set of quantum states, which is intimately tied to the probabilistic interpretation of quantum theory, this restriction can be considered unnatural. We nevertheless make it, for two reasons: First, it is easy to show using a Bayesian predictor that, under the assumption of μ being supported on \({\mathscr{D}}\) (which could, of course, also be contested), the optimal choice of predictors among all functions \(({\mathscr{S}}\) \((\mathbb {C}^d))^{{\mathscr{X}}}\) is actually a function in \({\mathscr{D}}^{{\mathscr{X}}}\). Second, it is the most direct analog of the classical scenario with binary labels and we consider it a sensible first step that, as demonstrated in Example 1, can already be of physical relevance.

4 Sample complexity upper bounds

4.1 The agnostic case

Our learning strategy is motivated by interpreting the classical training data arising from performing a measurement on the label states as noisy version of the true training data. Before describing the learning strategy, we recall our assumption that classical descriptions of the label states σ0, σ1 are known to the learner. Based on this knowledge, the learner can derive the optimal measurement {E0,E1} for minimum error distinction between the two states, the so-called Holevo-Helstrom measurement (see ; Watrous 2018, Theorem 3.4), by choosing E0 to be the orthogonal projector onto the eigenspaces of σ0σ1 corresponding to nonnegative eigenvalues. This step is where knowledge of the states σ0 and σ1 is used.

The learning strategy is now the following, in which we use the Holevo-Helstrom measurement to produce classical training data and thus obtain a classical learning problem:

figure h

Note that the only non-classical step in the strategy is step (1), which consists only of performing local two-outcome measurements.

The modification of the loss function in step (3) gives an unbiased estimate of the true risk:

Lemma 1

(see ;Natarajan et al. 2013, Lemma 1)

Fix \(x\in {\mathscr{X}}\). With the notation introduced above, for every z ∈{0,1} it holds that

We can use a standard generalization bound in terms of Rademacher complexities (see, e.g., Theorem 26.5 of Shalev-Shwartz and Ben-David (2014)) to obtain: With probability ≥ 1 − δ over the choice of training data \(S=\{(x_{i},y_{i}) \}_{i=1}^{m}\) according to ν, we have that for all \(\tilde {f}^{\ast }\in \mathcal {\tilde {F}}\)

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{(x,y)\sim\nu} [\tilde{\ell}(\hat{g}(x),y)] - \mathbb{E}_{(x,y)\sim\nu} [\tilde{\ell}(\tilde{f}^{\ast}(x),y)]\\ &\leq& 2\hat{\mathscr{R}}(\tilde{\mathcal{G}}) + \frac{5}{1-\eta_{0}-\eta_{1}}\sqrt{\frac{2\ln{8}/{\delta}}{m}}, \end{array} $$

where we used that \(|\tilde {\ell }(y_{1},y_{2})|\leq \frac {1}{1-\eta _{0}-\eta _{1}}\) and defined the function class

$$ \tilde{\mathcal{G}} := \{ \mathscr{X}\times\{ 0,1\}\ni (x,y)\mapsto \tilde{\ell}(\tilde{f}(x),y)~|~\tilde{f}\in\tilde{\mathscr{F}}\}. $$

Next, we relate the empirical Rademacher complexity of \(\tilde {\mathcal {G}}\) to that of \(\tilde {{\mathscr{F}}}\).

Lemma 2

For any training data set \(S=\{(x_{i},y_{i}) \}_{i=1}^{m}\), viewed as an element of \(({\mathscr{X}}\times \{ 0,1\})^{m}\), we have

$$ \hat{\mathscr{R}} (\tilde{\mathcal{G}}) \leq \frac{2}{1-\eta_{0}-\eta_{1}}\hat{\mathscr{R}}(\tilde{\mathscr{F}}). $$

Proof

(Sketch) The proof uses some standard steps that are typically used for example in proving the Lipschitz contraction property of the Rademacher complexity and in studying the Rademacher complexity in a binary classification scenario.

See Appendix Appendix for a detailed proof. □

With this, we now reformulate the above result in terms of the VC-dimension. Suppose \(\text {VCdim} (\tilde {{\mathscr{F}}})=d<\infty \). Then \(\hat {{\mathscr{R}}}(\tilde {{\mathscr{F}}}) \leq 31\sqrt {\frac {d}{m}}\) (see, e.g., Vershynin 2018, Theorem 8.3.23). Therefore, we obtain that, with probability ≥ 1 − δ over the choice of training data \(S=\{(x_{i},y_{i}) \}_{i=1}^{m}\) according to ν,

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}_{(x,y)\sim\nu} [\tilde{\ell}(\hat{g}(x),y)] - \underset{\tilde{f}\in\tilde{\mathscr{F}}}{\inf}\mathbb{E}_{(x,y)\sim\nu} [\tilde{\ell}(\tilde{f}(x),y)]\\ &\leq& \frac{124}{1-\eta_{0}-\eta_{1}}\sqrt{\frac{d}{m}} + \frac{5}{1-\eta_{0}-\eta_{1}}\sqrt{\frac{2\ln{8}/{\delta}}{m}}. \end{array} $$

Note that, using Lemma 1, we can now bound

Now we can set this equal to ε and rearrange to conclude that a sample size of

$$ m\geq \frac{\left\|\sigma_{0}-\sigma_{1}\right\|_{1}^{2}}{4\varepsilon^{2}}\left( \frac{124}{1-\eta_{0}-\eta_{1}}\sqrt{d} + \frac{5}{1-\eta_{0}-\eta_{1}}\sqrt{2\ln{8}/{\delta}} \right)^{2} $$

suffices to guarantee that, with probability ≥ 1 − δ, \(R_{\mu }(\hat {h}) - \underset {f\in {\mathscr{F}}}{\inf } R_{\mu } (f)\leq \varepsilon \).

If we now observe that \(\frac {1}{1-\eta _{0}-\eta _{1}}\leq \frac {4}{\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}\), we obtain the sample complexity upper bound

$$ m =m(\varepsilon,\delta) = \mathscr{O}\left( \frac{d}{\varepsilon^{2}} + \frac{\log{1}/{\delta}}{\varepsilon^{2}}\right). $$

Remark 1

The naive version of our learning strategy would be to perform Holevo-Helstrom measurements and then apply a classical learning strategy, like empirical risk minimization, without correcting for the noise in the resulting classical labels. Actually, this learning strategy already performs reasonably well and, in certain special cases, even allows to reduce the quantum learning problem to a fully classical one. For a detailed analysis of the performance of this simpler strategy, the reader is referred to Appendix Appendix.

Example 2

We illustrate our agnostic learning strategy for the scenario of Example 1. As discussed in Appendix Appendix, as both label states |φ0〉〈φ0| and |φ1〉〈φ1| are pure, we can actually dispense with the modification of the classical loss function and simply take the 0-1-loss. Therefore, the Holevo-Helstrom strategy will look as follows: We first perform local Holevo-Helstrom measurements with measurement operators \(E_{0} \propto \begin {pmatrix} -1+\sqrt {2} & 1\end {pmatrix}^T \begin {pmatrix} -1+\sqrt {2} & 1\end {pmatrix}\oplus 0\), . This gives rise to classical training data. With that data, we then perform (classical) empirical risk minimization over the class \(\tilde {{\mathscr{F}}}=\left \{\tilde {f}^{(i)}_R \right \}_{i=0,1,~R\in {\mathscr{R}}}\), where \(\tilde {f}^{(i)}_R:\mathbb {R}^D\to \{0,1\}\), . Note that \(f^{(0)}_R\) is the zero-function for every \(R\in {\mathscr{R}}\).

Both the optimization procedure and the generalization capability depend on the class \({\mathscr{R}}\) of possible noise regions. Concerning the generalization performance, observerve that, if \(\emptyset \in {\mathscr{R}}\), then \(\text {VCdim} (\tilde {{\mathscr{F}}})=\text {VCdim}(\tilde {{\mathscr{F}}}_{{\mathscr{R}}})\), where we take to be the class of indicator functions of sets from \({\mathscr{R}}\). The VC-dimension of such classes is well known for different geometric classes \({\mathscr{R}}\). E.g., if \({\mathscr{R}}\) is the class of axis-aligned rectangles or that of Euclidean balls in \(\mathbb {R}^D\), then \(\text {VCdim} (\tilde {{\mathscr{F}}}_{{\mathscr{R}}})\) scales linearly in D and thus the dependence of the sample complexity upper bound on the number of parameters D is linear. If, however, we take \({\mathscr{R}}\) to be the class of compact and convex subsets of \(\mathbb {R}^D\), then \(\text {VCdim} (\tilde {{\mathscr{F}}}_{{\mathscr{R}}})=\infty \) and the sample complexity upper bound becomes void. This is congruent with the intuition that without prior assumptions on the structure of the regions that can be influenced by noise, learning the noise (in particular its region) will be hard and maybe infeasible.

4.2 The realizable case

The strategy from the previous subsection uses a generalization bound via the Rademacher complexity and yields a sample complexity bound depending quadratically on 1/ε. In the classical binary classification problem it is known (see Theorem 1) that under the realizability assumption this can be improved to 1/ε, but this typically requires a different kind of reasoning via ε-nets. (Compare section 28.3 of Shalev-Shwartz and Ben-David (2014)). In Theorem 6 we show how the reasoning by Hanneke (2016) can be combined with results by Laird (1988) to achieve the 1/ε-scaling also in the case of two-sided classification noise. This sample complexity upper bound is seen to be optimal in its dependence on the VC-dimension d, the error rate bound η, the confidence δ and the accuracy ε by a comparison to the lower bound in Theorem 27 of Arunachalam and de Wolf (2018).

If, as in the previous subsection, we consider the classical training data obtained by measuring the quantum training data as noisy version of a true sample, we can exchange step 3 in the Holevo-Helstrom strategy by the minimum disagreement-based classical learning strategy achieving the optimal sample complexity bound of Theorem D.2. This directly yields the following

Theorem 2

Let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\) be (distinct) quantum states. Let ε ∈ (0,1), \(\delta \in (0,2\cdot (\frac {2e}{d})^d)\), where d is the VC-dimension of \({\mathscr{F}}\subset \lbrace 0,1\rbrace ^{{\mathscr{X}}}\). Then

$$m = \mathscr{O}\left( \frac{1}{\varepsilon (1 - 2\max\lbrace \text{tr}[E_{0}\sigma_{1}],\text{tr}[E_{1}\sigma_{0}]\rbrace)^{2}} \left( d + \log{1}/{\delta}\right)\right)$$

quantum examples of a function in \({\mathscr{F}}\) are sufficient for binary classification with classical instances and quantum labels σ0,σ1 with accuracy ε and confidence 1 − δ.

Example 3

When considering this learning strategy in the setting of Example 1, we first perform the Holevo-Helstrom measurements as in Example 2 to obtain classical data. Again, this is followed by a classical learning procedure for the class \(\tilde {{\mathscr{F}}}=\left \{\tilde {f}^{(i)}_R \right \}_{i=0,1,~R\in {\mathscr{R}}}\).

Whereas the sample complexity bound derived for the agnostic case in Section 4.1 applies to any (noise-corrected) classical empirical risk minimization, the procedure leading to the bound in Theorem 2 is a specific one, presented in the proof of Theorem D.2. First, the classical data is processed, using the subsampling algorithm of Hanneke (2016) (see Algorithm 2), to generate a collection of subsamples. For each of those subsamples, we then apply Algorithm 1: We use a first part of the subsample to group the elements of \(\tilde {{\mathscr{F}}}\) into equivalence classes (according how they act on that part of the subsample), and the remainder is used to test the performance of each equivalence class. Afterwards, we output as hypothesis for that subsample a representative of the equivalence class that performs best in that test, i.e., that minimizes the number of disagreements with the part of the subsample used for testing. Whether and how the grouping into equivalence classes and finding minimum disagreement strategies can be done (efficiently) depends on \(\tilde {{\mathscr{F}}}\), and thus on \({\mathscr{R}}\). Finally, we take a majority vote over all the subsample hypotheses to get the output hypothesis of the classical learning procedure.

The dependence of the sample complexity on \(\tilde {{\mathscr{F}}}\) via the VC-dimension of the class of indicator functions of sets from \({\mathscr{R}}\) is analogous to Example 2.

Remark 2

From the description of our noise-corrected Holevo-Helstrom strategy (either in the form of Section 4.1 or that of this subsection), we can directly see that whether it is a proper or an improper learner depends on whether the classical learning algorithm in step (3) is. As the classical learning algorithm used in Section 4.1 is a simple Empirical Risk Minimization, it is in particular proper. So our noise-corrected Holevo-Helstrom strategy for the agnostic case is proper as well. The classical learner used in this subsection, however, is in general improper. So also the noise-corrected Holevo-Helstrom strategy for the realizable case is in general improper.

5 Sample complexity lower bounds

Whereas the goal of the previous section was to give strategies for solving the binary classification problem with classical instances and quantum labels and to prove upper bounds on the sufficient number of classical-quantum examples, we now turn to the complementary question of lower bounds on the number of required examples. In this section, we derive lower bounds that match the respective upper bounds from the previous section, and therefore, we conclude that the procedures described in Section 4 are optimal w.r.t. sample size in terms of the dependence on ε, δ, and d.

5.1 The agnostic case

We prove the sample complexity lower bounds in two parts, the first depending on the confidence parameter δ but not on the VC-dimension of the function class and conversely for the second.

We establish the VC-dimension-independent sample complexity lower bound in the following

Lemma 3

Let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\), let \(\varepsilon \in (0,\frac {\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}{2\sqrt {2}})\), δ ∈ (0,1). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class. Suppose \({\mathscr{A}}\) is a learning algorithm that solves the binary classification task with classical instances and (distinct) label states σ0,σ1 and concept class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε using m = m(ε,δ) examples. Then \(m\geq {\varOmega }\left (\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}^{2}\frac {\log {1}/{\delta }}{\varepsilon ^{2}}\right )\).

Proof

(Sketch) As \({\mathscr{F}}\) is non-trivial, there exist concepts \(f, g\in {\mathscr{F}}\) and a point \(x\in {\mathscr{X}}\) s.t. f(x) = σ0 and g(x) = σ1. Let \(\lambda =\frac {\varepsilon }{2\left \|{\sigma _{0} - \sigma _{1}}\right \|_{1}}\in (0,1)\). Define probability distributions μ± on \({\mathscr{X}}\times {\mathscr{D}}\) via

$$ \mu_{\pm}(x,f(x)) = \frac{1\pm \lambda}{2},\quad \mu_{\pm}(x,g(x))=\frac{1\mp\lambda}{2}. $$

By explicitly evaluating the risk R±(h), we see that achieving an excess risk ≤ ε with probability ≥ 1 − δ, requires the learner to distinguish between the underlying distributions μ±, and thus the corresponding training data states \(\rho _{\pm }^{\otimes m}\), with probability ≥ 1 − δ.

It is well known (see, e.g., Nielsen and Chuang 2009, chapter 9) that the optimal success probability of this quantum distinguishing task is given by

$$p_{\text{opt}} = \frac{1}{2}(1+\frac{1}{2}\left\|{\rho_{+}^{\otimes m} - \rho_{-}^{\otimes m}}\right\|_{1}).$$

Via the Fuchs-van de Graaf inequalities, which state that

$$ \frac{1}{2}\left\|{\rho_{1}^{\otimes m} - \rho_{2}^{\otimes m}}\right\|_{1} \leq \sqrt{1-F(\rho_{1}^{\otimes m}, \rho_{2}^{\otimes m})^{2}} = \sqrt{1-F(\rho_{1}, \rho_{2})^{2m}}, $$

this can be upper bounded using lower bounds on the fidelity F(ρ+⊗m,ρ−⊗m) = F(ρ+,ρ)m. The fidelity F(ρ+,ρ) can be lower-bounded using its strong concavity and the explicit expressions for ρ±. The result then follows by comparing the obtained upper bound with the required lower bound popt ≥ 1 − δ.

See Appendix Appendix for a detailed proof. □

For the proof of the VC-dimension-dependent part of the lower bound we need a well known observation about the eigenvalues of a statistical mixture of two pure quantum states, which is the content of the following

Lemma 4

Let \(|\psi \rangle ,|\phi \rangle \in \mathbb {C}^n\) be distinct pure quantum states. Let α,β ≥ 0 be real numbers. Then the non-zero eigenvalues of the mixture ρ := α|ψ〉〈ψ| + β|ϕ〉〈ϕ| are given by

$$ \begin{array}{@{}rcl@{}} \lambda_{1/2}(\rho) = \frac{\alpha+\beta\pm\sqrt{(\alpha-\beta)^{2} + 4\alpha\beta |\langle\psi |\phi\rangle|^{2}}}{2}. \end{array} $$

With this we can now prove a sample complexity lower bound for the case of pure label states.

Theorem 3

Let \(\sigma _{0}=|\psi _{0}\rangle \langle \psi _{0}|,\sigma _{1}=|\psi _{1}\rangle \langle \psi _{1}|\in {\mathscr{S}}(\mathbb {C}^n)\) be (distinct) pure quantum states, let \(\varepsilon \in (0,\frac {\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}}{8})\), \(\delta \in (0,1-H\left (\frac {1}{4}\right ))\). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class s.t. \(\tilde {{\mathscr{F}}}\) has VC-dimension d. Suppose \({\mathscr{A}}\) is a learning algorithm that solves the binary classification task with classical instances and (distinct) label states σ0,σ1 and concept class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε using m = m(ε,δ) examples. Then \(m\geq {\varOmega }\left (\frac {d}{\varepsilon ^{2}}\right )\).

Proof

(Sketch) We follow the information-theoretic proof strategy from Arunachalam and de Wolf (2018). Let \(S=(s_{1},\ldots ,s_d)\in {\mathscr{X}}\) be a set shattered by \(\tilde {{\mathscr{F}}}\), for each a ∈{0,1}d define the distribution μa on {1,…,d}×{0,1} via

$$ \mu_{a}(i,b) := \frac{1}{2d}\left( 1 + (-1)^{a_{i} + b} \frac{8\varepsilon}{\left\|\sigma_{0}-\sigma_{1}\right\|_{1}}\right). $$

Note that \(\forall a\in \lbrace 0,1\rbrace ^d\ \exists f_a\in \tilde {{\mathscr{F}}}: f_a(s_i)=a_i\) by shattering and that fa is a minimum error concept w.r.t. μa. By evaluating the excess error of an \(f_{\tilde {a}}\) compared to fa, we see that solving the learning problem with confidence 1 − δ requires the learner to output, with probability ≥ 1 − δ, a hypothesis described by a string whose Hamming distance to the true underlying string is \(\leq \frac {d}{4}\). We can use this observation to obtain the lower bound I(A : B) ≥Ω(d) on the mutual information between underlying string A (drawn uniformly at random) and corresponding quantum training data B.

We can also upper bound the mutual information. A standard argument shows I(A : B) ≤ mI(A : B1), where m is the number of copies of the quantum example state and B1 describes a single quantum example state. Using Lemma 4 and the explicit expression for a quantum example state, we can compute I(A : B1) and use Taylor expansion to see that \(I(A:B_{1})\leq {\mathscr{O}}(\varepsilon ^{2})\). Comparing the lower and upper bounds on I(A : B) now gives \(m\geq {\varOmega }\left (\frac {d}{\varepsilon ^{2}}\right )\).

See Appendix Appendix for a detailed proof. □

If we now combine Lemma 3 and Theorem 3 with the result of Section 4.1 we obtain

Corollary 1

Let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\) be (distinct) pure quantum states, let \(\varepsilon \in (0,\frac {\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}{8})\), \(\delta \in (0,1-H\left (\frac {1}{4}\right ))\). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class s.t. \(\tilde {{\mathscr{F}}}\) has VC-dimension d. Then a sample size of \({{\varTheta }}\left (\frac {d}{\varepsilon ^{2}} + \frac {\log {1}/{\delta }}{\varepsilon ^{2}} \right )\) is necessary and sufficient for solving the binary classification task with classical instances and quantum labels σ0,σ1 and hypothesis class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε.

Therefore, we have shown that the strategy from Section 4.1 is, for pure states, optimal in sample complexity w.r.t. its dependence the VC-dimension, the accuracy and the confidence. But we do not make a statement on optimality w.r.t. the dependence on the distinguishability of the label states, because the parameter \(\left \|{\sigma _{0} - \sigma _{1}}\right \|_{1}\) is lacking from our lower bound.

5.2 The realizable case

We now show analogous lower bounds for the sample complexity in the realizable scenario with the same proof strategy.

Lemma 5

Let \(\sigma _{0},\sigma _{1}\in {\mathscr{S}}(\mathbb {C}^n)\), let \(\varepsilon \in (0,\frac {\left \|\sigma _{0}-\sigma _{1}\right \|_{1}}{2})\), \(\delta \in (0,\frac {1}{2})\). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class. Suppose \({\mathscr{A}}\) is a learning algorithm which solves the binary classification task with classical instances and (distinct) label states σ0,σ1 and concept class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε using m = m(ε,δ) examples in the realizable scenario. Then \(m\geq {\varOmega }\left (\frac {\log {1}/{\delta }}{\varepsilon }\right )\).

Proof

This can be proved similarly to Lemma 3. See Appendix Appendix for a detailed proof. □

We now provide the analog of Theorem 3 for the realizable case.

Theorem 4

Let \(\sigma _{0}=|\psi _{0}\rangle \langle \psi _{0}|,\sigma _{1}=|\psi _{1}\rangle \langle \psi _{1}|\in {\mathscr{S}}(\mathbb {C}^n)\) be (distinct) pure quantum states, let \(\varepsilon \in (0,\frac {\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}}{8})\), \(\delta \in (0,\frac {1}{2})\). Let \({\mathscr{F}}\subset {\mathscr{D}}^{{\mathscr{X}}}\) be a non-trivial concept class s.t. \(\tilde {{\mathscr{F}}}\) has VC-dimension d + 1. Suppose \({\mathscr{A}}\) is a learning algorithm which solves the binary classification task with classical instances and (distinct) label states σ0,σ1 and concept class \({\mathscr{F}}\) with confidence 1 − δ and accuracy ε using m = m(ε,δ) examples in the realizable case. Then \(m\geq {\varOmega }\left (\frac {d}{\varepsilon }\right )\).

Proof

This can be proved similarly to Theorem 3. See Appendix Appendix for a detailed proof. □

Thus, we have obtained a sample complexity lower bound that matches the upper bound proved in Section 4.2 in the dependence on the VC-dimension, the confidence and the accuracy, but we do not make a statement about optimality w.r.t. the dependence on \(\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}\).

Remark 3

As already discussed in Section 2.1, in proving the sample complexity lower bounds we resort to the Heisenberg picture, which allows us to absorb the intermediate quantum channels performed by a learner into the measurement. These lower bounds therefore even hold for quantum learning algorithms that perform coherent and adaptive measurements on the training data. In particular, the information-theoretic complexity of our learning problem does not change if we restrict the quantum learner to only performing two-outcome POVMs locally (i.e., on one subsystem only). This is maybe not too much of a surprise, since the optimal measurement for distinguishing states drawn uniformly at random from \(\{\bigotimes _{i=1}^{m} \sigma _{x_{i}}\}_{x\in \{0,1\}^{m}}\) can, using the Holevo-Yuen-Kennedy-Lax optimality criterion (Holevo 1973; Yuen et al. 1975), be seen to be exactly given by local Holevo-Helstrom measurements.

6 Conclusion and outlook

We have proposed a novel way of modifying the classical binary classification problem to obtain a quantum counterpart. The conceptual difference to the framework of quantum PAC learning as discussed in Arunachalam and de Wolf (2017) is that we work with maps whose outputs are themselves quantum states, not classical labels. This naturally gives rise to training data given by quantum states, which is one aspect in which our setting differs from Aaronson (2007).

Using results from classical learning theory on dealing with classification noise in the training data, we exhibited learning strategies (based on the Holevo-Helstrom measurement) for binary classification with classical instances and quantum labels. The learning strategies consist of two main steps: First, classical information is extracted from the training data by performing a (localized) measurement. Second, classical learning strategies are applied. We complemented these procedures by sample complexity lower bounds thereby establishing the information-theoretic optimality of these strategies for pure label states w.r.t. the dependence on VC-dimension, confidence and accuracy.

We conclude with some open questions that we leave open for further research:

  • Can we derive sample complexity lower bounds which explicitly incorporate factors related to the hardness of distinguishing σ0 and σ1, e.g., in terms of \(\left \|{\sigma _{0}-\sigma _{1}}\right \|_{1}\) or \(\max \limits \lbrace \text {tr}[E_{0}\sigma _{1}],\text {tr}[E_{1}\sigma _{0}]\}\)? Or can the corresponding factors in the upper bounds be eliminated? Could this be related to another complexity measure from classical learning theory, the “fat-shattering dimension” of the class

    $$\{\mathscr{X}\times\mathscr{E}(\mathbb{C}^{d}) \ni (x,E)\mapsto \text{tr}[Ef(x)]~|~f\in\mathscr{F}\}?$$
  • Our analysis is focused on the information-theoretic part of the learning problem, i.e., the sample complexity. Can we improve the computational complexity?

  • For deriving our sample complexity upper bounds, we used specific classical learning procedures applied to the post-measurement training data. In the agnostic case, we use empirical risk minimization, in the realizable case we use a combination of a minimum disagreement approach with a subsampling procedure. In both cases, we decided for these algorithms to achieve the (essentially) optimal sample complexity characterized via the VC-dimension.

    However, we could use other classical learning procedures for “post-processing”. Can we identify situations in which procedures like structural risk minimization, compression schemes, or stable learning procedures yield useful sample complexity bounds?

  • We considered the case of classical instances. Can this be extended to a scenario of quantum instances with classical (or even quantum) labels? Whereas we were able to study the case of classical instances and quantum labels with methods from learning with label noise, once the instances themselves are quantum, we might have to employ ideas from learning models with restricted access to the instances such as that of “learning with restricted focus of attention” proposed in Ben-David and Dichterman (1998).

  • Our strategy uses the Holevo-Helstrom measurement which can be understood as inducing the minimum amount of noise. However, in classical learning theory it is well known that adding noise to the training data can be helpful in preventing overfitting. In this spirit, can we justify other measurements than the Holevo-Helstrom measurement?

  • We assumed throughout our analysis that the learning algorithm has to output a hypothesis that maps into {σ0,σ1}. What if we allow for hypotheses that map into \(\text {conv}\left (\lbrace \sigma _{0},\sigma _{1}\rbrace \right )\) or \({\mathscr{S}}(\mathbb {C}^d)\)?

  • Finally, we assume throughout that the label states σ0, σ1 are known in advance. Can this assumption be removed? Here, it might be helpful that Theorem 6 does not need explicit knowledge of the error rates η0, η1, but merely of an upper bound ηb on them.