Universal discriminative quantum neural networks

Quantum mechanics fundamentally forbids deterministic discrimination of quantum states and processes. However, the ability to optimally distinguish various classes of quantum data is an important primitive in quantum information science. In this work, we train near-term quantum circuits to classify data represented by non-orthogonal quantum probability distributions using the Adam stochastic optimization algorithm. This is achieved by iterative interactions of a classical device with a quantum processor to discover the parameters of an unknown non-unitary quantum circuit. This circuit learns to simulates the unknown structure of a generalized quantum measurement, or Positive-Operator-Value-Measure (POVM), that is required to optimally distinguish possible distributions of quantum inputs. Notably we use universal circuit topologies, with a theoretically motivated circuit design, which guarantees that our circuits can in principle learn to perform arbitrary input-output mappings. Our numerical simulations show that shallow quantum circuits could be trained to discriminate among various pure and mixed quantum states exhibiting a trade-off between minimizing erroneous and inconclusive outcomes with comparable performance to theoretically optimal POVMs. We train the circuit on different classes of quantum data and evaluate the generalization error on unseen mixed quantum states. This generalization power hence distinguishes our work from standard circuit optimization and provides an example of quantum machine learning for a task that has inherently no classical analogue.


Introduction
The interface of quantum physics and machine learning has recently received a considerable amount of interest. Two complementary methodologies have been developed to address the question whether quantum mechanics can help with solving machine learning tasks or similarly whether machine learning techniques could help solving problems in quantum computation and many-body condensed matter systems more efficiently [1,2,3].
The circuit model of universal fault-tolerant quantum computers has been shown to offer a range of machine learning algorithms [4,5,6,7] which could lead to considerable speed-ups under certain assumptions over classical counterparts. The underlying property which results in such quantum advantage is the ability of quantum computers to perform certain linear algebra operations faster than classical machines [8,9,10,11,12]. The algorithmic complexity for some of these quantum learning schemes is in principle O(polylog(N )) for input dimension N instead of the classical O(poly(N )) which can manifest itself as exponential speedup when applied to quantum data [4].
However, these algorithms have been shown to come with certain caveats, when applied to classical data. The most relevant is the preparation of quantum states that encode classical information and is believed to have in the worst case a polynomial scaling in the input dimension of the data [13,14] diminishing their advantages in the first place. Here we take a different approach and utilize a hybrid quantum-classical setup to directly learn the design of shallow quantum circuits for an inherently quantum-mechanical task, which has no classical counterpart based on the non-orthogonal nature of quantum states. Hence we solve a problem where a comparison with classical algorithms is not appropriate unlike for the case of the above mentioned algorithms. Recent works focused on using quantum-classical hybrid methods for training quantum circuits for a range of tasks [15,16,17,18,19,20,21,22]. However, prior quantum circuit training works rarely give a motivation of the underlying circuit structure nor provide a concrete quantum application.
In this work, we show that universal shallow quantum circuits can be used as a universal discriminator for classification of quantum data from various different probability distributions approaching the optimal theoretical performance when such bounds are available. Our circuit topologies comprise gates from a universal gate set consisting of C-NOT and single-qubit gates, motivated by the fact that implementations of such are known for the currently most used experimental architectures. Furthermore our decompositions are nearly optimal in terms of the number of C-NOT gates which is an important feature for an implementation on near term devices. The quantum circuits we apply here can be viewed as a form of quantum neural networks with non-unitary layers, i.e., the generalized measurements, leading to sufficient non-linearities. A high-level description of a comparison of our quantum circuit learning and an analogy with a neural network structure is given in in Fig 1. Recently, unitary architectures have also been considered for the training of classical neural networks [23,24]. Specifically, we develop our quantum neutral networks based on optimal circuit decompositions for quantum channels, and directly adapt these channel decompositions to POVMs, conjecturing that our adapted decompositions for POVMs are also nearly optimal. We then train these circuits with a classical-quantum hybrid algorithm for the task of classifying different families of quantum data based on theoretical foundations of an important primitive known as quantum state discrimination. Notably, unlike previous works, we focus on the generalisation ability of our circuit, i.e., we train the circuit on a specific range of the parameters with the goal of maximising its generalisation performance, hence considering a learning framework. This distinguishes our work from the pure optimisation problem for the state discrimination task, which is optimising the circuit to distinguish only a concrete set of states.
We show here with numerical simulations that the hybrid machine can learn a discrimination strategy either in a minimal error setting, or in an unambiguous one, and observe a clear trade-off between those two. One important feature of our discriminator quantum circuit model is that it uses a number of parameters that is polynomial in the size of the quantum states that are input into the circuit. In contrast, classical deep learning uses commonly a number of weights that is linear in the dimension of the input, however they will be exponential in the number of qubits for analysing quantum data.
The discriminative quantum neural networks introduced here could have broad range of applications. Quantum state discrimination by itself plays a key role in quantum information processing protocols and is used in quantum cryptography [25], quantum cloning [26], quantum state separation and entanglement concentration [27,27]. Our shallow quantum circuit learning could further be used to construct quantum repeaters and state purification units within quantum communication networks. It could also have wide range of applications, in quantum meteorology [28] quantum sensing [29], quantum illumination [30], and quantum imaging [31] as a systematic way of engineering structured receivers. In general, our discriminative networks could be used as a general quantum circuit verification units to examine the outputs of other shallow quantum circuits with possible applications to quantum versions of Generative Adversarial Networks (GANs) [32].
The paper is organized as follows. In section 2, we first describe the family of quantum states that we aim to discriminate, and introduce the optimization procedure that we use in the numerical experiments. In section 3, we describe various numerical experiments for training a quantum circuit where we use the exact probabilities of the state measurements for our classical optimization of the parameters. We demonstrate that there exists a trade-off between the error probability P err and the probability of inconclusive outcomes P inc when optimising our quantum circuits. In section 4 we discuss the number of measurement repetitions that we need in order to obtain a good estimate of the probabilities for the optimization task, i.e., optimal quantum state discrimination. In Appendix A, we describe the parametrization we used for our quantum circuit, while Appendix B provides a brief introduction to the Adam stochastic optimization algorithm.

Quantum state discrimination and classification
Quantum state discrimination (QSD) is defined as follows: one is given a quantum device in an unknown state ρ which is believed to be from a family of non-orthogonal states {ρ i }; the task is to design a measurement to optimally identify ρ [33,34]. Our fundamental inability to perfectly discriminate non-orthogonal quantum states is one of the key features of quantum information processing and quantum communication protocols [33]. For example, in a simple scenario in the context of quantum repeaters, when Alice wants to send a message to Bob, from previously agreed set, through a noisy quantum channel. Bob needs to discriminate the different messages within the background noise of the channel; that is he has to keep distilling the messages when they start overlapping with each other. Here we show that one could design a state discrimination measurement by training quantum circuits to accomplish this distillation. Ideally the training should be done in a secure and controlled environments to avoid tempering from malicious eavesdroppers.
Generally, there are two different strategies to cope with our inherent inability to perform perfect QSD based on two different figures of merit. (a) Minimum-error state discrimination: This is a deterministic strategy that we always make a judgment about the nature of the unknown state. The task is to minimize the inevitable errors in our classification. In this method the number of outcomes is equal to the number of possible states and we always can envision an optimal projective measurement according to Helstrom bound [35]. However such optimal quantum measurements are very hard to find. (b) Unambiguous state discrimination: This is a non-deterministic strategy. Here, we want to make sure with 100% certainty that we have identified the correct state. Thus we need to introduce entropy sinks, here referred as "inconclusive outcomes", to attribute our lack of knowledge. Consequently, in the approach we need to rotate the states via unitary gates in a higher dimensional Hilbert space and then perform projective measurements. The number of outcomes will be always more than the number of possible input states. In other words, we always need to implement a POVM measurement. The maximum amount of information that can be extracted here is given by Holevo bound [36]. As we will show here a pure unambiguous QSD is too costly with respect to inconclusive outcomes and one needs to allow for some small but non-zero errors to happen. We rely here on the formulation of the QSD problem as numerical optimization problem, where one attempts to optimize one of the figures of merit [37,38]. However, rather than performing solely the optimisation, we use a machine learning approach to build a hybrid of the above two alternative strategies. We denote the probability of making a successful judgment (P suc ), the probability of making a wrong judgment (P err ), and the probability of making an inconclusive judgment (P inc ), to evaluate a specific strategy. When the probably of an inconclusive judgment is zero (P inc = 0) and the error probability P err is minimized, the strategy becomes the standard minimal error discrimination. When the error probability is zero, and the inconclusive rate P inc is minimized, the strategy becomes the standard unambiguous state discrimination.
In practice, we propose the usage of a classical computer to optimize POVMs, implemented as parametrised quantum circuits, for state discrimination by changing the input quantum states and parameters of a digital quantum processor. This quantum-classical hybrid method has the advantage of delegating the expensive part of the optimization process, which is the evolution of quantum states, to a quantum processor and can be considered as quantum neural networks. The reason is the close mathematical relation between the unitary circuit structures and classical unitary-weight neural network structures [22,23,24]. It has been shown that the unitarity of the layers indeed is optimal since it can avoid the exploding or vanishing gradient problem if the activation function is chosen adequately, e.g., a ReLu [23]. Since quantum circuits do not include any activation functions between the layers, unless a projective measurement is performed, this advantage is immediately transferable to the case of quantum circuit training. As pointed out already in [22], it has further been shown that unitary weight matrices further allow for gradient descent methods to converge independently of the circuit depth [39], which is important for the training of circuits with larger circuit depth.
Here we treat the general problem of state discrimination of an the ensemble of families of pure states. In this setting, each state in the ensemble is be drawn from a family of states, each parametrised by a specific distribution. Concretely the input is then given by where any α k can be a distinct discrete classical probability distribution, and we use the short notation where a ∼ α i means that a is drawn from the distribution α i , and β(a) indicates some distribution for the amplitude parametrised by a. In particular this means that we perform state discrimination on the ensemble of parametrised pure states where the parameters a i that characterise the amplitudes of |ψ i (a i ) are distributed according to p(a i ) ∼ α i . We hence draw the a i from the according probability distribution α i and then use a i to specify the amplitude distribution, i.e., |ψ i = N j=1 β i (a) |i , where the family of states |ψ i (α i ) is then used as input into the circuit with some probability λ i .
We now address a canonical example of discriminating among two families of non-orthogonal quantum states over the Hilbert space of a two-qubit system. One of our inputs belong to a family of pure states ψ 1 (a) parametrised by a real number a ∈ [0, 1]: The second input class consists of a family of mixed states represented by two, up to a sign, equal pure states ψ 2/3 (b) which are parametrised by a real number b ∈ [0, 1]: For simplicity, we first focus on the task of supervised learning of states that are parametrised by a ∈ [0, 1], i.e., the first family of quantum states and using a fixed parameter b = 1 √ 2 for the second set. This allows a direct comparison to theoretical and prior optical realization of quantum state discrimination and the related task of quantum state filtering [40]. We therefore will work here with the states of the form: 1 − a 2 , 0, a, 0 , and For comparison of [40] we set λ 1 = 1/3 and λ 2 = 2/3. The overlap (fidelity) between ψ 1 (a) and ψ 2/3 is given by a/ √ 2. This means roughly, that the task of discriminating ψ 1 (a) and ψ 2/3 becomes increasingly harder with increasing a. The unambiguous discrimination of these two families has in particular been demonstrated experimentally for values a = 0.25 and a = 0.5 by Ref. [40], which we will use to benchmark our own results.
In this paper, we use a quantum-classical hybrid scheme, in which the (simulated) quantum computer is used as a subroutine for the optimisation task which is called from a classical device. The classical machine thereby controls the input states to the quantum circuit, assessed the output and finally optimizes parameters of the quantum circuits depending on the results of this subroutine. Our quantum circuit is implemented as a general POVM with 4 possible measurement outcomes (Eq. 2.5). The parametrisation of the circuit is described in detail in Appendix A. We hence use 2 (ancilla) qubits for the measurements, and denote these measurement outcomes as m i2i1 , where i 1 , i 2 ∈ {0, 1} are the measurement outcomes of first and second qubit respectively. Since we have four possible outcomes but only three different input signals, i.e., ψ 1 (a), ψ 2/3 , and the inconclusiveness result, we arbitrarily choose a correspondence between outcomes and signals. Here in particular we assign m 00 or m 10 to the input ψ 1 (a), while a measurement outcome of m 01 is assigned to the input ψ 2/3 , and m 11 to the inconclusive measurement. With these assumptions, the various probabilities mentioned before (P suc , P err , and P inc ) can be defined in a natural way. A simplified version of the circuit we use for the discrimination task is given below.
In construction of shallow circuits, since C-NOT gates are usually much more prone to errors than single-qubit gates, an ideal decomposition minimizes the number of C-NOTs required for a particular task. The problem of finding near optimal (in terms of the required number of C-NOT gates) circuit topologies for quantum channels from m to n qubits was considered in [41,42]. We adapt here the construction given therein to find circuit topologies for POVMs. The procedure is as follows. Since all possible POVMs on m input qubits can be represented by a quantum channel from m to n qubits, we can find the low-cost circuit topology for POVMs, which is universal, by applying a QR decomposition and a Cosine-Sine decomposition of the quantum channel, similarly to [41,42] (Appendix A). The high level idea for proving the near optimality of the channel decomposition is then based on a parameter counting argument. One inspects the manifold of the quantum channel and determines its dimensionality, which gives a lower bound for the number of parameters and hence for the number of single-qubit gates which need to be introduced in the circuit topology. If this number of parameters is not introduced, one can only construct a set of measure zero in the manifold of all channels with the given circuit topology.
Next, we define our theoretically motived cost function that is used for our optimisation task throughout the paper. Denote S i the quantum states for different families of distribution acquired to train the circuit, which could be viewed as a specific sampling of the distributions α i . The cost function is then defined as: where |S i | is the number of samples in the training sets S i , α err is the penalty for making errors, and α inc is the penalty for giving inconclusive outcomes. P suc (ψ)/P err (ψ)/P inc (ψ) are the probabilities of giving a correct/erroneous/inconclusive measurement outcome for a input wave function ψ. This cost function reflects three things: the award given to a successful discrimination of the states, the penalty (controlled by α err ) for a wrong discrimination, and the penalty (controlled by α inc ) of an inconclusive result. The probabilities P suc/err/inc (ψ) are directly accessible in the numerical simulations, and can in practical experiments be inferred through repeated measurements on ψ 1 , up to some precision. Eq. 2.6 hence measures the performance of the quantum circuit 2.5, which we use in the following for the state discrimination task.
Since the families of quantum states we aim to discriminate consist of a set of infinitely many wave functions (a is a real number), we adopt a strategy which is common in machine learning. We sample a set of values a from the range [0, 1] and divide the resulting set into a training and a test set. The training set (denoted S train ) is then used to calculate the cost function and to optimise the quantum circuit during the training phase, while the test set (denoted S test ) is used for the verification of the performance of the trained model, i.e., in our case the quantum discrimination circuit. The concrete optimisation of the circuit with the classical machine was done by minimising our cost function for the training set using Adam, where the gradients are calculated using forward differences.

Training on simulated quantum computers
In this section, we describe several numerical experiments performed on classical hardware using the exact probabilities as defined above. The extension of this approach is then treated in Section 4. In all the numerical experiments, we used the three values P suc , P err and P inc as defined in equation 3.1 to summarise and compare the training results. We found empirically that saddle points often exists in training and hence stochastic gradient descent does not perform well. This motivates the use of Adam over stochastic gradient descent throughout the experiments. We will later compare the performances of various different stochastic optimization techniques.
where S i here is the training or test set.
Discriminate an input from a normal distribution centred at a 0 ∈ [0, 1].-In this experiment, we optimise the cost function J 1 with a ∈ S train centred around a 0 1 , which allows for a direct comparison of our trained circuit with the results for unambiguous state discrimination of [40] 2 . We 1 The centred data was chosen from 100 points that were i.i.d., randomly sampled from a normal distribution with mean µ = a0 and variance σ = 0.01. The normal distribution was truncated so that 0 ≤ a ≤ 1.
2 Note that this problem is commonly denoted as state filtering (SF).  Figure 2: Performance for unambiguous discrimination with training data centred at a 0 with standard deviation 0.01. The penalty parameters are set to be (α err , α inc ) = (25,2). The trained circuit unambiguously discriminates the pure states ρ pure = |ψ 1 (a 0 ) ψ 1 (a 0 )| from the mixed states ρ mixed = 1/2(|ψ 2 ψ 2 | + |ψ 3 ψ 3 |). The values shown are the exact probabilities for the given specific measurement outcome, and are averaged over 50 runs. For reference, we provide the theoretically optimal success rate of 0.833 for a 0 = 0.25 and 0.666 for a 0 = 0.50 from Mohseni, et. al., [40].
train our model using Adam 3 and test it on the test set S test = {a 0 }. In repeated experiments, we did not reach on average the theoretical minimum inconclusiveness P inc when we use a high penalty to force the error probability P err to be close to zero. But since the training took data with a centred at, but not fixed at the point a 0 , we should not expect the trained circuit to obtain exactly the same minimal value. Also, we are able to obtain a P inc which is close the the theoretical minimum, if we allow a small, but non-zero error probability. This demonstrates that a less strict constraint on P err allows for a much lower inconclusiveness rate, i.e. there is a trade-off between the error rate (P err ) and the inconclusive rate (P inc ) for this problem during the training process (c.f. Fig. 3). To allow a visual comparison with Fig. 2 in [40], we first show the averaged performance for our circuit for (α err , α inc ) = (25, 2) in Fig. 2 and then the resulting trade-off in Fig. 3.
Discriminate among all data a ∈ [0, 1].-Here we aim to discriminate the inputs from two classes of states where the second class are wave functions coming from the family ψ 1 (a) with in a ∈ [0, 1]. We train and test the circuits with S train and S test which are two random subsets sampled from the range [0, 1] 4 . Inspired by the trade-off found in previous results for discriminating a single data point, we train the cost function J 1 for different choices of the parameters (α err , α inc ) using Adam 5 . Our results show that in comparison with the case of zero penalties (α err = α inc = 0), the penalties act as a form of regularisation and can be adjusted to give a higher success probability or a lower inconclusiveness rate for the final model. Furthermore, we observe a gradual transition from unambiguous discrimination (characterised by near-zero error probability) to minimal error discrimination (characterised by near-zero inconclusiveness) with varying penalties (c.f. Fig. 5). The results for the trained circuit for specific pairs of α err and α inc , after being fine-tuned to closely resemble both the unambiguous discrimination strategy and the minimal error discrimination strategy, are shown in Fig. 4.
Generalisability of training data.-On the other hand, we hypothesize that the fidelity is a good indicator of the generalisation ability of the trained model. To test this hypothesis, we train the circuit 3 The Adam used a stochastic gradient taking 50 sub-samples of a in the training set. Parameters for Adam are β1 = 0.9, β2 = 0.999,ε = 10 −8 and learning rate 0.001, as in Appendix B. The gradient was calculated using the forward differences formula with step size 10 −6 . 4 The J1 was optimised with Strain chosen to be 100 evenly spaced points from the range [0, 1]. The test set Stest was chosen to be 150 i.i.d., randomly chosen points in [0, 1] 5 We use as input parameter β1 = 0.9, β2 = 0.999,ε = 10 −8 and a learning rate of 0.001 for Adam (c.f. Appendix B). The gradient is calculated stochastically using 50 sub-samples chosen i.i.d. at random from Strain. The step size for the numerical calculation of the gradient is chosen to be 10 −3 for the forward differences formula. All data is obtained after 5000 iterations of Adam. : Trade-off between the error probability and the inconclusiveness for training data centred at a 0 with standard deviation 0.01. The lines represent averaged quantities of P err and P suc which we obtain by increasing α err from 10, 15, 20, · · · to 40, with α inc fixed at 2. Bars represent the standard deviation. The dashed vertical line represents the theoretical minimal inconclusive rate for unambiguous state discrimination [40]. Although the training does not reach the minimal inconclusiveness rate when we require P err ≈ 0, we are able to obtain much smaller P inc by accepting small non-zero P err . This trade-off between P err and P inc can be useful in realistic applications. The standard deviation of P err are estimated to be 0.001 to 0.01 and 0.04 to 0.12 for P inc .
with a restricted to a subsets of [0, 1], where a is chosen close to 1 (recalling that the fidelity between ψ 1 (a) and ψ 2/3 is proportional to a), and then test the performance of the trained circuit on the full range of a ∈ [0, 1]. The parameters for the training are the same as in the previous experiment with centred data. We find for the unambiguous discrimination case (i.e., large penalty for errors), that the training result is dominated by wave functions ψ 1 (a) with a close to 1(c.f. Fig. 6). This reflects the difficulty of distinguishing ψ 1 (a) from ψ 2/3 if a is close to 1.
Distinguishing data from different probability distributions.-Here we shown that our quantum circuit has the power to unambiguously classify data which was not seen during the optimisation process. We attempt to show this by allowing both parameters a and b to be drawn from some probability distribution. That is, we use training data which is sampled from some distribution, and test it performance with data sampled from the same distribution for both families of states. We show that the trained circuit can successfully classify data with a and b drawn from the normal distribution, the uniform distribution, or a mixture of two. The resulting success rate was inversely correlated with the averaged fidelity between the quantum states. The data distribution is shown in Fig. 7.

Learning convergence from ensemble measurements
Here we simulate the process that a classical-quantum hybrid scheme would implement utilizing a quantum device and analyse its performance. These numerical simulations can in principle be validated in a physical experiment, where the measurement outcomes are used to infer the different probabilities for the cost function. To have good estimation of the probability, and hence the cost function, one has to make repeated measurements to train the model, and we note that in particular better methods to evaluate the analytical gradient are available on a shallow quantum device [17]. We first give a brief discussion of the an estimated number of repeated measurements which are required to approximate the gradient, which is oriented on [18][Section 3]. Since the gradients are calculated using the forward difference formula:  Figure 4: Performance of learned quantum circuits: The values shown above are the exact probabilities for a given specific measurement outcome averaged over 50 repeated runs. The probability for ψ 1 (a) is averaged over all a ∈ S test . Here we find two types of discrimination strategy. When α err > α inc , (Fig. (a) and Fig. (b), where (α err , α inc ) = (20, 2)), we obtain unambiguous discrimination with zero error rate. When α inc > α err (Fig. (c) and Fig. (d), where (α err , α inc ) = (5, 20)), we obtain a minimal error discrimination with zero inconclusiveness rate. By comparing Fig. (a) with Fig. (b) (and (c) with (d)), one notices the general feature that the degree of non-orthogonality of the wave function determines the hardness.
The error in the calculation of f must be at most of the order of O(ε 2 ), in order to prevent dominating the total error. To achieve this ideally with an 99% probability, one requires the number of repeated measurement to be of the order 1 (ε 2 ) 2 = 1 ε 4 . 6 For example, when ε = 10 −3 , the ideal number of repetitions is given by 10 12 .
In practice, we do not use 1 ε 4 measurements, since Adam is designed with the stochasticity of cost function taken into account. To give an estimate of the number of repeated measurements which are required for convergence of the optimisation process, we perform two numerical experiments. We first look at the case when the number of repeated measurements is ≥ 10 3 and ε = 10 −2 . We find that 10 5 repeated measurements for each iteration are a robust configuration for successful convergence. Second, we vary the learning rate and increase the maximal number of iterations for Adam, setting ε = 10 −2 and taking only 100 repeated measurements. We observe that optimisation is successful with the large number of iteration. In both experiments, the penalties were set to α inc = 5 and α err = 40.
Large number of repetitions. Our results show that for a fixed maximum number of iterations (5000) for Adam, a combination of ε = 10 −2 and 10 5 repeated measurements gives robust results, 6 This assumes that the cost function follows a normal distribution with variance of the order 1 √ N , where N is the number of measurements made in reach run in order to calculate the cost function.  Fig. (a)-(c) show the gradual transition from the unambiguous discrimination (near-zero error rate) to a minimal error discrimination (near-zero inconclusiveness) with different parameters α err and α inc . The bottom right regions in all plots are mostly uniform, which shows that it is much harder to obtain a small inconclusiveness rate than a small error rate. Compared with the point α err = α inc = 0, the added penalties improve the success probability or the inconclusiveness respectively. The data was tested on a ∈ [0, 1], and averaged over 50 runs. The standard deviation for P suc is shown in Fig. (d). With an increasing standard deviation (closer to the diagonal line), the result becomes increasingly unstable when the two penalties (α err and α inc ) are comparable in magnitude. The standard deviation for other values shows the same pattern as for P suc .
i.e., the final cost function is close to the value obtained with the exact probability (with an error within 3%) and is stable (with a relative standard deviation of 13%). A more detailed description of the trade-off between repeated measurements and the stability of the cost function is shown in Fig. 8.
Small learning rates and high number of iterations. Our numerical experiments further showed that in the case of small repetitions, lowered learning rate could effectively counter the noisy brought by the insufficient sampling. Although in this case, the optimisation required a longer iteration to finish. For example, with only 100 repeated measurements, the variance of cost function J 1 after 20000 iterations decreased as we lowered the learning rate ( Fig.9(a)). We could visually observe the optimisation process where the cost function J 1 slowly approached the optimal value in Fig 9(b). Here, gradient step were taken as ε = 10 −2 .

Conclusions
We have developed a universal quantum circuit learning approach for discrimination and classification of quantum data. In particular, we have designed a theoretically motivated cost function and then used the stochastic optimisation algorithm Adam in a quantum-classical hybrid scheme to train The circuits trained on a small range a close to 1, shows a performance that is similar to a circuits trained on the whole range of a ∈ [0, 1]. We therefore conclude, that the training is highly dominated by the wave functions ψ 1 (a) with a close to 1. This reflects the fact that with increasing fidelity (F (ψ 1 (a), ψ 2/3 ) ∝ a), two quantum states are generally harder to discriminate. The test data was averaged over 10 repeated runs with parameters α err = 40 and α inc = 4. The bars indicates the standard deviations.     a circuit to perform quantum state discrimination. The training was performed over a prior specified range of input states, however, without training the circuit on the whole range. This training process generalised well for the discrimination task on new data, i.e., states from the parameter range which have not been seen during the training process. This in particular distinguishes our work from previous results on quantum circuit learning, in particular very recent study in e.g. [43], who only optimise circuits for specific inputs. Note that prior work hence does not consider the generalisation ability and hence does not treat the actual learning problem which is aiming at optimisation as well as generalisation.
We observed a trade-off between error and inconclusiveness rates when we penalised them differently in the cost function. Although this experiment was done on a simulated quantum computers, i.e., classical hardware, where exact measurement probabilities are available, we showed that this optimisation could be experimentally performed with a repeated number of measurements of the wavefunction. Finally we note that recent quantum methods for estimating the analytical gradient via variations in the unitaries [17] can be directly applied for training our circuits and therefore one can perform the optimisation efficiently on near term quantum devices.
The discriminative quantum neural networks of the forms that were trained here could be potentially used as quantum shallow circuits for verifying or certifying other shallow or deep quantum circuits within machine learning or quantum simulations applications. They could also be used to verify the output of other generative models, such as Restricted Boltzmann machines or GANS [32]. Small-scale discriminative quantum circuit learning could be used for constructing non-trivial (e.g., POVM-based) receivers in quantum meteorology [28], sensing [29], and imaging [31].

A Quantum Circuits for POVMs
This section describes the parametrisation of the circuit 2.5.

A.1 Cosine-Sine Decomposition
Here we mention the Cosine-Sine Decomposition of unitary matrix, which will be frequently used in the following. For every unitary matrix U ∈ C 2 n ×2 n , it can be decomposed as: where A 0 , A 1 , B 0 , B 1 are unitary matrices of size 2 n−1 × 2 n−1 , C and S are real diagonal matrices of size 2 n−1 × 2 n−1 satisfying C 2 + S 2 = 1. It is written in the following circuit equivalence: Where a box represents the control part of a uniformly controlled gate, see section IV of [41] for details. In circuit 2.5, the first qubit is initiated to be |0 , so we have A.2 Decomposition of circuit 2.5 For a general measurement giving at most 4 measurement outcomes, we have the following circuit representation: |0 The first V could be decomposed using the circuit equivalence on page 5 of [42] into: |ψ where the R gate does not act on the second qubit. Applying Cosine-Sine Decomposition gives The uniform controlled V and U can be merged and put after measurement of M 1 as: The first line of the circuit could be merged with the second line as follows: And then we apply the Cosine-Sine Decomposition to V , throwing away the last gate on third and forth qubits, we obtain: The uniformly-controlled rotations and remaining two qubit unitary gates could be easily parametrised by CNOTs and single qubit rotations. For example, see [44] and [45].

B Adam algorithm for stochastic optimisation
Here we provide a brief introduction to the Adam algorithm for stochastic optimisation. Adam is based on the gradient descent algorithm for function optimisation. The gradient descent algorithm starts with an initial guess of minimal parameter θ 1 and updates this parameter iteratively until a minimal value is obtained by the following rule: Here J is the cost function, which could be stochastic. α is called the learning rate and its value requires empirically tuning. We used two improvements on gradient descent to optimise our circuit.
Stochastic calculation of gradient. In practice, the cost function J often has the following form: Here each J i is usually associated with a single datum for optimisation. The calculation of gradient may be computationally expensive when N is large, in which case we calculated it in a stochastic manner. Specifically, at each calculation of gradient, we sampled a mini-batch B = {i 1 , i 2 , · · · , i N }, drawn uniformly from the training data, and calculated an estimate of gradient in this mini-batch: The N was held fixed throughout the training.
Adaptive moment estimation (Adam). Adam improves on the gradient descent by incorporating two pieces of information: Here g t = ∇ θ J(θ t ), and g 2 t = g t g t is a vector of element-wise squares of gradient. The first term m t is an exponential moving average of the gradient controlled by parameter β 1 , and the second term v t is an exponential moving average of the g 2 t controlled by parameter β 2 . These two values are combined in updating the parameter θ in the following manner: Hereε is a small number to avoid division by zero when v t is initialised to be 0.
With m t , the parameter update in Eq. B.6 will favor the direction where the gradient points mostly to the same direction, while disfavor direction where the gradient oscillates backwards and forwards. Intuitively, m t makes the cost function J "accelerates" in the optimisation by accumulating "momentum" g t .
With v t in Eq. B.6, a moving average of the magnitudes of gradient in each direction is included, and the direction with smaller gradient is amplified in the update. Intuitively, this amplifies the influence of rarely seen features (which contribute to small gradients) on the training.
In practice, we initialised m t and v t to be a zero vector, which made the moving averages biased towards zero. This problem was corrected by the following adjusted updating rules (as suggested in the original paper [46]): 7 : Generally, Adam is very robust to the choice of parameters, and a good choice of parameters suggested by its authors are: α = 0.001, β 1 = 0.9, β 2 = 0.999, andε = 10 −8 .

C Discussions on stochastic optimization strategies
In our work we observed a poor performance for stochastic gradient descent (SGD). However, when replacing SGD with the Adam algorithm, we are able to recover a good solution, i.e., nearly optimal results. Recent result by [47] imply that classical-quantum hybrid methods might not perform well in practice based on a proof involving random unitaries. However, we showed here that the optimisation procedure works well for the distributions of qubits studied if we replace the stochastic gradient descent with improved methods like Adam, or RMSProp [48](c.f. Fig.10). Since it is in general hard to determine the concrete reasons for failure of the optimisation process, based on the resulting performance of different optimisation algorithms, we can hence only hypothesise about origins of the observed behaviour. One explanation is the usage of non-optimal training parameters. A solution to this would be to perform an optimisation on these parameters which we didn't include in this work.
Another explanation is the saddle-point hypothesis. Since methods like Adam and RMSProp have been shown to perform particularly well in high-dimensional landscapes we suggest that the widely believed proliferation of saddle-points hypothesis [49,50,51,52,53] might also apply to quantum circuit training. However, in many practical applications stochastic gradient descent has been shown to perform well and been able to escape saddle-points due to its stochastic component, while only certain cases restrict its usage.
In comparison with our structure, in Ref. [47] the authors also assumes a certain type of unitary circuit, i.e., i.i.d. randomly sampled circuits which form a random unitary matrix. Referring to Spielman and Tengs monograph on the smoothed analysis [54], we conjecture that the special case of randomly sampled matrices does not apply in our highly structured problem, and support this hypothesis with the observation that up to the number of qubits we obtain a stable optimisation process. However, we leave open whether this holds for larger amount of qubits. Here the shading indicates. The shading represents one standard deviation computed across 10 runs from random initial parameters.