1 Introduction

Quantum computation has been shown to provide speedups in several applications over classical computation in the query model. Besides the famous Shor’s algorithm for prime number factorization, quantum computers can also produce statistical patterns that are hard to produce for classical devices. This raises the possibility that quantum computers can also recognize patterns that are hard to recognize for classical computers, or, in general, that quantum computers can help solve classical machine learning problems more efficiently. Recently, the intersection of quantum computation and machine learning has received a considerable amount of attention. Using the circuit model of computation, several quantum algorithms have been designed that in principle provide quadratic to exponential speedups on classical data (Biamonte et al. 2017; Ciliberto et al. 2018).

A related area is concerned with developing novel machine learning methods that operate on quantum data. In general, any set of quantum states which encode meaningful information can be considered quantum data. To motivate this direction, we want to emphasize that using quantum states as a storage medium for information has been demonstrated to provide advantages in several ways. For example, by coupling a quantum state with another target system, we can obtain information about the target system with increased sensitivity. Quantum metrology allows for example for a quadratic improvement over classical methods in terms of the statistical sampling error, i.e., the scaling of the standard deviations in estimates obtained through repeated measurements. Another example is quantum sensing, which provides much higher sensitivity for tasks like target detections in microwaves, i.e., quantum radar (Barzanjeh et al. 2015), and in general, sensing electric or magnetic fields (Degen et al. 2017). A practical application for these methods is the reduction of damage to pictures which are sensitive to the exposure of light (Schaller and Schützhold 2006).

Certain types of datasets are inherently quantum mechanical. Such data could, for example, be the output of quantum information processing procedures such as simulation of quantum materials, or quantum chemistry more generally. For such datasets, we conjecture the inherent advantage of quantum computers to perform recognition and classification tasks. For example, topological materials made in the exotic topological phase have non-classical electronic properties and are promising materials to build fault-tolerant quantum computers (Qi and Zhang 2011; Karzig et al. 2017). Predicting the phase of topological materials has been a very challenging problem for classical approaches. However, it has recently been shown that quantum neural networks could be used to recognize the phase of a quantum state (Cong et al. 2019) and hence for predicting this phase. In addition, the promised security of quantum communication protocols and a surge of ideas in quantum communication networks (Kimble 2008; Ren et al. 2017) further stimulates the research into areas dealing with inherently quantum data.

In this work, we explore the general problem of classifying quantum data. This problem can be seen as an extension of the established field of quantum state discrimination, which identifies a quantum state among a set of a priori completely known candidate states. A key challenge for the discrimination of quantum states is that a deterministic discrimination is impossible when the complex vectors representing the input states are non-orthogonal, i.e., when their overlaps are non-zero. Quantum state discrimination then allows finding the measurement that optimally discriminates these states. Note that we will use in the following input data and (quantum) states interchangeably.

However, it is not possible to directly apply quantum state discrimination to classify states, i.e., quantum data. First, it is inappropriate to assume that one possesses the complete knowledge of the input data a priori, which are often only samples generated from a data collecting process. Also, even with all the input data available as quantum states, performing quantum tomography on them is prohibitively expensive. In addition, quantum state discrimination often fails to give the optimal discriminative measurement in an analytically closed form, unless the quantum states are already orthogonal or possess certain symmetry properties (Barnett and Croke 2009). In case it fails, one may use numerical optimization to find the optimal measurement. However, the exponential growth of the dimensionality of the density matrices renders the numerical optimization also inefficient if performed on a classical device.

Due to the limitations of quantum state discrimination, it is natural to ask whether we can use a quantum computer to help with the optimization procedure. Since fully error corrected quantum computers are not available yet, a recent stream of works proposed various applications for circuit learning (Banchi et al. 2016; Wan et al. 2017; Innocenti et al. 2018; Romero et al. 2017; Mitarai et al. 2018; Farhi and Neven 2018; Verdon et al. 2017; Li and Benjamin 2017; Grant et al. 2018; Schuld et al. 2018; Xu et al. 2019; Khatri et al. 2019), which constitutes a form of quantum-classical hybrid neural network that have been shown to be less prone to the inherent errors of early-stage quantum hardware. In this work, we similarly utilize a hybrid approach to learn the design of a shallow quantum circuit for the classification of quantum states. Concretely, this hybrid scheme consists of a classical computer which interactively changes the parameters of a quantum circuit in order to optimize the output of the quantum computation. In other words, we train a quantum circuit to classify the states correctly.

The approach we take is novel in two ways. First, we use a quantum circuit ansatz that is designed for an implementation on near-term devices (details available in Appendix 1). This ansatz allows for a shallow circuit but is still universal; i.e., it can perform any unitary transformation allowed by quantum mechanics. It comprises gates from a universal gate set consisting of C-NOT and single-qubit gates, which is motivated by the fact that their implementations are known for the current mainstream experimental architectures. It is furthermore nearly optimal in terms of the number of C-NOT gates, which is an important feature for an implementation on near-term devices. Second, unlike previous works on quantum state discrimination, we focus on the generalization ability of our circuit; i.e., we train the circuit on a specific range of the parameters with the goal of maximizing its generalization performance, and hence in a learning setting. This distinguishes our work from the pure optimization problem for the state discrimination task, i.e., optimizing the circuit to distinguish only a concrete set of states. We show here that this universal quantum circuit can be trained as a discriminator for classification of non-orthogonal quantum data which is sampled from various different probability distributions. Our discriminator can achieve a near-zero error rate by producing inconclusive signals.

2 Dataset and framework

In this work, we propose a novel approach for training a universal quantum circuit to classify quantum data, which is stored in qubits. In this section, we first introduce the mathematical notation and description of quantum data. We then specify the quantum data we use in this work for classification. Next, we outline the approach we take to optimize a universal quantum circuit which is used to classify the quantum data. We defer the detailed decomposition of this quantum circuit to Appendix 1.

Mathematical descriptions

Quantum data are collections of quantum states which store useful information. For these data, we may assume that their density matrices ρ are parameterized by parameters a which follow a probability distribution α specific to the carried information. Then, for classification, we are normally presented with an unknown quantum state ρx, which belongs to a family of quantum states, each described mathematically by:

$$ \begin{array}{@{}rcl@{}} \rho_{i}(a_{i}), a_{i}\sim \alpha_{i}, \end{array} $$
(1)

where i is the label for the corresponding family, and ρi(ai) is the density matrix describing the quantum state in the family i, parametrized by ai, and the parameters ai are assumed to follow the probability distribution αi. The purpose of a classifier is to identify the family x. Note that to train the classifier, we draw samples of ρ(ai) according to the distribution αi.

Transformations of quantum states are described by a unitary matrix U, which transforms a quantum state ρ according to the rule ρUρU.

A measurement on the quantum state ρ is described by a set of matrices {Mj}, which are Hermitian, positive semi-definite, and sum to the identity. Here, j labels the possible measurement outcomes, and the probability pj for the measurement outcome j is given by pj = Tr(Mjρ). Such a collection of matrices Mj is commonly called a positive-operator valued measure (POVM). A common example of POVM is a projection-valued measure (PVM). In the case of a PVM, each Mj is a projector into some linear subspace and different Mj are orthogonal to each other, i.e., MjMi = δijMj. With the help of ancilla qubits, any POVM could be realized by a quantum circuit consisting of a series of unitary matrices (transformations) and measurements in the computational basis. Conversely, a quantum circuit which consists of a parameterized set of gates and measurements could also represent a range of different POVMs. There exists a quantum circuit which could represent any POVM with a fixed number of possible measurement outcomes. Such a circuit is called a universal discriminator in this paper, and the specific one we chose to use here is discussed in Appendix 1.

Dataset

For this work, we restrict our attention to the classification of two families of quantum states stored in a 2-qubit system. Our first family consists of pure states, parametrized by a real number a ∈ [0, 1]:

$$ \begin{array}{@{}rcl@{}} \psi_{1}(a) = \left( \sqrt{1-a^{2}}, 0, a, 0 \right), \rho_{1} =| {\psi_{1}(a)}\rangle \langle{\psi_{1}(a)}|. \end{array} $$
(2)

The second family consists of mixed states ρ2(b) where b ∈ [0, 1]. Specifically,

$$ \begin{array}{@{}rcl@{}} &\psi_{2/3} = \left( 0, \pm \sqrt{1-b^{2}}, b, 0 \right), \\ &\rho_{2}(b) = \frac{1}{2}|{\psi_{2}}\rangle\langle{\psi_{2}}| + \frac{1}{2} |{\psi_{3}}\rangle\langle {\psi_{3}}|. \end{array} $$
(3)

The overlap between ψ1 and ψ2/3 is ab, indicating that the two families of states are non-orthogonal. For the case of a fixed a and \(b=\frac {1}{\sqrt {2}}\), the maximal success rate for unambiguously discriminating between ρ1 and ρ2 has been studied theoretically, and experimentally demonstrated (Mohseni et al. 2004). The specific distributions we have tested in our experiments are summarized in Table 1. To generate the data for the training, validation, and testing of our circuits, we randomly and independently sampled points from the corresponding distributions.

Table 1 A summary of different test cases we classify in this work

Approach

Overall, there are two major strategies to cope with our inherent inability to perform deterministic discrimination of quantum states: (a) Minimum-error discrimination: In this strategy, the task is to minimize the probability that the inevitable errors occur in the classification. (b) Unambiguous discrimination: In this strategy, the discriminator has one more output prediction than the number of classes: the inconclusive outcome. The task is to eliminate the error rate of the discriminator while minimizing the probability of this inevitable inconclusive outcome. A pure unambiguous discrimination with strictly zero error rate is not guaranteed to be possible for arbitrary quantum data. From the perspective of numerical optimization, one hence needs to allow for some small but non-zero errors.

In this work, we use a machine learning approach for training a universal quantum circuit capable of giving any quantum measurements with four possible measurement outcomes \(m_{i_{2}i_{1}}\), where i1, i2 ∈ {0, 1} are the measurement outcomes of the first and the second qubits respectively. The parameterization of this circuit is discussed in Appendix 1. By assuming that input ρ1(a) produces the output m00 or m10, input ρ2 produces the output m01, and assuming that m11 is the inconclusive output, this circuit acts as a discriminator for our experiment datasets. Therefore, we could trivially define various probabilities (success probability Psuc, error probability Perr, and inconclusive probability Pinc) with respect to the input (training) data with known class label. For example, when ρ1 is the input, the probability of detecting m01 is the Perr, and the probability of detecting m11 is the Pinc. In this work, we perform experiments on simulated quantum computers, where these probabilities are available since the whole state is stored and processed on a classical computer. We note that on real quantum computers, these probabilities need to be estimated through repeated measurements up to some precision, and require repeated data input.

To train the circuit, we use a heuristically motivated loss function defined in Eq. 4, which is the averaged absolute difference between the desired probabilities and the measured probabilities. It contains hyperparameters αerr and αinc to balance between the erroneous outcomes and the inconclusive outcomes:

$$ \begin{array}{@{}rcl@{}} J &=& \underset{i}{\sum} \frac{1}{|S_{i}|} \underset{a_{i}\in S_{i}}{\sum} \left| P_{\text{suc}}(\rho_{i}(a_{i})) - 1 \right| \\ && + \alpha_{\text{err}} \underset{i}{\sum} \frac{1}{|S_{i}|} \underset{a_{i}\in S_{i}}{\sum} \left| P_{\text{err}}(\rho_{i}(a_{i})) - 0 \right| \\ && + \alpha_{\text{inc}} \underset{i}{\sum} \frac{1}{|S_{i}|} \underset{a_{i}\in S_{i}}{\sum} \left| P_{\text{inc}}(\rho_{i}(a_{i})) - 0 \right| . \end{array} $$
(4)

Here, we assume that for each family of quantum states, we are given a set Si of training samples, where each class is labelled by i. We denote with |Si| the cardinality of this set, i.e., the number of samples in the training set Si, αerr is the penalty for making errors, and αinc is the penalty for giving inconclusive outcomes. Psuc(ρi)/Perr(ρi)/Pinc(ρi) are the probabilities of giving a correct/erroneous/inconclusive measurement outcome for the specific input quantum data ρi. This loss function measures the performance of our quantum circuit as a minimal-error discriminator (when αerr < αinc) or as an unambiguous discriminator (when αerr > αinc).

To train this circuit, we use the Adam optimization algorithm (Kingma and Ba 2014), and we calculate the gradients using the forward difference formula.

For our specific problem of classifying ρ1 and ρ2 as defined in Eqs. 2 and 3, we define an extra set of success/erroneous/inconclusive rates in Eq. 5 to summarize and compare the performance of different instances of the training process:

$$ \begin{array}{@{}rcl@{}} P_{s} &=& \frac{1}{3} P_{s}(\rho_{1})_{\text{avg}} + \frac{2}{3} P_{s}(\rho_{2})_{\text{avg}} \\ &=&\frac{1}{3} P_{s}(\psi_{1})_{\text{avg}} + \frac{1}{3} P_{s}(\psi_{2})_{\text{avg}} + \frac{1}{3} P_{s}(\psi_{3})_{\text{avg}}, \end{array} $$
(5)

where s stands for suc (successful), err (erroneous), or inc (inconclusive). The subscript avg means that the probabilities are calculated as the average value for all samples of either the training set, or the test set (but not both). The choice of weights (\(\frac {1}{3} \)and \(\frac {2}{3}\)) in the Eq. 5 was made to be consistent with the results in Mohseni et al. (2004).

3 Theoretical analysis

Here, we describe a theoretical result to which we will compare our numerical results. In the general case, assume we have a family (or class) of quantum data ρ(a), each one parameterized by a and occurring with a probability P(a). Assume in addition that we have a quantum measurement described by a POVM with elements \(\{{\varPi }_{i}\}_{i\in \mathbb {N}}\), where i labels different measurement outcomes. Then, the probability of detecting measurement outcome i, averaged over any of the input data ρ(a), is:

$$ \begin{array}{@{}rcl@{}} \int \text{Tr}({\varPi}_{i} \rho(a)) P(a)\mathrm{d}a &=& \text{Tr}\left[\int {\varPi}_{i}\rho(a) P(a)\mathrm{d}a a \right] \\ &=& \text{Tr}\left[{\varPi}_{i}\int\rho(a) P(a)\mathrm{d}a\right] \\ &=& \text{Tr}\left[{\varPi}_{i} \rho\right], \end{array} $$
(6)

where \(\rho ={\int \limits } \rho (a) P(a)\mathrm {d} a\), and the integration of the matrix is done in an element-wise fashion. Therefore, if Tr(πiρ) = 0 for some i, then \({\int \limits }_{D} \text {Tr}({\varPi }_{i} \rho (a)) P(a) = 0\) for any subset D with non-zero measure in the whole parameter space of a. This is due to the fact that Tr[πiρ(a)]P(a) ≥ 0 for any parameter a.

The analysis above shows that the problem of unambiguously discriminating \(\rho _1={\int \limits }_a \rho _1(a) P_1(a)\mathrm {d} a\) and \(\rho _2={\int \limits }_a \rho _2(a) P_2(a) \mathrm {d} b\), is equivalent to the problem of unambiguously discriminating the family ρ1(a),∀a, from the family ρ2(b),∀b, where P1(a)/P2(b) is the probability of occurrence of ρ1(a)/ρ2(b). That is, if {π1, π2, πinc} is a POVM that unambiguously classifies all members of the two families ρ1(a) and ρ2(b), for all possible parameters, i.e., πinc corresponds to the inconclusive outcome with:

$$ \begin{array}{@{}rcl@{}} &\text{Tr}({\varPi}_{2}\rho_{1}(a)) = 0, \forall a,\\ &\text{Tr}({\varPi}_{1}\rho_{2}(b)) = 0, \forall b, \end{array} $$

then Tr(π1ρ2) = Tr(π2ρ1) = 0, and vice versa. Using this formalism, we can theoretically analyze the different cases we described in Table 1 based on the works of Raynal et al. (2003) and Barnett and Croke (2009), and the results are displayed in Table 2. Note that these are average case success probabilities.

Table 2 A summary of maximal success rate when the error rate is exactly 0 for the different test cases classified in this work

4 Numerical results

In this work, we aim to train a universal discriminator to discriminate different families of quantum data. Here, we present the results of training the universal discriminator to discriminate different distributions summarized in Table 1 on a quantum computer. The training is done by simulating the evolution of the quantum system under the parametrized circuits in a classical computer. To balance between eliminating the error rate (Perr) while minimizing the inconclusive rate (Pinc), we use a specific training strategy described in the following. We first prioritize a smaller inconclusive rate by starting with a zero penalty for erroneous outcomes (αinc > αerr = 0), and then increase the αerr in a step-wise manner until a certain objective error rate is achieved. Similar optimization procedures have been used in the context of variational auto-encoders both in classical machine learning (Sønderby et al. 2016), and in quantum machine learning applications (Rocchetto et al. 2018). Using this scheme, we train our circuit to unambiguously discriminate the two families of quantum states and observe the convergence toward the theoretical success rates for the discriminator obtained in Section 3 with an increasing amount of training data . Notably, we do not observe any signs of overfitting despite the varying size of the training dataset (Fig. 1a).

Fig. 1
figure 1

Unambiguous classification of non-orthogonal quantum data sampled from different probability distributions. The data is averaged over 10 repeated trails starting with random initializations and the bars indicate the standard deviations. The training, validation, and test datasets are sampled from the corresponding distributions. The dataset size indicates the size of the training and the validation dataset. The test dataset is fixed and had a size of 104 samples for each family and each distribution

Trade-off between the error rate and the inconclusive rate

Here, we show that our model is able to obtain a much higher success rate (Psuc) if we allow a slightly higher error rate compared with the previous results. This hints at a trade-off between the error rate (Perr) and the inconclusive rate (Pinc) which can be utilized in real-world applications.

Specifically, for the dataset “Case 4” in Table 1, we fix the two penalties, αerr and αinc, during the training and observed a gradual transition from unambiguous-like classification (characterized by a near-zero error probability) to minimal-error–like classification (characterized by the near-zero inconclusiveness) when we use varying penalties (Fig. 2a–c) throughout the different trainings with random initializations. Allowing a small error rate results then in a much higher success rate, which has not been predicted theoretically. We note that introducing the penalty terms αerr and αinc also makes the training process more stable (Fig. 2a). Therefore, the hyperparameters αerr and αinc act as a form of regularization and could be adjusted to give a higher success probability or a lower inconclusiveness rate for the final model (Fig. 2).

Fig. 2
figure 2

With different penalties, we observe a trade-off between the error rate and the inconclusive rate. Compared with the point αerr = αinc = 0 (bottom left corner), the added penalties improve the success probability or the inconclusiveness respectively. ac The gradual transition from the unambiguous classification (near-zero error rate, top left corner) to a minimal error classification (near-zero inconclusiveness, bottom right corner) with changes in the error penalty αerr and the inconclusiveness penalty αinc. We observe that the gain in the success rate is around 0.32 when we make a sacrifice of only 0.1 in the error rate. We let a ∈ [0, 1], and average over 50 repeated trails with random initializations. d Standard deviation for Psuc. With an increasing standard deviation (closer to the diagonal line), the result becomes increasingly unstable when the two penalties (αerr and αinc) are closer in value. The standard deviations for Perr and Pinc show the same pattern as for Psuc (not shown)

Furthermore, similar trade-off effects exhibited in all datasets are listed in Table 1. If we stop the training once the error rate drops below 0.01, we can achieve a much higher success rate than the theoretical case of exactly zero error rate (Fig. 3).

Fig. 3
figure 3

Unambiguous discrimination of data sampled from different probability distributions with higher success rate. a Trained quantum circuits are capable of classifying quantum data which is sampled from a variety of different mixed probability distributions for ρ1(a) and ρ2(b). The classification is done in an unambiguous manner (with error rate < 0.01). b For comparison, we include here the theoretical result mentioned in Table 2

5 Learning convergence from ensemble measurements

We additionally perform experiments in which we estimate the probabilities from repeated measurements on the (simulated) quantum device. We find that the noise in gradient calculation which is caused by these estimated probabilities could be effectively countered by increasing the number of repeated measurements, using a lower error rate, and adjusting the step size in the forward difference formula. The detailed discussion is available in Appendix Appendix. Therefore, our study here appears to be feasible to be run on error-corrected quantum devices. We leave open the effects of machine noise (the noise caused by imperfect quantum devices), and an actual implementation as future projects.

6 Conclusions

We have developed a quantum circuit learning approach for the classification of quantum data. Specifically, we have designed a heuristically motivated loss function and used the stochastic optimization algorithm Adam in a quantum-classical hybrid scheme to train a circuit to perform quantum state discrimination. This training process generalizes well for the discrimination tasks on new data, i.e., states from the parameter range which have not been seen during the training process. This distinguishes our work from previous results on quantum circuit learning, in particular the very recent study in Fanizza et al. (2018), which only optimizes circuits for specific inputs. Note that this prior work hence does not consider the generalization ability and hence does not treat the actual learning problem, which aims at optimization as well as generalization.

In our work, we observe a trade-off between the error rates and the inconclusive rates when we penalize them differently in the loss function. Although this experiment is done on simulated quantum computers where exact measurement probabilities are available, we show that this optimization could be experimentally performed with repeated measurements of the quantum states. We note that the recent quantum methods for estimating the analytical gradient via variations in the unitaries (Mitarai et al. 2018) can be directly applied to training our circuits; therefore, one can perform the optimization efficiently on near-term quantum devices. Also, although the Adam optimization algorithm is shown to be sufficient for the experiments conducted in this paper, several optimization algorithms specific to variational hybrid quantum-classical algorithms have been proposed and may provide improvements in more complicated cases (see for example Kübler et al. (2019)).

In this work, we have not addressed the issue of scalability of classifying quantum states. However, we expect most kinds of quantum data of interests will only require polynomial-depth circuits for classifying them. For example, it is likely that an ansatz based on the idea of tensor networks (e.g., Grant et al. (2018) and Cong et al. (2019)) can classify the different phases of ground states of quantum many-body systems in polynomial depth. Also, a scheme where one systematically increases the depth of the ansatz circuit will help explore the required circuit depth for classifying quantum data. A similar idea has been explored in the context of variational quantum eigensolver (Ostaszewski et al. 2019).

We believe that with the progress on technologies for preservation and transportation of quantum states, we will see many applications of a trained discriminative quantum circuits introduced here. Quantum state discrimination by itself plays a key role in quantum information processing protocols and is used in quantum cryptography (Bennett 1992a), quantum cloning (Duan and Guo 1998), quantum state separation, and entanglement concentration (Chefles 2000). Our work can provide improvements on these traditional areas by producing a classifier that is resilient to the statistical noise found in the actual communication. For example, we can consider an improved version of the B92 quantum key distribution protocol (Bennett 1992b) by including the noise-induced randomness in its two quantum keys and classify them with our discriminative circuit. Furthermore, we can consider training a discriminative quantum circuit used to construct quantum repeaters and state purification units within quantum communication networks. The training can take quantum data that have noise specific to the communication networks and therefore produces a discriminator that can recognize and filter those noise to provide better performance. Our discriminator can also be used to verify the output of other generative models, such as the quantum version of Boltzmann machines (Amin et al. 2018), or generative artificial neural networks (Goodfellow et al. 2014; Lloyd and Weedbrook 2018).