1 Introduction

Viewed from a cognitive perspective, individuals, after learning from a substantial set of unlabeled samples (the U set) and gaining a comprehensive understanding of global class characteristics, can effectively identify samples with similar patterns to the provided labeled positive data, even when there are no labeled negative samples available for reference. In this context, the set of labeled positive samples (the P set) serves more as a guiding reference rather than a strict supervisor.

The learning problems that exclusively involve the P set and the U set for training are commonly known as PU learning problems. Solving the PU problem is of great importance due to its widespread occurrence in various practical applications, such as web data mining, product recommendations and medical diagnosis, among others. However, traditional classification methods are ill-suited for handling PU problems because they typically assume the presence of explicit negative samples. To address this challenge, two main learning strategies have been proposed for PU problems: the likely negative examples-based strategy and the class-prior-based strategy. These strategies do not consistently adhere to the roles of the P set and U set outlined in the cognitive process mentioned earlier; but rather treat them equally. They typically examine the statistical divergence between the two datasets in a preprocessing step to obtain additional information such as class priors or likely negative samples, and then utilize the obtained knowledge to compensate for the absence of the set of labeled negative samples (the N set) in the second training step [1,2,3,4,5,6,7]. However, when the labeling mechanism is unclear or there is a limited number of positively labeled samples, overreliance on the statistical information from the P set can lead to inaccurate prior estimation and render the model sensitive to even minor changes of the samples [2]. In fact, in comparison to the P set, the statistical information provided by the U set tends to be more stable and comprehensive.

In contrast to the traditional methods that employ a two-step learning strategy, our approach utilizes a direct discriminant strategy for PU data, circumventing the possible issues caused by the initial estimation of, e.g., class priors. Here, we treat the U set as the primary source of global clustering information [8], while the P set acts as a guiding reference. However, in the absence of additional constraints imposed by prior knowledge, a direct learning strategy may encounter the issue of conflicting objectives between acquiring clustering information and obtaining label information. In fact, the strong coupling relationship between the bits in the label, for example, (1, 0) of the positive training samples, may be captured as a form of knowledge, which could be repeatedly reinforced during training and potentially overwrite the clustering information obtained from the U set, leading to a deviation in the classification plane and significant misjudgments.

The key to addressing this issue lies in ensuring independent representations for the positive and negative classes, respectively, thereby preventing the training on the P set from eroding the training on the U set. Simultaneously, it is also crucial to maintain a certain degree of separation between the representations of the two classes to satisfy the clustering requirement for differentiating samples from distinct classes. In fact, it has been theoretically demonstrated that by minimizing the overlap between the positive and negative classes, particularly by enhancing their mutual exclusivity when the representations are directly considered as the bits in labels (see Eq. (4)), one can obtain a reliable estimate of the negative class conditional probability density function (PDF) [9].

However, setting the mutual exclusivity of learning results, i.e., the bits in the estimated labels, as one of the learning goals will compromise the independence between the representations of different classes due to the fundamental fact that mutual exclusion generally implies correlation. Classical learning methods, including component analysis methods such as independent component analysis (ICA) [10] and linear discriminant analysis (LDA) [11], are not suitable for the learning setup in this paper. Fortunately, quantum theory provides concepts and principles like density operator, entanglement, and measurement to concurrently model multiple potential states of positive and negative classes to address this issue. In fact, for a two-qubit system in a product state, there is neither quantum correlation nor classical correlation between the qubits [12], yet their measurement results as output of learning method can be mutually exclusive.

Utilizing the neural networks such as fully connected neural networks and convolutional neural networks as backends, we establish a mapping from the samples in both the P and U sets to the quantum systems comprising two qubits: the positive qubit and the negative qubit. By setting the quantum product state as the learning objective and using fidelity as the overlap measure between positive and negative classes, we propose a direct learning strategy and develop a quantum-inspired PU learning method, named qPU (see Fig. 1). Compared to existing methods, qPU eliminates the need for the traditional two-step strategy in PU learning; additionally, qPU employs neural networks as backends, offering the advantages of easy implementation and theoretical training time equivalent to general neural networks. The experimental results on various datasets validate the superiority of qPU. Furthermore, we find that entanglement can be employed as an effective measure of separability between positive and negative classes; and reducing the entanglement between positive and negative qubits can enhance the separability.

The rest of this paper is organized as follows. Section 2 provides a comprehensive discussion on related work in the fields of PU learning and quantum machine learning. Section 3 reviews the necessary background knowledge on PU learning and basic quantum theory for better understanding of this paper. Section 4 presents the details of the proposed qPU method, including its theoretical motivation, model framework, and loss function. Section 5 verifies the superiority of qPU through experiments and explores the relationship between entanglement and separability. Finally, Sect. 6 summarizes the findings and contributions of the research presented in this paper, and also discusses potential directions for future research and improvements.

Fig. 1
figure 1

An illustration of the proposed qPU. Classical bits are replaced by quantum bits, and the mapping from samples to quantum states is denoted by \(\mathcal {T}\), where u and v represent two quantum bits, and R represents the relationship between the two quantum bits

2 Related Work

PU methods have numerous practical applications, such as radar false target recognition [13], disease recognition [14], product recommendation [15] and link prediction task in the biological domain [16]. In these applications, false targets, recommended products and diseases, among others, can be considered as labeled positive instances, while the labeled negative instances are often missing owing to the lack of interest. Due to the scarcity of labeled positive instances, applications relying on positive class priors may be affected by the accuracy of prior estimation, which consequently impacts their practical effectiveness.

PU learning models can be categorized into two groups: the first group, including Self-training Expectation-Maximization (S-EM) [17], Positive-Example and Positive-Unlabeled Classification (PE-PUC) [18], and Positive-Example-Based Learning (PEBL) [19], adopt a two-step strategy. In this strategy, a reliable set of negative samples is initially determined from unlabeled data, and then (semi-) supervised learning methods are applied in the second step. Recently, graphs have been utilized to measure the similarity of samples [15, 20], aiming to obtain reliable sets of positive and negative samples. For instance, Luo et al. introduced Positive-Unlabeled Learning via Neural Selection (PULNS) [21], which leveraged reinforcement learning to obtain effective negative sample selectors. The common drawback of methods in this category lies in the challenge of determining the appropriate size of the extracted negative set. This can potentially lead to overfitting or underfitting of the final classifier, especially when there is a significant overlap between the classes.

The second group of methods typically assumes that class priors are known and focuses on designing loss functions that fully utilize this prior information. A milestone method of this type is the unbiased risk estimator for PU learning problem, called Unbiased PU (uPU) [3], which employed a non-convex loss function. Later, they discovered that convex surrogate loss functions [4] can reduce computational costs with similar accuracy. In order to solve the problem that the empirical risk of training data may tend to be negative, Kiryo et al. proposed a PU method with non-negative risk estimator called non-negative PU (nnPU) [5], which had a relatively strong theoretical foundation. In addition, Chen et al. proposed that Self-PU conduct self-supervised learning on nnPU through auxiliary tasks [22]; Su developed a new balanced dataset by oversampling positive samples when the set of positive samples is small, expanding nnPU to unbalanced data by amplifying the weight of minorities [2]; Hsieh proposed Positive-unlabeled binary Classification using Neural networks (PUbN) [23], which used the nnPU preprocessing model to identify the impacts of the negative class, and then combines positive risk, negative risk, and unlabeled risk to learn the final classifier; Zhao proposed Distributional Positive-Unlabeled Learning (distPU), which utilizes label consistency between predicted and ground-truth label distributions [1]. The above methods required known class priors, and thus, the estimation of class priors also garnered extensive attention [24,25,26,27]. There are also some PU learning methods that do not need to estimate class priors. For example, Variational PU (vPU) [28] proposed the variational principle of mixed regularization PU learning and learned a classifier with the minimum KL divergence.

In recent years, the intersection of quantum technology and machine learning has sparked widespread interest among researchers. In the early stages, the main focus is on providing quantum acceleration for machine learning methods to address challenges related to large data volumes and slow training processes. Notable examples of such approaches include Shor’s algorithm [29], Grover’s algorithm [29], Harrow–Hassidim–Lloyd (HHL) algorithm [30], Quantum Principal Component Analysis [31], Quantum Linear Discriminant Analysis [32], and Quantum Support Vector Machine [33]. More recently, the realms of quantum many-body physics and deep learning have begun to intertwine. For instance, Gao et al. rigorously demonstrated that the Restricted Boltzmann Machine can represent a wide range of quantum many-body states [34]. Stoudenmire [35] and Liu [36] developed classification models based on matrix product states and multi-scale entangled renormalization group, respectively, and explored quantum characteristics of the models such as quantum entanglement and fidelity. Wang [37] introduced the quantum density matrix into the Recurrent Neural Network model, showcasing further integration of quantum effects into machine learning.

3 Preliminaries

3.1 PU learning and Learning Strategy by Minimizing Overlap

PU learning. Let \(\mathcal {Y}\text {=}\{+1,-1\}\ \) be the set of possible labels, and \(\mathcal {X}\text {=}\{{{x}_{1}},{{x}_{2}},...,{{x}_{n}}\}\ \) be the collective set of training samples. Without loss of generality, we suppose only the rest \( {{n}_{L}}\) cases in \(\mathcal {X}\ \) are labeled with the positive label +1, and the rest \({{n}_{U}}\) samples are unlabeled. Let P be the set of labeled positive samples, and U be the set of unlabeled samples. The goal of PU learning is to distinguish the negative samples and the positive samples in the U set or in the testing set.

Learning strategy by minimizing overlap. A PU learning method can be obtained by minimizing the overlap between the positive and negative classes when the positive class conditional PDF and the mixture PDF of both classes are given [9]. Specifically, using Bhattacharyya coefficient \(B\left( \theta \right) =\int _{{{\mathbb {R}}^{d}}}{\sqrt{p(x|y=1;\theta )p(x|y=-1;\theta )}}\textrm{d}x \) as the overlap measure between positive and negative conditional PDFs, the learning strategy can be formalized as \({\min }_{\theta }\, \left\{ B\left( \theta \right) \right\} \), where \(\theta \) is the vector of model parameters. However, estimating the PDFs from PU data remains a challenging problem.

3.2 Mathematical Formalization of Quantum States

Quantum theory is based on the complex Hilbert space [29]. Quantum pure states are vectors in this space, which can alternatively be represented in matrix form or as Bloch representations. A product state specifically refers to a composite quantum state in which its constituent states are statistically independent.

Pure states. For a single-qubit system, when given a set of basis states \(|{0}\rangle =\left( \begin{array}{c} 1\\ 0\\ \end{array} \right) \) and \(|{1}\rangle =\left( \begin{array}{c} 0\\ 1\\ \end{array} \right) \), a pure state \(|{\psi }\rangle \) can be represented as \(|{\psi }\rangle =\alpha |{0}\rangle +\beta |{ 1}\rangle \), where \(\alpha \) and \(\beta \) are complex numbers satisfying \({{\left| \alpha \right| }^{2}}+{{\left| \beta \right| }^{2}}\equiv 1 \). Ignoring the biases, we express \(|{\psi }\rangle \) as the column vector \(|{\psi }\rangle ={{\left( \alpha ,\beta \right) }^{T}}\), and denote its conjugate transpose as \(\langle {\psi }|\). Accordingly, when provided with a set of basis states \(|{00}\rangle ,|{01}\rangle ,|{10}\rangle \), and \(|{11}\rangle \), the state \(|{\psi }\rangle \) in a two-qubit composite system can be expressed as \(|{\psi }\rangle ={{\alpha }_{1}}|{00}\rangle +{{\alpha }_{2}}|{01}\rangle +{{\alpha }_{3}}|{10}\rangle +{{\alpha }_{4}}|{11}\rangle \), where \({{\alpha }_{i}}(i=1,2,3,4)\) are complex numbers subject to \(\sum \nolimits _{i=1}^{4}{{{\left| {{\alpha }_{i}} \right| }^{2}}}\equiv 1\), and \(|{ij}\rangle =|{i}\rangle \otimes |{j}\rangle \) with \(\otimes \) representing the tensor product.

Density operator. The density operator, typically denoted by the symbol \(\rho \), is a matrix representation of quantum state. For a pure state, its density operator can be obtained by the outer product of its own state vector, namely, \(\rho = |{\psi }\rangle \langle {\psi }|\).

Bloch representation. Given the Pauli basis \(\{{{\sigma }_{i}}|i=0,1,2,3\}=\{1,{{\sigma }_{x}},{{\sigma }_{y}},{{\sigma }_{z}}\}\), the density operator \(\rho \) of any single-qubit can be uniquely expressed as \(\rho =\frac{1}{2}\left( {{\sigma }_{0}}+\sum \nolimits _{i=1}^{3}{{{u}_{i}}}{{\sigma }_{i}} \right) \), where \({{u}_{i}}=tr(\rho {{\sigma }_{i}})\) represents the measurement result obtained by observing \(\rho \) using the observable \(\sigma _i\). The vector \(u=\left( {{u}_{1}},{{u}_{2}}, {{u}_{3}} \right) \) is known as the Bloch vector of \(\rho \). Similarly, the density operator \(\rho \) of any two-qubit state can be uniquely decomposed into a linear combination of Dirac matrices that are tensor products of Pauli matrices: \(\rho =\frac{1}{4}\left( {{\sigma }_{0}}\otimes {{\sigma }_{0}}\text {+}\sum \nolimits _{i=0}^{3}{{{u}_{i}}}{{\sigma }_{i}}\otimes {{\sigma }_{0}}+\sum \nolimits _{j=0}^{3}{{{v}_{j}}}{{\sigma }_{0}}\otimes {{\sigma }_{j}} \right. \left. +\sum \nolimits _{i,j=0}^{3}{{{R}_{ij}}}{{\sigma }_{i}}\otimes {{\sigma }_{j}} \right) \), where \({{u}_{i}}=tr(\rho {{\sigma }_{i}}\otimes {{\sigma }_{0}})\), \({{v}_{j}}=tr(\rho {{\sigma }_{0}}\otimes {{\sigma }_{j}})\) and \({{R}_{ij}}=tr(\rho {{\sigma }_{i}}\otimes {{\sigma }_{j}})\) are measurement results obtained by treating Dirac matrices as observables. The Bloch representation of \(\rho \) is denoted as (uvR), where \(u=\left( {{u}_{1}},{{u}_{2}},{{u}_{3}} \right) \) and \(v=\left( {{v}_{1}},{{v}_{2}},{{v}_{3}} \right) \) are the local Bloch vectors of the reduced states of qubits, and \(R=(R{{\text {}}_{ij}})\) is a 3 \(\times \) 3 matrix encoding the correlations.

Product state. A product state \(|{\psi }\rangle \) refers to a quantum state that can be expressed as a tensor product of individual states \(|{\psi }\rangle _{1},|{\psi }\rangle _{2},...,|{\psi }\rangle _{n}\) of multiple subsystems, i.e., \(|{\psi }\rangle =|{\psi }\rangle _{1}\otimes |{\psi }\rangle _{2}\otimes \cdots \otimes |{\psi }\rangle _{n}\). Specifically, for the Bloch representation (uvR) of a product state, we have \(R=u{{v}^{T}}\) [12]. Moreover, it is easy to verify that \(\left\| u \right\| _{2}^{2}+\left\| v \right\| _{2}^{2}+\left\| R \right\| _{F}^{2}=3\) holds for pure states, where \(\left\| \cdot \right\| _{F}^{2}\) is the Frobenius norm. Consequently, we can conclude that \(\left\| R \right\| _{F}^{2}\ge 1\), and \(\left\| R \right\| _{F}^{2}\) can reach its minimum value of 1 if and only if the system is in the product state due to the fact that \(\left\| u \right\| _{2}^{2}\le 1\) and \(\left\| v \right\| _{2}^{2}\le 1\). Finally, it is important to highlight that one of the main learning objectives of our strategy is to map samples to product states of the individual qubits u and v. In this context, (uv), functioning akin to classical bits, serves as quantum representations of samples for class labels.

4 Quantum-Inspired PU Learning Method

In this section, we will demonstrate our theoretical motivation, model framework and loss function of the proposed qPU method.

4.1 Theoretical Motivation

As stated in the introduction, this paper aims to propose a quantum-inspired PU method, where based on the traditional neural networks, a mapping \(\mathcal {T}\) (see Fig. 2) is trained to get the quantum representations \(\rho \) of samples x, first, i.e., \(\rho =\mathcal {T}\left( x \right) \). Then, after measuring \(\rho \) to obtain the Bloch representations (uvR) of samples, a classification method is established based on the Bloch representations, where (uvR) is often abbreviated as (uv) if the system is in the product state.

In order to coordinate the training on the P and U sets, \(\mathcal {T}\) is demanded to be trained to output quantum product states, where due to the independence between the positive and negative qubits, the clustering information conveyed from samples to the quantum system is maximized, and learning on the P set, which mainly influence the positive qubit, will not compromise the acquisition of clustering information on the U set. As a result, it can be assumed roughly that the two-qubit system has preserved sufficient information about the positive class and the mixture of positive and negative classes. In this case, minimizing the overlap between positive and negative qubits is expected to differentiate the representations of samples according to their underlying classes, based on which a suitable discriminant rule can be specified. Concretely, \(\mathcal {T}\) is expected to possess the following properties:

Property 1

The mapping \(\mathcal {T}\) should be trained to encourage the evolution of quantum state \(\rho \) towards a product state, which can be achieved by minimizing \(\left\| R \right\| _{F}^{2}\) as introduced in preliminaries. This loss term associated with this objective is denoted as \({{I}_{P\cup U}}\), since it applies to the \(P\cup U\) set:

$$\begin{aligned} {{I}_{P\cup U}}(\mathcal {T})=\underset{x\in P\cup U}{\mathop {E}}\,\left( \sqrt{\left\| R \right\| _{F}^{2}} \right) =\underset{x\in P\cup U}{\mathop {E}}\,\left( \sqrt{\sum \limits _{i,j}{R_{ij}^{2}}} \right) . \end{aligned}$$
(1)

Property 2

Labeled positive samples should be mapped to similar quantum states to ensure a high level of intra-class cohesion. In this regard, we introduce \(B=(u, v)\) as the outcome of mapping data by \(\mathcal {T}\) and \(B^+=(u^+, v^+)\) as the learning objective for positive labeled samples, which can, to some extent, be considered a quantum analogue of positive label in classical scenarios. However, unlike the classical label, \(B^+\) might contain multiple values, where \({{u}^{+}}\) is fixed as (1, 0, 0), while \({v}^{+}\) is determined not by its label but rather by the v part in the quantum representations \(B=(u, v)\) of the samples:

$$\begin{aligned} v_{i}^{+}=\left\{ \begin{matrix} 1, &{} i=\arg \max \{{{v}_{k}}|k=1,2,3\} \\ 0, &{} otherwise \\ \end{matrix} \right. . \end{aligned}$$
(2)

The multi-value mechanism of \(v^+\) makes the training on the P set predominantly affect the positive qubit rather than the negative qubit, analogous to the effect of local unitary transformations on the 2-qubit quantum system. The loss term meeting this property is defined as the mean value of the cross-entropy between \({{B}^{+}}\) and B:

$$\begin{aligned} {{M}_{P}}(\mathcal {T})=\frac{1}{6}\underset{x\in P}{\mathop {E}}\,\left( \sum \limits _{i}{C(u^{+}_{i},u_{i})}+\sum \limits _{j}C({v^{+}_{j},v_{j})} \right) . \end{aligned}$$
(3)

Correspondingly, \({{B}^{-}}\) can be introduced for the negative samples. However, due to the absence of negative training samples, \({{B}^{-}}\) is not explicitly used in this paper.

Property 3

The mapping \(\mathcal {T}\) should be trained to minimize the overlap between classes to differentiate the representations of samples, where the overlap is measured by the Bhattacharyya coefficient [9]. Specifically, by

$$\begin{aligned} \begin{aligned} B\left( \mathcal {T} \right)&=\int _{{{\mathbb {R}}^{d}}}{\sqrt{p(x|y=1;\mathcal {T})p(x|y=-1;\mathcal {T})}}\textrm{d}x \\&\propto \int _{{{\mathbb {R}}^{d}}}{\sqrt{p(x,y=1;\mathcal {T})p(x,y=-1;\mathcal {T})}}\textrm{d}x \\&=\int _{{{\mathbb {R}}^{d}}}{\sqrt{p(y=1|x;\mathcal {T})p(y=-1|x;\mathcal {T})}}p(x)\textrm{d}x \\&=\underset{x}{\mathop {E}}\,\left( \sqrt{p(y=1|x;\mathcal {T})p(y=-1|x;\mathcal {T})} \right) , \\ \end{aligned} \nonumber \\ \end{aligned}$$
(4)

the Bhattacharyya coefficient in the data space can be converted to that in the label space. As fidelity is a natural extension of Bhattacharyya coefficient in the quantum field, using it as the overlap measure, the corresponding loss term is defined by \({{F}_{P\cup U}}\left( \mathcal {T} \right) =\underset{x}{\mathop {E}}\,\left( F(u,v|\text { }x;\mathcal {T}) \right) \), where \(y=1\) and \(y=-1\) are substituted with the positive and negative qubits, respectively; and \(F(u,v\text { }\!|\!\text { }x;\mathcal {T})\) represents the fidelity between the qubits.

Nonetheless, fidelity is typically defined in terms of the density operator rather than the Bloch representation. Here, we propose a simplified computational method based on the Bloch representation. Given \({{\rho }_{1}}\) and \({{\rho }_{2}}\) as the density operators of the positive and negative qubits, respectively, and with \({{\lambda }_{1}}\) and \({{\lambda }_{2}}\) as the eigenvalues of \(\rho _{1}^{1/2}{{\rho }_{2}}\rho _{1}^{1/2}\), the fidelity of the two-qubit [38] is given and upper bounded by

$$\begin{aligned} F({{\rho }_{1}},{{\rho }_{2}})&={{\left( Tr\sqrt{\rho _{1}^{1/2}{{\rho }_{2}}\rho _{1}^{1/2}} \right) }^{2}}={{\left( \sqrt{{{\lambda }_{1}}}+\sqrt{{{\lambda }_{2}}} \right) }^{2}} \nonumber \\&={{\lambda }_{1}}+{{\lambda }_{2}}+2\sqrt{{{\lambda }_{1}}{{\lambda }_{2}}}=Tr{{\rho }_{1}}{{\rho }_{2}}\nonumber \\&\quad +2\sqrt{\det {{\rho }_{1}}\det {{\rho }_{2}}} \nonumber \\&\le Tr{{\rho }_{1}}{{\rho }_{2}}+\left| \det {{\rho }_{1}} \right| +\left| \det {{\rho }_{2}} \right| . \end{aligned}$$
(5)

Replace the density operators with its Bloch representations, then the first term on the right-hand side of the inequality can be rewritten as follows:

$$\begin{aligned} \begin{aligned} Tr{{\rho }_{1}}{{\rho }_{2}}&=Tr\left[ \frac{1}{4}\left( I+\sum \limits _{i}^{{}}{{{u}_{i}}{{\sigma }_{i}}} \right) \left( I+\sum \limits _{i}^{{}}{{{v}_{i}}{{\sigma }_{i}}} \right) \right] \\&=\frac{1}{4}Tr\Bigg [ I+\sum \limits _{i}^{{}}{{{u}_{i}}{{\sigma }_{i}}}+\sum \limits _{i}^{{}}{{{v}_{i}}{{\sigma }_{i}}}\\&\quad +\sum \limits _{l,k}^{{}}{{u}_{l}}{{v}_{k}}({{\delta }_{lk}}I +j{{\varepsilon }_{lkn}}{{\sigma }_{n}}) \Bigg ] \\&=\frac{1}{4}\sum \limits _{l,k}^{{}}{{{u}_{l}}{{v}_{k}}}+\frac{1}{2} ,\\ \end{aligned} \nonumber \\ \end{aligned}$$
(6)

where \({{\delta }_{lk}}=\left\{ \begin{matrix} 1,&{}l=k \\ 0,&{}l\ne k \\ \end{matrix} \right. \), and \({{\varepsilon }_{lkn}}= \left\{ \begin{matrix} {+}1,&{} (l,k,n)\text { }is\text { }an\text { }even\text { }permutation\text { }of\text { }(1,2,3) \\ {-}1,&{} (l,k,n)\text { }is\text { }an\text { }odd\text { }permutation\text { }of\text { }(1,2,3)\text { } \\ 0,&{} otherwise \\ \end{matrix} \right. \). For the remaining two terms, it holds that \(\det {{\rho }_{1}}=\frac{1}{\text {4}}\det \left( \begin{matrix} 1+{{u}_{3}} &{} {{u}_{1}}-j{{u}_{2}} \\ {{u}_{1}}+j{{u}_{2}} &{} 1-{{u}_{3}} \\ \end{matrix} \right) =\frac{1}{\text {4}}\left( 1-u_{1}^{2}-u_{2}^{2}-u_{3}^{2} \right) \), and \(\det {{\rho }_{\text {2}}}=\frac{1}{\text {4}}\left( 1-v_{1}^{2}-v_{2}^{2}-v_{3}^{2} \right) \). Therefore,

$$\begin{aligned} F(u,v)\le & {} Tr{{\rho }_{1}}{{\rho }_{2}}+\left| \det {{\rho }_{1}} \right| +\left| \det {{\rho }_{2}} \right| =1\nonumber \\{} & {} +\frac{1}{4}\left( \sum \limits _{l,k}{{{u}_{l}}{{v}_{k}}}-\sum \limits _{i}{u_{i}^{2}}-\sum \limits _{i}{v_{i}^{2}} \right) . \end{aligned}$$
(7)

By taking the upper bound of F(uv) and ignoring the constant term and coefficients, \({{F}_{P\cup U}}\left( \mathcal {T} \right) \) is simplified as

$$\begin{aligned} {{F}_{P\cup U}}\left( \mathcal {T} \right) =\underset{x\in P\cup U}{\mathop {E}}\,\left( \sum \limits _{l,k}{{{u}_{l}}{{v}_{k}}}-\sum \limits _{i}{u_{i}^{2}}-\sum \limits _{i}{v_{i}^{2}} \right) . \end{aligned}$$
(8)

Property 4

Just like other discriminant methods, the mapping \(\mathcal {T}\) should be trained to minimize the discriminant loss under a specified discriminant rule. According to the setting of \({{B}^{+}}\) and \({{B}^{-}}\), a simple discriminant rule can be specified as follows:

$$\begin{aligned} {{y}^{*}}(x)=\left\{ \begin{matrix} 1, &{} u_1\ge v_1 \\ -1, &{} u_1<v_1 \\ \end{matrix}, \right. \end{aligned}$$
(9)

where \({{y}^{*}}(x) \) is the predicted label of x.

Then the discriminant loss can be defined correspondingly as

$$\begin{aligned} {{D}_{P\cup U}}(\mathcal {T})=\underset{x\in P\cup U}{\mathop {E}}\,\left( C\left( \sigma \left( u_1-v_1 \right) ,y(x) \right) \right) , \end{aligned}$$
(10)

where \(y(x)=\left\{ \begin{matrix} 1, &{} x\in P \\ 0, &{} x\in U \\ \end{matrix} \right. \), and \(C(\cdot )\) is the cross-entropy. However, it is worth emphasizing that different from the discriminant loss in other PU learning methods significantly, this loss term is expected to play a minor role, rather than serving as the primary learning objective. In this sense, we not only restrict its weight to a relatively small value but also slow down its learning speed. Specifically, in each iteration of the training process, only one random sample is taken from the \(P\cup U\) set, and the discriminant loss of this sample is used to replace the expectation in Eq. (10). Hence, the loss term is finally defined as

$$\begin{aligned} {{D}_{P\cup U}}(\mathcal {T})={{\left. C\left( \sigma \left( u_1-v_1 \right) ,y(x) \right) \right| }_{\forall x\in p\text {}\cup U}}. \end{aligned}$$
(11)

4.2 Model Framework and Loss Function

Based on the theoretical motivation, the qPU model framework is designed to include five components: input layer, network layer, conversion layer, measurement layer, and discriminant layer, as illustrated in Fig. 2.

Fig. 2
figure 2

The model framework of qPU. After the sample is input, the quantum pure state \(|{\psi }\rangle \) is obtained through the network layer and the conversion layer, and we can get its Bloch representation (uvR) by measurement layer. Finally, the predicted label of the sample is obtained through the discrimination layer

First, the training samples are fed into the input layer and subsequently processed by the network layer to extract their abstract features. Generally, networks with stronger expressive power are more preferred. However, in order to analyze the strengths and weaknesses of the qPU method itself more effectively and to avoid potential advantages that may arise from complex network structures, this paper only employs the simple network structures such as dense neural networks and convolutional neural networks as backends.

Following the network layer is the conversion layer, which utilizes the real numbers output by the network layer to combine the complex coefficients and construct the corresponding quantum pure states in a given basis set. For instance, in the context of a 2-qubit composite system, given a set of basis \(|{00}\rangle ,|{01}\rangle ,|{10}\rangle \) and \(|{11}\rangle \), the eight real numbers \(({{r}_{1}},{{r}_{2}},{{r}_{3}},{{r}_{4}},{{m}_{1}},{{m}_{2}},{{m}_{3}},{{m}_{4}})\) output by the network layer uniquely determine four complex numbers \(({{\alpha }_{1}},{{\alpha }_{2}},{{\alpha }_{3}},{{\alpha }_{4}})\), and thus a single corresponding quantum pure state \(|{\psi }\rangle ={{\alpha }_{1}}|{00}\rangle +{{\alpha }_{2}}|{01}\rangle +{{\alpha }_{3}}|{10}\rangle +{{\alpha }_{4}}|{11}\rangle \), where \({{\alpha }_{k}}={{r}_{k}}+i{{m}_{k}}\), \(k=1,2,3,4\). Note that the regular quantum pure states always demand the normalization of coefficients, hence the regularizer

$$\begin{aligned} reg(\mathcal {T})=\underset{\textbf{v}}{\mathop {E}}\,\left( \left| {{\left| {{\alpha }_{1}} \right| }^{2}}+{{\left| {{\alpha }_{2}} \right| }^{2}}+{{\left| {{\alpha }_{3}} \right| }^{2}}+{{\left| {{\alpha }_{4}} \right| }^{2}}-1 \right| \right) \end{aligned}$$
(12)

is added to the final loss function.

The fourth component is the measurement layer, which measures the obtained quantum state \(|{\psi }\rangle \) to yield its Bloch representation (uvR). On one hand, R is related to the operations that preserve global classification information. The related calculation only involves the squares of its elements, independent of their signs. On the other hand, (uv) is considered to be related to the discriminant rule which can typically be adapted based on the range of output values without affecting the discriminant results. For example, if the value range [0, 1] is expanded to \([-1, 1]\), a discriminant rule greater than 0.5 can be changed to a discriminant rule greater than 0. Based on these two points, this paper does not distinguish between the signs of the elements in (uvR). Therefore, if the numerical values are the same, even if the signs of the elements differ, they are considered to represent the same quantum state.

The final component is the discrimination layer, which predicts the labels of samples using the discriminant rule specified by Eq. (9). Finally, by combining the four loss terms presented in the previous subsection, and along with the regularization term Eq. (12), the loss function can be defined as

$$\begin{aligned} L(\mathcal {T})= & {} {{\gamma }_{1}}{{I}_{P\cup U}}(\mathcal {T})+{{\gamma }_{2}}{{M}_{P}}(\mathcal {T})+{{\gamma }_{3}}{{F}_{P\cup U}}(\mathcal {T})\nonumber \\{} & {} +{{\gamma }_{4}}{{D}_{P\cup U}}(\mathcal {T})+{{\gamma }_{\text {5}}}reg(\mathcal {T}), \end{aligned}$$
(13)

where \({{\gamma }_{i}}(1\le i\le 5)\) are constant coefficients. Notice that the five terms on the right-hand side of Eq.(13) can be roughly divided into two parts. The first part consists of \({{I}_{P\cup U}}(\mathcal {T})\), \({{F}_{P\cup U}}(\mathcal {T})\) and the regularization term \(reg(\mathcal {T})\). They are all unsupervised terms acting on the quantum representations of samples. The second part includes \({{M}_{P}}(\mathcal {T})\) and \({{D}_{P\cup U}}(\mathcal {T})\), reflecting the relationship between the samples and their labels. By assigning identical coefficients to the loss terms in the same part, i.e., \({{\gamma }_{1}}={{\gamma }_{3}}={{\gamma }_{5}}\) and \({{\gamma }_{2}}={{\gamma }_{4}}\), the loss function can be further simplified as

$$\begin{aligned} L(\mathcal {T})= & {} {{\gamma }_{1}}({{I}_{P\cup U}}(\mathcal {T})+{{F}_{P\cup U}}(\mathcal {T})+reg(\mathcal {T}))\nonumber \\{} & {} +{{\gamma }_{2}}({{M}_{P}}(\mathcal {T})+{{D}_{P\cup U}}(\mathcal {T})). \end{aligned}$$
(14)
Table 1 Summary of used datasets and corresponding models for them

5 Experiment

In this section, we provide a detailed description of the experiments, including the datasets used, experimental settings, the methods compared, evaluation metrics and result analysis.

5.1 Datasets

Image datasets. The image datasets used in this paper includes CIFAR-10 [39], MNIST [40] and F-MNIST [41].Footnote 1 These datasets originally consist of 10 classes. To ensure a fair cross-comparison, we followed a similar approach to nnPU [5] in constructing the positive and negative classes. Specifically, for MNIST as well as F-MNIST, we preprocessed the dataset in a way that digits ‘0’, ‘2’, ‘4’, ‘6’, and ‘8’ formed the positive class, while digits ‘1’, ‘3’, ‘5’, ‘7’, and ‘9’ formed the negative class. For CIFAR-10, the positive class consisted of the categories ‘airplane’, ‘automobile’, ‘ship’, and ‘truck’, while the negative class consisted of the categories ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, and ‘horse’.

Benchmark datasets. In addition to the image datasets, the experiments were also conducted on 10 benchmark datasets obtained from the UCI machine learning repository [42].Footnote 2 For the multiclass datasets, the samples of two classes were selected from each dataset to form a new dataset, with the first class as the positive class and the second class as the negative class. For the datasets with only two categories, we assigned the category with fewer samples as the positive class and the other category as the negative class. Each new dataset was then divided into an 8:2 ratio, with 80% of the samples used for training and the remaining 20% used for testing.

For each image dataset and benchmark dataset, 25% of the positive training samples were randomly selected to form the P set, while the remaining samples formed the U set. It is important to note that no special preprocessing techniques were applied to any of the datasets, except for the normalization of pixel values in the case of image datasets.

Table 2 Performance comparison of methods on the image datasets

5.2 Experimental Settings

qPU. This paper implements two specific qPU methods, with the backend network layer consisting of convolutional neural networks and dense neural networks, respectively. Specifically, the model architecture for CIFAR-10 is set as follows: input-C(3*3,32)-C(3*3,64)-P(2*2,2)-D(128)-N-M-1, where the input is a 32*32 RGB image, C(3*3,32) refers to 32 channels of size 3*3 convolutions followed by ReLU activation, C(3*3,64) indicates a similar convolution operation but with double channels, P(2*2, 2) represents a 2x2 max pooling operation with a stride of 2, and D(128) denotes a dense layer followed by ReLU activation with 128 neurons; following the network layer are the conversion layer N, the measurement layer M and the discriminant layer with 1 neuron, as described in the previous section on the model framework. For the remaining datasets, the model is structured as follows: input-D(300)-D(300)-D(300)-D(300)-N-M-1, where the convolutional/pooling layers in the previous model are replaced by dense layers D(300). Please refer to Table 1 for the correspondence between the datasets and the model architectures.

In subsequent experiments, if no further specifications are given, the two hyperparameters \(({{\gamma }_{1}},\text { }{{\gamma }_{2}})\) of qPU are fixed at (0.1, 0.5) (see Eq. (14)). The learning rate is set to 1e-3. The models were trained using Adam [43] as the optimizer until the number of iterations reached 1000.

Compared Methods. The performance of qPU was compared to that of uPU [3], nnPU [5], distPU [1] and vPU [28].Footnote 3 For the compared methods, to achieve their best possible performance, the hyperparameters in these methods were set to their optimal values as indicated in the papers that introduced them. The exception was the hyperparameters of vPU on the MNIST dataset, which were fine-tuned to establish its two specific hyperparameters \(\alpha \ =0.3\) and \(\lambda \ =0.1\) through multiple experiments.

Moreover, since uPU, nnPU and distPU are class-prior-based methods, the true class prior is provided as an additional input for them. It is crucial to emphasize that providing true class priors as a “God’s oracle” renders the comparison between these methods and our approach unfair. However, conducting a comparison with these methods under an idealized state, which might not be practically achievable, can either highlight the strengths of our method or, conversely, help identify potential margin for improvement.

5.3 Evaluation Metrics

For each PU learning method, we report five metrics on the training and test sets for a more comprehensive comparison, including G-mean (Gm), F1-score (F1), Precision (Pre), Accuracy (Acc), and Recall (Rec). Every experiment was repeated six times with random sampling, and the experimental results were represented as the mean and standard deviation of the metrics.

5.4 Experiment Result

5.4.1 Image Datasets

Fig. 3
figure 3

The experimental results for the evaluation metrics (Gm, F1, Pre, Acc, and Rec) of all methods on CIFAR-10, MNIST, and F-MNIST

Fig. 4
figure 4

The changes of accuracy and loss during training on CIFAR-10 and MNIST datasets. The first and the third column are accuracies on CIFAR-10 and MNIST, respectively, and the second and the fourth column are losses on CIFAR-10 and MNIST, respectively

We analyzed the performance of all the methods on the image datasets firstly. The average performances as well as the standard deviations of metrics obtained by different methods on the image datasets are reported in Table 2 and Fig. 3. As seen from Table 2 and Fig. 3, qPU provides the best performance regarding the G-mean, F1-score and accuracy on almost all datasets, except the CIFAR-10 dataset, where distPU performs better than qPU. Nonetheless, it is worth noting that the impressive performance of methods such as distPU relies on the utilization of accurate class prior values as additional input. In fact, accurately estimating the class prior in practical applications remains a research challenge. For instance, when estimating the class prior using the KM2 method [26], distPU based on the estimated value (referred to as distPU-) may show a significant performance decline compared to the method using the true value of the class prior, as reported in Table 2. This decline is likely due to estimation biases in the class prior. In addition, except for qPU, vPU is another method that does not require the class prior. However, qPU outperforms vPU by more than 2 points in almost all metrics.

We also discussed the changes of each method in the loss function and accuracy during the training process, which are reported in Fig. 4. From Fig. 4, we can see that all the methods tend to stabilize after approximately 250 epochs, exhibiting rapid learning speed. Furthermore, qPU, nnPU and disPU demonstrate consistent changes in accuracy on both the training and test sets, while the accuracy of uPU shows a slight decrease on the test set, despite its increase on the training set. This is a typical overfitting phenomenon [5]. The possible reason for this phenomenon may be that the loss function values for uPU can be negative, which leads to a continuous optimization of the loss function towards negative infinity erroneously.

Table 3 Performance comparison of methods on the benchmark datasets

5.4.2 Benchmark datasets

Similarly, we evaluated all the methods on the ten benchmark datasets, and the experimental results are summarized in Table 3. From Table 3, it is evident that qPU emerges as the optimal or suboptimal learning method across all datasets. Even in cases where qPU is not the optimal method, the performance gap between qPU and the optimal method is almost negligible. To provide a comprehensive assessment of the overall performance, we report the average metrics of each method on the ten datasets in Table 4 and Fig. 5. Remarkably, qPU outperforms other methods significantly in almost all metrics, particularly excelling in the overall performance metrics, G-mean and F1-score, surpassing other methods by more than 3 points. This robustly indicates the superiority of qPU in the context of PU learning.

Fig. 5
figure 5

The average performance across benchmark datasets of various methods based on comprehensive metrics

In addition, it is worth reiterating that nnPU, distPU, and uPU necessitate prior knowledge of the class priors. However, in this study, instead of using estimated class priors, the true class priors were provided for these methods, which allows them to benefit in terms of performance [28].

Table 4 The average results of each method on the ten benchmark datasets

5.4.3 Hyperparameter Analysis

The hyperparameters \({{\gamma }_{1}}\) and \({{\gamma }_{2}}\) of qPU were fixed at 0.1 and 0.5 in the previous experiments. To take a further step to analyze the influence of these hyperparameters, additional experiments were conducted on the datasets CIFAR-10 and MNIST to explore the grids of hyperparameters, where both \({{\gamma }_{1}}\) and \({{\gamma }_{2}}\) were set to be in the range [0.1, 1] with step 0.1, resulting in a 10x10 grid. For each combination of \({{\gamma }_{1}}\) and \({{\gamma }_{2}}\) in the grid, six experiments were performed under the same setting as the above experiments. The average results were then reported in Fig. 6.

Fig. 6
figure 6

The influence of hyperparameters on CIFAR-10 and MNIST datasets

The experimental results show that the hyperparameters have a certain impact on the performance of qPU. For instance, on the MNIST dataset, when \({{\gamma }_{2}}\) is much smaller than \({{\gamma }_{1}}\), the learning performance of qPU degrades. However, a common trend is observed when comparing the experimental results on both datasets: as the hyperparameters vary from the bottom right corner to the top left corner in the grid, the performance of qPU becomes stable and improves to a high level. This suggests that good performance can consistently be obtained by qPU when \({{\gamma }_{1}}<{{\gamma }_{2}}\). Based on this observation, we simply fixed \({{\gamma }_{1}}\) and \({{\gamma }_{2}}\) of qPU at 0.1 and 0.5, respectively.

5.4.4 Entanglement and Separability

Fig. 7
figure 7

The changes of entanglement and separability during training

We experimentally tested whether the training process of qPU can reduce the entanglement between the positive and negative qubits. To measure entanglement, we used the value of \(\left| \det (R) \right| \), since the eigenvalues of R can directly reflect the global entanglement situation [12]. The experiments were conducted on the datasets CIFAR-10 and MNIST. Every experiment was repeated 6 times. Their average results are presented in Fig. 7. As shown in the figure, on both the datasets, the entanglement decreases as the number of epochs increases, indicating that qPU can encourage the reduction of entanglement during the training process.

Further, it is hypothesized that the reduction of entanglement is associated with the preservation of clustering information. To investigate this, we examined the relationship between entanglement and the separability of sample features. The separability is quantified as the ratio of intra-qubit correlation to inter-qubit correlation, with the positive and negative qubits treated as two distinct perspectives of samples, akin to the principles of Fisher discriminant analysis [11], namely \(J=\frac{{{s}_{b}}}{{{s}_{w}}}\), where \({{s}_{w}}=\sum \limits _{i\ne j}^{{}}{\frac{cov(\widehat{u}_{i},\widehat{u}_{j})}{\sqrt{cov(\widehat{u}_{i},\widehat{u}_{i})cov(\widehat{u}_{j},\widehat{u}_{j})}}}+\sum \limits _{i\ne j}^{{}}{\frac{cov(\widehat{v}_{i},\widehat{v}_{j})}{\sqrt{cov(\widehat{v}_{i},\widehat{v}_{i})cov(\widehat{v}_{j},\widehat{v}_{j})}}}\), \({{s}_{b}}=\sum \limits _{i,j}^{3}{\frac{cov(\widehat{u}_{i},\widehat{v}_{j})}{\sqrt{cov(\widehat{u}_{i},\widehat{u}_{i})cov(\widehat{v}_{j},\widehat{v}_{j})}}}\) and \({\text {cov}}(\cdot \text { },\text { }\cdot )\) denotes the sample covariance, \(\widehat{u}_{i}=\text { }{{u}_{i}}-\overline{u}_{i}\) represents the deviation of \({u}_{i}\) from its expected value \(\overline{u}_{i}\), \(\widehat{v}_{i}\) follows the same definition.

To facilitate the comparison with the entanglement, the changes in separability, represented on the right y-axis, were also included in Fig. 7. As observed, in the initial stages, as entanglement rapidly decreases, separability also experiences a rapid decline. However, after a certain number of iterations, as the changes in entanglement become more gradual, the alterations in separability similarly decelerate. This indicates a clear and consistent correlation between the entanglement and the separability, implying that reducing entanglement helps retain the underlying clustering information in the training data.

6 Conclusion

This paper proposes a direct PU learning strategy. This strategy eliminates the initial estimation step and addresses the challenge of reconciling the conflicts of acquiring clustering information from unlabeled data and obtaining label information from positive data through introducing quantum formalization. By mapping samples into two-qubit composite systems and formulating an appropriate discriminant rule after measurement, our learning strategy yields a PU classifier that can be trained directly on PU data. Experimental results on 13 datasets demonstrate its superior performance, even when true class priors are provided for other PU methods. Moreover, the implementation of our method is straightforward since from a technical perspective, our method simply replaces classical bits with quantum bits in the output of traditional networks. Finally, this paper identifies that entanglement can serve as an effective measure of separability between positive and negative classes.

Considering the potential scarcity of positively labeled samples in many application scenarios, estimating the prior probability of positive instances becomes challenging. This feature enables the method based on our strategy to have a broader range of application scenarios compared to other PU methods. In future research endeavors, the focus will be on enhancing qPU by utilizing neural networks with enhanced expressive capabilities as the backend model.