Semi-supervised AUC optimization based on positive-unlabeled learning

Sakai, Tomoya; Niu, Gang; Sugiyama, Masashi

doi:10.1007/s10994-017-5678-9

Semi-supervised AUC optimization based on positive-unlabeled learning

Published: 26 October 2017

Volume 107, pages 767–794, (2018)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Semi-supervised AUC optimization based on positive-unlabeled learning

Download PDF

Tomoya Sakai^1,2,
Gang Niu^1,2 &
Masashi Sugiyama^1,2

3967 Accesses
33 Citations
16 Altmetric
Explore all metrics

A Correction to this article was published on 23 January 2018

This article has been updated

Abstract

Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced classification. So far, various supervised AUC optimization methods have been developed and they are also extended to semi-supervised scenarios to cope with small sample problems. However, existing semi-supervised AUC optimization methods rely on strong distributional assumptions, which are rarely satisfied in real-world problems. In this paper, we propose a novel semi-supervised AUC optimization method that does not require such restrictive assumptions. We first develop an AUC optimization method based only on positive and unlabeled data and then extend it to semi-supervised learning by combining it with a supervised AUC optimization method. We theoretically prove that, without the restrictive distributional assumptions, unlabeled data contribute to improving the generalization performance in PU and semi-supervised AUC optimization methods. Finally, we demonstrate the practical usefulness of the proposed methods through experiments.

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

Learning from positive and unlabeled data: a survey

Article 02 April 2020

1 Introduction

Maximizing the area under the receiver operating characteristic curve (AUC) (Hanley and McNeil 1982) is a standard approach to imbalanced classification (Cortes and Mohri 2004). While the misclassification rate relies on the sign of the score of a single sample, AUC is governed by the ranking of the scores of two samples. Based on this principle, various supervised methods for directly optimizing AUC have been developed so far and demonstrated to be useful (Herschtal and Raskutti 2004; Zhao et al. 2011; Rakhlin et al. 2012; Kotlowski et al. 2011; Ying et al. 2016).

However, collecting labeled samples is often expensive and laborious in practice. To mitigate this problem, semi-supervised AUC optimization methods have been developed that can utilize unlabeled samples (Amini et al. 2008; Fujino and Ueda 2016). These semi-supervised methods solely rely on the assumption that an unlabeled sample that is “similar” to a labeled sample shares the same label. However, such a restrictive distributional assumption (which is often referred to as the cluster or the entropy minimization principle) is rarely satisfied in practice and thus the practical usefulness of these semi-supervised methods is limited (Cozman et al. 2003; Sokolovska et al. 2008; Li and Zhou 2015; Krijthe and Loog 2017).

On the other hand, it has been recently shown that unlabeled data can be effectively utilized without such restrictive distributional assumptions in the context of classification from positive and unlabeled data (PU classification) (du Plessis et al. 2014). Furthermore, based on recent advances in PU classification (du Plessis et al. 2014, 2015; Niu et al. 2016), a novel semi-supervised classification approach has been developed that combines supervised classification with PU classification (Sakai et al. 2017). This approach inherits the advances of PU classification that the restrictive distributional assumptions are not necessary and is demonstrated to perform excellently in experiments.

Following this line of research, we first develop an AUC optimization method from positive and unlabeled data (PU-AUC) in this paper. Previously, a pairwise ranking method for PU data has been developed (Sundararajan et al. 2011), which can be regarded as an AUC optimization method for PU data. However, it merely regards unlabeled data as negative data and thus the obtained classifier is biased. On the other hand, our PU-AUC method is unbiased and we theoretically prove that unlabeled data contribute to reducing an upper bound on the generalization error with the optimal parametric convergence rate without the restrictive distributional assumptions.

Then we extend our PU-AUC method to the semi-supervised setup by combining it with a supervised AUC optimization method. Theoretically, we again prove that unlabeled data contribute to reducing an upper bound on the generalization error with the optimal parametric convergence rate without the restrictive distributional assumptions, and further we prove that the variance of the empirical risk of our semi-supervised AUC optimization method can be smaller than that of the plain supervised counterpart. The latter claim suggests that the proposed semi-supervised empirical risk is also useful in the cross-validation phase. Finally, we experimentally demonstrate the usefulness of the proposed PU and semi-supervised AUC optimization methods.

2 Preliminary

We first describe our problem setting and review an existing supervised AUC optimization method.

Let covariate $\varvec{x}\in \mathbb {R}^d$ and its corresponding label $y\in \{\pm 1\}$ be equipped with probability density $p(\varvec{x},y)$, where d is a positive integer. Suppose we have sets of positive and negative samples:

$$\begin{aligned} \mathcal {X}_{\mathrm {P}}&:=\{\varvec{x}^{\mathrm {P}}_i\}^{n_{{\mathrm {P}}}}_{i=1} {\mathop {\sim }\limits ^{\mathrm {i.i.d.}}} p_{{\mathrm {P}}}(\varvec{x}):=p(\varvec{x}\mid y=+1) , \mathrm {\;and} \\ \mathcal {X}_{\mathrm {N}}&:=\{\varvec{x}^{\mathrm {N}}_j\}^{n_{{\mathrm {N}}}}_{j=1} {\mathop {\sim }\limits ^{\mathrm {i.i.d.}}} p_{{\mathrm {N}}}(\varvec{x}):=p(\varvec{x}\mid y=-1) . \end{aligned}$$

Furthermore, let $g:\mathbb {R}^d\rightarrow \mathbb {R}$ be a decision function and classification is carried out based on its sign: $\widehat{y}={{\mathrm{sign}}}(g(\varvec{x}))$.

The goal is to train a classifier g by maximizing the AUC (Hanley and McNeil 1982; Cortes and Mohri 2004) defined and expressed as

$$\begin{aligned} \mathrm {AUC}(g)&:={{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {N}}}[I(g(\varvec{x}^{\mathrm {P}})\ge g(\varvec{x}^{\mathrm {N}}))]\right] \nonumber \\&=1-{{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\mathrm {N}}}\left[ I(g(\varvec{x}^{\mathrm {P}})<g(\varvec{x}^{\mathrm {N}}))]\right] \nonumber \\&=1-{{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\mathrm {N}}}\left[ \ell _{0\text {-}1}(g(\varvec{x}^{\mathrm {P}})-g(\varvec{x}^{\mathrm {N}}))]\right] , \end{aligned}$$

(1)

where ${{\mathrm{E}}}_{{\mathrm {P}}}$ and ${{\mathrm{E}}}_{{\mathrm {N}}}$ be the expectations over $p_{{\mathrm {P}}}(\varvec{x})$ and $p_{{\mathrm {N}}}(\varvec{x})$, respectively. $I(\cdot )$ is the indicator function, which is replaced with the zero-one loss, $\ell _{0\text {-}1}(m)=(1-{{\mathrm{sign}}}(m))/2$, to obtain the last equation. Let

$$\begin{aligned} f(\varvec{x},\varvec{x}'):=g(\varvec{x})-g(\varvec{x}') \end{aligned}$$

be a composite classifier. Maximizing the AUC corresponds to minimizing the second term in Eq. (1). Practically, to avoid the discrete nature of the zero-one loss, we replace the zero-one loss with a surrogate loss $\ell (m)$ and consider the following PN-AUC risk (Herschtal and Raskutti 2004; Kotlowski et al. 2011; Rakhlin et al. 2012):

$$\begin{aligned} R_\mathrm {PN}(f):={{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {N}}))]\right] . \end{aligned}$$

(2)

In practice, we train a classifier by minimizing the empirical PN-AUC risk defined as

$$\begin{aligned} \widehat{R}_\mathrm {PN}(f):=\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {N}}}}_{j=1} \ell \left( f\left( \varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {N}}_j\right) \right) . \end{aligned}$$

Similarly to the classification-calibrated loss (Bartlett et al. 2006) in misclassification rate minimization, the consistency of AUC optimization in terms of loss functions has been studied recently (Gao and Zhou 2015; Gao et al. 2016). They showed that minimization of the AUC risk with a consistent loss function is asymptotically equivalent to that with the zero-one loss function. The squared loss $\ell _{\mathrm {S}}(m):=(1-m)^2$, the exponential loss $\ell _{\mathrm {E}}(m):=\exp (-m)$, and the logistic loss $\ell _{\mathrm {L}}(m):=\log (1+\exp (-m))$ are shown to be consistent, while the hinge loss $\ell _{\mathrm {H}}(m):=\max (0,1-m)$ and the absolute loss $\ell _{\mathrm {A}}(m):=|1-m|$ are not consistent.

3 Proposed method

In this section, we first propose an AUC optimization method from positive and unlabeled data and then extend it to a semi-supervised AUC optimization method.

3.1 PU-AUC optimization

In PU learning, we do not have negative data while we can use unlabeled data drawn from marginal density $p(\varvec{x})$ in addition to positive data:

$$\begin{aligned} \mathcal {X}_{\mathrm {U}}&:=\{\varvec{x}^{\mathrm {U}}_k\}^{n_{{\mathrm {U}}}}_{k=1} {\mathop {\sim }\limits ^{\mathrm {i.i.d.}}} p(\varvec{x})=\theta _{{\mathrm {P}}}p_{{\mathrm {P}}}(\varvec{x})+\theta _{{\mathrm {N}}}p_{{\mathrm {N}}}(\varvec{x}) , \end{aligned}$$

(3)

where

$$\begin{aligned} \theta _{{\mathrm {P}}}:=p(y=+1) \mathrm {\;\;and\;\;} \theta _{{\mathrm {N}}}:=p(y=-1). \end{aligned}$$

We derive an equivalent expression to the PN-AUC risk that depends only on positive and unlabeled data distributions without the negative data distribution. In our derivation and theoretical analysis, we assume that $\theta _{{\mathrm {P}}}$ and $\theta _{{\mathrm {N}}}$ are known. In practice, they are replaced by their estimate obtained, e.g., by du Plessis et al. (2017), Kawakubo et al. (2016), and references therein.

From the definition of the marginal density in Eq. (3), we have

$$\begin{aligned} {{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}}, \varvec{x}^{\mathrm {U}}))]\right]&=\theta _{{\mathrm {P}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\bar{\mathrm {P}}}}[\ell (f(\varvec{x}^{\mathrm {P}}, \varvec{\bar{x}}^{\mathrm {P}}))]\right] +\theta _{{\mathrm {N}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {P}}, \varvec{x}^{\mathrm {N}}))]\right] \\&=\theta _{{\mathrm {P}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\bar{\mathrm {P}}}}[\ell (f(\varvec{x}^{\mathrm {P}}, \varvec{\bar{x}}^{\mathrm {P}}))]\right] +\theta _{{\mathrm {N}}}R_\mathrm {PN}(f), \end{aligned}$$

where ${{\mathrm{E}}}_{{\bar{\mathrm {P}}}}$ denotes the expectation over $p_{{\mathrm {P}}}(\varvec{\bar{x}}^{\mathrm {P}})$. Dividing the above equation by $\theta _{{\mathrm {N}}}$ and rearranging it, we can express the PN-AUC risk in Eq. (2) based on PU data (the PU-AUC risk) as

$$\begin{aligned} R_\mathrm {PN}(f)=\frac{1}{\theta _{{\mathrm {N}}}} {{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}}, \varvec{x}^{\mathrm {U}}))]\right] -\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}} {{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\bar{\mathrm {P}}}}\left[ \ell (f(\varvec{x}^{\mathrm {P}}, \varvec{\bar{x}}^{\mathrm {P}}))]\right] := R_\mathrm {PU}(f) . \end{aligned}$$

(4)

We refer to the method minimizing the PU-AUC risk as PU-AUC optimization. We will theoretically investigate the superiority of $R_\mathrm {PU}$ in Sect. 4.1.

To develop a semi-supervised AUC optimization method later, we also consider AUC optimization form negative and unlabeled data, which can be regarded as a mirror of PU-AUC optimization. From the definition of the marginal density in Eq. (3), we have

$$\begin{aligned} {{\mathrm{E}}}_{{\mathrm {U}}}\left[ {{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {U}}, \varvec{x}^{\mathrm {N}}))]\right]&=\theta _{{\mathrm {P}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {P}}, \varvec{x}^{\mathrm {N}}))]\right] +\theta _{{\mathrm {N}}}{{\mathrm{E}}}_{{\mathrm {N}}}\left[ {{\mathrm{E}}}_{{{\bar{\mathrm{N}}}}}[\ell (f(\varvec{x}^{\mathrm {N}}, \varvec{\bar{x}}^{\mathrm {N}}))]\right] \\&=\theta _{{\mathrm {P}}}R_\mathrm {PN}(f) +\theta _{{\mathrm {N}}}{{\mathrm{E}}}_{{\mathrm {N}}}\left[ {{\mathrm{E}}}_{{{\bar{\mathrm{N}}}}}[\ell (f(\varvec{x}^{\mathrm {N}}, \varvec{\bar{x}}^{\mathrm {N}}))]\right] , \end{aligned}$$

where ${{\mathrm{E}}}_{{{\bar{\mathrm{N}}}}}$ denotes the expectation over $p_{{\mathrm {N}}}(\varvec{\bar{x}}^{\mathrm {N}})$. Rearranging the above equation, we can obtain the PN-AUC risk in Eq. (2) based on negative and unlabeled data (the NU-AUC risk):

$$\begin{aligned} R_\mathrm {PN}(f)=\frac{1}{\theta _{{\mathrm {P}}}} {{\mathrm{E}}}_{{\mathrm {U}}}\left[ {{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {U}}, \varvec{x}^{\mathrm {N}}))]\right] -\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}} {{\mathrm{E}}}_{{\mathrm {N}}}\left[ {{\mathrm{E}}}_{{{\bar{\mathrm{N}}}}}[\ell (f(\varvec{x}^{\mathrm {N}}, \varvec{\bar{x}}^{\mathrm {N}}))]\right] := R_\mathrm {NU}(f) . \end{aligned}$$

(5)

We refer to the method minimizing the NU-AUC risk as NU-AUC optimization.

3.2 Semi-supervised AUC optimization

Next, we propose a novel semi-supervised AUC optimization method based on positive-unlabeled learning. The idea is to combine the PN-AUC risk with the PU-AUC/NU-AUC risks, similarly to Sakai et al. (2017).^{Footnote 1}

First of all, let us define the PNPU-AUC and PNNU-AUC risks as

$$\begin{aligned} R_\mathrm {PNPU}^\gamma (f)&:=(1-\gamma )R_\mathrm {PN}(f)+\gamma R_\mathrm {PU}(f), \\ R_\mathrm {PNNU}^\gamma (f)&:=(1-\gamma )R_\mathrm {PN}(f)+\gamma R_\mathrm {NU}(f) , \end{aligned}$$

where $\gamma \in [0, 1]$ is the combination parameter. We then define the PNU-AUC risk as

$$\begin{aligned} R_\mathrm {PNU}^\eta (f):= {\left\{ \begin{array}{ll} R_\mathrm {PNPU}^\eta (f) &{} (\eta \ge 0), \\ R_\mathrm {PNNU}^{-\eta }(f) &{} (\eta <0), \end{array}\right. } \end{aligned}$$

(6)

where $\eta \in [-1, 1]$ is the combination parameter. We refer to the method minimizing the PNU-AUC risk as PNU-AUC optimization. We will theoretically discuss the superiority of $R_\mathrm {PNPU}^\gamma $ and $R_\mathrm {PNNU}^\gamma $ in Sect. 4.1.

3.3 Discussion about related work

Sundararajan et al. (2011) proposed a pairwise ranking method for PU data, which can be regarded as an AUC optimization method for PU data. Their approach simply regards unlabeled data as negative data and the ranking SVM (Joachims 2002) is applied to PU data so that the score of positive data tends to be higher than that of unlabeled data. Although this approach is simple and shown computationally efficient in experiments, the obtained classifier is biased. From the mathematical viewpoint, the existing method ignores the second term in Eq. (4) and maximizes only the first term with the hinge loss function. However, the effect of ignoring the second term is not negligible when the class prior, $\theta _{{\mathrm {P}}}$, is not sufficiently small. In contrast, our proposed PU-AUC risk includes the second term so that the PU-AUC risk is equivalent to the PN-AUC risk.

Our semi-supervised AUC optimization method can be regarded as an extension of the work by Sakai et al. (2017). They considered the misclassification rate as a measure to train a classifier and proposed a semi-supervised classification method based on the recently proposed PU classification method (du Plessis et al. 2014, 2015). On the other hand, we train a classifier by maximizing the AUC, which is a standard approach for imbalanced classification. To this end, we first developed an AUC optimization method for PU data, and then extended it to a semi-supervised AUC optimization method. Thanks to the AUC maximization formulation, our proposed method is expected to perform better than the method proposed by Sakai et al. (2017) for imbalanced data sets.

4 Theoretical analyses

In this section, we theoretically analyze the proposed risk functions. We first derive generalization error bounds of our methods and then discuss variance reduction.

4.1 Generalization error bounds

Recall the composite classifier $f(\varvec{x},\varvec{x}')=g(\varvec{x})-g(\varvec{x}')$. As the classifier g, we assume the linear-in-parameter model given by

$$\begin{aligned} g(\varvec{x})=\sum ^b_{\ell =1}w_\ell \phi (\varvec{x})=\varvec{w}^\top \varvec{\phi }(\varvec{x}) , \end{aligned}$$

where ${}^\top $ denotes the transpose of vectors and matrices, b is the number of basis functions, $\varvec{w}=(w_1,\ldots ,w_b)^\top $ is a parameter vector, and $\varvec{\phi }(\varvec{x})=(\phi _1(\varvec{x}),\ldots ,\phi _b(\varvec{x}))^\top $ is a basis function vector. Let $\mathcal {F}$ be a function class of bounded hyperplanes:

$$\begin{aligned} \mathcal {F}:=\{ f(\varvec{x},\varvec{x}')= \varvec{w}^\top (\varvec{\phi }(\varvec{x})-\varvec{\phi }(\varvec{x}')) \mid \Vert \varvec{w}\Vert \le C_{\varvec{w}}; \; \forall \varvec{x}:\Vert \varvec{\phi }(\varvec{x})\Vert \le C_{\varvec{\phi }} \}, \end{aligned}$$

where $C_{\varvec{w}}>0$ and $C_{\varvec{\phi }}>0$ are certain positive constants. This assumption is reasonable because the $\ell _2$-regularizer included in training and the use of bounded basis functions, e.g., the Gaussian kernel basis, ensure that the minimizer of the empirical AUC risk belongs to such the function class $\mathcal {F}$. We assume that a surrogate loss is bounded from above by $C_\ell $ and denote the Lipschitz constant by L. For simplicity,^{Footnote 2} we focus on a surrogate loss satisfying $\ell _{0\text {-}1}(m)\le \ell (m)$. For example, the squared loss and the exponential loss satisfy the condition.^{Footnote 3}

Let

$$\begin{aligned} I(f)={{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {N}}}[\ell _{0\text {-}1}(f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {N}}))]\right] \end{aligned}$$

be the generalization error of f in AUC optimization. For convenience, we define

$$\begin{aligned} h(\delta )&:=2\sqrt{2}LC_\ell C_{\varvec{w}}C_{\varvec{\phi }} +\frac{3}{2}\sqrt{2\log (2/\delta )} . \end{aligned}$$

In the following, we prove the generalization error bounds of both PU and semi-supervised AUC optimization methods.

For the PU-AUC/NU-AUC risks, we prove the following generalization error bounds (its proof is available in Appendix 1):

Theorem 1

For any $\delta >0$, the following inequalities hold separately with probability at least $1-\delta $ for all $f\in \mathcal {F}$:

$$\begin{aligned} I(f)\le \widehat{R}_\mathrm {PU}(f) + h(\delta /2) \left( \frac{1}{\theta _{{\mathrm {N}}}\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {U}}}})}} +\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}\sqrt{{n_{{\mathrm {P}}}}}} \right) , \\ I(f)\le \widehat{R}_\mathrm {NU}(f) + h(\delta /2) \left( \frac{1}{\theta _{{\mathrm {P}}}\sqrt{\min ({n_{{\mathrm {N}}}},{n_{{\mathrm {U}}}})}} +\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}\sqrt{{n_{{\mathrm {N}}}}}} \right) , \\ \end{aligned}$$

where $\widehat{R}_\mathrm {PU}$ and $\widehat{R}_\mathrm {NU}$ are unbiased empirical risk estimators corresponding to $R_\mathrm {PU}$ and $R_\mathrm {NU}$, respectively.

Theorem 1 guarantees that I(f) can be bounded from above by the empirical risk, $\widehat{R}(f)$, plus the confidence terms of order

$$\begin{aligned} \mathcal {O}_p\left( \frac{1}{\sqrt{{n_{{\mathrm {P}}}}}}+\frac{1}{\sqrt{{n_{{\mathrm {U}}}}}}\right) ~~\mathrm {and}~~ \mathcal {O}_p\left( \frac{1}{\sqrt{{n_{{\mathrm {N}}}}}}+\frac{1}{\sqrt{{n_{{\mathrm {U}}}}}}\right) . \end{aligned}$$

Since ${n_{{\mathrm {P}}}}$ (${n_{{\mathrm {N}}}}$) and ${n_{{\mathrm {U}}}}$ can increase independently in our setting, this is the optimal convergence rate without any additional assumptions (Vapnik 1998; Mendelson 2008).

For the PNPU-AUC and PNNU-AUC risks, we prove the following generalization error bounds (its proof is also available in Appendix 1):

Theorem 2

For any $\delta >0$, the following inequalities hold separately with probability at least $1-\delta $ for all $f\in \mathcal {F}$:

$$\begin{aligned} I(f)&\le \widehat{R}_\mathrm {PNPU}^\gamma (f)+h(\delta /3) \left( \frac{1-\gamma }{\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {N}}}})}} +\frac{\gamma }{\theta _{{\mathrm {N}}}\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {U}}}})}} +\frac{\gamma \theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}\sqrt{{n_{{\mathrm {P}}}}}} \right) , \\ I(f)&\le \widehat{R}_\mathrm {PNNU}^\gamma (f)+h(\delta /3) \left( \frac{1-\gamma }{\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {N}}}})}} +\frac{\gamma }{\theta _{{\mathrm {P}}}\sqrt{\min ({n_{{\mathrm {N}}}},{n_{{\mathrm {U}}}})}} +\frac{\gamma \theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}\sqrt{{n_{{\mathrm {N}}}}}} \right) . \end{aligned}$$

where $\widehat{R}_\mathrm {PNPU}^\gamma $ and $\widehat{R}_\mathrm {PNNU}^\gamma $ are unbiased empirical risk estimators corresponding to $R_\mathrm {PNPU}^\gamma $ and $R_\mathrm {PNNU}^\gamma $, respectively.

Theorem 2 guarantees that I(f) can be bounded from above by the empirical risk, $\widehat{R}(f)$, plus the confidence terms of order

$$\begin{aligned} \mathcal {O}_p\left( \frac{1}{\sqrt{{n_{{\mathrm {P}}}}}}+\frac{1}{\sqrt{{n_{{\mathrm {N}}}}}} +\frac{1}{\sqrt{{n_{{\mathrm {U}}}}}}\right) . \end{aligned}$$

Again, since ${n_{{\mathrm {P}}}}$, ${n_{{\mathrm {N}}}}$, and ${n_{{\mathrm {U}}}}$ can increase independently in our setting, this is the optimal convergence rate without any additional assumptions.

4.2 Variance reduction

In the existing semi-supervised classification method based on PU learning, the variance of the empirical risk was proved to be smaller than the supervised counterpart under certain conditions (Sakai et al. 2017). Similarly, we here investigate if the proposed semi-supervised risk estimators have smaller variance than its supervised counterpart.

Let us introduce the following variances and covariances:^{Footnote 4}

$$\begin{aligned} \sigma _\mathrm {PN}^2(f)&={{\mathrm{Var}}}_\mathrm {PN} \left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {N}}))\right] , \\ \sigma _\mathrm {PP}^2(f)&={{\mathrm{Var}}}_{\mathrm {P}{\bar{\mathrm{P}}}} \left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))\right] , \\ \sigma _\mathrm {NN}^2(f)&={{\mathrm{Var}}}_{\mathrm{N}\bar{\mathrm{N}}} \left[ \ell (f(\varvec{x}^{\mathrm {N}},\varvec{\bar{x}}^{\mathrm {N}}))\right] , \\ \tau _\mathrm {PN,PP}(f)&={{\mathrm {Cov}}}_{\mathrm {PN},\mathrm {P}\bar{\mathrm {P}}} \left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {N}})), \ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))\right] , \\ \tau _\mathrm {PN,NN}(f)&={{\mathrm {Cov}}}_{\mathrm {PN},\mathrm {N}\bar{\mathrm {N}}} \left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {N}})), \ell (f(\varvec{x}^{\mathrm {N}},\varvec{\bar{x}}^{\mathrm {N}}))\right] , \\ \tau _\mathrm {PU,PP}(f)&={{\mathrm {Cov}}}_{\mathrm {PU},\mathrm {P}\bar{\mathrm {P}}} \left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}})), \ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))\right] , \\ \tau _\mathrm {NU,NN}(f)&={{\mathrm {Cov}}}_{\mathrm {NU},\mathrm {N}\bar{\mathrm {N}}} \left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}})), \ell (f(\varvec{x}^{\mathrm {N}},\varvec{\bar{x}}^{\mathrm {N}}))\right] . \end{aligned}$$

Then, we have the following theorem (its proof is available in Appendix 1):

Theorem 3

Assume ${n_{{\mathrm {U}}}}\rightarrow \infty $. For any fixed f, the minimizers of the variance of the empirical PNPU-AUC and PNNU-AUC risks are respectively obtained by

$$\begin{aligned} \gamma _\mathrm {PNPU}&={{\mathrm{argmin}}}_{\gamma } {{\mathrm{Var}}}\left[ \widehat{R}_\mathrm {PNPU}^\gamma (f)\right] =\frac{\psi _\mathrm {PN}-\psi _\mathrm {PP}/2}{\psi _\mathrm {PN}+\psi _\mathrm {PU}-\psi _\mathrm {PP}} , \end{aligned}$$

(7)

$$\begin{aligned} \gamma _\mathrm {PNNU}&={{\mathrm{argmin}}}_{\gamma } {{\mathrm{Var}}}\left[ \widehat{R}_\mathrm {PNNU}^\gamma (f)\right] =\frac{\psi _\mathrm {PN}-\psi _\mathrm {NN}/2}{\psi _\mathrm {PN}+\psi _\mathrm {NU}-\psi _\mathrm {NN}} , \end{aligned}$$

(8)

where

$$\begin{aligned} \psi _\mathrm {PN}&=\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}}\sigma _\mathrm {PN}^2(f), \\ \psi _\mathrm {PU}&=\frac{\theta _{{\mathrm {P}}}^2}{\theta _{{\mathrm {N}}}^2{n_{{\mathrm {P}}}}^2}\sigma _\mathrm {PP}^2(f) -\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}^2{n_{{\mathrm {P}}}}}\tau _\mathrm {PU,PP}(f) , \\ \psi _\mathrm {PP}&=\frac{1}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}}\tau _\mathrm {PN,PU}(f) -\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}}\tau _\mathrm {PN,PP}(f), \\ \psi _\mathrm {NU}&=\frac{\theta _{{\mathrm {N}}}^2}{\theta _{{\mathrm {P}}}^2{n_{{\mathrm {N}}}}^2}\sigma _\mathrm {NN}^2(f) -\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}^2{n_{{\mathrm {N}}}}}\tau _\mathrm {NU,NN}(f) , \\ \psi _\mathrm {NN}&=\frac{1}{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}}\tau _\mathrm {PN,NU}(f) -\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}}\tau _\mathrm {PN,NN}(f) . \end{aligned}$$

Additionally, we have ${{\mathrm{Var}}}[\widehat{R}_\mathrm {PNPU}^\gamma (f)]<{{\mathrm{Var}}}[\widehat{R}_\mathrm {PN}(f)]$ for any $\gamma \in (0,2\gamma _\mathrm {PNPU})$ if $\psi _\mathrm {PN}+\psi _\mathrm {PU}>\psi _\mathrm {PP}$ and $2\psi _\mathrm {PN}>\psi _\mathrm {PP}$. Similarly, we have ${{\mathrm{Var}}}[\widehat{R}_\mathrm {PNNU}^\gamma (f)]<{{\mathrm{Var}}}[\widehat{R}_\mathrm {PN}(f)]$ for any $\gamma \in (0,2\gamma _\mathrm {PNNU})$ if $\psi _\mathrm {PN}+\psi _\mathrm {NU}>\psi _\mathrm {NN}$ and $2\psi _\mathrm {PN}>\psi _\mathrm {NN}$.

This theorem means that, if $\gamma $ is chosen appropriately, our proposed risk estimators, $\widehat{R}_\mathrm {PNPU}^\gamma $ and $\widehat{R}_\mathrm {PNNU}^\gamma $, have smaller variance than the standard supervised risk estimator $\widehat{R}_\mathrm {PN}$. A practical consequence of Theorem 3 is that when we conduct cross-validation for hyperparameter selection, we may use our proposed risk estimators $\widehat{R}_\mathrm {PNPU}^\gamma $ and $\widehat{R}_\mathrm {PNNU}^\gamma $ instead of the standard supervised risk estimator $\widehat{R}_\mathrm {PN}$ since they are more stable (see Sect. 5.3 for details).

5 Practical implementation

In this section, we explain the implementation details of our proposed methods.

5.1 General case

In practice, the AUC risks R introduced above are replaced with their empirical version $\widehat{R}$, where the expectations in R are replaced with the corresponding sample averages.

Here, we focus on the linear-in-parameter model given by

$$\begin{aligned} g(\varvec{x})=\sum ^b_{\ell =1}w_\ell \phi (\varvec{x})=\varvec{w}^\top \varvec{\phi }(\varvec{x}) , \end{aligned}$$

where ${}^\top $ denotes the transpose of vectors and matrices, b is the number of basis functions, $\varvec{w}=(w_1,\ldots ,w_b)^\top $ is a parameter vector, and $\varvec{\phi }(\varvec{x})=(\phi _1(\varvec{x}),\ldots ,\phi _b(\varvec{x}))^\top $ is a basis function vector. The linear-in-parameter model allows us to express the composite classifier as

$$\begin{aligned} f(\varvec{x},\varvec{x}')=\varvec{w}^\top \bar{\varvec{\phi }}(\varvec{x},\varvec{x}') , \end{aligned}$$

where

$$\begin{aligned} \bar{\varvec{\phi }}(\varvec{x},\varvec{x}'):=\varvec{\phi }(\varvec{x})-\varvec{\phi }(\varvec{x}') \end{aligned}$$

is a composite basis function vector. We train the classifier by minimizing the $\ell _2$-regularized empirical AUC risk:

$$\begin{aligned} \min _{\varvec{w}} \widehat{R}(f) + \lambda \Vert \varvec{w}\Vert ^2 , \end{aligned}$$

where $\lambda \ge 0$ is the regularization parameter.

5.2 Analytical solution for squared loss

For the squared loss $\ell _S(m):=(1-m)^2$, the empirical PU-AUC risk^{Footnote 5} can be expressed as

$$\begin{aligned} \widehat{R}_\mathrm {PU}(f)&=\frac{1}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1} \ell _{\mathrm {S}}\left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k)\right) \\&-\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1}\ell _{\mathrm {S}}\left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'})\right) + \frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}({n_{{\mathrm {P}}}}-1)}\\&=1-2\varvec{w}^\top \widehat{\varvec{h}}_\mathrm {PU} + \varvec{w}^\top \widehat{\varvec{H}}_\mathrm {PU}\varvec{w}- \varvec{w}^\top \widehat{\varvec{H}}_\mathrm {PP}\varvec{w}, \end{aligned}$$

where

$$\begin{aligned} \widehat{\varvec{h}}_\mathrm {PU}&:=\frac{1}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}}\varvec{\varPhi }_{{\mathrm {P}}}^\top \varvec{1}_{n_{{\mathrm {P}}}}- \frac{1}{\theta _{{\mathrm {N}}}{n_{{\mathrm {U}}}}}\varvec{\varPhi }_{{\mathrm {U}}}^\top \varvec{1}_{n_{{\mathrm {U}}}}, \\ \widehat{\varvec{H}}_\mathrm {PU}&:=\frac{1}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}}\varvec{\varPhi }_{{\mathrm {P}}}^\top \varvec{\varPhi }_{{\mathrm {P}}}-\frac{1}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}}\varvec{\varPhi }_{{\mathrm {U}}}^\top \varvec{1}_{n_{{\mathrm {U}}}}\varvec{1}_{n_{{\mathrm {P}}}}^\top \varvec{\varPhi }_{{\mathrm {P}}}\\&-\frac{1}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}}\varvec{\varPhi }_{{\mathrm {P}}}^\top \varvec{1}_{n_{{\mathrm {P}}}}\varvec{1}_{n_{{\mathrm {U}}}}^\top \varvec{\varPhi }_{{\mathrm {U}}}+\frac{1}{\theta _{{\mathrm {N}}}{n_{{\mathrm {U}}}}}\varvec{\varPhi }_{{\mathrm {U}}}^\top \varvec{\varPhi }_{{\mathrm {U}}}, \\ \widehat{\varvec{H}}_\mathrm {PP}&:=\frac{2\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}({n_{{\mathrm {P}}}}-1)}\varvec{\varPhi }_{{\mathrm {P}}}^\top \varvec{\varPhi }_{{\mathrm {P}}}- \frac{2\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)} \varvec{\varPhi }_{{\mathrm {P}}}^\top \varvec{1}_{n_{{\mathrm {P}}}}\varvec{1}_{n_{{\mathrm {P}}}}^\top \varvec{\varPhi }_{{\mathrm {P}}}, \\ \varvec{\varPhi }_{{\mathrm {P}}}&:=\left( \varvec{\phi }(\varvec{x}^{\mathrm {P}}_1), \ldots , \varvec{\phi }(\varvec{x}^{\mathrm {P}}_{n_{{\mathrm {P}}}})\right) ^\top , \\ \varvec{\varPhi }_{{\mathrm {U}}}&:=\left( \varvec{\phi }(\varvec{x}^{\mathrm {U}}_1), \ldots , \varvec{\phi }(\varvec{x}^{\mathrm {U}}_{n_{{\mathrm {U}}}})\right) ^\top , \end{aligned}$$

and $\varvec{1}_b$ is the b-dimensional vector whose elements are all one. With the $\ell _2$-regularizer, we can analytically obtain the solution by

$$\begin{aligned} \widehat{\varvec{w}}_\mathrm {PU}:=(\widehat{\varvec{H}}_\mathrm {PU}-\widehat{\varvec{H}}_\mathrm {PP}+\lambda \varvec{I}_b)^{-1}\widehat{\varvec{h}}_\mathrm {PU}, \end{aligned}$$

where $\varvec{I}_b$ is the b-dimensional identity matrix.

The computational complexity of computing $\widehat{\varvec{h}}_\mathrm {PU}$, $\widehat{\varvec{H}}_\mathrm {PU}$, and $\widehat{\varvec{H}}_\mathrm {PP}$ are $\mathcal {O}(({n_{{\mathrm {P}}}}+{n_{{\mathrm {U}}}})b)$, $\mathcal {O}(({n_{{\mathrm {P}}}}+{n_{{\mathrm {U}}}})b^2)$, and $\mathcal {O}({n_{{\mathrm {P}}}}b^2)$, respectively. Then, solving a system of linear equations to obtain the solution $\widehat{\varvec{w}}_\mathrm {PU}$ requires the computational complexity of $\mathcal {O}(b^3)$. In total, the computational complexity of this PU-AUC optimization method is $\mathcal {O}(({n_{{\mathrm {P}}}}+{n_{{\mathrm {U}}}})b^2+b^3)$.

As given by Eq. (6), our PNU-AUC optimization method consists of the PNPU-AUC risk and the PNNU-AUC risk. For the squared loss $\ell _S(m):=(1-m)^2$, the empirical PNPU-AUC risk can be expressed as

$$\begin{aligned} \widehat{R}_\mathrm {PNPU}^\gamma (f)&=\frac{1-\gamma }{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {N}}}}_{j=1} \ell _{\mathrm {S}}\left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {N}}_j)\right) +\frac{\gamma }{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1} \ell _{\mathrm {S}}\left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k)\right) \\&\qquad -\frac{\gamma \theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell _{\mathrm {S}}\left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'})\right) + \frac{\gamma \theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}({n_{{\mathrm {P}}}}-1)} \\&=(1-\gamma )-2(1-\gamma )\varvec{w}^\top \widehat{\varvec{h}}_\mathrm {PN} + (1-\gamma )\varvec{w}^\top \widehat{\varvec{H}}_\mathrm {PN}\varvec{w}\\&\qquad +\gamma - 2\gamma \varvec{w}^\top \widehat{\varvec{h}}_\mathrm {PU} + \gamma \varvec{w}^\top \widehat{\varvec{H}}_\mathrm {PU}\varvec{w}- \gamma \varvec{w}^\top \widehat{\varvec{H}}_\mathrm {PP}\varvec{w}, \end{aligned}$$

where

$$\begin{aligned} \widehat{\varvec{h}}_\mathrm {PN}&:=\frac{1}{{n_{{\mathrm {P}}}}}\varvec{\varPhi }_{{\mathrm {P}}}^\top \varvec{1}_{n_{{\mathrm {P}}}}- \frac{1}{{n_{{\mathrm {N}}}}}\varvec{\varPhi }_{{\mathrm {N}}}^\top \varvec{1}_{n_{{\mathrm {N}}}}, \\ \widehat{\varvec{H}}_\mathrm {PN}&:=\frac{1}{{n_{{\mathrm {P}}}}}\varvec{\varPhi }_{{\mathrm {P}}}^\top \varvec{\varPhi }_{{\mathrm {P}}}-\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}}\varvec{\varPhi }_{{\mathrm {P}}}^\top \varvec{1}_{n_{{\mathrm {P}}}}\varvec{1}_{n_{{\mathrm {N}}}}^\top \varvec{\varPhi }_{{\mathrm {N}}}\\&-\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}}\varvec{\varPhi }_{{\mathrm {N}}}^\top \varvec{1}_{n_{{\mathrm {N}}}}\varvec{1}_{n_{{\mathrm {P}}}}^\top \varvec{\varPhi }_{{\mathrm {P}}}+\frac{1}{{n_{{\mathrm {N}}}}}\varvec{\varPhi }_{{\mathrm {N}}}^\top \varvec{\varPhi }_{{\mathrm {N}}}, \\ \varvec{\varPhi }_{{\mathrm {N}}}&:=\left( \varvec{\phi }(\varvec{x}^{\mathrm {N}}_1), \ldots , \varvec{\phi }(\varvec{x}^{\mathrm {N}}_{n_{{\mathrm {N}}}})\right) ^\top . \\ \end{aligned}$$

The solution for the $\ell _2$-regularized PNPU-AUC optimization can be analytically obtained by

$$\begin{aligned} \widehat{\varvec{w}}_\mathrm {PNPU}^\gamma :=\Big ((1-\gamma )\widehat{\varvec{H}}_\mathrm {PN} + \gamma \widehat{\varvec{H}}_\mathrm {PU} - \gamma \widehat{\varvec{H}}_\mathrm {PP} +\lambda \varvec{I}_b \Big )^{-1} \Big ((1-\gamma )\widehat{\varvec{h}}_\mathrm {PN} + \gamma \widehat{\varvec{h}}_\mathrm {PU} \Big ) . \end{aligned}$$

Similarly, the solution for the $\ell _2$-regularized PNNU-AUC optimization can be obtained by

$$\begin{aligned} \widehat{\varvec{w}}_\mathrm {PNNU}^\gamma :=\Big ((1-\gamma )\widehat{\varvec{H}}_\mathrm {PN} + \gamma \widehat{\varvec{H}}_\mathrm {NU} - \gamma \widehat{\varvec{H}}_\mathrm {NN} +\lambda \varvec{I}_b \Big )^{-1} \Big ((1-\gamma )\widehat{\varvec{h}}_\mathrm {PN} + \gamma \widehat{\varvec{h}}_\mathrm {NU} \Big ) . \end{aligned}$$

where

$$\begin{aligned} \widehat{\varvec{h}}_\mathrm {NU}&:=\frac{1}{\theta _{{\mathrm {P}}}{n_{{\mathrm {U}}}}}\varvec{\varPhi }_{{\mathrm {U}}}^\top \varvec{1}_{n_{{\mathrm {U}}}}-\frac{1}{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}}\varvec{\varPhi }_{{\mathrm {N}}}^\top \varvec{1}_{n_{{\mathrm {N}}}}, \\ \widehat{\varvec{H}}_\mathrm {NU}&:=\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}}\varvec{\varPhi }_{{\mathrm {N}}}^\top \varvec{\varPhi }_{{\mathrm {N}}}-\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}{n_{{\mathrm {U}}}}}\varvec{\varPhi }_{{\mathrm {U}}}^\top \varvec{1}_{n_{{\mathrm {U}}}}\varvec{1}_{n_{{\mathrm {N}}}}^\top \varvec{\varPhi }_{{\mathrm {N}}}\\&-\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}{n_{{\mathrm {U}}}}}\varvec{\varPhi }_{{\mathrm {N}}}^\top \varvec{1}_{n_{{\mathrm {N}}}}\varvec{1}_{n_{{\mathrm {U}}}}^\top \varvec{\varPhi }_{{\mathrm {U}}}+\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}{n_{{\mathrm {U}}}}}\varvec{\varPhi }_{{\mathrm {U}}}^\top \varvec{\varPhi }_{{\mathrm {U}}}, \\ \widehat{\varvec{H}}_\mathrm {NN}&:=\frac{2\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}({n_{{\mathrm {N}}}}-1)}\varvec{\varPhi }_{{\mathrm {N}}}^\top \varvec{\varPhi }_{{\mathrm {N}}}- \frac{2\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}({n_{{\mathrm {N}}}}-1)} \varvec{\varPhi }_{{\mathrm {N}}}^\top \varvec{1}_{n_{{\mathrm {N}}}}\varvec{1}_{n_{{\mathrm {N}}}}^\top \varvec{\varPhi }_{{\mathrm {N}}}. \\ \end{aligned}$$

The computational complexity of computing $\widehat{\varvec{h}}_\mathrm {PN}$ and $\widehat{\varvec{H}}_\mathrm {PN}$ are $\mathcal {O}(({n_{{\mathrm {P}}}}+{n_{{\mathrm {N}}}})b)$ and $\mathcal {O}(({n_{{\mathrm {P}}}}+{n_{{\mathrm {N}}}})b^2)$, respectively. Then, obtaining the solution $\widehat{\varvec{w}}_\mathrm {PNPU}$ ($\widehat{\varvec{w}}_\mathrm {PNNU}$) requires the computational complexity of $\mathcal {O}(b^3)$. Including the computational complexity of computing $\widehat{\varvec{h}}_\mathrm {PU}$, $\widehat{\varvec{H}}_\mathrm {PU}$, and $\widehat{\varvec{H}}_\mathrm {PP}$, the total computational complexity of the PNPU-AUC optimization method is $\mathcal {O}(({n_{{\mathrm {P}}}}+{n_{{\mathrm {N}}}}+{n_{{\mathrm {U}}}})b^2+b^3)$. Similarly, the total computational complexity of the PNNU-AUC optimization method is $\mathcal {O}(({n_{{\mathrm {P}}}}+{n_{{\mathrm {N}}}}+{n_{{\mathrm {U}}}})b^2+b^3)$. Thus, the computational complexity of the PNU-AUC optimization method is $\mathcal {O}(({n_{{\mathrm {P}}}}+{n_{{\mathrm {N}}}}+{n_{{\mathrm {U}}}})b^2+b^3)$.

From the viewpoint of computational complexity, the squared loss and the exponential loss are more efficient than the logistic loss because these loss functions reduce the nested summations to individual ones. More specifically, for example, in the PU-AUC optimization method, the logistic loss requires $\mathcal {O}({n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}})$ operations for evaluating the first term in the PU-AUC risk, i.e., the loss over positive and unlabeled samples. In contrast, the squared loss and exponential loss reduce the number of operations for loss evaluation to $\mathcal {O}({n_{{\mathrm {P}}}}+{n_{{\mathrm {U}}}})$.^{Footnote 6} This property is beneficial especially when we handle large scale data sets.

5.3 Cross-validation

To tune the hyperparameters such as the regularization parameter $\lambda $, we use the cross-validation.

For the PU-AUC optimization method, we use the PU-AUC risk in Eq. (4) with the zero-one loss as the cross-validation score.

For the PNU-AUC optimization method, we use the PNU-AUC risk in Eq. (6) with the zero-one loss as the score. To this end, however, we need to fix the combination parameter $\eta $ in the cross-validation score in advance and then, we tune the hyperparameters including the combination parameter. More specifically, let $\bar{\eta }\in [-1,1]$ be the predefined combination parameter. We conduct cross-validation with respect to $R_\mathrm {PNU}^{\bar{\eta }}(f)$ for tuning the hyperparameters. Since the PNU-AUC risk is equivalent to the PN-AUC risk for any $\bar{\eta }$, we can choose any $\bar{\eta }$ in principle. However, when the empirical PNU-AUC risk is used in practice, choice of $\bar{\eta }$ may affect the performance of cross-validation.

Here, based on the theoretical result of variance reduction given in Sect. 4.2, we give a practical method to determine $\bar{\eta }$. Assuming the covariances, e.g., $\tau _\mathrm {PN,PP}(f)$, are small enough to be neglected and $\sigma _\mathrm {PN}(f)=\sigma _\mathrm {PP}(f)=\sigma _\mathrm {NN}(f)$, we can obtain a simpler form of Eqs. (7) and (8) as

$$\begin{aligned} \bar{\gamma }_\mathrm {PNPU}&= \frac{1}{1+\theta _{{\mathrm {P}}}^2{n_{{\mathrm {N}}}}/\left( \theta _{{\mathrm {N}}}^2{n_{{\mathrm {P}}}}\right) } , \\\bar{\gamma }_\mathrm {PNNU}&= \frac{1}{1+\theta _{{\mathrm {N}}}^2{n_{{\mathrm {P}}}}/\left( \theta _{{\mathrm {P}}}^2{n_{{\mathrm {N}}}}\right) } . \end{aligned}$$

They can be computed simply from the number of samples and the (estimated) class-prior. Finally, to select the combination parameter $\eta $, we use $\widehat{R}_\mathrm {PNPU}^{\bar{\gamma }_\mathrm {PNPU}}$ for $\eta \ge 0$, and $\widehat{R}_\mathrm {PNNU}^{\bar{\gamma }_\mathrm {PNNU}}$ for $\eta <0$.

6 Experiments

In this section, we numerically investigate the behavior of the proposed methods and evaluate their performance on various data sets. All experiments were carried out using a PC equipped with two 2.60 GHz Intel^® Xeon^® E5-2640 v3 CPUs.

As the classifier, we used the linear-in-parameter model. In all experiments except text classification tasks, we used the Gaussian kernel basis function expressed as

$$\begin{aligned} \phi _\ell (\varvec{x})=\exp \left( -\frac{\Vert \varvec{x}-\varvec{x}_\ell \Vert ^2}{2\sigma ^2}\right) , \end{aligned}$$

where $\sigma >0$ is the Gaussian bandwidth, $\{\varvec{x}_\ell \}^b_{\ell =1}$ are the samples randomly selected from training samples $\{\varvec{x}_i\}^n_{i=1}$ and n is the number of training samples. In text classification tasks, we used the linear kernel basis function:

$$\begin{aligned} \phi _\ell (\varvec{x})=\varvec{x}^\top \varvec{x}_\ell . \end{aligned}$$

The number of basis functions was set at $b=\min (n, 200)$. The candidates of the Gaussian bandwidth were $\mathrm {median}(\{\Vert \varvec{x}_i-\varvec{x}_j\Vert \}^n_{i,j=1}) \times \{1/8,1/4,1/2,1,2\}$ and that of the regularization parameter were $\{10^{-3},10^{-2},10^{-1},10^0,10^1\}$. All hyper-parameters were determined by five-fold cross-validation. As the loss function, we used the squared loss function $\ell _{\mathrm {S}}(m)=(1-m)^2$.

6.1 Effect of variance reduction

First, we numerically confirm the effect of variance reduction. We compare the variance of the empirical PNU-AUC risk against the variance of the empirical PN-AUC risk, ${{\mathrm{Var}}}[\widehat{R}_\mathrm {PNU}^\eta (f)]$ vs. ${{\mathrm{Var}}}[\widehat{R}_\mathrm {PN}(f)]$, under a fixed classifier f.

As the fixed classifier, we used the minimizer of the empirical PN-AUC risk, denoted by $\widehat{f}_\mathrm {PN}$. The number of positive and negative samples for training varied as $({n_{{\mathrm {P}}}},{n_{{\mathrm {N}}}})=(2,8)$, (10, 10), and (18, 2). We then computed the variance of the empirical PN-AUC and PNU-AUC risks with additional 10 positive, 10 negative, and 300 unlabeled samples. As the data set, we used the Banana data set (Rätsch et al. 2001). In this experiment, the class-prior was set at $\theta _{{\mathrm {P}}}=0.1$ and assumed to be known.

Figure 1 plots the value of the variance of the empirical PNU-AUC risk divided by that of the PN-AUC risk,

$$\begin{aligned} r:=\frac{{{\mathrm{Var}}}\left[ \widehat{R}_\mathrm {PNU}^\eta (\widehat{f}_\mathrm {PN})\right] }{{{\mathrm{Var}}}\left[ \widehat{R}_\mathrm {PN}(\widehat{f}_\mathrm {PN})\right] } , \end{aligned}$$

as a function of the combination parameter $\eta $ under different numbers of positive and negative samples. The results show that $r<1$ can be achieved by an appropriate choice of $\eta $, meaning that the variance of the empirical PNU-AUC risk can be smaller than that of the PN-AUC risk.

We then investigate how the class-prior affects the variance reduction. In this experiment, the number of positive and negative samples for $\widehat{f}_\mathrm {PN}$ are ${n_{{\mathrm {P}}}}=10$ and ${n_{{\mathrm {N}}}}=10$, respectively. Figure 2 showed the values of r as a function of the combination parameter $\eta $ under different class-priors. When the class-prior, $\theta _{{\mathrm {P}}}$, is 0.1 and 0.2, the variance can be reduced for $\eta >0$. When the class-prior is 0.3, the range of the value of $\eta $ that yields variance reduction becomes smaller. However, this may not be that problematic in practice, because AUC optimization is effective when two classes are highly imbalanced, i.e., the class-prior is far from 0.5; when the class-prior is close to 0.5, we may simply use the standard misclassification rate minimization approach.

6.2 Benchmark data sets

Next, we report the classification performance of the proposed PU-AUC and PNU-AUC optimization methods, respectively. We used 15 benchmark data sets from the IDA Benchmark Repository (Rätsch et al. 2001), the Semi-Supervised Learning Book (Chapelle et al. 2006), the LIBSVM (Chang et al. 2011), and the UCI Machine Learning Repository (Lichman 2013). The detailed statistics of the data sets are summarized in Appendix 1.

6.2.1 AUC optimization from positive and unlabeled data

We compared the proposed PU-AUC optimization method against the existing AUC optimization method based on the ranking SVM (PU-RSVM) (Sundararajan et al. 2011). We trained a classifier with samples of size ${n_{{\mathrm {P}}}}=100$ and ${n_{{\mathrm {U}}}}=1000$ under the different class-priors $\theta _{{\mathrm {P}}}=0.1$ and 0.2. For the PU-AUC optimization method, the squared loss function was used and the class-prior was estimated by the distribution matching method (du Plessis et al. 2017). The results of the estimated class-prior are summarized in Table 1.

Table 1 Average and standard error of the estimated class-prior over 50 trials on benchmark data sets in PU learning setting

Full size table

Table 2 lists the average with standard error of the AUC over 50 trials, showing that the proposed PU-AUC optimization method achieves better performance than the existing method. In particular, when $\theta _{{\mathrm {P}}}=0.2$, the difference between PU-RSVM and our method becomes larger compared with the difference when $\theta _{{\mathrm {P}}}=0.1$. Since PU-RSVM can be interpreted as regarding unlabeled data as negative, the bias caused by this becomes larger when $\theta _{{\mathrm {P}}}=0.2$.

Figure 3 summarizes the average computation time over 50 trials. The computation time of the PU-AUC optimization method includes both the class-prior estimation and the empirical risk minimization. The results show that the PU-AUC optimization method requires almost twice computation time as that of PU-RSVM, but it would be acceptable in practice to obtain better performance.

Table 2 Average and standard error of the AUC over 50 trials on benchmark data sets

Full size table

6.2.2 Semi-supervised AUC optimization

Here, we compare the proposed PNU-AUC optimization method against existing AUC optimization approaches: the semi-supervised rankboost (SSRankboost) (Amini et al. 2008),^{Footnote 7} the semi-supervised AUC-optimized logistic sigmoid (sAUC-LS) (Fujino and Ueda 2016),^{Footnote 8} and the optimum AUC with a generative model (OptAG) (Fujino and Ueda 2016).

We trained the classifier with samples of size ${n_{{\mathrm {P}}}}=\theta _{{\mathrm {P}}}\cdot {n_\mathrm {L}}$, ${n_{{\mathrm {N}}}}={n_\mathrm {L}}-{n_{{\mathrm {P}}}}$, and ${n_{{\mathrm {U}}}}=1000$, where ${n_\mathrm {L}}$ is the number of labeled samples. For the PNU-AUC optimization method, the squared loss function was used and the candidates of the combination parameter $\eta $ were $\{-0.9, -0.8, \ldots , 0.9\}$. For the class-prior estimation, we used the energy distance minimization method (Kawakubo et al. 2016). The results of the estimated class-prior are summarized in Table 3.

Table 3 Average and standard error of the estimated class-prior over 50 trials on benchmark data sets in semi-supervised learning setting

Full size table

Table 4 Average and standard error of the AUC over 50 trials on benchmark data sets

Full size table

For SSRankboost, the discount factor and the number of neighbors were chosen from $\{10^{-3},10^{-2},10^{-1}\}$ and $\{2,3,\ldots ,7\}$, respectively. For sAUC-LS and OptAG, the regularization parameter for the entropy regularizer was chosen from $\{1, 10\}$. Furthermore, as the generative model of OptAG, we adapted the Gaussian distribution for the data distribution and the Gaussian and Gamma distributions for the prior of the data distribution.^{Footnote 9}

Table 4 lists the average with standard error of the AUC over 50 trials, showing that the proposed PNU-AUC optimization method achieves better performance than or comparable performance to the existing methods on many data sets. Figure 4 summarizes the average computation time over 50 trials. The computation time of the PNU-AUC optimization method includes both the class-prior estimation and the empirical risk minimization. The results show that even though our proposed method involves the class-prior estimation, the computation time is relatively faster than SSRankboost and much faster than sAUC-LS and OptAG. The reason for longer computation time of sAUC-LS and OptAG is that their implementation is based on the logistic loss in which the number of operations for loss evaluation is $\mathcal {O}({n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}+{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}+{n_{{\mathrm {N}}}}{n_{{\mathrm {U}}}})$, unlike the PNU-AUC optimization method with the squared loss in which the number of operations for loss evaluation is $\mathcal {O}({n_{{\mathrm {P}}}}+{n_{{\mathrm {N}}}}+{n_{{\mathrm {U}}}})$ (cf. the discussion about the computational complexity in Sect. 5).

Table 5 Average with standard error of the AUC over 20 trials on the text classification data sets

Full size table

6.3 Text classification

Next, we apply our proposed PNU-AUC optimization method to a text classification task. We used the Reuters Corpus Volume I data set (Lewis et al. 2004), the Amazon Review data set (Dredze et al. 2008), and the 20 Newsgroups data set (Lang 1995). More specifically, we used the data set processed for a binary classification task: the rcv1, amazon2, and news20 data sets. The rcv1 and news20 data sets are available at the website of LIBSVM (Chang et al. 2011), and the amazon2 is designed by ourselves, which consists of the product reviews of books and music from the Amazon$\textit{7}$ data set (Blondel et al. 2013). The dimension of a feature vector of the rcv1 data set is 47, 236, that of the amazon2 data set is 262,144, and that of the news20 data set is 1, 355, 191.

We trained a classifier with samples of size ${n_{{\mathrm {P}}}}=20$, ${n_{{\mathrm {N}}}}=80$, and ${n_{{\mathrm {U}}}}=10,000$. The true class-prior was set at $\theta _{{\mathrm {P}}}=0.2$ and estimated by the method based on energy distance minimization (Kawakubo et al. 2016). For the generative model of OptAG, we employed naive Bayes (NB) multinomial models and a Dirichlet prior for the prior distribution of the NB model as described in Fujino and Ueda (2016).

Table 5 lists the average with standard error of the AUC over 20 trials, showing that the proposed method outperforms the existing methods. Figure 5 summarizes the average computation time of each method. These results show that the proposed method achieves better performance with short computation time.

6.4 Sensitivity analysis

Here, we investigate the effect of the estimation accuracy of the class-prior for the PNU-AUC optimization method. Specifically, we added noise $\rho \in \{-0.09,-0.08,\ldots ,0.09\}$ to the true class-prior $\theta _{{\mathrm {P}}}$ and used $\widehat{\theta }_{\mathrm {P}}=\theta _{{\mathrm {P}}}+\rho $ as the estimated class-prior for the PNU-AUC optimization method. Under the different values of the class-prior $\theta _{{\mathrm {P}}}=0.1, 0.2$, and 0.3, we trained a classifier with samples of size ${n_{{\mathrm {P}}}}=\theta _{{\mathrm {P}}}\times 50$, ${n_{{\mathrm {N}}}}=\theta _{{\mathrm {N}}}\times 50$, and ${n_{{\mathrm {U}}}}=1000$.

Figure 6 summarizes the average with standard error of the AUC as a function of the noise. The plots show that when $\theta _{{\mathrm {P}}}=0.2$ and 0.3, the performance of the PNU-AUC optimization method is stable even when the estimated class-prior has some noise. On the other hand, when $\theta _{{\mathrm {P}}}=0.1$, as the noise is close to $\rho =-\,0.09$, the performance largely decreases. Since the true class-prior is small, it is sensitive to the negative bias. In particular, when $\rho =-\,0.09$, the gap between the estimated and true class-priors is larger than other values. For instance, when $\rho =-\,0.09$ and $\theta _{{\mathrm {P}}}=0.2$, $\theta _{{\mathrm {P}}}/\widehat{\theta }_\mathrm {P}\approx 1.8$, but when $\rho =-0.09$ and $\theta _{{\mathrm {P}}}=0.1$, $\theta _{{\mathrm {P}}}/\widehat{\theta }_\mathrm {P}\approx 10$. In contrast, the positive bias does not heavily affect the performance even when $\theta _{{\mathrm {P}}}=0.1$.

6.5 Scalability

Finally, we report the scalability of our proposed PNU-AUC optimization method. Specifically, we evaluated the AUC and computation time while increasing the number of unlabeled samples. We picked two large data sets: the SUSY and amazon2 data sets. The number of positive and negative samples were ${n_{{\mathrm {P}}}}=40$ and ${n_{{\mathrm {N}}}}=160$, respectively.

Figure 7 summarizes the average with standard error of the AUC and computation time as a function of the number of unlabeled samples. The AUC on the SUSY data set slightly increased at ${n_{{\mathrm {U}}}}=$1,000,000, but the improvement on the amazon2 data set was not noticeable or the performance decreased slightly. In this experiment, the increase of the size of unlabeled data did not significantly improve the performance of the classifier, but it did not affect adversely, i.e., it did not cause significant performance degeneration.

The result of computation time shows that the proposed PNU-AUC optimization method can handle approximately 1,000,000 samples within reasonable computation time in this experiment. The longer computation time on the SUSY data set before ${n_{{\mathrm {U}}}}=$10,000 is because we need to choose one additional hyperparameter, i.e., the bandwidth of the Gaussian kernel basis function, compared with the linear kernel basis function. However, the effect gradually decreases; after ${n_{{\mathrm {U}}}}=$10,000, the matrix multiplication of the high dimensional matrix on the amazon2 data set $(d=262,144)$ requires more computation time than the SUSY data set $(d=18)$.

7 Conclusions

In this paper, we proposed a novel AUC optimization method from positive and unlabeled data and extend it to a novel semi-supervised AUC optimization method. Unlike the existing approach, our approach does not rely on strong distributional assumptions on the data distributions such as the cluster and the entropy minimization principle. Without the distributional assumptions, we theoretically derived the generalization error bounds of our PU and semi-supervised AUC optimization methods. Moreover, for our semi-supervised AUC optimization method, we showed that the variance of the empirical risk can be smaller than that of the supervised counterpart. Through numerical experiments, we demonstrated the practical usefulness of the proposed PU and semi-supervised AUC optimization methods.

Change history

23 January 2018
On page 7, 4.2 Variance reduction, the third line of the equation needs to be corrected as follows:

Notes

In Sakai et al. (2017), the combination of the PU and NU risks has also considered and found to be less favorable than the combination of the PN and PU/NU risks. For this reason, we focus on the latter in this paper.
Our theoretical analysis can be easily extended to the loss satisfying $\ell _{0\text {-}1}(m)\le M\ell (m)$ with a certain $M>0$.
These losses are bounded in our setting, since the input to $\ell (m)$, i.e., f is bounded.
${{\mathrm{Var}}}_\mathrm {PN}$, ${{\mathrm{Var}}}_{\mathrm{P}\bar{\mathrm{P}}}$, and ${{\mathrm{Var}}}_{\mathrm{N}\bar{\mathrm{N}}}$ are the variances over $p_{{\mathrm {P}}}(\varvec{x}^{\mathrm {P}})p_{{\mathrm {N}}}(\varvec{x}^{\mathrm {N}})$, $p_{{\mathrm {P}}}(\varvec{x}^{\mathrm {P}})p_{{\mathrm {P}}}(\varvec{\bar{x}}^{\mathrm {P}})$, and $p_{{\mathrm {N}}}(\varvec{x}^{\mathrm {N}})p_{{\mathrm {N}}}(\varvec{\bar{x}}^{\mathrm {N}})$, respectively. ${{\mathrm {Cov}}}_{\mathrm {PN},\mathrm {P}\bar{\mathrm {P}}}$, ${{\mathrm {Cov}}}_{\mathrm {PN},\mathrm {N}\bar{\mathrm {N}}}$, ${{\mathrm {Cov}}}_{\mathrm {PU},\mathrm {P}\bar{\mathrm {P}}}$, and ${{\mathrm {Cov}}}_{\mathrm {NU},\mathrm {N}\bar{\mathrm {N}}}$ are the covariances over $p_{{\mathrm {P}}}(\varvec{x}^{\mathrm {P}})p_{{\mathrm {N}}}(\varvec{x}^{\mathrm {N}})p_{{\mathrm {P}}}(\varvec{\bar{x}}^{\mathrm {P}}), p_{{\mathrm {P}}}(\varvec{x}^{\mathrm {P}})p_{{\mathrm {N}}}(\varvec{x}^{\mathrm {N}})p_{{\mathrm {N}}}(\varvec{\bar{x}}^{\mathrm {N}})$, $p_{{\mathrm {P}}}(\varvec{x}^{\mathrm {P}})p(\varvec{x}^{\mathrm {U}})p_{{\mathrm {P}}}(\varvec{\bar{x}}^{\mathrm {P}})$, and $p_{{\mathrm {N}}}(\varvec{x}^{\mathrm {N}})p(\varvec{x}^{\mathrm {U}})p_{{\mathrm {N}}}(\varvec{\bar{x}}^{\mathrm {N}})$, respectively.
We discuss the way of estimating the PU-AUC risk in Appendix 1.
For example, the exponential loss over positive and unlabeled data can be computed as follows:
$$\begin{aligned} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{{n_{{\mathrm {U}}}}}_{k=1}\ell _{\mathrm {E}}\left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k)\right)&=\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{{n_{{\mathrm {U}}}}}_{k=1}\exp (-g(\varvec{x}^{\mathrm {P}}_i) +g(\varvec{x}^{\mathrm {U}}_k)) \\&=\sum ^{n_{{\mathrm {P}}}}_{i=1}\exp \left( -g(\varvec{x}^{\mathrm {P}}_i)\right) \sum ^{{n_{{\mathrm {U}}}}}_{k=1}\exp \left( g(\varvec{x}^{\mathrm {U}}_k)\right) . \end{aligned}$$
Thus, the number of operations for loss evaluation is reduced to ${n_{{\mathrm {P}}}}+{n_{{\mathrm {U}}}}+1$ rather than ${n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}$.
We used the code available at http://ama.liglab.fr/~amini/SSRankBoost/.
This method is equivalent to OptAG without a generative model, which only employs a discriminative model with the entropy minimization principle. To eliminate the adverse effect of the wrongly chosen generative model, we added this method for comparison.
As the generative model, we used the Gaussian distributions for positive and negative classes:
$$\begin{aligned} p_g\Big (\varvec{x}^{\mathrm {P}}; \varvec{\mu }_\mathrm {P}\Big )&\propto \tau _\mathrm {P}^\frac{d}{2} \exp \Big (-\frac{\tau _\mathrm {P}}{2}\Vert \varvec{x}^{\mathrm {P}}-\varvec{\mu }_\mathrm {P}\Vert ^2\Big ) , \\ p_g\Big (\varvec{x}^{\mathrm {P}}; \varvec{\mu }_\mathrm {N}\Big )&\propto \tau _\mathrm {N}^\frac{d}{2} \exp \Big (-\frac{\tau _\mathrm {N}}{2}\Vert \varvec{x}^{\mathrm {N}}-\varvec{\mu }_\mathrm {N}\Vert ^2\Big ) , \end{aligned}$$
where $\tau _\mathrm {P}$ and $\tau _\mathrm {N}$ denote the precisions and $\varvec{\mu }_\mathrm {P}$ and $\varvec{\mu }_\mathrm {N}$ are the means. As the prior of $\varvec{\mu }_\mathrm {P}$, $\varvec{\mu }_\mathrm {N}$, $\tau _\mathrm {P}$, and $\tau _\mathrm {N}$, we used the Gaussian and Gamma distributions:
$$\begin{aligned} p\left( \varvec{\mu }_\mathrm {P};\varvec{\mu }_\mathrm {P}^0\right)&\propto \tau _\mathrm {P}^\frac{d}{2} \exp \left( -\frac{\rho _\mathrm {P}^0\tau _\mathrm {P}}{2}\Vert \varvec{\mu }_\mathrm {P}-\varvec{\mu }_\mathrm {P}^0\Vert ^2\right) , \\ p\left( \varvec{\mu }_\mathrm {N};\varvec{\mu }_\mathrm {N}^0\right)&\propto \tau _\mathrm {N}^\frac{d}{2} \exp \left( -\frac{\rho _\mathrm {N}^0\tau _\mathrm {N}}{2}\Vert \varvec{\mu }_\mathrm {N}-\varvec{\mu }_\mathrm {N}^0\Vert ^2\right) , \\ p\left( \tau _\mathrm {P}; a_\mathrm {P}^0,b_\mathrm {P}^0\right)&\propto \tau _\mathrm {P}^{a_\mathrm {P}^0-1}\exp \left( -b_\mathrm {P}^0\tau _\mathrm {P}\right) , \\ p\left( \tau _\mathrm {N}; a_\mathrm {N}^0,b_\mathrm {N}^0\right)&\propto \tau _\mathrm {N}^{a_\mathrm {N}^0-1}\exp \left( -b_\mathrm {N}^0\tau _\mathrm {N}\right) , \\ \end{aligned}$$
where $\varvec{\mu }_\mathrm {P}^0, \varvec{\mu }^0, a_\mathrm {P}^0, b_\mathrm {P}^0, a_\mathrm {N}^0, b_\mathrm {N}^0, \rho _\mathrm {P}^0, \hbox {and} \rho _\mathrm {N}^0$ are the hyperparameters.

References

Amini, M. R., Truong, T. V., & Goutte, C. (2008). A boosting algorithm for learning bipartite ranking functions with partially labeled data. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 99–106).
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138–156.
Article MathSciNet MATH Google Scholar
Blondel, M., Seki, K., & Uehara, K. (2013). Block coordinate descent algorithms for large-scale sparse multiclass classification. Machine Learning, 93(1), 31–52.
Article MathSciNet MATH Google Scholar
Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.
Article Google Scholar
Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). Semi-supervised learning. Cambridge: MIT Press.
Google Scholar
Cortes, C., & Mohri, M. (2004). AUC optimization vs. error rate minimization. Advances in Neural Information Processing Systems, 16, 313–320.
Google Scholar
Cozman, F. G., Cohen, I., & Cirelo, M. C. (2003). Semi-supervised learning of mixture models. In Proceedings of the 20th international conference on machine learning (pp. 99–106).
Dredze, M., Crammer, K., & Pereira, F. (2008). Confidence-weighted linear classification. In Proceedings of the 25th international conference on machine learning (pp. 264–271).
du Plessis, M. C., Niu, G., & Sugiyama, M. (2014). Analysis of learning from positive and unlabeled data. Advances in Neural Information Processing Systems, 27, 703–711.
Google Scholar
du Plessis, M. C., Niu, G., & Sugiyama, M. (2017). Class-prior estimation for learning from positive and unlabeled data. Machine Learning, 106(4), 463–492.
Article MathSciNet MATH Google Scholar
du Plessis, M. C., Niu, G., & Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. In Proceedings of 32nd international conference on machine learning, JMLR workshop and conference proceedings (Vol. 37, pp. 1386–1394).
Fujino, A., Ueda, N. (2016). A semi-supervised AUC optimization method with generative models. In IEEE 16th international conference on data mining (pp. 883–888).
Gao, W., & Zhou, Z. H. (2015). On the consistency of AUC pairwise optimization. In International joint conference on artificial intelligence (pp. 939–945).
Gao, W., Wang, L., Jin, R., Zhu, S., & Zhou, Z. H. (2016). One-pass AUC optimization. Artificial Intelligence, 236(C), 1–29.
Article MathSciNet MATH Google Scholar
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.
Article Google Scholar
Herschtal, A., & Raskutti, B. (2004). Optimising area under the ROC curve using gradient descent. In Proceedings of the 21st international conference on machine learning.
Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 133–142).
Kawakubo, H., du Plessis, M. C., & Sugiyama, M. (2016). Computationally efficient class-prior estimation under class balance change using energy distance. IEICE Transactions on Information and Systems, E99–D(1), 176–186.
Article Google Scholar
Kotlowski, W., Dembczynski, K. J, & Huellermeier, E. (2011). Bipartite ranking through minimization of univariate loss. In Proceedings of the 28th international conference on machine learning (pp. 1113–1120).
Krijthe, J. H., & Loog, M. (2017). Robust semi-supervised least squares classification by implicit constraints. Pattern Recognition, 63, 115–126.
Article MATH Google Scholar
Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the 12th international machine learning conference.
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Google Scholar
Li, Y. F., & Zhou, Z. H. (2015). Towards making unlabeled data never hurt. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1), 175–188.
Article Google Scholar
Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml.
Mendelson, S. (2008). Lower bounds for the empirical minimization algorithm. IEEE Transactions on Information Theory, 54(8), 3797–3803.
Article MathSciNet MATH Google Scholar
Niu, G., du Plessis, M. C., Sakai, T., Ma Y., & Sugiyama, M. (2016). Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., & Garnett, R. (Eds.) Advances in neural information processing systems (Vol. 29, pp. 1199–1207)
Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th international conference on machine learning (pp. 449–456).
Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for adaboost. Machine Learning, 42(3), 287–320.
Article MATH Google Scholar
Sakai, T., du Plessis, M. C., Niu, G., & Sugiyama, M. (2017). Semi-supervised classification based on classification from positive and unlabeled data. In Proceedings of the 34th international conference on machine learning.
Sokolovska, N., Cappé, O., & Yvon, F. (2008). The asymptotics of semi-supervised learning in discriminative probabilistic models. In Proceedings of the 25th international conference on machine learning (pp. 984–991).
Sundararajan, S., Priyanka, G., & Selvaraj, S. S. K. (2011). A pairwise ranking based approach to learning with positive and unlabeled examples. In Proceedings of the 20th ACM international conference on information and knowledge management (pp. 663–672).
Usunier, N., Amini, M., & Patrick, G. (2006). Generalization error bounds for classifiers trained with interdependent data. In: Weiss, Y., Schölkopf, P. B., & Platt J. C. (Eds.) Advances in neural information processing systems (Vol. 18, pp. 1369–1376). La Jolla, CA: Neural Information Processing Systems Foundation Inc.
Vapnik, V. N. (1998). Statistical learning theory. London: Wiley.
MATH Google Scholar
Ying, Y., Wen, L., & Lyu, S. (2016). Stochastic online AUC maximization. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., & Garnett, R. (Eds.) Advances in neural information processing systems (Vol. 29, pp. 451–459). La Jolla, CA: Neural Information Processing Systems Foundation Inc.
Zhao, P., Jin, R., Yang, T., & Hoi, S. C. (2011). Online AUC maximization. In Proceedings of the 28th international conference on machine learning (pp. 233–240).

Download references

Acknowledgements

TS was supported by KAKENHI 15J09111. GN was supported by the JST CREST program and Microsoft Research Asia. MS was supported by JST CREST JPMJCR1403. We thank Han Bao for his comments.

Author information

Authors and Affiliations

Center for Advanced Intelligence Project, RIKEN, Nihonbashi, Chuo-ku, Tokyo, Japan
Tomoya Sakai, Gang Niu & Masashi Sugiyama
Graduate School of Frontier Sciences, The University of Tokyo, Kashiwanoha, Kashiwa-shi, Chiba, Japan
Tomoya Sakai, Gang Niu & Masashi Sugiyama

Authors

Tomoya Sakai
View author publications
You can also search for this author in PubMed Google Scholar
Gang Niu
View author publications
You can also search for this author in PubMed Google Scholar
Masashi Sugiyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomoya Sakai.

Additional information

Editors: Wee Sun Lee and Robert Durrant.

The original version of this article was revised: Corrections made to title, author affiliations, and equations on pp. 7 and 26.

A correction to this article is available online at https://doi.org/10.1007/s10994-017-5695-8.

Appendices

Appendix A: PU-AUC risk estimator

In this section, we discuss the way of estimating the proposed PU-AUC risk. Recall that the PU-AUC risk in Eq. (4) is defined as

$$\begin{aligned} R_\mathrm {PU}(f)=\frac{1}{\theta _{{\mathrm {N}}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}}))]\right] -\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\bar{\mathrm {P}}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))]\right] . \end{aligned}$$

If one additional set of positive samples $\{\varvec{\bar{x}}^{\mathrm {P}}_i\}^{n_{{\mathrm {P}}}}_{i=1}$ is available, we obtain the unbiased PU-AUC risk estimator by

$$\begin{aligned} \widehat{R}_\mathrm {PU}(f)=\frac{1}{\theta _{{\mathrm {N}}}}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1} \ell \left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k)\right) - \frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}^2}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell \left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{\bar{x}}^{\mathrm {P}}_{i'})\right) . \end{aligned}$$

We used this estimator in our theoretical analyses because learning is not involved. However, obtaining one additional set of samples is not always possible in practice. Thus, instead of the above risk estimator, we use the following risk estimator in our implementation:

$$\begin{aligned} \widehat{R}_\mathrm {PU}(f)=\frac{1}{\theta _{{\mathrm {N}}}}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1} \ell \left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k)\right) - \frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}} \left( \frac{1}{{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell \left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'})\right) - \frac{\ell (0)}{{n_{{\mathrm {P}}}}-1}\right) . \end{aligned}$$

This estimator is also unbiased. To show unbiasedness of this estimator, let us rewrite the second term of the PU-AUC risk without coefficient in Eq. (4) as

$$\begin{aligned} {{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\bar{\mathrm {P}}}}\left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))]\right] = {{\mathrm{E}}}_{\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}}\left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))\right] . \end{aligned}$$

The unbiased estimator can be expressed as

$$\begin{aligned} \frac{1}{{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'})) - \frac{\ell (0)}{{n_{{\mathrm {P}}}}-1} , \end{aligned}$$

because the expectation of the above estimator can be computed as follows:

$$\begin{aligned}&{{\mathrm{E}}}_{\varvec{x}^{\mathrm {P}}_1, \ldots , \varvec{x}^{\mathrm {P}}_{n_{{\mathrm {P}}}}} \Big [\frac{1}{{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'}))-\frac{\ell (0)}{{n_{{\mathrm {P}}}}-1}\Big ] \\&={{\mathrm{E}}}_{\varvec{x}^{\mathrm {P}}_1, \ldots , \varvec{x}^{\mathrm {P}}_{n_{{\mathrm {P}}}}} \Big [\frac{1}{{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)} \Big (\sum ^{n_{{\mathrm {P}}}}_{i=1}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_i)) +\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'\ne i}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'}))\Big ) -\frac{\ell (0)}{{n_{{\mathrm {P}}}}-1}\Big ] \\&={{\mathrm{E}}}_{\varvec{x}^{\mathrm {P}}_1, \ldots , \varvec{x}^{\mathrm {P}}_{n_{{\mathrm {P}}}}} \Big [\frac{1}{{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)}\Big (\sum ^{n_{{\mathrm {P}}}}_{i=1}\ell (0) +\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'\ne i}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'}))\Big ) -\frac{\ell (0)}{{n_{{\mathrm {P}}}}-1}\Big ] \\&=\frac{1}{{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'\ne i} {{\mathrm{E}}}_{\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'}} \Big [\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {P}}_{i'}))\Big ] \\&=\frac{1}{{n_{{\mathrm {P}}}}({n_{{\mathrm {P}}}}-1)}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'\ne i} {{\mathrm{E}}}_{\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}} \Big [\ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))\Big ] \\&={{\mathrm{E}}}_{\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}}\Big [\ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))\Big ] , \end{aligned}$$

where we used $f(\varvec{x},\varvec{x})=\varvec{w}^\top (\varvec{\phi }(\varvec{x})-\varvec{\phi }(\varvec{x}))=0$ from the second to third lines. If the squared loss function $\ell (m)=(1-m)^2$ is used, $\ell (0)=1$ (cf. the implementation with the squared loss in Sect. 5). Therefore, the proposed PU-AUC risk estimator is unbiased.

Appendix B: Proof of generalization error bounds

Here, we give the proofs of generalization error bounds in Sect. 4.1. The proofs are based on Usunier et al. (2006).

Let $\{\varvec{x}_i\}^m_{i=1}$ and $\{\varvec{x}'_j\}^n_{j=1}$ be two sets of samples drawn from the distribution equipped with densities $q(\varvec{x})$ and $q'(\varvec{x})$, respectively. Recall $\mathcal {F}$ be a function class of bounded hyperplanes:

$$\begin{aligned} \mathcal {F}:=\{ f(\varvec{x})=\langle \varvec{w}, \varvec{\phi }(\varvec{x})-\varvec{\phi }(\varvec{x}')\rangle \mid \Vert \varvec{w}\Vert \le C_{\varvec{w}}; ~~ \forall \varvec{x}:\Vert \varvec{\phi }(\varvec{x})\Vert \le C_{\varvec{\phi }} \}, \end{aligned}$$

where $C_{\varvec{w}}>0$ and $C_{\varvec{\phi }}>0$ are certain positive constants. Then, the AUC risk over distributions q and $q'$ and its empirical version can be expressed as

$$\begin{aligned} R(f)&:={{\mathrm{E}}}_{\varvec{x}\sim q}\left[ {{\mathrm{E}}}_{\varvec{x}'\sim q'}[\ell (f(\varvec{x},\varvec{x}'))]\right] , \\ \widehat{R}(f)&:=\frac{1}{mn}\sum ^m_{i=1}\sum ^n_{j=1} \ell \left( f(\varvec{x}_i,\varvec{x}'_j)\right) . \end{aligned}$$

For convenience, we define

$$\begin{aligned} h(\delta ):=2\sqrt{2}LC_\ell C_{\varvec{w}}C_{\varvec{\phi }} +\frac{3}{2}\sqrt{2\log (2/\delta )} . \end{aligned}$$

We first have the following theorem:

Theorem 4

For any $\delta >0$, the following inequality holds with probability at least $1-\delta $ for any $f\in \mathcal {F}$:

$$\begin{aligned} R(f)-\widehat{R}(f)&\le h(\delta )\frac{1}{\sqrt{\min (n,n')}} . \end{aligned}$$

Proof

By slightly modifying Theorem 7 in Usunier et al. (2006) to fit our setting, for any $\delta >0$, with probability at least $1-\delta $ for any $f\in \mathcal {F}$, we have

$$\begin{aligned} R(f)-\widehat{R}(f)&\le \frac{2LC_\ell C_{\varvec{w}}\sqrt{\max (n,n')}}{nn'} \sqrt{\sum ^n_{i=1}\sum ^{n'}_{j=1} \Vert \varvec{\phi }(\varvec{x}_i)-\varvec{\phi }(\varvec{x}_j)\Vert ^2} \nonumber \\&+ 3\sqrt{\frac{\log (2/\delta )}{2\min (n,n')}} . \end{aligned}$$

(9)

Applying the inequality

$$\begin{aligned} \sum ^n_{i=1}\sum ^{n'}_{j=1} \Vert \varvec{\phi }(\varvec{x}_i)-\varvec{\phi }(\varvec{x}_j)\Vert ^2&\le n'\sum ^n_{i=1}\Vert \varvec{\phi }(\varvec{x}_i)\Vert ^2 +n\sum ^{n'}_{j=1}\Vert \varvec{\phi }(\varvec{x}_j)\Vert ^2 \\&\le 2nn' C_{\varvec{\phi }}^2 , \end{aligned}$$

to the first term in Eq. (9), we obtain the theorem. $\square $

By using Theorem 4, we prove the risk bounds of the PU-AUC and NU-AUC risks:

Lemma 1

For any $\delta >0$, the following inequalities hold separately with probability at least $1-\delta $ for any $f\in \mathcal {F}$:

$$\begin{aligned} R_\mathrm {PU}(f)-\widehat{R}_\mathrm {PU}(f)&\le h(\delta /2) \left( \frac{1}{\theta _{{\mathrm {N}}}\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {U}}}})}} +\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}\sqrt{{n_{{\mathrm {P}}}}}} \right) , \\ R_\mathrm {NU}(f)-\widehat{R}_\mathrm {NU}(f)&\le h(\delta /2) \left( \frac{1}{\theta _{{\mathrm {P}}}\sqrt{\min ({n_{{\mathrm {N}}}},{n_{{\mathrm {U}}}})}} +\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}\sqrt{{n_{{\mathrm {N}}}}}} \right) . \end{aligned}$$

Proof

Recall that the PU-AUC and NU-AUC risks are expressed as

$$\begin{aligned} R_\mathrm {PU}(f)&=\frac{1}{\theta _{{\mathrm {N}}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}}))]\right] -\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}}{{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\bar{\mathrm {P}}}}\left[ \ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))]\right] , \\ R_\mathrm {NU}(f)&=\frac{1}{\theta _{{\mathrm {P}}}}{{\mathrm{E}}}_{{\mathrm {U}}}\left[ {{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {U}},\varvec{x}^{\mathrm {N}}))]\right] -\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}}{{\mathrm{E}}}_{{\mathrm {N}}}\left[ {{\mathrm{E}}}_{{{\bar{\mathrm{N}}}}}[\ell (f(\varvec{x}^{\mathrm {N}},\varvec{\bar{x}}^{\mathrm {N}}))]\right] . \end{aligned}$$

Based on Theorem 4, for any $\delta >0$, we have these uniform deviation bounds with probability at least $1-\delta /2$:

$$\begin{aligned} \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}}))]] -\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k))\Big )&\le h(\delta /2)\frac{1}{\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {U}}}})}} , \\ \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {U}}}[{{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {U}},\varvec{x}^{\mathrm {N}}))]] -\frac{1}{{n_{{\mathrm {N}}}}{n_{{\mathrm {U}}}}} \sum ^{n_{{\mathrm {U}}}}_{k=1}\sum ^{n_{{\mathrm {N}}}}_{j=1}\ell (f(\varvec{x}^{\mathrm {U}}_k,\varvec{x}^{\mathrm {N}}_j))\Big )&\le h(\delta /2)\frac{1}{\sqrt{\min ({n_{{\mathrm {N}}}},{n_{{\mathrm {U}}}})}} , \\ \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\bar{\mathrm {P}}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))]] -\frac{1}{{n_{{\mathrm {P}}}}^2}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{\bar{x}}^{\mathrm {P}}_{i'}))\Big )&\le h(\delta /2)\frac{1}{\sqrt{{n_{{\mathrm {P}}}}}} , \\ \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {N}}}[{{\mathrm{E}}}_{{{\bar{\mathrm{N}}}}}[\ell (f(\varvec{x}^{\mathrm {N}},\varvec{\bar{x}}^{\mathrm {N}}))]] -\frac{1}{{n_{{\mathrm {N}}}}^2}\sum ^{n_{{\mathrm {N}}}}_{j=1}\sum ^{n_{{\mathrm {N}}}}_{j'=1} \ell (f(\varvec{x}^{\mathrm {N}}_j,\varvec{\bar{x}}^{\mathrm {N}}_{j'}))\Big )&\le h(\delta /2)\frac{1}{\sqrt{{n_{{\mathrm {N}}}}}} , \end{aligned}$$

Simple calculation showed that for any $\delta >0$, with probability $1-\delta $, we have

$$\begin{aligned} \sup _{f\in \mathcal {F}}\left( R_\mathrm {PU}(f) - \widehat{R}_\mathrm {PU}(f)\right)&\le \frac{1}{\theta _{{\mathrm {N}}}} \sup _{f\in \mathcal {F}}\left( {{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}}))]] -\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k))\right) \nonumber \\&+\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}}\sup _{f\in \mathcal {F}}\left( {{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\bar{\mathrm {P}}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))]] -\frac{1}{{n_{{\mathrm {P}}}}^2}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{\bar{x}}^{\mathrm {P}}_{i'}))\right) \nonumber \\&\le h(\delta /2) \left( \frac{1}{\theta _{{\mathrm {N}}}\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {U}}}})}} +\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}\sqrt{{n_{{\mathrm {P}}}}}} \right) , \end{aligned}$$

(10)

where we used

$$\begin{aligned} \sup (x+y)&\le \sup (x)+\sup (y) , \\ R_\mathrm {PU}(f)&\le \frac{1}{\theta _{{\mathrm {N}}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}}))]\right] +\frac{\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}}{{\mathrm{E}}}_{{\mathrm {P}}}\left[ {{\mathrm{E}}}_{{\bar{\mathrm {P}}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))]\right] . \end{aligned}$$

Similarly, for the NU-AUC risk, we have

$$\begin{aligned} \sup _{f\in \mathcal {F}}\left( R_\mathrm {NU}(f) - \widehat{R}_\mathrm {NU}(f)\right)&\le h(\delta /2) \left( \frac{1}{\theta _{{\mathrm {P}}}\sqrt{\min ({n_{{\mathrm {N}}}},{n_{{\mathrm {U}}}})}} +\frac{\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}\sqrt{{n_{{\mathrm {N}}}}}} \right) . \end{aligned}$$

(11)

Equations (10) and (11) conclude the lemma. $\square $

Finally, we give the proof of Theorem 1.

Proof

Assume the loss satisfying $\ell _{0\text {-}1}(m)\le M\ell (m)$. We have $I(f)\le M R(f)$. When $M=1$ such as $\ell _{\mathrm {S}}(m)$ and $\ell _{\mathrm {E}}(m)$, $I(f)\le R(f)$ holds. This observation yields Theorem 1. $\square $

Next, we prove the generalization error bounds of the PNPU-AUC and PNNU-AUC risks in Theorem 2. We first prove the following risk bounds:

Lemma 2

For any $\delta >0$, the following inequalities hold separately with probability at least $1-\delta $ for all $f\in \mathcal {F}$:

$$\begin{aligned} R_\mathrm {PNPU}^\gamma (f) - \widehat{R}_\mathrm {PNPU}^\gamma (f)&\le h(\delta /3) \Big ( \frac{1-\gamma }{\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {N}}}})}} +\frac{\gamma }{\theta _{{\mathrm {N}}}\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {U}}}})}} +\frac{\theta _{{\mathrm {P}}}\gamma }{\theta _{{\mathrm {N}}}\sqrt{{n_{{\mathrm {P}}}}}} \Big ) , \\ R_\mathrm {PNNU}^\gamma (f) - \widehat{R}_\mathrm {PNNU}^\gamma (f)&\le h(\delta /3) \Big ( \frac{1-\gamma }{\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {N}}}})}} +\frac{\gamma }{\theta _{{\mathrm {P}}}\sqrt{\min ({n_{{\mathrm {N}}}},{n_{{\mathrm {U}}}})}} +\frac{\theta _{{\mathrm {N}}}\gamma }{\theta _{{\mathrm {P}}}\sqrt{{n_{{\mathrm {N}}}}}} \Big ) . \end{aligned}$$

Proof

Recall the PNPU-AUC and PNNU-AUC risks:

$$\begin{aligned} R_\mathrm {PNPU}^\gamma (f)&:=(1-\gamma )R_\mathrm {PN}(f)+\gamma R_\mathrm {PU}(f), \\ R_\mathrm {PNNU}^\gamma (f)&:=(1-\gamma )R_\mathrm {PN}(f)+\gamma R_\mathrm {NU}(f) . \end{aligned}$$

Based on Theorem 4, for any $\delta >0$, we have these uniform deviation bounds with probability at least $1-\delta /3$:

$$\begin{aligned} \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {N}}))]] -\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {N}}}}_{j=1}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {N}}_j))\Big )&\le h(\delta /3)\frac{1}{\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {N}}}})}} , \\ \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}}))]] -\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k))\Big )&\le h(\delta /3)\frac{1}{\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {U}}}})}} , \\ \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {U}}}[{{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {U}},\varvec{x}^{\mathrm {N}}))]] -\frac{1}{{n_{{\mathrm {N}}}}{n_{{\mathrm {U}}}}} \sum ^{n_{{\mathrm {U}}}}_{k=1}\sum ^{n_{{\mathrm {N}}}}_{j=1}\ell (f(\varvec{x}^{\mathrm {U}}_k,\varvec{x}^{\mathrm {N}}_j))\Big )&\le h(\delta /3)\frac{1}{\sqrt{\min ({n_{{\mathrm {N}}}},{n_{{\mathrm {U}}}})}} , \\ \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\bar{\mathrm {P}}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))]] -\frac{1}{{n_{{\mathrm {P}}}}^2}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{\bar{x}}^{\mathrm {P}}_{i'}))\Big )&\le h(\delta /3)\frac{1}{\sqrt{{n_{{\mathrm {P}}}}}} , \\ \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {N}}}[{{\mathrm{E}}}_{{{\bar{\mathrm{N}}}}}[\ell (f(\varvec{x}^{\mathrm {N}},\varvec{\bar{x}}^{\mathrm {N}}))]] -\frac{1}{{n_{{\mathrm {N}}}}^2} \sum ^{n_{{\mathrm {N}}}}_{j=1}\sum ^{n_{{\mathrm {N}}}}_{j'=1}\ell (f(\varvec{x}^{\mathrm {N}}_j,\varvec{\bar{x}}^{\mathrm {N}}_{j'}))\Big )&\le h(\delta /3)\frac{1}{\sqrt{{n_{{\mathrm {N}}}}}} . \end{aligned}$$

Combining three bounds from the above, for any $\delta >0$, with probability $1-\delta $, we have

$$\begin{aligned}&\sup _{f\in \mathcal {F}}\Big (R_\mathrm {PNPU}^\gamma (f) - \widehat{R}_\mathrm {PNPU}^\gamma (f)\Big ) \\&\le (1-\gamma ) \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\mathrm {N}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {N}}))]] -\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {N}}}}_{j=1}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {N}}_j))\Big ) \\&\quad +\frac{\gamma }{\theta _{{\mathrm {N}}}} \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\mathrm {U}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{x}^{\mathrm {U}}))]]-\frac{1}{{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1}\ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k))\Big ) \\&\quad +\frac{\gamma \theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}} \sup _{f\in \mathcal {F}}\Big ({{\mathrm{E}}}_{{\mathrm {P}}}[{{\mathrm{E}}}_{{\bar{\mathrm {P}}}}[\ell (f(\varvec{x}^{\mathrm {P}},\varvec{\bar{x}}^{\mathrm {P}}))]] -\frac{1}{{n_{{\mathrm {P}}}}^2}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1} \ell (f(\varvec{x}^{\mathrm {P}}_i,\varvec{\bar{x}}^{\mathrm {P}}_{i'})) \Big ) \\&\le h(\delta /3) \Big ( \frac{1-\gamma }{\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {N}}}})}} +\frac{\gamma }{\theta _{{\mathrm {N}}}\sqrt{\min ({n_{{\mathrm {P}}}},{n_{{\mathrm {U}}}})}} +\frac{\gamma \theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}\sqrt{{n_{{\mathrm {P}}}}}} \Big ) . \end{aligned}$$

This concludes the risk bounds of the PNPU-AUC risk.

Similarly, we prove the risk bounds of the PNNU-AUC risk. $\square $

Again, $I(f)\le R(f)$ holds in our setting. This leads to Theorem 2.

Appendix C: Proof of variance reduction

Here, we give the proof of Theorem 3.

Proof

The empirical PNPU-AUC risk can be expressed as

$$\begin{aligned} \widehat{R}_\mathrm {PNPU}^\gamma (f)&=(1-\gamma )\widehat{R}_\mathrm {PN}(f)+\gamma \widehat{R}_\mathrm {PU}(f) \\&=\frac{1-\gamma }{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}}\sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {N}}}}_{j=1} \ell \left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {N}}_j)\right) +\frac{\gamma }{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}{n_{{\mathrm {U}}}}} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {U}}}}_{k=1}\ell \left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{x}^{\mathrm {U}}_k)\right) \\&\quad -\frac{\gamma \theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}^2} \sum ^{n_{{\mathrm {P}}}}_{i=1}\sum ^{n_{{\mathrm {P}}}}_{i'=1}\ell \left( f(\varvec{x}^{\mathrm {P}}_i,\varvec{\bar{x}}^{\mathrm {P}}_{i'})\right) . \end{aligned}$$

Assume ${n_{{\mathrm {U}}}}\rightarrow \infty $, we obtain

$$\begin{aligned} {{\mathrm{Var}}}\left[ \widehat{R}_\mathrm {PNPU}^\gamma (f)\right]&= \frac{(1-\gamma )^2}{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}}\sigma _\mathrm {PN}^2(f) +\frac{\gamma ^2\theta _{{\mathrm {P}}}^2}{{n_{{\mathrm {P}}}}^2\theta _{{\mathrm {N}}}^2}\sigma _\mathrm {PP}^2(f) +\frac{(1-\gamma )\gamma }{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}}\tau _\mathrm {PN,PU}(f) \\&\quad - \frac{\gamma ^2\theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}^2{n_{{\mathrm {P}}}}}\tau _\mathrm {PU,PP}(f) -\frac{(1-\gamma )\gamma \theta _{{\mathrm {P}}}}{\theta _{{\mathrm {N}}}{n_{{\mathrm {P}}}}} \tau _\mathrm {PN,PP}(f) \\&=(1-\gamma )^2 \psi _\mathrm {PN}+ \gamma ^2\psi _\mathrm {PU}+ (1-\gamma )\gamma \psi _\mathrm {PP}, \end{aligned}$$

where the terms divided by ${n_{{\mathrm {U}}}}$ are disappeared. Setting the derivative with respect to $\gamma $ at zero, we obtain the minimizer in Eq. (7).

For the empirical PNNU-AUC risk, when ${n_{{\mathrm {U}}}}\rightarrow \infty $, we obtain

$$\begin{aligned} {{\mathrm{Var}}}\left[ \widehat{R}_\mathrm {PNNU}^\gamma (g)\right]&= \frac{(1-\gamma )^2}{{n_{{\mathrm {P}}}}{n_{{\mathrm {N}}}}}\sigma _\mathrm {PN}^2(g) +\frac{\gamma ^2\theta _{{\mathrm {N}}}^2}{{n_{{\mathrm {N}}}}^2\theta _{{\mathrm {P}}}^2}\sigma _\mathrm {NN}^2(g) +\frac{(1-\gamma )\gamma }{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}}\tau _\mathrm {PN,NU}(g) \\&\quad - \frac{\gamma ^2\theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}^2{n_{{\mathrm {N}}}}}\tau _\mathrm {NU,NN}(g) -\frac{(1-\gamma )\gamma \theta _{{\mathrm {N}}}}{\theta _{{\mathrm {P}}}{n_{{\mathrm {N}}}}} \tau _\mathrm {PN,NN}(g) \\&=(1-\gamma )^2 \psi _\mathrm {PN}+ \gamma ^2 \psi _\mathrm {NU}+ (1-\gamma )\gamma \psi _\mathrm {NN}. \end{aligned}$$

Setting the derivative with respect to $\gamma $ at zero, we obtain the minimizer in Eq. (8). $\square $

Table 6 The statistics of the data sets

Full size table

Appendix D: Statistics of data sets

Table 6 summarizes the statistics of the data sets used in our experiments. The class balance is the number of positive samples divided by that of total samples. The sources of data sets are as follows: the IDA Benchmark Repository (IDA) (Rätsch et al. 2001), the UCI Machine Learning Repository (UCI) (Lichman 2013), the LIBSVM data sets (LIBSVM) (Chang et al. 2011), the Semi-Supervised Learning Book (SSL) (Chapelle et al. 2006), and the Amazon Review (Amazon7) (Blondel et al. 2013).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sakai, T., Niu, G. & Sugiyama, M. Semi-supervised AUC optimization based on positive-unlabeled learning. Mach Learn 107, 767–794 (2018). https://doi.org/10.1007/s10994-017-5678-9

Download citation

Received: 29 April 2017
Accepted: 16 September 2017
Published: 26 October 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10994-017-5678-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semi-supervised AUC optimization based on positive-unlabeled learning

Abstract

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

1 Introduction

2 Preliminary

3 Proposed method

3.1 PU-AUC optimization

3.2 Semi-supervised AUC optimization

3.3 Discussion about related work

4 Theoretical analyses

4.1 Generalization error bounds

Theorem 1

Theorem 2

4.2 Variance reduction

Theorem 3

5 Practical implementation

5.1 General case

5.2 Analytical solution for squared loss

5.3 Cross-validation

6 Experiments

6.1 Effect of variance reduction

6.2 Benchmark data sets

6.2.1 AUC optimization from positive and unlabeled data

6.2.2 Semi-supervised AUC optimization

6.3 Text classification

6.4 Sensitivity analysis

6.5 Scalability

7 Conclusions

Change history

23 January 2018

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: PU-AUC risk estimator

Appendix B: Proof of generalization error bounds

Theorem 4

Proof

Lemma 1

Proof

Proof

Lemma 2

Proof

Appendix C: Proof of variance reduction

Proof

Appendix D: Statistics of data sets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation