1 Introduction

Partial label learning deals with the problem where each training example is associated with a set of candidate labels, among which only one label is valid (Cour et al. 2011; Zhang 2014). In recent years, partial label learning techniques have been found useful in solving many real-world scenarios such as web mining (Jie and Orabona 2010), multimedia content analysis (Cour et al. 2009; Zeng et al. 2013), ecoinformatics (Liu and Dietterich 2012), etc.

Formally speaking, let \({\mathcal {X}}={\mathbb {R}}^d\) be the d-dimensional instance space and \({\mathcal {Y}}=\{1,2,\ldots ,\) \(q\}\) be the label space with q class labels. Given the partial label training set \({\mathcal {D}}=\{(\mathbf {x}_i,S_i)\mid 1\le i\le m \}\), the task of partial label learning is to induce a multi-class classifier \(f:{\mathcal {X}}\mapsto {\mathcal {Y}}\) from \({\mathcal {D}}\). Here, \(\mathbf {x}_i\in {\mathcal {X}}\) is a d-dimensional feature vector \((x_{i1},x_{i2},\ldots ,x_{id})^\top \) and \(S_i\subseteq {\mathcal {Y}}\) is the associated candidate label set. Partial label learning takes the core assumption that the ground-truth label \(y_i\) of \(\mathbf{x}_i\) resides in its candidate label set \(S_i\) and not directly accessible to the learning algorithm.Footnote 1

Intuitively, the basic strategy for handling partial label learning problem is disambiguation, i.e. trying to identify the ground-truth label from the candidate label set associated with each training example. As one of the popular machine learning techniques, maximum margin criterion has been applied to learn from partial label examples. Specifically, existing attempts disambiguate the partial label training example by optimizing the margin between the maximum modeling output from its candidate labels and that from its non-candidate labels (Nguyen and Caruana 2008). In other words, given the parametric model with parameters \({{\varvec{\Theta }}}\) and \({{\varvec{x}}}_i\)’s modeling output \(F(\mathbf{x}_i,y;{{\varvec{\Theta }}})\) on each class label \(y\in {\mathcal {Y}}\), the existing formulation works by maximizing the following predictive difference over \(\mathbf{x}_i\): \(\max _{y_j\in S_i}F(\mathbf{x}_i,y_j;{{\varvec{\Theta }}})-\max _{y_k\notin S_i}F(\mathbf{x}_i,y_k;{{\varvec{\Theta }}})\). Nonetheless, this formulation fails to consider the predictive difference between the ground-truth label (i.e. \(y_i\)) and other labels in the candidate label set (i.e. \(S_i{\setminus }\{y_i\}\)). Due to the ignorance of such discriminative properties, the generalization performance of the resulting maximum margin partial label learning approach might be suboptimal.

Essentially, the task of partial label learning is to induce a multi-class classifier \(f:{\mathcal {X}}\mapsto {\mathcal {Y}}\). Therefore, the canonical multi-class margin, i.e. \(F(\mathbf{x}_i,y_i;{{\varvec{\Theta }}})-\max _{{\tilde{y}}_i\ne y_i}F(\mathbf{x}_i,{\tilde{y}}_i;{{\varvec{\Theta }}})\), should be a natural choice to learn from partial label examples. In this way, the modeling output from the ground-truth label is distinguished with those from all the other labels. In view of this observation, a new maximum margin partial label learning approach named M3PL, i.e. MaxiMum Margin Partial Label learning, is proposed in this paper. Evidently, the major challenge in making use of the multi-class margin for partial label training examples lies in that the ground-truth labeling information is not accessible to the learning algorithm. To overcome this difficulty, an iterative optimization procedure is employed by M3PL which alternates between the task of identifying the ground-truth label and maximizing the multi-class margin. Comprehensive comparative studies against state-of-the-art partial label learning approaches clearly validate the effectiveness of the proposed formulation.

The remainder of this paper is organized as follows. Section 2 briefly discusses related work on partial label learning. Section 3 introduces technical details of the proposed M3PL approach. Section 4 reports experimental results across a broad range of datasets. Finally, Sect. 5 summarizes the paper and indicates several future research issues.

2 Related work

As the labeling information conveyed by each partial label training example is ambiguous, partial label learning can be regarded as one of the weakly-supervised learning frameworks. Conceptually speaking, it situates between the two ends of the supervision spectrum, i.e. standard supervised learning with explicit supervision and unsupervised learning with blind supervision. Learning with weak supervision has found wide application in solving various learning tasks as it is generally hard to obtain explicit and sufficient supervision information in real-world scenarios (Pfahringer 2012). In particular, partial label learning is related to several well-studied weakly-supervised learning frameworks, including semi-supervised learning, multi-instance learning and multi-label learning, while the weak supervision scenario considered by partial label learning is different to those counterpart frameworks.

Semi-supervised learning (Chapelle et al. 2006; Zhu and Goldberg 2009) aims to induce a classifier \(f:{\mathcal {X}}\mapsto {\mathcal {Y}}\) from few labeled training examples along with abundant unlabeled training examples. For an unlabeled example the ground-truth label assumes the whole label space, while for a partial label example the ground-truth label is confined within its candidate label set. Multi-instance learning (Dietterich et al. 1997; Amores 2013) aims to induce a classifier \(f:2^{{\mathcal {X}}}\mapsto {\mathcal {Y}}\) from training examples each represented as a labeled bag of instances. For a multi-instance example the label is assigned to bag of instances, while for a partial label example the label is assigned to single instance. Multi-label learning (Zhang and Zhou 2014; Gibaja and Ventura 2015) aims to learn a classifier \(f:{\mathcal {X}}\mapsto 2^{{\mathcal {Y}}}\) from training examples each associated with multiple labels. For a multi-label example the associated labels are all valid ones, while for a partial label example the associated labels are only candidate ones.

In recent years, a number of partial label learning approaches have been proposed by adapting major machine learning techniques. Maximum likelihood techniques are introduced to learn from partial label examples by maximizing the likelihood function \(\sum _{i=1}^{m}\log \) \((\sum _{y\in S_i}F(\mathbf{x}_i,y;\varvec{\Theta }))\), where EM-based optimization is performed by treating the ground-truth label as a latent variable (Jin and Ghahramani 2003; Liu and Dietterich 2012). To enable convex optimization for partial label learning, a relaxed formulation is proposed by discriminating the average output from all candidate labels, i.e. \(\frac{1}{|S_i|}\sum _{y\in S_i}F({{\varvec{x}}}_i,y;{{\varvec{\Theta }}})\), against the outputs from non-candidate labels, i.e. \(F({{\varvec{x}}}_i,y;{{\varvec{\Theta }}})\) \((y\notin S_i)\) (Cour et al. 2011). For instanced-based approaches, the labeling information from neighboring training examples are combined by weighted voting to make predictions for unseen instances (Hüllermeier and Beringer 2006; Zhang and Yu 2015). There are also some approaches which transform the problem of partial label learning into the problem of binary classification by error-correcting output codes (ECOC) (Zhang 2014), the problem of sparse coding by dictionary learning (Chen et al. 2014), or the problem of multioutput regression by manifold analysis (Zhang et al. 2016).

Specifically, maximum margin techniques have also been employed to design partial label learning approaches (Nguyen and Caruana 2008). Given the parametric model \({{\varvec{\Theta }}}=\{(\mathbf{w}_p,b_p)\mid 1\le p\le q\}\) with one linear classifier \((\mathbf{w}_p,b_p)\) for each class label, the existing maximum margin partial label formulation aims to solve the following optimization problem (OP):

OP 1: Existing Maximum Margin Formulation

$$\begin{aligned} \begin{aligned}&\min _{{{\varvec{\Theta }}},{{\varvec{\xi }}}}~~~\frac{1}{2}\sum _{p=1}^q ||\mathbf{w}_p||^2+C\sum _{i=1}^m \xi _i\\ \mathrm{s.t.:}&\max _{y_j\in S_i}(\mathbf{w}_{y_j}^\top \cdot \mathbf{x}_i+b_{y_j})-\max _{y_k\notin S_i}(\mathbf{w}_{y_k}^\top \cdot \mathbf{x}_i+b_{y_k})\ge 1-\xi _i\\&\xi _i\ge 0\qquad \forall i\in \{1,2,\ldots ,m\} \end{aligned} \end{aligned}$$

Here, \({{\varvec{\xi }}}=\{\xi _1,\xi _2,\ldots ,\xi _m\}\) represents the set of slack variables and C is the regularization parameter. As shown in OP 1, the existing formulation focuses on distinguishing the maximum output from candidate labels, i.e. \(\max _{y_j\in S_i}(\mathbf{w}_{y_j}^\top \cdot \mathbf{x}_i+b_{y_j})\), with the maximum output from non-candidate labels, i.e. \(\max _{y_k\notin S_i}(\mathbf{w}_{y_k}^\top \cdot \mathbf{x}_i+b_{y_k})\). One potential drawback of this formulation lies in the fact that the predictive difference between the ground-truth label and other candidate labels is not taken into account, which may lead to suboptimal performance for the resulting partial label learning approach.

In the next section, a new maximum margin formulation towards partial label learning is proposed, which aims to maximize the canonical multi-class margin between the ground-truth label and all other labels in the label space.

3 The M3PL approach

3.1 Proposed formulation

Based on the notation given in Sect. 1, the training set \({\mathcal {D}}\) is composed of m partial label examples \(({{\varvec{x}}}_i,S_i)~(1\le i\le m)\) with \(\mathbf{x}_i\in {\mathcal {X}}\) and \(S_i\subseteq {\mathcal {Y}}\). In addition, let \(\mathbf{y}=(y_1,y_2,\ldots ,y_m)\) be the (unknown) ground-truth label assignments for the training examples. Following partial label learning assumption, the ground-truth label of each instance \(\mathbf{x}_i\) should reside in its candidate label set \(S_i\). Therefore, the feasible solution space of \(\mathbf{y}\) corresponds to \({\mathcal {S}}=S_1\times S_2\times \cdots \times S_m\).

As in common practice, M3PL assumes a maximum margin learning system with q linear classifiers \({{\varvec{\Theta }}}=\{(\mathbf{w}_p,b_p)\mid 1\le p\le q\}\), one for each class label. Once the ground-truth label assignments \(\mathbf{y}=(y_1,y_2,\ldots ,y_m)\) are fixed, M3PL proceeds to maximize the canonical multi-class margin over each instance \(\mathbf{x}_i\), i.e.: \((\mathbf{w}_{y_i}^\top \cdot \mathbf{x}_i+b_{y_i})-\max _{{\tilde{y}}_i\ne y_i}(\mathbf{w}_{{\tilde{y}}_i}^\top \cdot \mathbf{x}_i+b_{{\tilde{y}}_i})\). By introducing slack variables \({{\varvec{\xi }}}=\{\xi _1,\xi _2,\ldots ,\xi _m\}\) to accommodate margin relaxations, the maximum margin problem considered by M3PL can be formulated as follows:

OP 2: Proposed Maximum Margin Formulation

$$\begin{aligned}&\min _{\mathbf{y},{{\varvec{\Theta }}},{{\varvec{\xi }}}}~~~\frac{1}{2}\sum _{p=1}^q ||\mathbf{w}_p||^2+C\sum _{i=1}^m \xi _i\\ \mathrm{s.t.:}&(\mathbf{w}_{y_i}^\top \cdot \mathbf{x}_i+b_{y_i})-\max _{{\tilde{y}}_i\ne y_i}(\mathbf{w}_{{\tilde{y}}_i}^\top \cdot \mathbf{x}_i+b_{{\tilde{y}}_i})\ge 1-\xi _i\\&\xi _i\ge 0\qquad \forall i\in \{1,2,\ldots ,m\}\\&\mathbf{y}\in {\mathcal {S}}\\&\sum _{i=1}^m {\mathbb {I}}(y_i=p)=n_p\qquad \forall p\in \{1,2,\ldots ,q\} \end{aligned}$$

As shown in OP 2, the first two constraints enforce the maximum margin criterion over each training example. In addition, the third constraint enforces that the ground-truth label assignment \(\mathbf{y}\) should take values within the feasible solution space \({\mathcal {S}}\). The fourth constraint, i.e. \(\sum _{i=1}^m {\mathbb {I}}(y_i=p)=n_p\), serves as an extra enforcement on \(\mathbf{y}\) reflecting its compatibility with the prior class distribution.Footnote 2 Intuitively, \(n_p\) represents the prior number of examples which take the p-th class label in \({\mathcal {Y}}\) as their ground-truth label.

By sharing equal labeling confidence \(\frac{1}{|S_i|}\) among each candidate label in \(S_i\), the prior number can be roughly estimated as:

$$\begin{aligned} {\hat{n}}_p=\sum _{i=1}^m {\mathbb {I}}(p\in S_i)\cdot \frac{1}{|S_i|} \end{aligned}$$
(1)

Obviously, \(\sum _{p=1}^q {\hat{n}}_p=m\) holds. Furthermore, let \(\lfloor {\hat{n}}_p \rfloor \) be the integer part of \({\hat{n}}_p\) and \(r=m-\sum _{p=1}^q \lfloor {\hat{n}}_p \rfloor \) be the corresponding residual number w.r.t. the rounding operation. Then, the integer value \(n_p\) for the fourth constraint is set as:

$$\begin{aligned} n_p={\left\{ \begin{array}{ll} \lfloor {\hat{n}}_p \rfloor +1&{} if\,p\,is\,among\,the\,r\,class\,labels\,with\,least\,{\hat{n}}_p\,values\\ \lfloor {\hat{n}}_p \rfloor &{} otherwise \end{array}\right. } \end{aligned}$$
(2)

Accordingly, \(\sum _{p=1}^q n_p=m\) still holds.

Note that OP 2 corresponds to an optimization problem involving mixed-type variables (i.e. integer variables \(\mathbf{y}\) and real-valued variables \({{\varvec{\Theta }}}\)), whose values are difficult to be optimized simultaneously. In the following subsection, an alternating optimization procedure is employed to update \(\mathbf{y}\) and \({{\varvec{\Theta }}}\) in an iterative manner.

3.2 Alternating optimization

3.2.1 Fix \(\mathbf{y}\), update \({{\varvec{\Theta }}}\)

By fixing the ground-truth label assignments \(\mathbf{y}=(y_1,y_2,\ldots ,y_m)\), OP 2 turns out to be the following optimization problem:

OP 3: Classification Model Optimization

$$\begin{aligned}&\min _{{{\varvec{\Theta }}},{{\varvec{\xi }}}}~~~\frac{1}{2}\sum _{p=1}^q ||\mathbf{w}_p||^2+C\sum _{i=1}^m \xi _i\\ \mathrm{s.t.:}&(\mathbf{w}_{y_i}^\top \cdot \mathbf{x}_i+b_{y_i})-\max _{{\tilde{y}}_i\ne y_i}(\mathbf{w}_{{\tilde{y}}_i}^\top \cdot \mathbf{x}_i+b_{{\tilde{y}}_i})\ge 1-\xi _i\\&\xi _i\ge 0\qquad \forall i\in \{1,2,\ldots ,m\} \end{aligned}$$

As shown in OP 3, the resulting optimization problem coincides with the well-studied single-label multi-class maximum margin formulation (Crammer and Singer 2001; Hsu and Lin 2002). Therefore, OP 3 can be readily solved by utilizing any off-the-shelf implementation on multi-class SVM (Fan et al. 2008).

3.2.2 Fix \({{\varvec{\Theta }}}\), update \(\mathbf{y}\)

By fixing the classification model \({{\varvec{\Theta }}}=\{(\mathbf{w}_p,b_p)\mid 1\le p\le q\}\), OP 2 turns out to be the following optimization problem:

OP 4: Ground-truth Label Assignment Optimization  (Version 1)

$$\begin{aligned}&\min _{\mathbf{y},{{\varvec{\xi }}}}~~~\sum _{i=1}^m \xi _i\\ \mathrm{s.t.:}&\xi _i\ge 1-\eta _i^{y_i}\\&\xi _i\ge 0\qquad \forall i\in \{1,2,\ldots ,m\}\\&\mathbf{y}\in {\mathcal {S}}\\&\sum _{i=1}^m {\mathbb {I}}(y_i=p)=n_p\qquad \forall p\in \{1,2,\ldots ,q\} \end{aligned}$$

Here, \(\eta _i^{y_i}\) represents the multi-class margin on \(\mathbf{x}_i\) by taking \(y_i\) as its ground-truth label, i.e.:

$$\begin{aligned} \eta _i^{y_i}=(\mathbf{w}_{y_i}^\top \cdot \mathbf{x}_i+b_{y_i})-\max _{{\tilde{y}}_i\ne y_i}(\mathbf{w}_{{\tilde{y}}_i}^\top \cdot \mathbf{x}_i+b_{{\tilde{y}}_i}) \end{aligned}$$
(3)

By setting \(\xi _i=\max (0,1-\eta _i^{y_i})\) according to the first two constraints, OP 4 can be re-written in the following form:

OP 5: Ground-truth Label Assignment Optimization  (Version 2)

$$\begin{aligned}&\min _{\mathbf{y}}~~~\sum _{i=1}^m \max (0,1-\eta _i^{y_i})\\ \mathrm{s.t.:}&\mathbf{y}\in {\mathcal {S}}\\&\sum _{i=1}^m {\mathbb {I}}(y_i=p)=n_p\qquad \forall p\in \{1,2,\ldots ,q\} \end{aligned}$$

Let \(\mathbf{Z}=[z_{pi}]_{q\times m}\) be the binary-valued labeling matrix for training examples, where \(z_{pi}=1\) indicates that the p-th class label in \({\mathcal {Y}}\) is the ground-truth label for \(\mathbf{x}_i\). Accordingly, set the coefficient matrix \(\mathbf{C}=[c_{pi}]_{q\times m}\) as follows:

$$\begin{aligned} \forall 1\le p\le q,~1\le i\le m:~~c_{pi}={\left\{ \begin{array}{ll} \max (0,1-\eta _i^p)&{} if~~p\in S_i\\ M &{} otherwise \end{array}\right. } \end{aligned}$$
(4)

Here, M is a user-specified large constant so that the learning algorithm can refrain from assigning ground-truth labels outside the candidate label set.Footnote 3 Based on the above definitions, OP 5 can be re-written in the following form:

OP 6: Ground-truth Label Assignment Optimization  (Version 3)

$$\begin{aligned}&\min _{\mathbf{Z}}~~~\sum _{p=1}^q\sum _{i=1}^m c_{pi}\cdot z_{pi}\\ \mathrm{s.t.:}&\sum _{p=1}^q z_{pi}=1\qquad \forall i\in \{1,2,\ldots ,m\}\\&\sum _{i=1}^m z_{pi}=n_p\qquad \forall p\in \{1,2,\ldots ,q\}\\&z_{pi}\in \{0,1\} \end{aligned}$$

Here, the first constraint \(\sum _{p=1}^q z_{pi}=1\) ensures that each training example has a unique ground-truth label. In addition, the second constraint \(\sum _{i=1}^m z_{pi}=n_p\) enforces the constraint w.r.t. the prior class distribution.

Note that OP 6 corresponds to a binary integer programming (BIP) problem, which is generally NP-hard to solve. Nonetheless, it is interesting that OP 6 actually falls into a special case of BIP where the constraint matrix is totally unimodular (TU) and the right-hand sides of the constraints are integers. To show this, let \(\mathbf{z}=[z_{11},\ldots ,z_{q1},\ldots ,z_{1m},\ldots ,z_{qm}]^\top \) denote the vector formed by sequentially concatenating each column of \(\mathbf{Z}\). According to OP 6, the set of constraints \(\sum _{p=1}^q z_{pi}=1~(\forall i\in \{1,2,\ldots ,m\})\) and \(\sum _{i=1}^m z_{pi}=n_p~(\forall p\in \{1,2,\ldots ,q\})\) can be expressed in the following form:

$$\begin{aligned} \mathbf{A}{} \mathbf{z}=\mathbf{s} \end{aligned}$$
(5)

Here, \(\mathbf{A}\in {\mathbb {R}}^{(m+q)\times mq}\) is the constraint matrix which corresponds to the concatenation of two matrices \(\mathbf{B}\in {\mathbb {R}}^{m\times mq}\) and \(\mathbf{C}\in {\mathbb {R}}^{q\times mq}\), i.e. \(\mathbf{A}=[\mathbf{B}^\top ,\mathbf{C}^\top ]^\top \). Specifically, entries of the matrices \(\mathbf{B}=[b_{ij}]_{m\times mq}\), \(\mathbf{C}=[c_{ij}]_{q\times mq}\) and the right-hand side vector \(\mathbf{s}=[s_1,s_2,\ldots ,s_{m+q}]^\top \) are set as:

$$\begin{aligned} \forall 1\le i\le m,~1\le j\le mq:&b_{ij}={\left\{ \begin{array}{ll} 1,&{}\mathrm{if}~~j\in [(i-1)\cdot m+1,i\cdot m]\\ 0,&{}\mathrm{otherwise} \end{array}\right. }\\ \forall 1\le i\le q,~1\le j\le mq:&c_{ij}={\left\{ \begin{array}{ll} 1,&{}\mathrm{if}~j \mod m=i-1\\ 0,&{}\mathrm{otherwise} \end{array}\right. }\nonumber \\ \forall 1\le i\le m+q:&s_i={\left\{ \begin{array}{ll} 1,&{}\mathrm{if}~i\in [1,m]\\ n_{i-m},&{}\mathrm{if}~i\in [m+1,m+q] \end{array}\right. }\nonumber \end{aligned}$$
(6)

To show that the constraint matrix \(\mathbf{A}\) is TU, it suffices to show that \(\mathbf{A}\) satisfies the following four conditions (Heller and Tompkins 1956):

  1. 1.

    Each column of \(\mathbf{A}\) contains at most two non-zero entries;

  2. 2.

    Every entry in A takes value of 0, 1 or -1;

  3. 3.

    If two non-zero entries in a column of \(\mathbf{A}\) have the same sign, then the row of one entry is in B and the row of another entry is in C;

  4. 4.

    If two non-zero entries in a column of \(\mathbf{A}\) have opposite sign, then the rows of both entries are either in B or in C.

As defined in Equation (6), for both matrices B and C, every entry takes a value of 0 or 1 and each column contains a unique non-zero entry. Therefore, it is not difficult to show that all four TU conditions hold for the constraint matrix \(\mathbf{A}\). Furthermore, the right-hand-side vector of Eq. (5) contain integer entries according to the definition of Eq. (6).

Based on the properties of A being TU and s being integer-valued, the original BIP problem of OP 6 can be equivalently solved in its linear programming (LP) relaxation form by replacing the integer constraint \(z_{pi}\in \{0,1\}\) with the weaker interval constraint \(z_{pi}\in [0,1]\) (Papadimitriou and Steiglitz 1998):

OP 7: Ground-truth Label Assignment Optimization  (Version 4)

$$\begin{aligned}&\min _{\mathbf{Z}}~~~\sum _{p=1}^q\sum _{i=1}^m c_{pi}\cdot z_{pi}\\ \mathrm{s.t.:}&\sum _{p=1}^q z_{pi}=1\qquad \forall i\in \{1,2,\ldots ,m\}\\&\sum _{i=1}^m z_{pi}=n_p\qquad \forall p\in \{1,2,\ldots ,q\}\\&0\le z_{pi}\le 1 \end{aligned}$$

Thereafter, a solution to the relaxation problem OP 7 can be efficiently found by employing standard LP solvers such as the simplex algorithm or the interior point algorithm (Boyd and Vandenberghe 2004).

3.3 Iterative implementation

To initialize the alternating optimization procedure, M3PL sets the initial coefficient matrix \(\mathbf{C}\) by consulting the candidate label sets:

$$\begin{aligned} \forall 1\le p\le q,~1\le i\le m:~~c_{pi}={\left\{ \begin{array}{ll} \frac{1}{|S_i|}&{} if~~p\in S_i\\ M &{} otherwise \end{array}\right. } \end{aligned}$$
(7)

By solving OP 7 based on initialized coefficients, the ground-truth label assignment \(\mathbf{y}=(y_1,y_2,\ldots ,y_m)\) would be \(y_i=\arg \max _{1\le p\le q} z_{pi}\). Then, the classification model \({{\varvec{\Theta }}}\) is updated by solving OP 3 and the alternating optimization procedure iterates. After every round of alternating update, the iteration procedure will terminate once the objective function value in OP 2 decreases l.

Other than fixing the value for the regularization parameter C, M3PL chooses to gradually increase the value of C within an outer annealing loop. A similar strategy has been used in solving other weakly-supervised learning problems (Joachims 1999; Chapelle et al. 2008) to reduce the risk of getting stuck with a local minimum solution.

figure a

Algorithm 1 summarizes the complete procedure of M3PL.Footnote 4 Given the partial label training set, M3PL firstly initializes the regularization parameter C and the ground-truth label assignment (Steps 1-3). After that, the classification model and ground-truth label assignment are alternatively optimized until convergence (Steps 7-13). An outer loop is used to gradually increase the value of C by a factor of \(1+\varDelta \) (Step 5). Finally, the unseen instance is classified based on the learned classification model (Step 15).Footnote 5 In Step 9, by introducing the kernel trick to solve the multi-class maximum margin problem OP 3 (Crammer and Singer 2001), the resulting kernelized version of M3PL is denoted as M3PL-kernel.

Within each outer loop, it is not difficult to show that the objective function in OP 2 converges as the inner alternating optimization procedure proceeds (Steps 7–13). Let \(f({{\varvec{\Theta }}}^{(t)},\mathbf{y}^{(t)})\) denote the value of the objective function at the t-th iteration, it suffices to prove the convergence of the objective function if \(f(\cdot ,\cdot )\) is bounded below and non-increasing as t increases. On the one hand, as shown in OP 2, \(f({{\varvec{\Theta }}},\mathbf{y})=\frac{1}{2}\sum _{p=1}^q ||\mathbf{w}_p||^2+C\sum _{i=1}^m \max (0,1-\eta _{i}^{y_i})\) with \(\eta _i^{y_i}=(\mathbf{w}_{y_i}^\top \cdot \mathbf{x}_i+b_{y_i})-\max _{{\tilde{y}}_i\ne y_i}(\mathbf{w}_{{\tilde{y}}_i}^\top \cdot \mathbf{x}_i+b_{{\tilde{y}}_i})\). Therefore, the property of being bounded below naturally holds with \(f({{\varvec{\Theta }}},\mathbf{y})\ge 0\). On the other hand, solving the first alternating optimization problem (OP 3; Step 9) leads to \(f({{\varvec{\Theta }}}^{(t)},\mathbf{y}^{(t)})\ge f({{\varvec{\Theta }}}^{(t+1)},\mathbf{y}^{(t)})\), and solving the second alternating optimization problem (OP 7; Steps 10-11) leads to \(f({{\varvec{\Theta }}}^{(t+1)},\mathbf{y}^{(t)})\ge f({{\varvec{\Theta }}}^{(t+1)},\mathbf{y}^{(t+1)})\). Therefore, the property of being non-increasing naturally holds with \(f({{\varvec{\Theta }}}^{(t)},\mathbf{y}^{(t)})\ge f({{\varvec{\Theta }}}^{(t+1)},\mathbf{y}^{(t)})\ge f({{\varvec{\Theta }}}^{(t+1)},\mathbf{y}^{(t+1)})\).

It is obvious that the proposed M3PL approach coincides with the existing maximum margin formulation (Nguyen and Caruana 2008) if the size of the candidate label set shrinks to 1. Correspondingly, iterative optimization has also been employed by a number of partial label learning approaches for disambiguating the candidate label set (Jin and Ghahramani 2003; Nguyen and Caruana 2008; Liu and Dietterich 2012; Chen et al. 2014). As shown in Eq. (4), a parameter M is utilized such that OP 6 (or equivalently OP 5) can be solved by restricting the assigned label only in the candidate label set. This restriction ensures the validity of the ground-truth label assignments \(\mathbf{y}\) which will be fixed as constants for solving OP 3.

4 Experiment

4.1 Experimental setup

In this section, two series of experiments are conducted to evaluate the performance of M3PL, with one series on controlled UCI datasets (Bache and Lichman 2013) and the other on real-world partial label datasets. Table 1 summarizes characteristics of the employed datasets.

Table 1 Characteristics of the experimental datasets

Following the widely-used controlling protocol over multi-class UCI datasets (Cour et al. 2011; Chen et al. 2014; Liu and Dietterich 2012; Zhang 2014), an artificial partial label dataset can be generated under different configurations of three controlling parameters p, r and \(\epsilon \). Here, p controls the proportion of examples which are partially labeled (i.e. \(|S_i|>1\)), r controls the number of false positive labels in the candidate label set (i.e. \(|S_i|=r+1\)), and \(\epsilon \) controls the co-occurring probability between one coupling candidate label and the ground-truth label. As shown in Table 1, a total of 28 (4 \(\times \) 7) configurations are considered for each of the six UCI datasets.

The real-world partial label datasets are collected from several task domains, such as facial age estimation including FG-NET (Panis and Lanitis 2015), automatic face naming including Lost (Cour et al. 2011), Soccer Player (Zeng et al. 2013), Yahoo! News (Guillaumin et al. 2010), bird song classification including BirdSong (Briggs et al. 2012), and object classification including MSRCv2 (Liu and Dietterich 2012).Footnote 6 For the task of facial age estimation, human faces with landmarks are represented as instances while ages annotated by ten crowdsourced labelers together with the ground-truth age are regarded as candidate labels. For the task of automatic face naming, faces cropped from an image or video frame are represented as instances while names extracted from the associated captions or subtitles are regarded as candidate labels. For the task of bird song classification, singing syllables of the birds are represented as instances while bird species jointly singing during a 10-second period are regarded as candidate labels. For the task of object classification, image segmentations are represented as instances while objects appearing within the same image are regarded as candidate labels. As shown in Table 1, the average number of candidate labels (avg. #CLs) for each real-world partial label dataset is also recorded.

Four well-established partial label learning approaches are employed for comparative studies, each implemented with parameter setup as suggested in the respective literature:

  • An existing maximum margin partial label learning approach named PL-SVM (Nguyen and Caruana 2008) [suggested setup: regularization parameter pool with \(\{10^{-3},\ldots ,10^{3}\}\)] as well as its kernelized version named PL-SVM-kernel [suggested setup: polynomial kernel and degree pool with \(\{1,\ldots ,5\}\)].

  • The k-nearest neighbor partial label learning approach named PL-KNN (Hüllermeier and Beringer 2006) [suggested setup: k=10].

  • The convex optimization partial label learning approach named CLPL (Cour et al. 2011) [suggested setup: SVM with squared hinge loss].

  • The maximum likelihood partial label learning algorithm named LSB-CMM (Liu and Dietterich 2012) [suggested setup: q mixture components].

For PL-SVM and LSB-CMM, both algorithms conduct disambiguation by treating the ground-truth label as a latent variable to be iteratively refined. Specifically, PL-SVM (and its kernelized version) works by maximizing the margin between the largest output from candidate labels and that from non-candidate labels, while LSB-CMM works by maximizing the likelihood function over partial label training examples with EM-based optimization over a conditional multinomial model. For PL-KNN and CLPL, both algorithms conduct disambiguation by treating each candidate label equally to be further aggregated. Specifically, PL-KNN works by voting among the candidate labels of each neighboring example whose voting weight is inversely proportional to its distance from the test instance, while CLPL works by transforming the original partial label learning problem into a binary learning problem which is then solved by conventional SVM classification.

For M3PL, the parameter \(C_\mathrm{max}\) is chosen among \(\{10^{-2},\ldots ,10^2\}\) via cross-validation. In addition, a Gaussian kernel with width parameter \(\frac{1}{d}\) is used to instantiate the M3PL-kernel. In this paper, ten-fold cross-validation is performed on each artificial as well as real-world dataset where the mean predictive accuracies as well as standard deviations are recorded for all comparing approaches.

4.2 Experimental result

4.2.1 Controlled UCI datasets

Figures 1, 2 and 3 illustrate the classification accuracy of each comparing algorithm as p increases from 0.1 to 0.7 with step-size 0.1 (\(r=1,2,3\)). For any partial label example, its candidate label set contains the ground-truth label along with r additional labels randomly chosen from \({\mathcal {Y}}\). Figure 4 illustrates the classification accuracy of each comparing algorithm as \(\epsilon \) increases from 0.1 to 0.7 with step-size 0.1 (\(p=1,r=1\)). For any label \(y\in {\mathcal {Y}}\), one extra label \(y'\in {\mathcal {Y}}\) is designated as the coupling label which co-occurs with y in the candidate label set with probability \(\epsilon \). Otherwise, any other class label would be randomly chosen to co-occur with y.

Fig. 1
figure 1

Classification accuracy of each comparing algorithm changes as p (proportion of partially labeled example) increases from 0.1 to 0.7 (\(r=1\)). a Glass, b ecoli, c dermatology, d vehicle, e segment, f satimage

Fig. 2
figure 2

Classification accuracy of each comparing algorithm changes as p (proportion of partially labeled example) increases from 0.1 to 0.7 (\(r=2\)). a Glass, b ecoli, c dermatology, d vehicle, e segment, f satimage

Fig. 3
figure 3

Classification accuracy of each comparing algorithm changes as p (proportion of partially labeled example) increases from 0.1 to 0.7 (\(r=3\)). a Glass, b ecoli, c dermatology, d vehicle, e segment, f satimage

Fig. 4
figure 4

Classification accuracy of each comparing algorithm changes as \(\epsilon \) (co-occurring probability of the coupling label) increases from 0.1 to 0.7 (\(p=1\), \(r=1\)). a Glass, b ecoli, c dermatology, d vehicle, e segment, f satimage

As shown in Figs. 1, 2, 3 and 4, in most cases, M3PL and its kernelized version achieve competitive performance against the comparing algorithms. Based on pairwise t test at 0.05 significance level, Tables 2 and 3 summarize the win/tie/loss counts of M3PL and M3PL-kernel against the comparing algorithms respectively. Out of the 168 statistical comparisons (28 configurations \(\times \) 6 datasets), the following observations can be made:

  • Compared to the existing maximum margin counterpart PL-SVM (Nguyen and Caruana 2008), M3PL achieves superior or at least comparable performance in 77.9% cases. Although M3PL does not perform favorably against PL-SVM-kernel, its kernelized version M3PL-kernel achieves superior or at least comparable performance against PL-SVM and PL-SVM-kernel in 74.4 and 85.7% cases respectively. These results indicate the advantage of the proposed formulation against existing maximum margin partial label formulations;

  • Compared to PL-KNN (Hüllermeier and Beringer 2006), CLPL (Cour et al. 2011) and LSB-CMM (Liu and Dietterich 2012), M3PL achieves superior or at least comparable performance in 79.8, 85.1 and 61.3% cases respectively, and M3PL-kernel achieves superior or at least comparable performance in 65.4, 81.5 and 82.1% cases respectively. These results validate the ability of M3PL in achieving state-of-the-art generalization performance for the partial label learning problem.

Table 2 Win/tie/loss counts (pairwise t test at 0.05 significance level) on the classification performance of M3PL against each comparing algorithm
Table 3 Win/tie/loss counts (pairwise t test at 0.05 significance level) on the classification performance of M3PL-kernel against each comparing algorithm

It is worth noting that on some controlled UCI datasets (e.g.: Fig. 1, segment and satimage), the performance of CLPL is much inferior to the comparing algorithms. One potential reason might lie in the procedure employed by CLPL to transform the partial label learning problem into the binary learning problem. Specifically, each partial label training example \((\mathbf{x}_i,S_i)\in {\mathcal {D}}\) is transformed into one positive example by aggregating all candidate labels, and \(q-|S_i|\) negative examples each for one non-candidate label. For the resulting binary training set, the ratio between the number of negative examples and positive examples would be \(q-q'\), where \(q'=\frac{\sum _{i=1}^m |S_i|}{m}\) corresponds to the average number of candidate labels in \({\mathcal {D}}\). The corresponding binary learning problem would be highly class-imbalanced when q is much larger than \(q'\), which can lead to performance deterioration for the binary learning algorithm.

In addition, multi-class SVM (MSVM) trained with ground-truth labels and its kernelized version (MSVM-kernel) are also employed for comparative studies, which serve as the upper-bound baseline for partial label learning algorithms.Footnote 7 As shown in Table 2, the performance of M3PL is comparable to MSVM and MSVM-kernel in 39.8 and 22.6% cases, and inferior to them in the rest of the cases. As shown in Table 3, it is interesting that the performance of M3PL-kernel is even superior to MSVM in a few (12.5%) cases, and is comparable or inferior to MSVM and MSVM-kernel in the rest of the cases.

4.2.2 Real-world datasets

Table 4 reports the performance of each comparing algorithm on the real-world partial label datasets. Based on the results of ten-fold cross-validation, pairwise t tests at 0.05 significance level between M3PL and the comparing algorithms are recorded as well. Note that the average number of candidate labels (avg. #CLs) for the FG-NET dataset (i.e. 7.48 as shown in Table 1) is quite large, which makes the task of facial age estimation (based on training examples with partial labels) rather challenging. Furthermore, the state-of-the-art performance on this dataset (based on training examples with ground-truth labels) corresponds to more than 3 years of mean average error (MAE) between the predicted age and the ground-truth age (Panis and Lanitis 2015). In Table 4, one extra classification accuracy is reported on the FG-NET dataset where an unseen example is regarded to be correctly classified if the difference between the predicted age and the ground-truth age is less than 3 years (MAE3).

Table 4 Classification accuracy (mean±std) of each comparing algorithm on the real-world partial label datasets

As shown in Table 4, it is impressive to observe that:

  • M3PL significantly outperforms its maximum margin counterpart on all real-world datasets except FG-NET and its MAE3 variant, on which both algorithms achieve comparable performance. In terms of the kernelized version, M3PL-kernel significantly outperforms PL-SVM-kernel on the MSRCv2, BirdSong, Soccer Player and Yahoo! News datasets, and performs comparably on the rest of the datasets;

  • Both M3PL and its kernelized version significantly outperforms PL-KNN on all datasets, except on Soccer Player where the performance of M3PL is inferior to PL-KNN;

  • For M3PL, its performance is only inferior to CLPL and LSB-CMM on FG-NET (MAE3) and Soccer Player respectively. For M3PL-kernel, its performance is only inferior to CLPL on FG-NET (MAE3). On the other cases, both M3PL and M3PL-kernel achieve superior or at least comparable performance against CLPL and LSB-CMM.

In addition, both M3PL and its kernelized version achieve comparable performance to MSVM and MSVM-kernel on FG-NET and its MAE3 variant, and are inferior to MSVM and MSVM-kernel in the rest of the cases. It is worth noting that for either M3PL or PL-SVM, although their performance is expected to be improved by employing the kernel trick, there are still some cases where the kernelized version achieves lower classification accuracy. These observations indicate the necessity of kernel function selection for learning from partial label examples.

In addition to inductive performance on unseen examples, it is also interesting to study the transductive performance of each comparing algorithm on classifying training examples (Cour et al. 2011). Here, for each training example \((\mathbf{x}_i,S_i)\), its ground-truth label is predicted by consulting the candidate label set, i.e.: \(y_i=\arg \max _{y\in S_i} F(\mathbf{x}_i,y;{{\varvec{\Theta }}})\). In other words, transductive performance of the partial label learning algorithm reflects its disambiguation ability in recovering ground-truth labeling information from the candidate label set. Similar to Table 4, Table 5 reports the transductive accuracy of each comparing algorithm together with the outcomes of pairwise t tests at 0.05 significance level. Furthermore, as the training procedure of M3PL terminates, the identified ground-truth label assignment \(\mathbf{y}\) can be also used as the disambiguation predictions on the training examples. The resulting transductive performance is reported in Table 5 as well (denoted as M3PL\(^\dag \) and M3PL-kernel\(^\dag \)) for reference purposes.

Table 5 Transductive accuracy (mean±std) of each comparing algorithm on the real-world partial label datasets

As shown in Table 5, M3PL significantly outperforms all the other comparing algorithms on the MSRCv2, BirdSong and Soccer Player datasets, and achieves superior or at least comparable performance against other comparing algorithms on the Lost and Yahoo! News dataset. Although the transductive performance of M3PL is not satisfactory on FG-NET and its MAE3 variant, its inductive performance on them is competitive to the comparing algorithms in most cases. On the other hand, out of the 35 statistical tests (7 datasets \(\times \) 5 comparing algorithms), the transductive performance of M3PL-kernel\(^\dag \) is superior to the comparing algorithms in 9 cases, comparable in 17 cases, and inferior in 9 cases. As expected, M3PL and M3PL\(^\dag \) (also for M3PL-kernel and M3PL-kernel\(^\dag \)) show similar transductive performance over each real-world dataset.

5 Conclusion

This paper extends our earlier research on maximum margin partial label learning (Yu and Zhang 2015), where a new formulation of the maximum margin criterion is proposed to learning from partial label examples. Specifically, the canonical multi-class margin is directly optimized by the proposed M3PL approach with an alternating optimization procedure. Comprehensive comparative studies on artificial as well as real-world partial label datasets clearly validate the effectiveness of M3PL.

In the future, it is interesting to investigate other ways to solve the proposed maximum margin formulation OP 2 other than utilizing alternating optimization. Furthermore, domain knowledge (e.g. the ordinal information among class labels) could be incorporated into partial label learning algorithms to improve their performance on specific tasks such as facial age estimation.