Metric learning-guided k nearest neighbor multilabel classifier

Ma, Jiajun; Zhou, Shuisheng

doi:10.1007/s00521-020-05134-9

Metric learning-guided k nearest neighbor multilabel classifier

Original Article
Published: 02 July 2020

Volume 33, pages 2411–2425, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

463 Accesses
3 Citations
Explore all metrics

Abstract

Multilabel classification deals with the problem where each instance belongs to multiple labels simultaneously. The algorithm based on large margin loss with k nearest neighbor constraints (LM-kNN) is one of the most prominent multilabel classification algorithms. However, due to the use of square hinge loss, LM-kNN needs to iteratively solve a constrained quadratic programming at a high computational cost. To address this issue, we propose a novel metric learning-guided k nearest neighbor approach (MLG-kNN) for multilabel classification. Specifically, we first transform the original instance into the label space by least square regression. Then, we learn a metric matrix in the label space, which makes the predictions of an instance in the learned metric space close to its true class values while far away from others. Since our MLG-kNN can be formulated as an unconstrained strictly (geodesically) convex optimization problem and yield a closed-form solution, the computational complexity is reduced. An analysis of generalization error bound indicates that our MLG-kNN converges to the optimal solutions. Experimental results verify that the proposed approach is more effective than the existing ones for multilabel classification across many benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from positive and unlabeled data: a survey

Article 02 April 2020

Multi-label feature selection via spectral clustering-based label enhancement and manifold distribution consistency

Article 09 May 2024

A survey of multi-label classification based on supervised and semi-supervised learning

Article 28 October 2022

Notes

References

Tsoumakas G, Katakis I, Vlahavas I (2009) Mining multi-label data. In: Data mining and knowledge discovery handbook, Springer, pp 667–685
Zhang ML, Zhou ZH (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837
Article Google Scholar
Zheng W, Qian Y, Huijuan L (2013) Text categorization based on regularization extreme learning machine. Neural Comput Appl 22(3):447–456
Article Google Scholar
Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771
Article Google Scholar
Barutcuoglu Z, Schapire R, Troyanskaya O (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836
Article Google Scholar
Montañes E, Senge R, Barranquero J, Quevedo JR, del Coz JJ, Hüllermeier E (2014) Dependent binary relevance models for multi-label classification. Pattern Recognit 47(3):1494–1508
Article Google Scholar
Hsu D, Kakade Sham M, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: International conference on neural information processing systems, pp 772–780
Tai F, Lin HT (2012) Multilabel classification with principal label space transformation. Neural Comput 24(9):2508
Article MathSciNet Google Scholar
Zhang Y, Schneider J (2011) Multi-label output codes using canonical correlation analysis. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 873–882
Zhang Y, Schneider J (2012) Maximum margin output coding. In: International conference on machine learning, pp 379–386
Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
Article Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Article Google Scholar
Liu W, Tsang IW (2015) Large margin metric learning for multi-label prediction. In: AAAI conference on artificial intelligence, pp 2800–2806
Liu W, Donna X, Tsang I, Zhang W (2018) Metric learning for multi-output tasks. IEEE Trans Pattern Anal Mach Intell 41(2):408–422
Article Google Scholar
Zadeh P, Hosseini R, Sra S (2016) Geometric mean metric learning. In: International conference on machine learning, pp 2464–2471
Davis Jason V, Kulis B, Jain P, Sra S, Dhillon Inderjit S (2007) Information-theoretic metric learning. In: Proceedings of the 24th international conference on machine learning, ACM, pp 209–216
Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J Matrix Anal Appl 29(1):328–347
Article MathSciNet Google Scholar
Bhatia R (2009) Positive definite matrices, vol 24. Princeton University Press, Princeton
Book Google Scholar
Iannazzo B (2016) The geometric mean of two matrices from a computational viewpoint. Numer Linear Algebra Appl 23(2):208–229
Article MathSciNet Google Scholar
Kontorovich A, Weiss R (2014) Maximum margin multiclass nearest neighbors. In: International conference on machine learning. arXiv:1401.7898:892–900
Krauthgamer R, Lee JR (2004) Navigating nets: simple algorithms for proximity search. In: ACM-SIAM symposium on discrete algorithms, pp 798–807
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: From theory to algorithms. Cambridge University Press, Cambridge
Book Google Scholar
Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, Vlahavas I (2011) Mulan: a java library for multi-label learning. J Mach Learn Res 12:2411–2414
MathSciNet MATH Google Scholar
Datta Biswa N (2010) Numerical linear algebra and applications. SIAM, vol 116
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9(9):1871–1874
MATH Google Scholar
Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2):135–168
Article Google Scholar
Luaces O, Diez J, Barranquero J, Coz JJD, Bahamonde A (2012) Binary relevance efficacy for multilabel classification. Prog Artif Intell 1(4):303–313
Article Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Yi L, Long Philip M (2006) Learnability and the doubling dimension. Advances in neural information processing systems, pp 889–896
Golub G, Loan CV (1996) Matrix computations (3rd edition)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61772020).

Author information

Authors and Affiliations

School of Mathematics and Statistics, Xidian University, Xi’an, 710071, Shaanxi, China
Jiajun Ma & Shuisheng Zhou
College of Mathematics and Computer Application, Shangluo University, Shangluo, 726000, Shaanxi, China
Jiajun Ma

Authors

Jiajun Ma
View author publications
You can also search for this author in PubMed Google Scholar
Shuisheng Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuisheng Zhou.

Ethics declarations

Conflicts of interest

The authors declared that they have no conflicts of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Proofs of the main theorems

Before proving, we first introduce several related definitions and lemmas.

Definition 1

[20, 21] Given a metric space $({\mathcal {X}}, d)$, let m be the smallest value such that every ball in ${\mathcal {X}}$ can be covered by $2^m$ balls of half the radius. The doubling dimension of ${\mathcal {X}}$ is defined as $ddim({\mathcal {X}})=m$.

Lemma 1

[29] Given a metric space $({\mathcal {X}}, d )$, an $\varepsilon$-cover for ${\mathcal {X}}$ is a set ${\mathcal {B}}\subseteq {\mathcal {X}}$ such that every element of ${\mathcal {X}}$ has a counterpart in ${\mathcal {B}}$ that is at a distance at most $\varepsilon$ (with respect to d). Denote the size of the smallest $\varepsilon$-cover by $N(\varepsilon , {\mathcal {X}}, d)$, then we have

$$\begin{aligned} N(\varepsilon , {\mathcal {X}}, d)\le \left( \frac{2diam({\mathcal {X}})}{\varepsilon }\right) ^{ddim({\mathcal {X}})} \end{aligned}$$

(20)

where $diam({\mathcal {X}})=\mathop {\sup }\limits _{\varvec{x},\varvec{x}^{'}\in {\mathcal {X}}}d(\varvec{x},\varvec{x}^{'})$.

1.2 Proof of Theorem 1

Proof

From the linearity of expectation and the definition of $L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))$, we can rewrite:

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ \sum _{q=1}^cL(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right] \\&=\sum _{q=1}^c\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\right]. \end{aligned} \end{aligned}$$

(21)

Let’s start with ${\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]$. For any two examples $(\varvec{x}, \varvec{y})$ and $(\varvec{x}^{'},\varvec{y}^{'})$ drawn from the distribution ${\mathcal {D}}$, we have

$$\begin{aligned} \begin{aligned}&{\mathbb {P}}[\varvec{y}(q)\ne \varvec{y}^{'}(q)|\varvec{x},\varvec{x}^{'}]\\&=\eta _q(\varvec{x})(1-\eta _q(\varvec{x}^{'}))+(1-\eta _q(\varvec{x}))\eta _q(\varvec{x}^{'})\\&=\eta _q(\varvec{x})(1-\eta _q(\varvec{x})+\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'}))\\&\quad+(1-\eta _q(\varvec{x}))(\eta _q(\varvec{x})-\eta _q(\varvec{x})+\eta _q(\varvec{x}^{'}))\\&=2\eta _q(\varvec{x})(1-\eta _q(\varvec{x}))+(2\eta _q(\varvec{x})-1)(\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'}))\\&\quad\le 2\eta _q(\varvec{x})(1-\eta _q(\varvec{x}))+(\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'})). \end{aligned} \end{aligned}$$

(22)

Using $\eta ^q$ is l-Lipschitz, we have $\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'})\le ld_{pro}(\varvec{x},\varvec{x}^{'})$. Due to $\mathbf{M }_{\alpha }$ is a symmetric positive definite matrix, we can factorize $\mathbf{M }_{\alpha }$ as $\mathbf{M }_{\alpha }=\mathbf{P }\mathbf{P }^{\top }$ [30]. Then, we have

$$\begin{aligned} \begin{aligned} d_{pro}(\varvec{x},\varvec{x}^{'})&=\Vert \mathbf{P }^{\top }\mathbf{W }^{\top }\varvec{x}-\mathbf{P }^{\top }\mathbf{W }^{\top }\varvec{x}^{'}\Vert _2\\&\quad\le \Vert \mathbf{W }\Vert _F\Vert \mathbf{P }\Vert _F\Vert \varvec{x}-\varvec{x}^{'}\Vert _2\\&=|\mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{P }\mathbf{P }^{\top }))^{1/2}\Vert \varvec{x}-\varvec{x}^{'}\Vert _2\\&=\Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}\Vert \varvec{x}-\varvec{x}^{'}\Vert _2. \end{aligned} \end{aligned}$$

(23)

Actually, the regularization term in MLG-kNN ensures that $\mathrm {tr}(\mathbf{M }_{\alpha })$ is bounded.

Let $(\varvec{x}^{'},\varvec{y}^{'})$ be the nearest neighbor of $(\varvec{x},\varvec{y})$ in ${\mathbb {S}}$ with respect to $d_{pro}$. Then, $\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\right]$ is equivalent to $\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne \varvec{y}^{'}(q)]\right]$. By combining the preceding with (22), we have

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\right] \\&\quad\le \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ 2\eta _q(\varvec{x})(1-\eta _q(\varvec{x}))\right] \\&\quad+l\Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2} \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}} \left[ d(\varvec{x}, \varvec{x}^{'})\right]. \end{aligned} \end{aligned}$$

(24)

The loss of the Bayes optimal classifier for q-th label is

$$\begin{aligned} \begin{aligned} L(\varvec{y}(q),h^*_q(\varvec{x}))&=\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ min\{\eta _q(\varvec{x}), 1-\eta _q(\varvec{x})\}\right] \\&\quad\ge \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ \eta _q(\varvec{x})(1-\eta _q(\varvec{x}))\right]. \end{aligned} \end{aligned}$$

(25)

Combining (24) and (25) concludes our proof. $\square$

Lemma 2

[14] Let $({\mathcal {X}}, d)$ be a metric space with $diam({\mathcal {X}})=1$ and $ddim({\mathcal {X}})=m<\infty$. For any $\varvec{x}$, $\varvec{x}^{'}\in {\mathcal {X}}$, we have

$$\begin{aligned} \begin{aligned} \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ d(\varvec{x}, \varvec{x}^{'})\right] \le \frac{3}{n^{1/(m+1)}}. \end{aligned} \end{aligned}$$

(26)

1.3 Proof of Theorem 2

Proof

Combining Theorem 1 with Lemma 2 concludes our proof. $\square$

1.4 Proof of Theorem 3

Proof

We use the proof of Theorem 19.5 in [22] to complete the proof of Theorem 3.

Let’s first consider $\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right]$ for the q-th label. For the training set ${\mathbb {S}}=\{(\varvec{x}_1,\varvec{y}_1),...,(\varvec{x}_n,\varvec{y}_n)\}$ and any $\varvec{x}\in {\mathcal {X}}$, let $\pi _1(\varvec{x}), \ldots , \pi _n(\varvec{x})$ be a reordering of $\{1, \ldots , n\}$ according to their distance to $\varvec{x}$, $d_{\mathbf{M }}(\varvec{x},\varvec{x}_i)$. That is, for all $i<n$, $d_{pro}(\varvec{x}, \varvec{x}_{\pi _i(\varvec{x})})\le d_{pro}(\varvec{x}, \varvec{x}_{\pi _{i+1}(\varvec{x})})$. Let${\mathcal {B}}=\{B_1,\ldots , B_N\}$ is the $\varepsilon$-cover for ${\mathcal {X}}$ of cardinality N. From Eq. (19.3) in [22], we have

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right] \\&\quad\le \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n}\left[ \sum _{j:|B_i\bigcap {\mathbb {S}}|<k}{\mathbb {P}}[B_i]\right] \\&\qquad+\max _i\mathop {{\mathbb {P}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})| \forall i~\in [k], d_{pro}(\varvec{x}, \varvec{x}_{\pi _i(\varvec{x})})\\&\quad\le \varepsilon \Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}]. \end{aligned} \end{aligned}$$

(27)

According to [22, Lemma 19.6], the first term of right side of (27) is bounded by $\frac{2Nk}{en}$. The second term of right side of (27) is bounded by $\left(1+\sqrt{\tfrac{8}{k}}\right)L(\varvec{y}(q),h^*_q(\varvec{x}))+3l\varepsilon \Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}$. Finally, setting $\varepsilon =2n^{-\tfrac{1}{m+1}}$ and combining Lemma 1, we have

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right] \\&\quad\le \left( 1+\sqrt{\frac{8}{k}}\right) L(\varvec{y}(q),h^*_q(\varvec{x})) +\frac{(6l\Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}+k)}{n^{1/(m+1)}}. \end{aligned} \end{aligned}$$

(28)

Applying Eq.(28) for each label and taking the sum, we complete the proof. $\square$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, J., Zhou, S. Metric learning-guided k nearest neighbor multilabel classifier. Neural Comput & Applic 33, 2411–2425 (2021). https://doi.org/10.1007/s00521-020-05134-9

Download citation

Received: 22 December 2019
Accepted: 16 June 2020
Published: 02 July 2020
Issue Date: April 2021
DOI: https://doi.org/10.1007/s00521-020-05134-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Metric learning-guided k nearest neighbor multilabel classifier

Abstract

Access this article

Similar content being viewed by others

Learning from positive and unlabeled data: a survey

Multi-label feature selection via spectral clustering-based label enhancement and manifold distribution consistency

A survey of multi-label classification based on supervised and semi-supervised learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendix

1.1 Proofs of the main theorems

Definition 1

Lemma 1

1.2 Proof of Theorem 1

Proof

Lemma 2

1.3 Proof of Theorem 2

Proof

1.4 Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Metric learning-guided k nearest neighbor multilabel classifier

Abstract

Access this article

Similar content being viewed by others

Learning from positive and unlabeled data: a survey

Multi-label feature selection via spectral clustering-based label enhancement and manifold distribution consistency

A survey of multi-label classification based on supervised and semi-supervised learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proofs of the main theorems

Definition 1

Lemma 1

1.2 Proof of Theorem 1

Proof

Lemma 2

1.3 Proof of Theorem 2

Proof

1.4 Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation