Skip to main content
Log in

Metric learning-guided k nearest neighbor multilabel classifier

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Multilabel classification deals with the problem where each instance belongs to multiple labels simultaneously. The algorithm based on large margin loss with k nearest neighbor constraints (LM-kNN) is one of the most prominent multilabel classification algorithms. However, due to the use of square hinge loss, LM-kNN needs to iteratively solve a constrained quadratic programming at a high computational cost. To address this issue, we propose a novel metric learning-guided k nearest neighbor approach (MLG-kNN) for multilabel classification. Specifically, we first transform the original instance into the label space by least square regression. Then, we learn a metric matrix in the label space, which makes the predictions of an instance in the learned metric space close to its true class values while far away from others. Since our MLG-kNN can be formulated as an unconstrained strictly (geodesically) convex optimization problem and yield a closed-form solution, the computational complexity is reduced. An analysis of generalization error bound indicates that our MLG-kNN converges to the optimal solutions. Experimental results verify that the proposed approach is more effective than the existing ones for multilabel classification across many benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://palm.seu.edu.cn/zhangml/.

  2. http://mulan.sourceforge.net/datasets-mlc.html.

References

  1. Tsoumakas G, Katakis I, Vlahavas I (2009) Mining multi-label data. In: Data mining and knowledge discovery handbook, Springer, pp 667–685

  2. Zhang ML, Zhou ZH (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837

    Article  Google Scholar 

  3. Zheng W, Qian Y, Huijuan L (2013) Text categorization based on regularization extreme learning machine. Neural Comput Appl 22(3):447–456

    Article  Google Scholar 

  4. Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771

    Article  Google Scholar 

  5. Barutcuoglu Z, Schapire R, Troyanskaya O (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836

    Article  Google Scholar 

  6. Montañes E, Senge R, Barranquero J, Quevedo JR, del Coz JJ, Hüllermeier E (2014) Dependent binary relevance models for multi-label classification. Pattern Recognit 47(3):1494–1508

    Article  Google Scholar 

  7. Hsu D, Kakade Sham M, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: International conference on neural information processing systems, pp 772–780

  8. Tai F, Lin HT (2012) Multilabel classification with principal label space transformation. Neural Comput 24(9):2508

    Article  MathSciNet  Google Scholar 

  9. Zhang Y, Schneider J (2011) Multi-label output codes using canonical correlation analysis. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 873–882

  10. Zhang Y, Schneider J (2012) Maximum margin output coding. In: International conference on machine learning, pp 379–386

  11. Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048

    Article  Google Scholar 

  12. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27

    Article  Google Scholar 

  13. Liu W, Tsang IW (2015) Large margin metric learning for multi-label prediction. In: AAAI conference on artificial intelligence, pp 2800–2806

  14. Liu W, Donna X, Tsang I, Zhang W (2018) Metric learning for multi-output tasks. IEEE Trans Pattern Anal Mach Intell 41(2):408–422

    Article  Google Scholar 

  15. Zadeh P, Hosseini R, Sra S (2016) Geometric mean metric learning. In: International conference on machine learning, pp 2464–2471

  16. Davis Jason V, Kulis B, Jain P, Sra S, Dhillon Inderjit S (2007) Information-theoretic metric learning. In: Proceedings of the 24th international conference on machine learning, ACM, pp 209–216

  17. Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J Matrix Anal Appl 29(1):328–347

    Article  MathSciNet  Google Scholar 

  18. Bhatia R (2009) Positive definite matrices, vol 24. Princeton University Press, Princeton

    Book  Google Scholar 

  19. Iannazzo B (2016) The geometric mean of two matrices from a computational viewpoint. Numer Linear Algebra Appl 23(2):208–229

    Article  MathSciNet  Google Scholar 

  20. Kontorovich A, Weiss R (2014) Maximum margin multiclass nearest neighbors. In: International conference on machine learning. arXiv:1401.7898:892–900

  21. Krauthgamer R, Lee JR (2004) Navigating nets: simple algorithms for proximity search. In: ACM-SIAM symposium on discrete algorithms, pp 798–807

  22. Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: From theory to algorithms. Cambridge University Press, Cambridge

    Book  Google Scholar 

  23. Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, Vlahavas I (2011) Mulan: a java library for multi-label learning. J Mach Learn Res 12:2411–2414

    MathSciNet  MATH  Google Scholar 

  24. Datta Biswa N (2010) Numerical linear algebra and applications. SIAM, vol 116

  25. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9(9):1871–1874

    MATH  Google Scholar 

  26. Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2):135–168

    Article  Google Scholar 

  27. Luaces O, Diez J, Barranquero J, Coz JJD, Bahamonde A (2012) Binary relevance efficacy for multilabel classification. Prog Artif Intell 1(4):303–313

    Article  Google Scholar 

  28. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  29. Yi L, Long Philip M (2006) Learnability and the doubling dimension. Advances in neural information processing systems, pp 889–896

  30. Golub G, Loan CV (1996) Matrix computations (3rd edition)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61772020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuisheng Zhou.

Ethics declarations

Conflicts of interest

The authors declared that they have no conflicts of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Proofs of the main theorems

Before proving, we first introduce several related definitions and lemmas.

Definition 1

[20, 21] Given a metric space \(({\mathcal {X}}, d)\), let m be the smallest value such that every ball in \({\mathcal {X}}\) can be covered by \(2^m\) balls of half the radius. The doubling dimension of \({\mathcal {X}}\) is defined as \(ddim({\mathcal {X}})=m\).

Lemma 1

[29] Given a metric space \(({\mathcal {X}}, d )\), an \(\varepsilon\)-cover for \({\mathcal {X}}\) is a set \({\mathcal {B}}\subseteq {\mathcal {X}}\) such that every element of \({\mathcal {X}}\) has a counterpart in \({\mathcal {B}}\) that is at a distance at most \(\varepsilon\) (with respect to d). Denote the size of the smallest \(\varepsilon\)-cover by \(N(\varepsilon , {\mathcal {X}}, d)\), then we have

$$\begin{aligned} N(\varepsilon , {\mathcal {X}}, d)\le \left( \frac{2diam({\mathcal {X}})}{\varepsilon }\right) ^{ddim({\mathcal {X}})} \end{aligned}$$
(20)

where \(diam({\mathcal {X}})=\mathop {\sup }\limits _{\varvec{x},\varvec{x}^{'}\in {\mathcal {X}}}d(\varvec{x},\varvec{x}^{'})\).

1.2 Proof of Theorem 1

Proof

From the linearity of expectation and the definition of \(L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\), we can rewrite:

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ \sum _{q=1}^cL(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right] \\&=\sum _{q=1}^c\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\right]. \end{aligned} \end{aligned}$$
(21)

Let’s start with \({\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\). For any two examples \((\varvec{x}, \varvec{y})\) and \((\varvec{x}^{'},\varvec{y}^{'})\) drawn from the distribution \({\mathcal {D}}\), we have

$$\begin{aligned} \begin{aligned}&{\mathbb {P}}[\varvec{y}(q)\ne \varvec{y}^{'}(q)|\varvec{x},\varvec{x}^{'}]\\&=\eta _q(\varvec{x})(1-\eta _q(\varvec{x}^{'}))+(1-\eta _q(\varvec{x}))\eta _q(\varvec{x}^{'})\\&=\eta _q(\varvec{x})(1-\eta _q(\varvec{x})+\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'}))\\&\quad+(1-\eta _q(\varvec{x}))(\eta _q(\varvec{x})-\eta _q(\varvec{x})+\eta _q(\varvec{x}^{'}))\\&=2\eta _q(\varvec{x})(1-\eta _q(\varvec{x}))+(2\eta _q(\varvec{x})-1)(\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'}))\\&\quad\le 2\eta _q(\varvec{x})(1-\eta _q(\varvec{x}))+(\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'})). \end{aligned} \end{aligned}$$
(22)

Using \(\eta ^q\) is l-Lipschitz, we have \(\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'})\le ld_{pro}(\varvec{x},\varvec{x}^{'})\). Due to \(\mathbf{M }_{\alpha }\) is a symmetric positive definite matrix, we can factorize \(\mathbf{M }_{\alpha }\) as \(\mathbf{M }_{\alpha }=\mathbf{P }\mathbf{P }^{\top }\) [30]. Then, we have

$$\begin{aligned} \begin{aligned} d_{pro}(\varvec{x},\varvec{x}^{'})&=\Vert \mathbf{P }^{\top }\mathbf{W }^{\top }\varvec{x}-\mathbf{P }^{\top }\mathbf{W }^{\top }\varvec{x}^{'}\Vert _2\\&\quad\le \Vert \mathbf{W }\Vert _F\Vert \mathbf{P }\Vert _F\Vert \varvec{x}-\varvec{x}^{'}\Vert _2\\&=|\mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{P }\mathbf{P }^{\top }))^{1/2}\Vert \varvec{x}-\varvec{x}^{'}\Vert _2\\&=\Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}\Vert \varvec{x}-\varvec{x}^{'}\Vert _2. \end{aligned} \end{aligned}$$
(23)

Actually, the regularization term in MLG-kNN ensures that \(\mathrm {tr}(\mathbf{M }_{\alpha })\) is bounded.

Let \((\varvec{x}^{'},\varvec{y}^{'})\) be the nearest neighbor of \((\varvec{x},\varvec{y})\) in \({\mathbb {S}}\) with respect to \(d_{pro}\). Then, \(\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\right]\) is equivalent to \(\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne \varvec{y}^{'}(q)]\right]\). By combining the preceding with (22), we have

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\right] \\&\quad\le \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ 2\eta _q(\varvec{x})(1-\eta _q(\varvec{x}))\right] \\&\quad+l\Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2} \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}} \left[ d(\varvec{x}, \varvec{x}^{'})\right]. \end{aligned} \end{aligned}$$
(24)

The loss of the Bayes optimal classifier for q-th label is

$$\begin{aligned} \begin{aligned} L(\varvec{y}(q),h^*_q(\varvec{x}))&=\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ min\{\eta _q(\varvec{x}), 1-\eta _q(\varvec{x})\}\right] \\&\quad\ge \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ \eta _q(\varvec{x})(1-\eta _q(\varvec{x}))\right]. \end{aligned} \end{aligned}$$
(25)

Combining (24) and (25) concludes our proof. \(\square\)

Lemma 2

[14] Let \(({\mathcal {X}}, d)\) be a metric space with \(diam({\mathcal {X}})=1\) and \(ddim({\mathcal {X}})=m<\infty\). For any \(\varvec{x}\), \(\varvec{x}^{'}\in {\mathcal {X}}\), we have

$$\begin{aligned} \begin{aligned} \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ d(\varvec{x}, \varvec{x}^{'})\right] \le \frac{3}{n^{1/(m+1)}}. \end{aligned} \end{aligned}$$
(26)

1.3 Proof of Theorem 2

Proof

Combining Theorem 1 with Lemma  2 concludes our proof. \(\square\)

1.4 Proof of Theorem 3

Proof

We use the proof of Theorem 19.5 in [22] to complete the proof of Theorem 3.

Let’s first consider \(\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right]\) for the q-th label. For the training set \({\mathbb {S}}=\{(\varvec{x}_1,\varvec{y}_1),...,(\varvec{x}_n,\varvec{y}_n)\}\) and any \(\varvec{x}\in {\mathcal {X}}\), let \(\pi _1(\varvec{x}), \ldots , \pi _n(\varvec{x})\) be a reordering of \(\{1, \ldots , n\}\) according to their distance to \(\varvec{x}\), \(d_{\mathbf{M }}(\varvec{x},\varvec{x}_i)\). That is, for all \(i<n\), \(d_{pro}(\varvec{x}, \varvec{x}_{\pi _i(\varvec{x})})\le d_{pro}(\varvec{x}, \varvec{x}_{\pi _{i+1}(\varvec{x})})\). Let\({\mathcal {B}}=\{B_1,\ldots , B_N\}\) is the \(\varepsilon\)-cover for \({\mathcal {X}}\) of cardinality N. From Eq. (19.3) in [22], we have

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right] \\&\quad\le \mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n}\left[ \sum _{j:|B_i\bigcap {\mathbb {S}}|<k}{\mathbb {P}}[B_i]\right] \\&\qquad+\max _i\mathop {{\mathbb {P}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})| \forall i~\in [k], d_{pro}(\varvec{x}, \varvec{x}_{\pi _i(\varvec{x})})\\&\quad\le \varepsilon \Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}]. \end{aligned} \end{aligned}$$
(27)

According to [22, Lemma 19.6], the first term of right side of (27) is bounded by \(\frac{2Nk}{en}\). The second term of right side of (27) is bounded by \(\left(1+\sqrt{\tfrac{8}{k}}\right)L(\varvec{y}(q),h^*_q(\varvec{x}))+3l\varepsilon \Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}\). Finally, setting \(\varepsilon =2n^{-\tfrac{1}{m+1}}\) and combining Lemma 1, we have

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}\limits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right] \\&\quad\le \left( 1+\sqrt{\frac{8}{k}}\right) L(\varvec{y}(q),h^*_q(\varvec{x})) +\frac{(6l\Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}+k)}{n^{1/(m+1)}}. \end{aligned} \end{aligned}$$
(28)

Applying Eq.(28) for each label and taking the sum, we complete the proof. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, J., Zhou, S. Metric learning-guided k nearest neighbor multilabel classifier. Neural Comput & Applic 33, 2411–2425 (2021). https://doi.org/10.1007/s00521-020-05134-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05134-9

Keywords

Navigation