Abstract
Multilabel classification deals with the problem where each instance belongs to multiple labels simultaneously. The algorithm based on large margin loss with k nearest neighbor constraints (LM-kNN) is one of the most prominent multilabel classification algorithms. However, due to the use of square hinge loss, LM-kNN needs to iteratively solve a constrained quadratic programming at a high computational cost. To address this issue, we propose a novel metric learning-guided k nearest neighbor approach (MLG-kNN) for multilabel classification. Specifically, we first transform the original instance into the label space by least square regression. Then, we learn a metric matrix in the label space, which makes the predictions of an instance in the learned metric space close to its true class values while far away from others. Since our MLG-kNN can be formulated as an unconstrained strictly (geodesically) convex optimization problem and yield a closed-form solution, the computational complexity is reduced. An analysis of generalization error bound indicates that our MLG-kNN converges to the optimal solutions. Experimental results verify that the proposed approach is more effective than the existing ones for multilabel classification across many benchmark datasets.
Similar content being viewed by others
References
Tsoumakas G, Katakis I, Vlahavas I (2009) Mining multi-label data. In: Data mining and knowledge discovery handbook, Springer, pp 667–685
Zhang ML, Zhou ZH (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837
Zheng W, Qian Y, Huijuan L (2013) Text categorization based on regularization extreme learning machine. Neural Comput Appl 22(3):447–456
Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771
Barutcuoglu Z, Schapire R, Troyanskaya O (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836
Montañes E, Senge R, Barranquero J, Quevedo JR, del Coz JJ, Hüllermeier E (2014) Dependent binary relevance models for multi-label classification. Pattern Recognit 47(3):1494–1508
Hsu D, Kakade Sham M, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: International conference on neural information processing systems, pp 772–780
Tai F, Lin HT (2012) Multilabel classification with principal label space transformation. Neural Comput 24(9):2508
Zhang Y, Schneider J (2011) Multi-label output codes using canonical correlation analysis. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 873–882
Zhang Y, Schneider J (2012) Maximum margin output coding. In: International conference on machine learning, pp 379–386
Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Liu W, Tsang IW (2015) Large margin metric learning for multi-label prediction. In: AAAI conference on artificial intelligence, pp 2800–2806
Liu W, Donna X, Tsang I, Zhang W (2018) Metric learning for multi-output tasks. IEEE Trans Pattern Anal Mach Intell 41(2):408–422
Zadeh P, Hosseini R, Sra S (2016) Geometric mean metric learning. In: International conference on machine learning, pp 2464–2471
Davis Jason V, Kulis B, Jain P, Sra S, Dhillon Inderjit S (2007) Information-theoretic metric learning. In: Proceedings of the 24th international conference on machine learning, ACM, pp 209–216
Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J Matrix Anal Appl 29(1):328–347
Bhatia R (2009) Positive definite matrices, vol 24. Princeton University Press, Princeton
Iannazzo B (2016) The geometric mean of two matrices from a computational viewpoint. Numer Linear Algebra Appl 23(2):208–229
Kontorovich A, Weiss R (2014) Maximum margin multiclass nearest neighbors. In: International conference on machine learning. arXiv:1401.7898:892–900
Krauthgamer R, Lee JR (2004) Navigating nets: simple algorithms for proximity search. In: ACM-SIAM symposium on discrete algorithms, pp 798–807
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: From theory to algorithms. Cambridge University Press, Cambridge
Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, Vlahavas I (2011) Mulan: a java library for multi-label learning. J Mach Learn Res 12:2411–2414
Datta Biswa N (2010) Numerical linear algebra and applications. SIAM, vol 116
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9(9):1871–1874
Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2):135–168
Luaces O, Diez J, Barranquero J, Coz JJD, Bahamonde A (2012) Binary relevance efficacy for multilabel classification. Prog Artif Intell 1(4):303–313
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Yi L, Long Philip M (2006) Learnability and the doubling dimension. Advances in neural information processing systems, pp 889–896
Golub G, Loan CV (1996) Matrix computations (3rd edition)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (No. 61772020).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declared that they have no conflicts of interest to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proofs of the main theorems
Before proving, we first introduce several related definitions and lemmas.
Definition 1
[20, 21] Given a metric space \(({\mathcal {X}}, d)\), let m be the smallest value such that every ball in \({\mathcal {X}}\) can be covered by \(2^m\) balls of half the radius. The doubling dimension of \({\mathcal {X}}\) is defined as \(ddim({\mathcal {X}})=m\).
Lemma 1
[29] Given a metric space \(({\mathcal {X}}, d )\), an \(\varepsilon\)-cover for \({\mathcal {X}}\) is a set \({\mathcal {B}}\subseteq {\mathcal {X}}\) such that every element of \({\mathcal {X}}\) has a counterpart in \({\mathcal {B}}\) that is at a distance at most \(\varepsilon\) (with respect to d). Denote the size of the smallest \(\varepsilon\)-cover by \(N(\varepsilon , {\mathcal {X}}, d)\), then we have
where \(diam({\mathcal {X}})=\mathop {\sup }\limits _{\varvec{x},\varvec{x}^{'}\in {\mathcal {X}}}d(\varvec{x},\varvec{x}^{'})\).
1.2 Proof of Theorem 1
Proof
From the linearity of expectation and the definition of \(L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\), we can rewrite:
Let’s start with \({\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\). For any two examples \((\varvec{x}, \varvec{y})\) and \((\varvec{x}^{'},\varvec{y}^{'})\) drawn from the distribution \({\mathcal {D}}\), we have
Using \(\eta ^q\) is l-Lipschitz, we have \(\eta _q(\varvec{x})-\eta _q(\varvec{x}^{'})\le ld_{pro}(\varvec{x},\varvec{x}^{'})\). Due to \(\mathbf{M }_{\alpha }\) is a symmetric positive definite matrix, we can factorize \(\mathbf{M }_{\alpha }\) as \(\mathbf{M }_{\alpha }=\mathbf{P }\mathbf{P }^{\top }\) [30]. Then, we have
Actually, the regularization term in MLG-kNN ensures that \(\mathrm {tr}(\mathbf{M }_{\alpha })\) is bounded.
Let \((\varvec{x}^{'},\varvec{y}^{'})\) be the nearest neighbor of \((\varvec{x},\varvec{y})\) in \({\mathbb {S}}\) with respect to \(d_{pro}\). Then, \(\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne h^{{\mathbb {S}}}_q(\varvec{x})]\right]\) is equivalent to \(\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ {\mathbb {P}}[\varvec{y}(q)\ne \varvec{y}^{'}(q)]\right]\). By combining the preceding with (22), we have
The loss of the Bayes optimal classifier for q-th label is
Combining (24) and (25) concludes our proof. \(\square\)
Lemma 2
[14] Let \(({\mathcal {X}}, d)\) be a metric space with \(diam({\mathcal {X}})=1\) and \(ddim({\mathcal {X}})=m<\infty\). For any \(\varvec{x}\), \(\varvec{x}^{'}\in {\mathcal {X}}\), we have
1.3 Proof of Theorem 2
Proof
Combining Theorem 1 with Lemma 2 concludes our proof. \(\square\)
1.4 Proof of Theorem 3
Proof
We use the proof of Theorem 19.5 in [22] to complete the proof of Theorem 3.
Let’s first consider \(\mathop {{\mathbb {E}}}\nolimits _{{\mathbb {S}}\sim {\mathcal {D}}^n,(\varvec{x},\varvec{y})\sim {\mathcal {D}}}\left[ L(\varvec{y}(q),h^{{\mathbb {S}}}_q(\varvec{x}))\right]\) for the q-th label. For the training set \({\mathbb {S}}=\{(\varvec{x}_1,\varvec{y}_1),...,(\varvec{x}_n,\varvec{y}_n)\}\) and any \(\varvec{x}\in {\mathcal {X}}\), let \(\pi _1(\varvec{x}), \ldots , \pi _n(\varvec{x})\) be a reordering of \(\{1, \ldots , n\}\) according to their distance to \(\varvec{x}\), \(d_{\mathbf{M }}(\varvec{x},\varvec{x}_i)\). That is, for all \(i<n\), \(d_{pro}(\varvec{x}, \varvec{x}_{\pi _i(\varvec{x})})\le d_{pro}(\varvec{x}, \varvec{x}_{\pi _{i+1}(\varvec{x})})\). Let\({\mathcal {B}}=\{B_1,\ldots , B_N\}\) is the \(\varepsilon\)-cover for \({\mathcal {X}}\) of cardinality N. From Eq. (19.3) in [22], we have
According to [22, Lemma 19.6], the first term of right side of (27) is bounded by \(\frac{2Nk}{en}\). The second term of right side of (27) is bounded by \(\left(1+\sqrt{\tfrac{8}{k}}\right)L(\varvec{y}(q),h^*_q(\varvec{x}))+3l\varepsilon \Vert \mathbf{W }\Vert _F(\mathrm {tr}(\mathbf{M }_{\alpha }))^{1/2}\). Finally, setting \(\varepsilon =2n^{-\tfrac{1}{m+1}}\) and combining Lemma 1, we have
Applying Eq.(28) for each label and taking the sum, we complete the proof. \(\square\)
Rights and permissions
About this article
Cite this article
Ma, J., Zhou, S. Metric learning-guided k nearest neighbor multilabel classifier. Neural Comput & Applic 33, 2411–2425 (2021). https://doi.org/10.1007/s00521-020-05134-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05134-9