Machine Learning

, Volume 88, Issue 1–2, pp 69–126 | Cite as

Compressed labeling on distilled labelsets for multi-label learning



Directly applying single-label classification methods to the multi-label learning problems substantially limits both the performance and speed due to the imbalance, dependence and high dimensionality of the given label matrix. Existing methods either ignore these three problems or reduce one with the price of aggravating another. In this paper, we propose a {0,1} label matrix compression and recovery method termed “compressed labeling (CL)” to simultaneously solve or at least reduce these three problems. CL first compresses the original label matrix to improve balance and independence by preserving the signs of its Gaussian random projections. Afterward, we directly utilize popular binary classification methods (e.g., support vector machines) for each new label. A fast recovery algorithm is developed to recover the original labels from the predicted new labels. In the recovery algorithm, a “labelset distilling method” is designed to extract distilled labelsets (DLs), i.e., the frequently appeared label subsets from the original labels via recursive clustering and subtraction. Given a distilled and an original label vector, we discover that the signs of their random projections have an explicit joint distribution that can be quickly computed from a geometric inference. Based on this observation, the original label vector is exactly determined after performing a series of Kullback-Leibler divergence based hypothesis tests on the distribution about the new labels. CL significantly improves the balance of the training samples and reduces the dependence between different labels. Moreover, it accelerates the learning process by training fewer binary classifiers for compressed labels, and makes use of label dependence via DLs based tests. Theoretically, we prove the recovery bounds of CL which verifies the effectiveness of CL for label compression and multi-label classification performance improvement brought by label correlations preserved in DLs. We show the effectiveness, efficiency and robustness of CL via 5 groups of experiments on 21 datasets from text classification, image annotation, scene classification, music categorization, genomics and web page classification.


Multi-label prediction Labelset selection Compressed sensing Random projection Label compression Distilled labelsets Binary matrix decomposition KL divergence Hypothesis test of distribution Support vector machines 


  1. Bian, W., & Tao, D. (2011). Max-min distance analysis by using sequential sdp relaxation for dimension reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 1037–1050. CrossRefGoogle Scholar
  2. Bianchi, N.C., Gentile, C., & Zaniboni, L. (2006). Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7, 31–54. MATHGoogle Scholar
  3. Boutell, M.R., Luo, J., Shen, X., & Brown, C.M. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9), 1757–1771. CrossRefGoogle Scholar
  4. Breiman, L., & Friedman, J.H. (1997). Predicting multivariate responses in multiple linear regression (with discussion). The Journal of the Royal Statistical Society Series B, 54, 5–54. Google Scholar
  5. Candès, E.J., Romberg, J.K., & Tao, T. (2006). Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2), 489–509. MATHCrossRefGoogle Scholar
  6. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at
  7. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. MATHGoogle Scholar
  8. Chen, J., Liu, J., & Ye, J. (2010). Learning incoherent sparse and low-rank patterns from multiple tasks. In SIGKDD’10: The 16th ACM SIGKDD international conference on knowledge discovery and data mining. Google Scholar
  9. Cheng, W., & Hüllermeier, E. (2009). Combining instance-based learning and logistic regression for multilabel classification. Machine Learning, 76(2–3), 211–225. CrossRefGoogle Scholar
  10. Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In PKDD’01: Proceedings of the 5th European conference on principles of data mining and knowledge discovery (pp. 42–53). London: Springer. CrossRefGoogle Scholar
  11. Clarkson, K.L. (2008). Tighter bounds for random projections of manifolds. In SCG’08: Proceedings of the 24 annual symposium on computational geometry (pp. 39–48). Google Scholar
  12. Crammer, K., & Singer, Y. (2003). A family of additive online algorithms for category ranking. Journal of Machine Learning Research, 3, 1025–1058. MathSciNetMATHGoogle Scholar
  13. Dasgupta, S. (2000). Experiments with random projection. In UAI’00: Proceedings of the 16th conference on uncertainty in artificial intelligence, San Francisco, CA, USA (pp. 143–151). Google Scholar
  14. Dasgupta, S., & Freund, Y. (2008). Random projection trees and low dimensional manifolds. In STOC’08: Proceedings of the 40th annual ACM symposium on theory of computing (pp. 537–546). Google Scholar
  15. Dembczyński, K., Cheng, W., & Hüllermeier, E. (2010). Bayes optimal multilabel classification via probabilistic classifier chains. In The 27th international conference on machine learning (ICML 2010). Google Scholar
  16. Dembczyński, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2010). On label dependence in multi-label classification. In ICML 2010 workshop on learning from multi-label data (MLD 10) (pp. 5–13). Google Scholar
  17. Dietterich, T.G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263–282. MATHGoogle Scholar
  18. Diplaris, S., Tsoumakas, G., Mitkas, P.A., & Vlahavas, I. (2005). Protein classification with multiple algorithms. In Proceedings of the 10th Panhellenic conference on informatics (PCI 2005) (pp. 448–456). Google Scholar
  19. Donoho, D.L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4), 1289–1306. MathSciNetCrossRefGoogle Scholar
  20. Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV’02: Proceedings of the 7th European conference on computer vision-part IV (pp. 97–112). London: Springer. Google Scholar
  21. Efron, B., Hastie, T., Johnstone, L., & Tibshirani, R. (2002). Least angle regression. Annals of Statistics, 32, 407–499. MathSciNetGoogle Scholar
  22. Escalera, S., Pujol, O., & Radeva, P. (2010). On the decoding process in ternary error-correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 120–134. CrossRefGoogle Scholar
  23. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In KDD’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 109–117). Google Scholar
  24. Fürnkranz, J., Hüllermeier, E., Mencía, E.L., & Brinker, K. (2008). Multilabel classification via calibrated label ranking. Machine Learning, 73(2), 133–153. CrossRefGoogle Scholar
  25. Ghamrawi, N., & McCallum, A. (2005). Collective multi-label classification. In CIKM’05: Proceedings of the 14th ACM international conference on information and knowledge management (pp. 195–200). CrossRefGoogle Scholar
  26. Goemans, M.X., & Williamson, D.P. (1995). Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, 42(6), 1115–1145. MathSciNetMATHCrossRefGoogle Scholar
  27. Gomez, J., Boiy, E., & Moens, M.-F. (2011). Highly discriminative statistical features for email classification. Knowledge and Information Systems. doi:10.1007/s10115-011-0403-7.
  28. Gretton, A., Bousquet, O., Smola, E., & Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In Proceedings algorithmic learning theory (pp. 63–77). Berlin: Springer. CrossRefGoogle Scholar
  29. Guan, N., Tao, D., Luo, Z., & Yuan, B. (2011). Non-negative patch alignment framework. IEEE Transactions on Neural Networks, 22(8), 1218–1230. CrossRefGoogle Scholar
  30. Gupta, A., Nowak, R., & Recht, B. (2010). Sample complexity for 1-bit compressed sensing and sparse classification. In Proceedings of the IEEE international symposium on information theory (ISIT). Google Scholar
  31. Halko, N., Martinsson, P.-G., & Tropp, J.A. (2009). Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. arXiv:0909.4061.
  32. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.). Springer series in statistics. Berlin: Springer. Corr. 3rd printing edition. MATHGoogle Scholar
  33. Hsu, D., Kakade, S.M., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In Advances in neural information processing systems 23. Google Scholar
  34. Hüllermeier, E., Fürnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16–17), 1897–1916. MathSciNetMATHCrossRefGoogle Scholar
  35. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC’98: Proceedings of the thirtieth annual ACM symposium on theory of computing (pp. 604–613). CrossRefGoogle Scholar
  36. Jia, S., & Ye, J. (2009). An accelerated gradient method for trace norm minimization. In The 26th international conference on machine learning (ICML) (pp. 457–464). Google Scholar
  37. Jia, S., Tang, L., Yu, S., & Ye, J. (2010). A shared-subspace learning framework for multi-label classification. ACM Transactions on Knowledge Discovery from Data, 2(1). Google Scholar
  38. Johnson, W., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. In Contemporary Mathematics: Vol. 26. Conference in modern analysis and probability, New Haven, Conn., 1982 (pp. 189–206). Providence: Am. Math. Soc. CrossRefGoogle Scholar
  39. Katakis, I., Tsoumakas, G., & Vlahavas, I. (2008). Multilabel text classification for automated tag suggestion. In Proceedings of the ECML/PKDD 2008 discovery challenge. Google Scholar
  40. Kong, X., & Yu, P. (2011). gmlc: a multi-label feature selection framework for graph classification. Knowledge and Information Systems. doi:10.1007/s10115-011-0407-3.
  41. Koufakou, A., Secretan, J., & Georgiopoulos, M. (2011). Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowledge and Information Systems, 29, 697–725. CrossRefGoogle Scholar
  42. Kullback, S., & Leibler, R.A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. MathSciNetMATHCrossRefGoogle Scholar
  43. Langford, J., & Beygelzimer, A. (2005). Sensitive error correcting output codes. In COLT’05: annual conference on learning theory (Vol. 3559, pp. 158–172). Google Scholar
  44. Li, P. (2008). Estimators and tail bounds for dimension reduction in α(0<α≤2) using stable random projections. In SODA’08: Proceedings of the nineteenth annual ACM-SIAM symposium on discrete algorithms (pp. 10–19). Philadelphia: Society for Industrial and Applied Mathematics. Google Scholar
  45. Li, P. (2010). Approximating higher-order distances using random projections. In The 26th conference on uncertainty in artificial intelligence (UAI 2010). Google Scholar
  46. Liu, L., & Liang, Q. (2011). A high-performing comprehensive learning algorithm for text classification without pre-labeled training set. Knowledge and Information Systems, 29, 727–738. CrossRefGoogle Scholar
  47. Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. MathSciNetCrossRefGoogle Scholar
  48. MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In Proceeding of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). Google Scholar
  49. Masud, M., Woolam, C., Gao, J., Khan, L., Han, J., Hamlen, K., & Oza, N. (2011). Facing the reality of data stream classification: coping with scarcity of labeled data. Knowledge and Information Systems. doi:10.1007/s10115-011-0447-8.
  50. Mencía, L., & Fürnkranz, J. (2008). Pairwise learning of multilabel classifications with perceptrons. In IEEE international joint conference on neural networks (IJCNN-08) (pp. 995–1000). Google Scholar
  51. Ng, A.Y., Jordan, M.I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In NIPS’01: advances in neural information processing systems 14 (Vol. 2, pp. 849–856). Google Scholar
  52. Osuna, E., Freund, R., & Girosi, F. (1997). Support vector machines: Training and applications (Technical report). Massachusetts Institute of Technology. Google Scholar
  53. Raginsky, M., & Lazebnik, S. (2009). Locality-sensitive binary codes from shift-invariant kernels. In The 23rd annual conference on neural information processing systems (NIPS 2009). Google Scholar
  54. Read, J. (2010). Meka softwares. Google Scholar
  55. Read, J., Pfahringer, B., & Holmes, G. (2008). Multi-label classification using ensembles of pruned sets. In ICDM’08: Proceedings of the 2008 eighth IEEE international conference on data mining, Washington, DC, USA (pp. 995–1000). CrossRefGoogle Scholar
  56. Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chains for multi-label classification. Machine Learning, 85(3), 333–359. CrossRefGoogle Scholar
  57. Schapire, R.E., & Singer, Y. (2000). Boostexter: a boosting-based system for text categorization. Machine Learning, 39(2/3), 135–168. MATHCrossRefGoogle Scholar
  58. Si, S., Tao, D., & Geng, B. (2010). Bregman divergence-based regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 22(7), 929–942. CrossRefGoogle Scholar
  59. Snoek, C. G. M., Worring, M., van Gemert, J. C., Geusebroek, J. M., & Smeulders, A. W. M. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In MULTIMEDIA ’06: Proceedings of the 14th annual ACM international conference on multimedia (pp. 421–430). New York: ACM. CrossRefGoogle Scholar
  60. Tao, D., Li, X., Wu, X., & Maybank, S.J. (2009). Geometric mean for subspace selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 260–274. CrossRefGoogle Scholar
  61. Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multilabel classification of music into emotions. In Proc. 9th international conference on music information retrieval (ISMIR 2008). Philadelphia, PA, USA. Google Scholar
  62. Tsoumakas, G. (2010). Mulan: A java library for multi-label learning.
  63. Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3), 1–13. CrossRefGoogle Scholar
  64. Tsoumakas, G., Katakis, I., & Vlahavas, I. (2008). Effective and efficient multilabel classification in domains with large number of labels. In Proc. ECML/PKDD 2008 workshop on mining multidimensional data (MMD’08 2007). Google Scholar
  65. Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In Data mining and knowledge discovery handbook. Google Scholar
  66. Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An ensemble method for multilabel classification. In Proceedings of the 18th European conference on machine learning (ECML 2007), Warsaw, Poland, 2007 (pp. 406–417) Google Scholar
  67. Ueda, N., & Saito, K. (2002). Parametric mixture models for multi-labeled text. In Advances in neural information processing systems 15. Neural information processing systems, NIPS 2002. Google Scholar
  68. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer. MATHGoogle Scholar
  69. Vempala, S.S. (2004). The random projection method. DIMACS Series in Discrete Mathematics and Theoretical Computer Science: Vol. 65. Providence: Am. Math. Soc. MATHGoogle Scholar
  70. Yates, F. (1934). Contingency tables involving small numbers and the χ 2 test. Journal of the Royal Statistical Society, 1, 217–235. Google Scholar
  71. Zhang, M., & Wang, Z. (2009). Mimlrbf: Rbf neural networks for multi-instance multi-label learning. Neurocomputing, 72(16–18), 3951–3956. CrossRefGoogle Scholar
  72. Zhang, M., & Zhou, Z. (2006). Multi-label neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1338–1351. CrossRefGoogle Scholar
  73. Zhang, M., & Zhou, Z. (2007). Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7), 2038–2048. MATHCrossRefGoogle Scholar
  74. Zhang, Y., & Zhou, Z. (2008). Multi-label dimensionality reduction via dependence maximization. In AAAI’08: Proceedings of the 23rd national conference on artificial intelligence (pp. 1503–1505). Google Scholar
  75. Zhou, T., Tao, D., & Wu, X. (2010). Nesvm: A fast gradient method for support vector machines. In ICDM’10: Proceedings of the 2010 IEEE international conference on data mining (pp. 679–688). CrossRefGoogle Scholar
  76. Zhou, T., Tao, D., & Wu, X. (2011). Manifold elastic net: a unified framework for sparse dimension reduction. Data Mining and Knowledge Discovery (Springer), 22(3), 340–371. MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Centre for Quantum Computation & Intelligent Systems (QCIS), Faculty of Engineering & ITUniversity of Technology, Sydney (UTS)BroadwayAustralia
  2. 2.Department of Computer ScienceHefei University of TechnologyHefeiChina
  3. 3.Department of Computer ScienceUniversity of VermontBurlingtonUSA

Personalised recommendations