Abstract
Many problem settings in machine learning are concerned with the simultaneous prediction of multiple target variables of diverse type. Amongst others, such problem settings arise in multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. These subfields of machine learning are typically studied in isolation, without highlighting or exploring important relationships. In this paper, we present a unifying view on what we call multi-target prediction (MTP) problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research.
Similar content being viewed by others
Notes
Let us remark that our notion of “representation-constructing” differs substantially from the notion of “representation learning” as commonly used in the area of deep neural networks. Here, we consider the construction of vector representations for targets. This is something that is not commonly done in multi-target prediction extensions of deep architectures.
References
Abernethy J, Bach F, Evgeniou T, Vert JP (2008) A new approach to collaborative filtering: operator estimation with spectral regularization. J Mach Learn Res 10:803–826
Adams RP, Dahl GE, Murray I (2010) Incorporating side information into probabilistic matrix factorization using Gaussian processes. In: Grünwald P, Spirtes P (eds) The 26th conference on uncertainty in artificial intelligence, pp 1–9
Aho T, Ženko B, Džeroski S (2009) Rule ensembles for multi-target regression. In: Proceedings of the IEEE international conference on data mining, pp 21–30
Aho T, Ženko B, Džeroski S, Elomaa T (2012) Multi-target regression with rule ensembles. J Mach Learn Res 13(1):2367–2407
Akata Z, Reed SE, Walter D, Lee H, Schiele B (2015) Evaluation of output embeddings for fine-grained image classification. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015, pp 2927–2936
Akata Z, Perronnin F, Harchaoui Z, Schmid C (2016) Label-embedding for image classification. IEEE Trans Pattern Anal Mach Intell 38(7):1425–1438
Álvarez M, Rosasco L, Lawrence N (2012) Kernels for vector-valued functions: a review. Found Trends Mach Learn 4(3):195–266
Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853
Bakker B, Heskes T (2003) Task clustering and gating for Bayesian multitask learning. J Mach Learn Res 4:83–99
Balasubramanian K, Lebanon G (2012) The landmark selection method for multiple output prediction. In: International conference on machine learning
Baldassarre L, Rosasco L, Barla A, Verri A (2012) Multi-output learning via spectral filtering. Mach Learn 87(3):259–301
Barutcuoglu Z, Schapire RE, Troyanskaya OG (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836
Basilico J, Hofmann T (2004) Unifying collaborative and content-based filtering. In: Proceedings of the 21st international conference on machine learning, pp 9–16
Ben-Hur A, Noble W (2005) Kernel methods for predicting protein–protein interactions. Bioinformatics 21(Suppl 1):38–46
Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, 7–12 December 2015, Montreal, Quebec, Canada, pp 730–738
Bi W, Kwok J (2012) Mandatory leaf node prediction in hierarchical multilabel classification. Adv Neural Inf Process Syst 25:153–161
Bi W, Kwok JT (2013) Efficient multi-label classification with many labels. In: Proceedings of the 30th international conference on machine learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013, pp 405–413
Bielza C, Li G, Larraòaga P (2011) Multi-dimensional classification with bayesian networks. Int J Approx Reason 52(6):705–727
Bonilla EV, Agakov F, Williams C (2007) Kernel multi-task learning using task-specific features. In: The 11th international conference on artificial intelligence and statistics AISTATS’07, pp 43–50
Breiman L, Friedman J (1997) Predicting multivariate responses in multiple linear regression. J R Stat Soc B 69:3–54
Candes E, Recht B (2008) Exact low-rank matrix completion via convex optimization. Found Comput Math 9:717–772
Caponnetto A, Micchelli CA, Pontil M, Ying Y (2008) Universal multi-task kernels. J Mach Learn Res 9:1615–1646
Caruana R (1997) Multitask learning: a knowledge-based source of inductive bias. Mach Learn 28:41–75
Chen J, Tang L, Liu J, Ye J (2009) A convex formulation for learning shared structures from multiple tasks. In: Proceedings of the 26th annual international conference on machine learning, ACM, New York, NY, USA, ICML’09, pp 137–144
Cheng W, Hüllermeier E (2009) Combining instance-based learning and logistic regression for multilabel classification. Mach Learn 76(2–3):211–225
Cissé M, Usunier N, Artières T, Gallinari P (2013) Robust bloom filters for large multilabel classification tasks. In: Advances in neural information processing systems, vol 26. Lake Tahoe, Nevada, United States, pp 1851–1859
Dembczyński K, Waegeman W, Cheng W, Hüllermeier E (2012) On label dependence and loss minimization in multi-label classification. Mach Learn 88:5–45
Dembczyński K, Kotłowski W, Gawel P, Szarecki A, Jaszkiewicz A (2013) Matrix factorization for travel time estimation in large traffic networks. In: Artificial intelligence and soft computing—12th international conference (ICAISC 2013). Lecture notes in computer science, vol 7895. Springer, pp 500–510
Dembczyński K, Kotłowski W, Waegeman W, Busa-Fekete R, Hüllermeier E (2016) Consistency of probabilistic classifier trees. In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II, pp 511–526
Deng J, Ding N, Jia Y, Frome A, Murphy K, Bengio S, Li Y, Neven H, Adam H (2014) Large-scale object classification using label relation graphs. In: European conference on computer vision. Lecture notes in computer science vol 8689. Springer, pp 48–64
Dinuzzo F (2013) Learning output kernels for multi-task problems. Neurocomput 118:119–126
Dinuzzo F, Ong CS, Gehler P, Pillonetto G (2011) Learning output kernels with block coordinate descent. In: Proceedings of the international conference on machine learning
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31th international conference on machine learning, ICML 2014, Beijing, China, 21–26 June 2014, pp 647–655
Evgeniou T (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 109–117
Fang Y, Si L (2011) Matrix co-factorization for recommendation with rich side information and implicit feedback. In: The 2nd international workshop on information heterogeneity and fusion in recommender systems, ACM, pp 65–69
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129
Fu Y, Hospedales T, Xiang T, Gong S (2013) Learning multimodal latent attributes. IEEE Trans Pattern Anal Mach Intell 36(2):303–316
Gaujoux R, Seoighe C (2010) A flexible R package for nonnegative matrix factorization. BMC Bioinform 11:367
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, Washington, DC, USA, pp 580–587
Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. PAKDD 2004:22–30
Gönen M (2012) Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 28(18):2304–10
Gong P, Ye J, Zhang C (2012) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD’12, pp 895–903
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Lecture notes in computer science, vol 8695. Springer, pp 392–407
Gopal S, Yang Y (2013) Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, pp 257–265
Gopal S, Yang Y, Bai B, Niculescu-Mizil A (2012) Bayesian models for large-scale hierarchical classification. In: Proceedings of the 25th international conference on neural information processing systems, USA, NIPS’12, pp 2411–2419
Gu Q, Li Z, Han J (2011) Correlated multi-label feature selection. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM, New York, NY, USA, CIKM’11, pp 1087–1096
Guo Y, Gu S (2011) Multi-label classification using conditional dependency networks. In: Proceedings of the twenty-second international joint conference on artificial intelligence, vol 2, AAAI Press, IJCAI’11, pp 1300–1305
Hariharan B, Zelnik-Manor L, Vishwanathan S, Varma M (2010) Large scale max-margin multi-label classification with priors. In: International conference on machine learning. Omni Press
Hastie T, Tibshirani R, Friedman JH (2007) Elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, Heidelberg
Hayashi K, Takenouchi T, Tomioka R, Kashima H (2012) Self-measuring similarity for multi-task gaussian process. In: Guyon I, Dror G, Lemaire V, Taylor GW, Silver DL (eds) ICML workshop on unsupervised and transfer learning, JMLR proceedings, vol 27, pp 145–154
Hsu D, Kakade S, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: NIPS 22, pp 772–780
Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artif Intell 172(16–17):1897–1916
Izenman A (1975) Reduced-rank regression for the multivariate linear model. J Multivar Anal 5:248–262
Jacob L, Vert J (2008) Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24(19):2149–2156
Jacob L, Bach F, Vert JP (2008) Clustered multi-task learning: a convex formulation. In: Advances in neural information processing systems
Jain P, Netrapalli P, Sanghavi S (2013) Low-rank matrix completion using alternating minimization. In: Proceedings of the forty-fifth annual ACM symposium on theory of computing, ACM, New York, NY, USA, pp 665–674
Jalali A, Sanghavi S, Ravikumar P, Ruan C (2010) A dirty model for multi-task learning. In: Neural information processing systems, pp 964–972
James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the fourth Berkeley symposium on mathematics, statistics and probability theory, pp 361–379
Jawanpuria P, Lapin M, Hein M, Schiele B (2015) Efficient output kernel learning for multiple tasks. In: Advances in neural information processing systems, vol 28, pp 1189–1197
Kashima H, Kato T, Yamanishi Y, Sugiyama M, Tsuda K (2009) Link propagation: a fast semi-supervised learning algorithm for link prediction. In: SIAM international conference on data mining (SDM’09), SIAM, pp 1099–1110
Kong X, Yu PS (2012) gMLC: a multi-label feature selection framework for graph classification. Knowl Inf Syst 31(2):281–305
Krichene W, Mayoraz N, Rendle S, Zhang L, Yi X, Hong L, Chi E, Anderson J (2018) Efficient training on very large corpora via gramian estimation. ArXiv e-prints
Kula M (2015) Metadata embeddings for user and item cold-start recommendations. In: Proceedings of the 2nd workshop on new trends on content-based recommender systems co-located with 9th ACM conference on recommender systems, pp 14–21
Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between class attribute transfer. In: Conference on computer vision and pattern recognition
Larochelle H, Erhan D, Bengio Y (2008) Zero-data learning of new tasks. In: 23rd national conference on artificial intelligence (AAAI’08). AAAI Press, pp 646–651
Lawrence N, Urtasun R (2009) Non-linear matrix factorization with Gaussian processes. In: Proceedings of the 26th annual international conference on machine learning
Lee G, Yang E, Hwang SJ (2016) Asymmetric multi-task learning based on task relatedness and confidence. In: Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, 19–24 June 2016, pp 230–238
Liu W, Johnson D (2009) Clustering and its application in multi-target prediction. Curr Opin Drug Discov Develop 12(1):98–107
Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: Proceedings of the 2011 IEEE conference on computer vision and pattern recognition, Washington, DC, USA, pp 3337–3344
Liu H, Sun J, Guan J, Zheng J, Zhou S (2015) Improving compound-protein interaction prediction by building up highly credible negative samples. Bioinformatics 31(12):i221–i229
Loza Mencía E, Janssen F (2016) Learning rules for multi-label classification: a stacking and a separate-and-conquer approach. Mach Learn 105(1):77–126
Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11:2287–2322
Menon A, Elkan C (2010) A log-linear model with latent features for dyadic prediction. In: The 10th IEEE international conference on data mining (ICDM), pp 364–373
Menon A, Elkan C (2011) Link prediction via matrix factorization. Mach Learn Knowl Discov Databases 6912:437–452
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781
Nam J, Loza-Mencia E, Kim HJ, Fürrnkranz J (2015) Predicting unseen labels using label hierarchies in large-scale multi-label learning. In: European conference on machine learning. Lecture notes in computer science, vol 9284. Springer, pp 102–118
Nam J, Loza Mencia E, Fürnkranz J (2016) All-in text: Learning document, label, and word representations jointly. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, 12–17 February 2016, Phoenix, Arizona, USA, pp 1948–1954
Obozinski G, Taskar B, Jordan MI (2010) Joint covariate selection and joint subspace selection for multiple classification problems. Stat Comput 20(2):231–252
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, Washington, DC, USA, pp 1717–1724
Oyama S, Manning C (2004) Using feature conjunctions across examples for learning pairwise classifiers. In: European conference on machine learning and knowledge discovery in databases. Lecture notes in computer science, vol 3201. Springer, pp 322–333
Pahikkala T, Waegeman W, Tsivtsivadze E, Salakoski T, De Baets B (2010) Learning intransitive reciprocal relations with kernel methods. Eur J Oper Res 206(3):676–685
Pahikkala T, Airola A, Stock M, Baets BD, Waegeman W (2013) Efficient regularized least-squares algorithms for conditional ranking on relational data. Mach Learn 93(2–3):321–356
Pahikkala T, Stock M, Airola A, Aittokallio T, De Baets B, Waegeman W (2014) A two-step learning approach for solving full and almost full cold start problems in dyadic prediction. In: Lecture notes in computer science, vol 8725, pp 517–532
Palatucci M, Hinton G, Pomerleau D, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: Advances in neural information processing systems, pp 1410–1418
Papagiannopoulou C, Tsoumakas G, Tsamardinos I (2015) Discovering and exploiting deterministic label relationships in multi-label learning. In: Cao L, Zhang C, Joachims T, Webb GI, Margineantu DD, Williams G (eds) KDD, ACM, pp 915–924
Papagiannopoulou C, Miralles DG, Demuzere M, Verhoest N, Waegeman W (2018) Global hydro-climatic biomes identified via multi-task learning. Geosci Model Dev 11:4139–4153
Park SH, Fürnkranz J (2008) Multi-label classification with label constraints. In: ECML PKDD 2008 workshop on preference learning (PL-08, Antwerp, Belgium)
Park ST, Chu W (2009) Pairwise preference regression for cold-start recommendation. In: The third ACM conference on recommender systems, ACM, pp 21–28
Park Y, Marcotte EM (2012) Flaws in evaluation schemes for pair-input computational predictions. Nat Methods 9(12):1134–1136
Pelossof R, Singh I, Yang JL, Weirauch MT, Hughes TR, Leslie CS (2015) Affinity regression predicts the recognition code of nucleic acid-binding proteins. Nat Biotechnol 33(12):1242–1249
Prabhu Y, Kag A, Harsola S, Agrawal R, Varma M (2018) Parabel: partitioned label trees for extreme classification with application to dynamic search advertising. In: The web conference (WWW), pp 993–1002
Rai P, Daumé III H (2009) Multi-label prediction via sparse infinite CCA. In: Proceedings of the conference on neural information processing systems (NIPS)
Rangwala H, Naik A (2017) Large scale hierarchical classification: foundations, algorithms and applications. KDD Tutorial, Halifax
Raymond R, Kashima H (2010) Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) European conference on machine learning and knowledge discovery in databases. Lecture notes in computer science, vol 6323. Springer, pp 131–147
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, Washington, DC, USA, pp 512–519
Read J (2013) Multi-dimensional classification with super-classes. IEEE Trans Knowl Data Eng 99:1
Rohrbach M, Stark M, Schiele B (2011) Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1641–1648
Romera-Paredes B, Torr P (2015) An embarrassingly simple approach to zero-shot learning. In: Proceedings of the 32nd international conference on machine learning, vol 37, pp 2152–2161
Rousu J, Saunders C, Szedmak S, Shawe-Taylor J (2006) Kernel-based learning of hierarchical multilabel classification models. J Mach Learn Res 7:1601–1626
Schäfer D, Hüllermeier E (2015) Dyad ranking using a bilinear Plackett–Luce model. In: Proceedings ECML/PKDD–2015, European conference on machine learning and knowledge discovery in databases, Porto, Portugal
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, Lecun Y (2014) Overfeat: integrated recognition, localization and detection using convolutional networks
Shan H, Banerjee A (2010) Generalized probabilistic matrix factorizations for collaborative filtering. In: Webb GI, Liu B, Zhang C, Gunopulos D, Wu X (eds) The 10th IEEE international conference on data mining (ICDM). IEEE Computer Society, pp 1025–1030
Silla CN, Aa F (2010) A survey of hierarchical classification across different application domains. Data Min Knowl Discov 22(1–2):31–72
Socher R, Ganjoo M, Sridhar H, Bastani O, Manning CD, Ng AY (2013) Zero-shot learning through cross-modal transfer. In: Advances in neural information processing systems 26, pp 935–943
Spolaôr N, Monard MC, Tsoumakas G, Lee HD (2016) A systematic review of multi-label feature selection and a new method based on label construction. Neurocomputing 180(C):3–15
Spyromitros-Xioufis E, Tsoumakas G, Groves W, Vlahavas I (2016) Multi-target regression via input space expansion: treating targets as inputs. Mach Learn 104(1):55–98
Stock M, Fober T, Hüllermeier E, Glinca S, Klebe G, Pahikkala T, Airola A, De Baets B, Waegeman W (2014) Identification of functionally related enzymes by learning-to-rank methods. IEEE Trans Comput Biol Bioinform 11(6):1157–1169
Stock M, Pahikkala T, Airola A, Baets BD, Waegeman W (2016) Efficient pairwise learning using kernel ridge regression: an exact two-step method. arXiv:1606.04275
Tai F, Lin HT (2010) Multi-label classification with principle label space transformation. In: Second international workshop on learning from multi-label data (MLD 2010), in conjunction with ICML/COLT 2010
Tai F, Lin HT (2012) Multilabel classification with principal label space transformation. Neural Comput 24(9):2508–2542
Takács G, Pilászy I, Németh B, Tikk D (2008) Matrix factorization and neighbor based algorithms for the netflix prize problem. In: Proceedings of the 2008 ACM conference on recommender systems. ACM Press, New York, pp 267–274
Todorovski L, Blockeel H, Dzeroski S (2002) Ranking with predictive clustering trees. In: Proceedings of the European conference on machine learning
Tsoumakas G, Katakis I (2007) Multi label classification: an overview. Int J Data Warehous Min 3(3):1–13
Van der Merwe A, Zidek J (1980) Multivariate regression analysis and canonical variates. Can J Stat 8:27–39
Van Loan CF (2000) The ubiquitous kronecker product. J Comput Appl Math 123(1–2):85–100
Van Peer G, Paepe AD, Stock M, Anckaert J, Volders PJ, Vandesompele J, Baets BD, Waegeman W (2017) miSTAR: miRNA target prediction through modeling quantitative and qualitative miRNA binding site information in a stacked model structure. Nucl Acids Res 45:e51
Vens C, Struyf J, Schietgat L, Dzeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185–214
Vert JP, Qiu J, Noble WS (2007) A new pairwise kernel for biological network inference with support vector machines. BMC Bioinform 8(S–10):1–10
Volkovs M, Zemel RS (2012) Collaborative ranking with 17 parameters. In: Advances in neural information processing systems, pp 2303–2311
Waegeman W, Pahikkala T, Airola A, Salakoski T, Stock M, De Baets B (2012) A kernel-based framework for learning graded relations from data. IEEE Trans Fuzzy Syst 20(6):1090–1101
Waegeman W, Dembczynski K, Jachnik A, Cheng W, Hüllermeier E (2014) On the bayes-optimality of f-measure maximizers. J Mach Learn Res 15:3333–3388
Wang F, Wang X, Li T (2009) Semi-supervised multi-task learning with task regularizations. In: IEEE international conference on data mining, pp 562–568
Wei Y, Xia W, Lin M, Huang J, Ni B, Dong J, Zhao Y, Yan S (2016) Hcp: a flexible cnn framework for multi-label image classification. IEEE Trans Pattern Anal Mach Intell 38(9):1901–1907
Weston J, Chapelle O, Elisseeff A, Schölkopf B, Vapnik V (2002) Kernel dependency estimation. In: Advances in neural information processing systems, UK pp 873–880
Wicker J, Tyukin A, Kramer S (2016) A nonlinear label compression and transformation method for multi-label classification using autoencoders. In: Advances in knowledge discovery and data mining: 20th Pacific-Asia conference, PAKDD 2016, Auckland, New Zealand
Wolpert DH (1992) Original contribution: stacked generalization. Neural Netw 5(2):241–259
Wu L, Fisch A, Chopra S, Adams K, Bordes A, Weston J (2018) Starspace: embed all the things! In: AAAI conference on artificial intelligence
Xian Y, Akata Z, Sharma G, Nguyen QN, Hein M, Schiele B (2016) Latent embeddings for zero-shot classification. In: IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, pp 69–77
Xian Y, Lampert C, Schiele B, Akata Z (2018) Zero-shot learning: a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2857768
Xue Y, Liao X, Carin L, Krishnapuram B (2007) Multi-task learning for classification with Dirichlet process priors. J Mach Learn Res 8:35–63
Yen IE, Huang X, Ravikumar P, Zhong K, Dhillon IS (2016) Pd-sparse: a primal and dual sparse approach to extreme multiclass and multilabel classification. In: Proceedings of the 33nd international conference on machine learning, New York City, NY, USA, pp 3069–3077
Zhang Y, Schneider J (2011) Multi-label output codes using canonical correlation analysis. In: Uncertainty in artificial intelligence
Zhang D, Shen D (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2):895–907
Zhang Y, Yeung D (2010) A convex formulation for learning task relationships in multi-task learning. In: Proceedings of the 26th conference on uncertainty in artificial intelligence (UAI), pp 733–742
Zhou J, Chen J, Ye J (2011a) Clustered multi-task learning via alternating structure optimization. In: Advances in neural information processing systems
Zhou J, Yuan L, Liu J, Ye J (2011b) A multi-task learning formulation for predicting disease progression. In: Apté, Ghosh J, Smyth P (eds) Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 814–822
Zhou J, Liu J, Narayan VA, Ye J (2012a) Modeling disease progression via fused sparse group lasso. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD’12, pp 1095–1103
Zhou T, Shan H, Banerjee A, Sapiro G (2012b) Kernelized probabilistic matrix factorization: exploiting graphs and side information. In: 12th SIAM international conference on data mining, SIAM, pp 403–414
Zhou Z, Zhang M (2007) Multi-instance multilabel learning with application to scene classification. In: Advances in neural information processing systems, vol 19
Acknowledgements
We would like to thank the editor and anonymous reviewers for useful suggestions that improved the structure and completeness of this survey. The work of Krzysztof Dembczynski was supported by the Polish National Science Centre under Grant No. 2013/09/D/ST6/03917.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Johannes Fürnkranz.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: James–Stein Estimation
Appendix A: James–Stein Estimation
In the late sixties, James and Stein discovered that the best estimator of the mean of a multivariate Gaussian distribution is not necessarily the maximum likelihood estimator. More formally, assume that \(\theta \) is the unknown mean of a multivariate Gaussian distribution with dimension \(m>2\) and a diagonal covariance matrix. Consider a single observation \(\mathbf {y}\) randomly drawn from that distribution:
Using only this observation, the maximum-likelihood estimator for \(\theta \) would be \({\hat{\theta }}_{ML} = \mathbf {y}\). James and Stein discovered that the maximum likelihood estimator is suboptimal in terms of mean squared error
where the expectation is over the distribution of \(\mathbf {y}\). (In general, the expectation is taken over all samples that contain a single observation \(\mathbf {y}\). Later on we will shortly discuss a situation in which we draw more than one observation to compute the value of the estimator). An estimator with lower squared error can be obtained by applying a regularizer to the maximum likelihood estimator. In case \(\sigma ^2\) is known, the James–Stein estimator is defined as follows:
From a machine learning perspective, a regularizer is introduced that shrinks the estimate towards the zero vector, and hence reduces variance at the cost of introducing a bias. It has been shown that this biased estimator outperforms the maximum likelihood estimator in terms of mean squared error. The result even holds when the covariance matrix is non-diagonal, but in view of the discussion concerning target dependence, it is most remarkable for diagonal covariance matrices. In fact, in the latter case, it means that joint target regularization will be beneficial even if targets are intrinsically independent. This is somewhat in contradiction with what is commonly assumed in the machine learning literature.
Let us notice, however, that the advantage of the James–Stein estimate over the maximum likelihood estimate will vanish for larger samples (of more than one observation). In the second term in parentheses, \(\sigma ^2\) is then divided by the size of the sample, so that the James–Stein estimate converges to the maximum likelihood estimate when the sample size grows to infinity.
The James–Stein paradox analyzes a very simple estimation setting, for which suboptimality of the maximum likelihood estimator can be proved analytically, but the principle extends to various multi-target prediction settings. By interpreting each component of \(\theta \) as an individual target (and omitting the instance space, or reducing it to a single point), the maximum likelihood estimator coincides with independent model fitting, whereas the James–Stein estimator adopts a regularization mechanism that is very similar to most of the regularization techniques used in the machine learning literature. For some specific multivariate regression models, connections of that kind have been discussed in the statistical literature (Breiman and Friedman 1997). As long as mean squared error is considered as a loss function and errors follow a Gaussian distribution, one can immediately extend the James–Stein paradox to multivariate regression settings by assuming that target vectors \(\mathbf {y}\) are generated according to the following statistical model:
where the mean is now conditioned on the input space. For other loss functions, we are not aware of any formal analysis of that kind, but it might be expected that similar conclusions can be drawn.
Rights and permissions
About this article
Cite this article
Waegeman, W., Dembczyński, K. & Hüllermeier, E. Multi-target prediction: a unifying view on problems and methods. Data Min Knowl Disc 33, 293–324 (2019). https://doi.org/10.1007/s10618-018-0595-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-018-0595-5