Skip to main content
Log in

Multi-target prediction: a unifying view on problems and methods

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Many problem settings in machine learning are concerned with the simultaneous prediction of multiple target variables of diverse type. Amongst others, such problem settings arise in multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. These subfields of machine learning are typically studied in isolation, without highlighting or exploring important relationships. In this paper, we present a unifying view on what we call multi-target prediction (MTP) problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Let us remark that our notion of “representation-constructing” differs substantially from the notion of “representation learning” as commonly used in the area of deep neural networks. Here, we consider the construction of vector representations for targets. This is something that is not commonly done in multi-target prediction extensions of deep architectures.

References

  • Abernethy J, Bach F, Evgeniou T, Vert JP (2008) A new approach to collaborative filtering: operator estimation with spectral regularization. J Mach Learn Res 10:803–826

    MATH  Google Scholar 

  • Adams RP, Dahl GE, Murray I (2010) Incorporating side information into probabilistic matrix factorization using Gaussian processes. In: Grünwald P, Spirtes P (eds) The 26th conference on uncertainty in artificial intelligence, pp 1–9

  • Aho T, Ženko B, Džeroski S (2009) Rule ensembles for multi-target regression. In: Proceedings of the IEEE international conference on data mining, pp 21–30

  • Aho T, Ženko B, Džeroski S, Elomaa T (2012) Multi-target regression with rule ensembles. J Mach Learn Res 13(1):2367–2407

    MathSciNet  MATH  Google Scholar 

  • Akata Z, Reed SE, Walter D, Lee H, Schiele B (2015) Evaluation of output embeddings for fine-grained image classification. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015, pp 2927–2936

  • Akata Z, Perronnin F, Harchaoui Z, Schmid C (2016) Label-embedding for image classification. IEEE Trans Pattern Anal Mach Intell 38(7):1425–1438

    Article  Google Scholar 

  • Álvarez M, Rosasco L, Lawrence N (2012) Kernels for vector-valued functions: a review. Found Trends Mach Learn 4(3):195–266

    Article  MATH  Google Scholar 

  • Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853

    MathSciNet  MATH  Google Scholar 

  • Bakker B, Heskes T (2003) Task clustering and gating for Bayesian multitask learning. J Mach Learn Res 4:83–99

    MATH  Google Scholar 

  • Balasubramanian K, Lebanon G (2012) The landmark selection method for multiple output prediction. In: International conference on machine learning

  • Baldassarre L, Rosasco L, Barla A, Verri A (2012) Multi-output learning via spectral filtering. Mach Learn 87(3):259–301

    Article  MathSciNet  MATH  Google Scholar 

  • Barutcuoglu Z, Schapire RE, Troyanskaya OG (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836

    Article  Google Scholar 

  • Basilico J, Hofmann T (2004) Unifying collaborative and content-based filtering. In: Proceedings of the 21st international conference on machine learning, pp 9–16

  • Ben-Hur A, Noble W (2005) Kernel methods for predicting protein–protein interactions. Bioinformatics 21(Suppl 1):38–46

    Article  Google Scholar 

  • Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, 7–12 December 2015, Montreal, Quebec, Canada, pp 730–738

  • Bi W, Kwok J (2012) Mandatory leaf node prediction in hierarchical multilabel classification. Adv Neural Inf Process Syst 25:153–161

    Google Scholar 

  • Bi W, Kwok JT (2013) Efficient multi-label classification with many labels. In: Proceedings of the 30th international conference on machine learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013, pp 405–413

  • Bielza C, Li G, Larraòaga P (2011) Multi-dimensional classification with bayesian networks. Int J Approx Reason 52(6):705–727

    Article  MathSciNet  MATH  Google Scholar 

  • Bonilla EV, Agakov F, Williams C (2007) Kernel multi-task learning using task-specific features. In: The 11th international conference on artificial intelligence and statistics AISTATS’07, pp 43–50

  • Breiman L, Friedman J (1997) Predicting multivariate responses in multiple linear regression. J R Stat Soc B 69:3–54

    Article  MathSciNet  MATH  Google Scholar 

  • Candes E, Recht B (2008) Exact low-rank matrix completion via convex optimization. Found Comput Math 9:717–772

    Article  MATH  Google Scholar 

  • Caponnetto A, Micchelli CA, Pontil M, Ying Y (2008) Universal multi-task kernels. J Mach Learn Res 9:1615–1646

    MathSciNet  MATH  Google Scholar 

  • Caruana R (1997) Multitask learning: a knowledge-based source of inductive bias. Mach Learn 28:41–75

    Article  Google Scholar 

  • Chen J, Tang L, Liu J, Ye J (2009) A convex formulation for learning shared structures from multiple tasks. In: Proceedings of the 26th annual international conference on machine learning, ACM, New York, NY, USA, ICML’09, pp 137–144

  • Cheng W, Hüllermeier E (2009) Combining instance-based learning and logistic regression for multilabel classification. Mach Learn 76(2–3):211–225

    Article  Google Scholar 

  • Cissé M, Usunier N, Artières T, Gallinari P (2013) Robust bloom filters for large multilabel classification tasks. In: Advances in neural information processing systems, vol 26. Lake Tahoe, Nevada, United States, pp 1851–1859

  • Dembczyński K, Waegeman W, Cheng W, Hüllermeier E (2012) On label dependence and loss minimization in multi-label classification. Mach Learn 88:5–45

    Article  MathSciNet  MATH  Google Scholar 

  • Dembczyński K, Kotłowski W, Gawel P, Szarecki A, Jaszkiewicz A (2013) Matrix factorization for travel time estimation in large traffic networks. In: Artificial intelligence and soft computing—12th international conference (ICAISC 2013). Lecture notes in computer science, vol 7895. Springer, pp 500–510

  • Dembczyński K, Kotłowski W, Waegeman W, Busa-Fekete R, Hüllermeier E (2016) Consistency of probabilistic classifier trees. In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II, pp 511–526

  • Deng J, Ding N, Jia Y, Frome A, Murphy K, Bengio S, Li Y, Neven H, Adam H (2014) Large-scale object classification using label relation graphs. In: European conference on computer vision. Lecture notes in computer science vol 8689. Springer, pp 48–64

  • Dinuzzo F (2013) Learning output kernels for multi-task problems. Neurocomput 118:119–126

    Article  Google Scholar 

  • Dinuzzo F, Ong CS, Gehler P, Pillonetto G (2011) Learning output kernels with block coordinate descent. In: Proceedings of the international conference on machine learning

  • Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31th international conference on machine learning, ICML 2014, Beijing, China, 21–26 June 2014, pp 647–655

  • Evgeniou T (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637

    MathSciNet  MATH  Google Scholar 

  • Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 109–117

  • Fang Y, Si L (2011) Matrix co-factorization for recommendation with rich side information and implicit feedback. In: The 2nd international workshop on information heterogeneity and fusion in recommender systems, ACM, pp 65–69

  • Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129

  • Fu Y, Hospedales T, Xiang T, Gong S (2013) Learning multimodal latent attributes. IEEE Trans Pattern Anal Mach Intell 36(2):303–316

    Article  Google Scholar 

  • Gaujoux R, Seoighe C (2010) A flexible R package for nonnegative matrix factorization. BMC Bioinform 11:367

    Article  Google Scholar 

  • Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, Washington, DC, USA, pp 580–587

  • Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. PAKDD 2004:22–30

    Google Scholar 

  • Gönen M (2012) Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 28(18):2304–10

    Article  Google Scholar 

  • Gong P, Ye J, Zhang C (2012) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD’12, pp 895–903

  • Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Lecture notes in computer science, vol 8695. Springer, pp 392–407

  • Gopal S, Yang Y (2013) Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, pp 257–265

  • Gopal S, Yang Y, Bai B, Niculescu-Mizil A (2012) Bayesian models for large-scale hierarchical classification. In: Proceedings of the 25th international conference on neural information processing systems, USA, NIPS’12, pp 2411–2419

  • Gu Q, Li Z, Han J (2011) Correlated multi-label feature selection. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM, New York, NY, USA, CIKM’11, pp 1087–1096

  • Guo Y, Gu S (2011) Multi-label classification using conditional dependency networks. In: Proceedings of the twenty-second international joint conference on artificial intelligence, vol 2, AAAI Press, IJCAI’11, pp 1300–1305

  • Hariharan B, Zelnik-Manor L, Vishwanathan S, Varma M (2010) Large scale max-margin multi-label classification with priors. In: International conference on machine learning. Omni Press

  • Hastie T, Tibshirani R, Friedman JH (2007) Elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, Heidelberg

    MATH  Google Scholar 

  • Hayashi K, Takenouchi T, Tomioka R, Kashima H (2012) Self-measuring similarity for multi-task gaussian process. In: Guyon I, Dror G, Lemaire V, Taylor GW, Silver DL (eds) ICML workshop on unsupervised and transfer learning, JMLR proceedings, vol 27, pp 145–154

  • Hsu D, Kakade S, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: NIPS 22, pp 772–780

  • Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artif Intell 172(16–17):1897–1916

    Article  MathSciNet  MATH  Google Scholar 

  • Izenman A (1975) Reduced-rank regression for the multivariate linear model. J Multivar Anal 5:248–262

    Article  MathSciNet  MATH  Google Scholar 

  • Jacob L, Vert J (2008) Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24(19):2149–2156

    Article  Google Scholar 

  • Jacob L, Bach F, Vert JP (2008) Clustered multi-task learning: a convex formulation. In: Advances in neural information processing systems

  • Jain P, Netrapalli P, Sanghavi S (2013) Low-rank matrix completion using alternating minimization. In: Proceedings of the forty-fifth annual ACM symposium on theory of computing, ACM, New York, NY, USA, pp 665–674

  • Jalali A, Sanghavi S, Ravikumar P, Ruan C (2010) A dirty model for multi-task learning. In: Neural information processing systems, pp 964–972

  • James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the fourth Berkeley symposium on mathematics, statistics and probability theory, pp 361–379

  • Jawanpuria P, Lapin M, Hein M, Schiele B (2015) Efficient output kernel learning for multiple tasks. In: Advances in neural information processing systems, vol 28, pp 1189–1197

  • Kashima H, Kato T, Yamanishi Y, Sugiyama M, Tsuda K (2009) Link propagation: a fast semi-supervised learning algorithm for link prediction. In: SIAM international conference on data mining (SDM’09), SIAM, pp 1099–1110

  • Kong X, Yu PS (2012) gMLC: a multi-label feature selection framework for graph classification. Knowl Inf Syst 31(2):281–305

    Article  Google Scholar 

  • Krichene W, Mayoraz N, Rendle S, Zhang L, Yi X, Hong L, Chi E, Anderson J (2018) Efficient training on very large corpora via gramian estimation. ArXiv e-prints

  • Kula M (2015) Metadata embeddings for user and item cold-start recommendations. In: Proceedings of the 2nd workshop on new trends on content-based recommender systems co-located with 9th ACM conference on recommender systems, pp 14–21

  • Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between class attribute transfer. In: Conference on computer vision and pattern recognition

  • Larochelle H, Erhan D, Bengio Y (2008) Zero-data learning of new tasks. In: 23rd national conference on artificial intelligence (AAAI’08). AAAI Press, pp 646–651

  • Lawrence N, Urtasun R (2009) Non-linear matrix factorization with Gaussian processes. In: Proceedings of the 26th annual international conference on machine learning

  • Lee G, Yang E, Hwang SJ (2016) Asymmetric multi-task learning based on task relatedness and confidence. In: Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, 19–24 June 2016, pp 230–238

  • Liu W, Johnson D (2009) Clustering and its application in multi-target prediction. Curr Opin Drug Discov Develop 12(1):98–107

    Google Scholar 

  • Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: Proceedings of the 2011 IEEE conference on computer vision and pattern recognition, Washington, DC, USA, pp 3337–3344

  • Liu H, Sun J, Guan J, Zheng J, Zhou S (2015) Improving compound-protein interaction prediction by building up highly credible negative samples. Bioinformatics 31(12):i221–i229

    Article  Google Scholar 

  • Loza Mencía E, Janssen F (2016) Learning rules for multi-label classification: a stacking and a separate-and-conquer approach. Mach Learn 105(1):77–126

    Article  MathSciNet  Google Scholar 

  • Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11:2287–2322

    MathSciNet  MATH  Google Scholar 

  • Menon A, Elkan C (2010) A log-linear model with latent features for dyadic prediction. In: The 10th IEEE international conference on data mining (ICDM), pp 364–373

  • Menon A, Elkan C (2011) Link prediction via matrix factorization. Mach Learn Knowl Discov Databases 6912:437–452

    Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781

  • Nam J, Loza-Mencia E, Kim HJ, Fürrnkranz J (2015) Predicting unseen labels using label hierarchies in large-scale multi-label learning. In: European conference on machine learning. Lecture notes in computer science, vol 9284. Springer, pp 102–118

  • Nam J, Loza Mencia E, Fürnkranz J (2016) All-in text: Learning document, label, and word representations jointly. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, 12–17 February 2016, Phoenix, Arizona, USA, pp 1948–1954

  • Obozinski G, Taskar B, Jordan MI (2010) Joint covariate selection and joint subspace selection for multiple classification problems. Stat Comput 20(2):231–252

    Article  MathSciNet  Google Scholar 

  • Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, Washington, DC, USA, pp 1717–1724

  • Oyama S, Manning C (2004) Using feature conjunctions across examples for learning pairwise classifiers. In: European conference on machine learning and knowledge discovery in databases. Lecture notes in computer science, vol 3201. Springer, pp 322–333

  • Pahikkala T, Waegeman W, Tsivtsivadze E, Salakoski T, De Baets B (2010) Learning intransitive reciprocal relations with kernel methods. Eur J Oper Res 206(3):676–685

    Article  MathSciNet  MATH  Google Scholar 

  • Pahikkala T, Airola A, Stock M, Baets BD, Waegeman W (2013) Efficient regularized least-squares algorithms for conditional ranking on relational data. Mach Learn 93(2–3):321–356

    Article  MathSciNet  MATH  Google Scholar 

  • Pahikkala T, Stock M, Airola A, Aittokallio T, De Baets B, Waegeman W (2014) A two-step learning approach for solving full and almost full cold start problems in dyadic prediction. In: Lecture notes in computer science, vol 8725, pp 517–532

  • Palatucci M, Hinton G, Pomerleau D, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: Advances in neural information processing systems, pp 1410–1418

  • Papagiannopoulou C, Tsoumakas G, Tsamardinos I (2015) Discovering and exploiting deterministic label relationships in multi-label learning. In: Cao L, Zhang C, Joachims T, Webb GI, Margineantu DD, Williams G (eds) KDD, ACM, pp 915–924

  • Papagiannopoulou C, Miralles DG, Demuzere M, Verhoest N, Waegeman W (2018) Global hydro-climatic biomes identified via multi-task learning. Geosci Model Dev 11:4139–4153

    Article  Google Scholar 

  • Park SH, Fürnkranz J (2008) Multi-label classification with label constraints. In: ECML PKDD 2008 workshop on preference learning (PL-08, Antwerp, Belgium)

  • Park ST, Chu W (2009) Pairwise preference regression for cold-start recommendation. In: The third ACM conference on recommender systems, ACM, pp 21–28

  • Park Y, Marcotte EM (2012) Flaws in evaluation schemes for pair-input computational predictions. Nat Methods 9(12):1134–1136

    Article  Google Scholar 

  • Pelossof R, Singh I, Yang JL, Weirauch MT, Hughes TR, Leslie CS (2015) Affinity regression predicts the recognition code of nucleic acid-binding proteins. Nat Biotechnol 33(12):1242–1249

    Article  Google Scholar 

  • Prabhu Y, Kag A, Harsola S, Agrawal R, Varma M (2018) Parabel: partitioned label trees for extreme classification with application to dynamic search advertising. In: The web conference (WWW), pp 993–1002

  • Rai P, Daumé III H (2009) Multi-label prediction via sparse infinite CCA. In: Proceedings of the conference on neural information processing systems (NIPS)

  • Rangwala H, Naik A (2017) Large scale hierarchical classification: foundations, algorithms and applications. KDD Tutorial, Halifax

    Google Scholar 

  • Raymond R, Kashima H (2010) Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) European conference on machine learning and knowledge discovery in databases. Lecture notes in computer science, vol 6323. Springer, pp 131–147

  • Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, Washington, DC, USA, pp 512–519

  • Read J (2013) Multi-dimensional classification with super-classes. IEEE Trans Knowl Data Eng 99:1

    Google Scholar 

  • Rohrbach M, Stark M, Schiele B (2011) Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1641–1648

  • Romera-Paredes B, Torr P (2015) An embarrassingly simple approach to zero-shot learning. In: Proceedings of the 32nd international conference on machine learning, vol 37, pp 2152–2161

  • Rousu J, Saunders C, Szedmak S, Shawe-Taylor J (2006) Kernel-based learning of hierarchical multilabel classification models. J Mach Learn Res 7:1601–1626

    MathSciNet  MATH  Google Scholar 

  • Schäfer D, Hüllermeier E (2015) Dyad ranking using a bilinear Plackett–Luce model. In: Proceedings ECML/PKDD–2015, European conference on machine learning and knowledge discovery in databases, Porto, Portugal

  • Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, Lecun Y (2014) Overfeat: integrated recognition, localization and detection using convolutional networks

  • Shan H, Banerjee A (2010) Generalized probabilistic matrix factorizations for collaborative filtering. In: Webb GI, Liu B, Zhang C, Gunopulos D, Wu X (eds) The 10th IEEE international conference on data mining (ICDM). IEEE Computer Society, pp 1025–1030

  • Silla CN, Aa F (2010) A survey of hierarchical classification across different application domains. Data Min Knowl Discov 22(1–2):31–72

    MathSciNet  Google Scholar 

  • Socher R, Ganjoo M, Sridhar H, Bastani O, Manning CD, Ng AY (2013) Zero-shot learning through cross-modal transfer. In: Advances in neural information processing systems 26, pp 935–943

  • Spolaôr N, Monard MC, Tsoumakas G, Lee HD (2016) A systematic review of multi-label feature selection and a new method based on label construction. Neurocomputing 180(C):3–15

    Article  Google Scholar 

  • Spyromitros-Xioufis E, Tsoumakas G, Groves W, Vlahavas I (2016) Multi-target regression via input space expansion: treating targets as inputs. Mach Learn 104(1):55–98

    Article  MathSciNet  MATH  Google Scholar 

  • Stock M, Fober T, Hüllermeier E, Glinca S, Klebe G, Pahikkala T, Airola A, De Baets B, Waegeman W (2014) Identification of functionally related enzymes by learning-to-rank methods. IEEE Trans Comput Biol Bioinform 11(6):1157–1169

    Article  Google Scholar 

  • Stock M, Pahikkala T, Airola A, Baets BD, Waegeman W (2016) Efficient pairwise learning using kernel ridge regression: an exact two-step method. arXiv:1606.04275

  • Tai F, Lin HT (2010) Multi-label classification with principle label space transformation. In: Second international workshop on learning from multi-label data (MLD 2010), in conjunction with ICML/COLT 2010

  • Tai F, Lin HT (2012) Multilabel classification with principal label space transformation. Neural Comput 24(9):2508–2542

    Article  MathSciNet  MATH  Google Scholar 

  • Takács G, Pilászy I, Németh B, Tikk D (2008) Matrix factorization and neighbor based algorithms for the netflix prize problem. In: Proceedings of the 2008 ACM conference on recommender systems. ACM Press, New York, pp 267–274

  • Todorovski L, Blockeel H, Dzeroski S (2002) Ranking with predictive clustering trees. In: Proceedings of the European conference on machine learning

  • Tsoumakas G, Katakis I (2007) Multi label classification: an overview. Int J Data Warehous Min 3(3):1–13

    Article  Google Scholar 

  • Van der Merwe A, Zidek J (1980) Multivariate regression analysis and canonical variates. Can J Stat 8:27–39

    Article  MathSciNet  MATH  Google Scholar 

  • Van Loan CF (2000) The ubiquitous kronecker product. J Comput Appl Math 123(1–2):85–100

    Article  MathSciNet  MATH  Google Scholar 

  • Van Peer G, Paepe AD, Stock M, Anckaert J, Volders PJ, Vandesompele J, Baets BD, Waegeman W (2017) miSTAR: miRNA target prediction through modeling quantitative and qualitative miRNA binding site information in a stacked model structure. Nucl Acids Res 45:e51

    Article  Google Scholar 

  • Vens C, Struyf J, Schietgat L, Dzeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185–214

    Article  Google Scholar 

  • Vert JP, Qiu J, Noble WS (2007) A new pairwise kernel for biological network inference with support vector machines. BMC Bioinform 8(S–10):1–10

    Google Scholar 

  • Volkovs M, Zemel RS (2012) Collaborative ranking with 17 parameters. In: Advances in neural information processing systems, pp 2303–2311

  • Waegeman W, Pahikkala T, Airola A, Salakoski T, Stock M, De Baets B (2012) A kernel-based framework for learning graded relations from data. IEEE Trans Fuzzy Syst 20(6):1090–1101

    Article  Google Scholar 

  • Waegeman W, Dembczynski K, Jachnik A, Cheng W, Hüllermeier E (2014) On the bayes-optimality of f-measure maximizers. J Mach Learn Res 15:3333–3388

    MathSciNet  MATH  Google Scholar 

  • Wang F, Wang X, Li T (2009) Semi-supervised multi-task learning with task regularizations. In: IEEE international conference on data mining, pp 562–568

  • Wei Y, Xia W, Lin M, Huang J, Ni B, Dong J, Zhao Y, Yan S (2016) Hcp: a flexible cnn framework for multi-label image classification. IEEE Trans Pattern Anal Mach Intell 38(9):1901–1907

    Article  Google Scholar 

  • Weston J, Chapelle O, Elisseeff A, Schölkopf B, Vapnik V (2002) Kernel dependency estimation. In: Advances in neural information processing systems, UK pp 873–880

  • Wicker J, Tyukin A, Kramer S (2016) A nonlinear label compression and transformation method for multi-label classification using autoencoders. In: Advances in knowledge discovery and data mining: 20th Pacific-Asia conference, PAKDD 2016, Auckland, New Zealand

  • Wolpert DH (1992) Original contribution: stacked generalization. Neural Netw 5(2):241–259

    Article  Google Scholar 

  • Wu L, Fisch A, Chopra S, Adams K, Bordes A, Weston J (2018) Starspace: embed all the things! In: AAAI conference on artificial intelligence

  • Xian Y, Akata Z, Sharma G, Nguyen QN, Hein M, Schiele B (2016) Latent embeddings for zero-shot classification. In: IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, pp 69–77

  • Xian Y, Lampert C, Schiele B, Akata Z (2018) Zero-shot learning: a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2857768

  • Xue Y, Liao X, Carin L, Krishnapuram B (2007) Multi-task learning for classification with Dirichlet process priors. J Mach Learn Res 8:35–63

    MathSciNet  MATH  Google Scholar 

  • Yen IE, Huang X, Ravikumar P, Zhong K, Dhillon IS (2016) Pd-sparse: a primal and dual sparse approach to extreme multiclass and multilabel classification. In: Proceedings of the 33nd international conference on machine learning, New York City, NY, USA, pp 3069–3077

  • Zhang Y, Schneider J (2011) Multi-label output codes using canonical correlation analysis. In: Uncertainty in artificial intelligence

  • Zhang D, Shen D (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2):895–907

    Article  MathSciNet  Google Scholar 

  • Zhang Y, Yeung D (2010) A convex formulation for learning task relationships in multi-task learning. In: Proceedings of the 26th conference on uncertainty in artificial intelligence (UAI), pp 733–742

  • Zhou J, Chen J, Ye J (2011a) Clustered multi-task learning via alternating structure optimization. In: Advances in neural information processing systems

  • Zhou J, Yuan L, Liu J, Ye J (2011b) A multi-task learning formulation for predicting disease progression. In: Apté, Ghosh J, Smyth P (eds) Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 814–822

  • Zhou J, Liu J, Narayan VA, Ye J (2012a) Modeling disease progression via fused sparse group lasso. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD’12, pp 1095–1103

  • Zhou T, Shan H, Banerjee A, Sapiro G (2012b) Kernelized probabilistic matrix factorization: exploiting graphs and side information. In: 12th SIAM international conference on data mining, SIAM, pp 403–414

  • Zhou Z, Zhang M (2007) Multi-instance multilabel learning with application to scene classification. In: Advances in neural information processing systems, vol 19

Download references

Acknowledgements

We would like to thank the editor and anonymous reviewers for useful suggestions that improved the structure and completeness of this survey. The work of Krzysztof Dembczynski was supported by the Polish National Science Centre under Grant No. 2013/09/D/ST6/03917.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Willem Waegeman.

Additional information

Communicated by Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: James–Stein Estimation

Appendix A: James–Stein Estimation

In the late sixties, James and Stein discovered that the best estimator of the mean of a multivariate Gaussian distribution is not necessarily the maximum likelihood estimator. More formally, assume that \(\theta \) is the unknown mean of a multivariate Gaussian distribution with dimension \(m>2\) and a diagonal covariance matrix. Consider a single observation \(\mathbf {y}\) randomly drawn from that distribution:

$$\begin{aligned} \mathbf {y} \sim {\mathscr {N}}(\theta , \sigma ^2 I) \,. \end{aligned}$$

Using only this observation, the maximum-likelihood estimator for \(\theta \) would be \({\hat{\theta }}_{ML} = \mathbf {y}\). James and Stein discovered that the maximum likelihood estimator is suboptimal in terms of mean squared error

$$\begin{aligned} {\mathbb {E}} \big [||\theta - {\hat{\theta }}||^2 \big ] \,, \end{aligned}$$

where the expectation is over the distribution of \(\mathbf {y}\). (In general, the expectation is taken over all samples that contain a single observation \(\mathbf {y}\). Later on we will shortly discuss a situation in which we draw more than one observation to compute the value of the estimator). An estimator with lower squared error can be obtained by applying a regularizer to the maximum likelihood estimator. In case \(\sigma ^2\) is known, the James–Stein estimator is defined as follows:

$$\begin{aligned} {\hat{\theta }}_{JS} = \left( 1 - \frac{(m-2)\sigma ^2}{||\mathbf {y}||^2} \right) \mathbf {y} \,. \end{aligned}$$

From a machine learning perspective, a regularizer is introduced that shrinks the estimate towards the zero vector, and hence reduces variance at the cost of introducing a bias. It has been shown that this biased estimator outperforms the maximum likelihood estimator in terms of mean squared error. The result even holds when the covariance matrix is non-diagonal, but in view of the discussion concerning target dependence, it is most remarkable for diagonal covariance matrices. In fact, in the latter case, it means that joint target regularization will be beneficial even if targets are intrinsically independent. This is somewhat in contradiction with what is commonly assumed in the machine learning literature.

Let us notice, however, that the advantage of the James–Stein estimate over the maximum likelihood estimate will vanish for larger samples (of more than one observation). In the second term in parentheses, \(\sigma ^2\) is then divided by the size of the sample, so that the James–Stein estimate converges to the maximum likelihood estimate when the sample size grows to infinity.

The James–Stein paradox analyzes a very simple estimation setting, for which suboptimality of the maximum likelihood estimator can be proved analytically, but the principle extends to various multi-target prediction settings. By interpreting each component of \(\theta \) as an individual target (and omitting the instance space, or reducing it to a single point), the maximum likelihood estimator coincides with independent model fitting, whereas the James–Stein estimator adopts a regularization mechanism that is very similar to most of the regularization techniques used in the machine learning literature. For some specific multivariate regression models, connections of that kind have been discussed in the statistical literature (Breiman and Friedman 1997). As long as mean squared error is considered as a loss function and errors follow a Gaussian distribution, one can immediately extend the James–Stein paradox to multivariate regression settings by assuming that target vectors \(\mathbf {y}\) are generated according to the following statistical model:

$$\begin{aligned} \mathbf {y} \sim {\mathscr {N}}(\theta (\mathbf {x}), \sigma ^2 I) \, , \end{aligned}$$

where the mean is now conditioned on the input space. For other loss functions, we are not aware of any formal analysis of that kind, but it might be expected that similar conclusions can be drawn.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Waegeman, W., Dembczyński, K. & Hüllermeier, E. Multi-target prediction: a unifying view on problems and methods. Data Min Knowl Disc 33, 293–324 (2019). https://doi.org/10.1007/s10618-018-0595-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-018-0595-5

Keywords

Navigation