## Abstract

Multitask learning (MTL) can improve performance on one task by sharing representations with one or more related auxiliary tasks. Usually, MTL networks are trained on a composite loss function formed by a fixed weighted combination of separate task losses. In practice, however, static loss weights lead to poor results for two reasons. First, the relevance of the auxiliary tasks gradually drifts throughout the learning process. Second, for minibatch-based optimization, the optimal task weights vary significantly from one update to the next depending on the minibatch sample composition. Here, we introduce HydaLearn, an intelligent weighting algorithm that connects the main-task gain to the individual task gradients, to inform dynamic loss weighting at the minibatch level, addressing the two above shortcomings. We demonstrate significant performance increases on synthetic data and two real-world data sets.

### Similar content being viewed by others

## Data Availability

All datasets used are open source, simulations are described in detail so they can be reproduced.

## References

Caruana R (1997) Multitask learning. Machine learning 28(1):41–75

Du Y, Czarnecki W M, Jayakumar S M, Pascanu R, Lakshminarayanan B (2018) Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224

Sener O, Koltun V (2018) Multi-task learning as multi-objective optimization. In: Advances in Neural Information Processing Systems, pp 527–538

Chen Z, Badrinarayanan V, Lee C-Y, Rabinovich A (2017) Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257

Ruder S, Bingel J, Augenstein I, Søgaard A (2019) Latent multi-task architecture learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 4822–4829

Liu S, Johns E, Davison A J (2019) End-to-end multi-task learning with attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1871–1880

Guo M, Haque A, Huang D-A, Yeung S, Fei-Fei L (2018) Dynamic task prioritization for multitask learning. In: Proceedings of the European conference on computer vision (ECCV), pp 270–287

Lin X, Baweja H, Kantor G, Held D (2019) Adaptive auxiliary task weighting for reinforcement learning. In: Advances in Neural Information Processing Systems, pp 4773–4784

Hochreiter S, Schmidhuber J (1997) Flat minima. Neural Comput 9(1):1–42

Yuan W, Hu F, Lu L (2022) A new non-adaptive optimization method: Stochastic gradient descent with momentum and difference. Appl Intell 52(4):3939–3953

Vandenhende S, Georgoulis S, De Brabandere B, Van Gool L (2020) Branched multi-task networks: deciding what layers to share. BMVC

Bruggemann D, Kanakis M, Georgoulis S, Van Gool L (2020) Automated search for resource-efficient branched multi-task networks. arXiv:2008.10292

Johnson Alistair EW, Pollard T J, Shen L, Li-wei H L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi L A, Mark R G (2016) Mimic-iii, a freely accessible critical care database. Scientific data 3:160035

Wei T, Wang S, Zhong J, Liu D, Zhang J (2021) A review on evolutionary multi-task optimization: Trends and challenges. IEEE Trans Evol Comput

Maurer A, Pontil M, Romera-Paredes B (2016) The benefit of multitask representation learning. J Mach Learn Res 17(1):2853–2884

Caruana R (1993) Multitask learning: A knowledge-based source of inductive bias. In: ICML

Vandenhende S, Georgoulis S, Van Gansbeke W, Proesmans M, Dai D, Van Gool L (2021) Multi-task learning for dense prediction tasks: A survey. IEEE Trans Pattern Anal Mach Intell

Harutyunyan H, Khachatrian H, Kale D C, Ver Steeg G, Galstyan A (2019) Multitask learning and benchmarking with clinical time series data. Scientific Data 6(1):1–18

Guo H, Pasunuru R, Bansal M (2019) Autosem: Automatic task selection and mixing in multi-task learning. arXiv preprint arXiv:1904.04153

Zhang Z, Luo P, Loy C C, Tang X (2014) Facial landmark detection by deep multi-task learning. In: European conference on computer vision. Springer, pp 94–108

Bingel J, Søgaard A (2017) Identifying beneficial task relations for multi-task learning in deep neural networks. arXiv preprint arXiv:1702.0830

Rai P, Daumé III H (2010) Infinite predictor subspace models for multitask learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 613–620

Sun G, Probst T, Paudel D P, Popović N, Kanakis M, Patel J, Dai D, Van Gool L (2021) Task switching network for multi-task learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8291–8300

Romera-Paredes B, Argyriou A, Berthouze N, Pontil M (2012) Exploiting unrelated tasks in multi-task learning. In: International conference on artificial intelligence and statistics, pp 951–959

Fifty C, Amid E, Zhao Z, Yu T, Anil R, Finn C (2021) Efficiently identifying task groupings for multi-task learning. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J (eds) Advances in Neural Information Processing Systems. https://proceedings.neurips.cc/paper/2021/file/e77910ebb93b511588557806310f78f1-Paper.pdf, vol 34. Curran Associates, Inc., pp 27503–27516

Wu Y, Song Y, Huang H, Ye F, Xie X, Jin H (2021) Enhancing graph neural networks via auxiliary training for semi-supervised node classification. Knowl-Based Syst 220:106884

Yang Z, Zhang Y, Yu J, Cai J, Luo J (2018) End-to-end multi-modal multi-task vehicle control for self-driving cars with visual perceptions. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, pp 2289–2294

Zhou Y, Chen H, Li Y, Liu Q, Xu X, Wang S, Yap P-T, Shen D (2021) Multi-task learning for segmentation and classification of tumors in 3d automated breast ultrasound images. Med Image Anal 70:101918

Verboven S, Martin N (2022) Combining the clinical and operational perspectives in heterogeneous treatment effect inference in healthcare processes. In: International Conference on Process Mining. Springer, pp 327–339

Dabre R, Chu C, Kunchukuttan A (2020) A survey of multilingual neural machine translation. ACM Computing Surveys (CSUR) 53(5):1–38

Caruana R, Baluja S, Mitchell T (1996) Using the future to” sort out” the present: Rankprop and multitask learning for medical risk evaluation. In: Advances in neural information processing systems, pp 959–965

Baesens B, Van Vlasselaer V, Verbeke W (2015) Fraud analytics using descriptive, predictive, and social network techniques: a guide to data science for fraud detection. Wiley, New York

Baesens B, Van Gestel T, Stepanova M, Van den Poel D, Vanthienen J (2005) Neural network survival analysis for personal loan data. J Oper Res Soc 56(9):1089–1098

Yu T, Kumar S, Gupta A, Levine S, Hausman K, Finn C (2020) Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782

Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7482–7491

Jean S, Firat O, Johnson M (2019) Adaptive scheduling for multi-task learning. arXiv preprint arXiv:1909.06434

Sutton R S (1992) Adapting bias by gradient descent: An incremental version of delta-bar-delta. In: AAAI, pp 171–176

Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer, pp 177–186

Kingma D P, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

Smith S L, Le Q V (2017) A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451

McCandlish S, Kaplan J, Amodei D, Team OpenAI Dota (2018) An empirical model of large-batch training. arXiv preprint arXiv:1812.06162

Keskar N S, Mudigere D, Nocedal J, Smelyanskiy M, Tang P T P (2016) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836

Zhang C, Liao Q, Rakhlin A, Miranda B, Golowich N, Poggio T (2018) Theory of deep learning iib: Optimization properties of sgd. arXiv preprint arXiv:1801.02254

Golmant N, Vemuri N, Yao Z, Feinberg V, Gholami A, Rothauge K, Mahoney M W, Gonzalez J (2018) On the computational inefficiency of large batch sizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941

Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677

Hoffer E, Hubara I, Soudry D (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: Advances in Neural Information Processing Systems, pp 1731–1741

Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra r-cnn: Towards balanced learning for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 821–830

Li B, Liu Y, Wang X (2019) Gradient harmonized single-stage detector. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 8577–8584

Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

Oksuz K, Cam B C, Kalkan S, Akbas E (2020) Imbalance problems in object detection: A review. IEEE Trans Pattern Anal Mach Intell

Ren M, Zeng W, Yang B, Urtasun R (2018) Learning to reweight examples for robust deep learning. In: International conference on machine learning. PMLR, pp 4334–4343

Dong T N, Brogden G, Gerold G, Khosla M (2021) A multitask transfer learning framework for the prediction of virus-human protein–protein interactions. BMC bioinformatics 22(1):1–24

Caruana R (2000) Learning from imbalanced data: Rank metrics and extra tasks. In: Proc. Am. Assoc. for Artificial Intelligence (AAAI) Conf, pp 51–57

Saeed M, Villarroel M, Reisner A T, Clifford G, Lehman L-W, Moody G, Heldt T, Kyaw T H, Moody B, Mark R G (2011) Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Critical Care Med 39(5):952

Purushotham S, Meng C, Che Z, Liu Y (2017) Benchmark of deep learning models on large healthcare mimic datasets. arXiv preprint arXiv:1710.08531

Johnson Alistair EW, Pollard T J, Mark R G (2017) Reproducibility in critical care: a mortality prediction case study. In: Machine Learning for Healthcare Conference, pp 361–376

Gentimis T, Ala’J A, Durante A, Cook K, Steele R (2017) Predicting hospital length of stay using neural networks on mimic iii data. In: 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, pp 1194–1201

Nemati S, Holder A, Razmi F, Stanley M D, Clifford G D, Buchman T G (2018) An interpretable machine learning model for accurate prediction of sepsis in the icu. Critical care medicine 46(4):547

Suresh H, Gong J J, Guttag J V (2018) Learning tasks for multitask learning: Heterogenous patient populations in the icu. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 802–810

Harutyunyan H, Khachatrian H, Kale D C, Steeg G V, Galstyan A (2017) Multitask learning and benchmarking with clinical time series data. arXiv:1703.07771

## Author information

### Authors and Affiliations

### Corresponding author

## Ethics declarations

###
**Conflict of interest**

The authors declare that they have no conflict of interest.

## Additional information

### Code availability

Code will be released upon acceptance.

### Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix A: Data and Implementation Details

### Appendix A: Data and Implementation Details

In this section, the data and implementation details will be described. The HydaLearn code will be released upon acceptance of the paper. A description of the features of the datasets can be found in Table 4. The final hyperparameters for all experiments are reported in Table 5.

### 1.1 A.1 Toy Example Details

Following [4], we sample two regression tasks from the following functions:

where **x** is the input vector. *𝜖*_{i} and **B** are constant matrices representing a task-dependent ((with *i* as task indicator) and a shared component, respectively. The *σ*_{i} linearly affect the scale of the tasks, and the hyperbolic tangent function (tanh) imposes a nonlinear transformation.

For both experiments **B** was sampled I.I.D. from a Gaussian with mean, **0** and covariance, 10*I*, and **𝜖**_{i} from a Gaussian with mean **0** and covariance 3.5*I*. To represent common scale differences in tasks. The scaling parameters *σ*_{m} and *σ*_{a} were set to 1 and 10, respectively.

Further toy example dataset details are provided in Table 3.

For GradNorm and Olaux, we take the recommended values [4, 8]. We note that for the experiments with real data, we perform hyperparameter optimization.

### 1.2 A.2 Preprocessing and Implementation: MIMIC

The MIMIC-III database [13, 54] is comprised of deidentified data of over 60000 intensive care unit (ICU) stays. MIMIC-III is a popular resource for machine learning research for a variety of tasks, such as mortality prediction [55, 56], length of stay prediction [55, 57], and sepsis prediction [58]. Since such tasks are often related, the database is also commonly used as a benchmark for multitask learning algorithms [18, 59]. Clinical data is often very noisy. Similarly, in the MIMIC dataset, the base features are recorded only sparsely and at irregular intervals, requiring heavy imputation.

We predict in-hospital mortality (classification) as our main task using features collected in the first 48 h of stay. For the auxiliary task, the length of stay (regression) is used, which ends with either death or discharge from the hospital. Thus, this experiment features a combination of a classification (area under the curve (AUC)) and a regression (MSE) loss. Therefore, the two losses operate on different scales, causing imbalances during learning. Furthermore, the dataset is high-dimensional relative to its sample size. This can also impede learning but can be mitigated through the regularizing effect of MTL. Further dataset details can be found in the Appendix A.

Only episodes that last longer than 48 hours are considered. For in-hospital mortality preprocessing, we follow the same approach as for the logistic regression baseline in [60] ^{Footnote 1} to enrich the 17 base-feature dataset. First, a given sequence is divided into 7 subsequences. Next, features are extracted for each subsequence based on the statistical characteristics of the original time series variables, specifically, mean, standard deviation, minimum, maximum, skewness and number of measurements [60]. This procedure yields 714 features (7 subsequences X 6 statistic features X 17 base features).

For all of the models, we used the same encoder/decoder setup. We used a random sweep over a range of possible configurations of the ‘static’ baseline to determine the backbone and batch size parameters. The shared layers and both task-specific heads of the network consist of 4 layers with 48 neurons and 2 layers with 24 neurons, respectively. The learning rate, batch size, and algorithm-specific hyperparameters were determined using a grid search using validation set performance. The following ranges were used; LR= [1i, 2.5i, 5i, 7,5i] for i in [E-2, E-3, E-4], BatchSize = [4, 8, 16, 32, 64, 128, 256]. For algorithm-specific hyperparameters; HydaLearn = [1, 2, 3, 4, 5, 6, 7],DWA_{t}—the so-called temperature = [1, 2, 3] DWA_{b} —the number of batches over which losses are averaged for weight calculations = [4, 16, 32], Olaux = [1, 2, 3, 4, 5, 6, 7], and Static = [0.5to1.9] with 0.1 increments was used.

### 1.3 A.3 Preprocessing and implementation: Fannie Mae

Data on mortgage default typically have an extreme class imbalance; the large majority of mortgage holders never default. Multitask learning is one approach to amplify the signal of the minority class [53]. As an auxiliary task, we propose to use prepayment prediction. By jointly learning both tasks, we can incorporate signals from future prepayments in the default model - a trick known as ‘using the future to predict the present’ [31].

We use a slice of the Fannie Mae dataset^{Footnote 2}. It includes data on over one million mortages. Improved sample efficiency through MTL has diminishing returns. As such, it makes sense to subsample the dataset to a reasonable size. Consequently, we take a uniformly sampled, I.I.D. slice of 10000 data points, containing mortgages that were accepted between 2000 and 2009. For prediction, we use the status at the start of 2010 to predict the occurrence of default and prepayment over the next twelve months.

The continuous and categorical features were standardized and one-hot encoded, respectively. The resulting 138 features are used in our experiments. The basic backbone architecture is the same for all models used in the experiments with the Fannie Mae dataset. This backbone consists of 2 24-neuron shared layers and two 2-layer 12-neuron task-specific heads. Again, the learning rate, batch size, and algorithm-specific hyperparameters are determined for each baseline separately by grid search with the same space as defined for the MIMIC dataset.

## Rights and permissions

## About this article

### Cite this article

Verboven, S., Chaudhary, M.H., Berrevoets, J. *et al.* HydaLearn.
*Appl Intell* **53**, 5808–5822 (2023). https://doi.org/10.1007/s10489-022-03695-x

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10489-022-03695-x