1 Introduction

A major limitation of state-of-the-art image interpretation systems is their narrow scope. Different models are learned to recognize faces (Taigman et al., 2014; Parkhi et al., 2015; Schroff et al., 2015; Zhong et al., 2016), textures (Cimpoi et al., 2014), sketches (Eitz et al., 2012) and drawings (Lake et al., 2015), various fine-grained flower (Nilsback & Zisserman, 2008), bird (Wah et al., 2011), fungi categories (Brigit & Yin, 2018), detect (Ren et al., 2015; Liu et al., 2016) and segment (Dai et al., 2016) object categories, and perform various low and mid-level tasks such as depth (Eigen et al., 2014), surface normal estimation (Wang et al., 2015), so on.

In contrast, humans in the early years of their development develop powerful internal visual representations that are subject to small refinements in response to later visual experience (Atkinson, 2002; Maurer & Lewis, 2001; Lewis & Maurer, 2005). Once these visual representations are formed, they are universal and later employed in many diverse vision tasks from reading text, recognizing faces to interpreting visual art forms.

Presence of universal representations in computer vision (Bilen & Vedaldi, 2017), has important implications. First, it means that vision has limited complexity. A growing number of visual domains and tasksFootnote 1 can be modeled with a bounded number of representations. As a result, one can use a compact set of representations for learning multiple domains and tasks, and efficiently share features and computations across them, which is crucial in platforms with limited computational resources such as mobile devices and autonomous cars. Second, as we obtain more complete universal representations, learning of new domains and tasks can be easier and performed efficiently from only few samples by transfer learning.

In practice, learning universal representations requires to address several challenges. First, modelling diverse visual data demands deep network architectures that can simultaneously learn representations while selectively sharing only the relevant representations across multiple tasks and domains. To this end, previous multi-task works proposed controlling representation sharing across tasks through latent connections (Misra et al., 2016; Ruder et al., 2019), constructing branched deep neural networks based on task affinities (Vandenhende et al., 2020), custom attention mechanisms (Liu et al., 2019), neural architecture search (Liang et al., 2018; Bruggemann et al., 2020; Guo et al., 2020), developing progressive communication across multiple tasks through recurrent networks (Bilen & Vedaldi, 2016; Zhang et al., 2018; Vandenhende et al., 2020), multi-scale feature sharing (Xu et al., 2018a; Vandenhende et al., 2020). Other previous works assume that features extracted from pretrained deep networks on ImageNet provide basis for universal representations and adapt them with a set of compact adapters to various domains (Rebuffi et al., 2017, 2018; Rosenfeld & Tsotsos, 2018; Deecke et al., 2022). In cross-domain few-shot classification, where the goal is to generalize to unseen tasks and domains from few samples, features from multiple domain-specific networks are considered as universal features and transferred to previously unseen domains and tasks after a post selection step (Dvornik et al., 2020; Liu et al., 2021b).

Fig. 1
figure 1

We propose Universal Representations Learning in purple as in the figure framework that learns multiple task-specific neural networks and distills their knowledge into a universal ones through aligning the universal network’s feature with the task-specific ones. Our method is generic and can be successfully applied to a wide range of problems including multi-task dense prediction tasks (a), multi-domain many-shot learning (b), cross-domain few-shot learning (c)

The second challenge is to develop training algorithms to learn representations that achieve good performance not only in one of the tasks or domains but in all of them. This problem is especially visible when the training involves jointly minimizing a set of loss functions (i.e.one for each task) with significantly different difficulty levels, magnitudes, and characteristics. Thus a naive strategy of uniformly weighing multiple losses can lead to sub-optimal performances and searching for optimal weights in a continuous hyperparameter space can be prohibitively expensive. Previous works (Chen et al., 2018; Sener & Koltun, 2018; Kendall et al., 2018; Guo et al., 2018; Liu et al., 2019) address the unbalanced loss optimization problem by weighing loss functions based on the task-dependent uncertainty of the model at training time (Kendall et al., 2018), proposing Pareto optimal solution (Sener & Koltun, 2018), eliminating conflicting gradient components between the tasks (Yu et al., 2020). Although these methods are shown to improve over the uniform weighing loss strategy in some benchmarks, they do not consistently outperform the baseline that simply weighs each loss function with a constant scalar (Vandenhende et al., 2021).

In this work, we focus on the second challenge. Inspired from knowledge distillation (Romero et al., 2015; Hinton et al., 2014), we approach the problem from a different perspective and propose a general methodology for universal representation learning that can be applied to a diverse set of problems including multi-task and multi-domain learning in few- and many-shot settings. We propose a two-stage procedure for universal representation learning where we first train a set of task or domain-specific models and freeze their parameters, and then distill their knowledge to a universal representation network while simultaneously training it over multiple-tasks/domains. In contrast to the standard knowledge distillation, in our setting each “teacher” network is trained for either significantly different task (e.g.semantic segmentation, depth estimation) and/or domain (e.g.flowers, handwritten characters), and encodes significantly different representations. Hence naively distilling their representations into a single network would result in poor performance. To this end, we propose aligning the universal network with the individual ones via small task-specific adapters before the distillation, and using specific loss functions that are invariant to certain transformations between representations. Our method has multiple key advantages over the previous work. First, in contrast to relying solely on weighing the individual loss functions (e.g.(Kendall et al., 2018)) or modifying the direction of their gradients (e.g.(Yu et al., 2020)) that are limited to prevent one task dominating or interfering the rest, we propose more explicit control on the model parameters through knowledge distillation such that representations from all tasks/domains are included in the universal representations. Second, unlike task/domain-specific loss functions with different characteristics which are difficult to balance, the distillation loss function is same for all tasks/domains and hence provides balanced optimization by design. Third, unlike (Dvornik et al., 2020; Liu et al., 2021b) that employ multiple feature extractors, our model learns a single set of universal representations (a single feature extractor) over multiple domains which has a fixed computational cost regardless of the number of domains at inference. Finally, our method can be successfully incorporated to various state-of-the-art multi-task/domain customized network architectures (Vandenhende et al., 2020; Rebuffi et al., 2018) and loss balancing strategies (Kendall et al., 2018; Liu et al., 2021c, a).

We illustrate our universal representation learning method and its applications to three standard vision problems in Fig. 1. The common step for all the applications is to first train a task or domain specific model and then distill their knowledge to a single universal network (see Fig. 1a). We show that the universal representations (depicted by the green network) can successfully be employed in jointly learning (i) multiple dense vision problems such as semantic segmentation, depth estimation (see Fig. 1b), (ii) multiple image classification problems from diverse datasets such as ImageNet (Deng et al., 2009), Omniglot (Lake et al., 2015), FGVC Aircraft (Maji et al., 2013) (see Fig. 1c), (iii) learning to classify images from few training samples of unseen tasks and domains (see Fig. 1d). In all applications, the computations and representations are largely shared across tasks and domains through the universal network, while light-weight task or domain-specific heads are used to obtain the predictions by mapping the universal representations to the task output space.

In summary, our core contribution is a generic framework for universal representation learning that can be employed in very diverse problems including dense multi-task prediction, and multi-domain many-shot and few-shot image classification tasks. We show that the learned universal representations generalize well not only to unseen samples in previously seen domains and tasks but also to unseen tasks and domains in few-shot setting. We rigorously evaluate our method and show that it outperforms the state-of-the-art multi-task dense prediction methods in NYU-v2 (Silberman et al., 2012) and CityScapes (Cordts et al., 2016), cross-domain few-shot image classification methods in MetaDataset (Triantafillou et al., 2020), and multi-domain many-shot image classification methods in Visual Decathlon benchmark (Rebuffi et al., 2017). We also propose an efficient learning strategy that enables to learn representations from subsets of tasks in parallel and finally merging them to universal representations. We extensively analyze the performance of our method over various design choices including different deep network architectures, adaptor types and loss functions and for knowledge distillation.

This work is an extended version of our prior contributions (Li & Bilen, 2020; Li et al., 2021) that focus on dense multi-task prediction and cross-domain few-shot learning problems. The new contributions are:

  • a unified look at universal representations for a diverse set of problems in multi-domain and multi-task learning,

  • a hierarchical distillation strategy that allows for learning representations in parallel,

  • evaluation in a new problem, multi-domain many-shot learning,

  • evaluation of our method with the state-of-the-art multi-task dense prediction architectures (Xu et al., 2018a; Liu et al., 2019; Vandenhende et al., 2020),

  • more extensive analysis on adaptor types and loss functions,

  • more extensive and recent literature review.

The rest of the paper is organized as follows: Sect. 2 provides an extensive overview of related work in multi-task learning, multi-domain learning and knowledge distillation. Section 3 reviews the background learning formulation for single and multiple task learning. Section 4 introduces universal representation learning for multi-task dense prediction, multi-domain classification and cross-domain few-shot classification. Section 5 provides a rigorous analysis of the design choices and evaluates the performance of the proposed models in multiple standard benchmarks for the three problems. Section 6 concludes the paper with future remarks and limitations.

2 Related Work

2.1 Multi-task Learning

Multi-task Learning (MTL) (Caruana, 1997) aims at learning a single model that can infer all desired task outputs given an input. We refer to Ruder (2017), Zhang and Yang (2017), Vandenhende et al. (2021) for more comprehensive literature review. As discussed above, the prior works can be broadly divided into two groups. The first one focuses on improving network architecture via more effective information sharing across tasks (Misra et al., 2016; Kokkinos, 2017; Ruder et al., 2019; Liu et al., 2019; Vandenhende et al., 2020; Liang et al., 2018; Bruggemann et al., 2020; Guo et al., 2020; Bragman et al., 2019; Strezoski et al., 2019; Xu et al., 2018a; Zhang et al., 2019; Bruggemann et al., 2021; Bilen & Vedaldi, 2016; Zhang et al., 2018; Vandenhende et al., 2020; Xu et al., 2018a, 2022; Sun et al., 2021). The second group aims to address the unbalanced optimization that is caused by jointly optimizing multiple task loss functions with varying characteristics through either actively changing weight of each loss term (Kendall et al., 2018; Liu et al., 2019; Guo et al., 2018; Chen et al., 2018; Lin et al., 2019; Sener & Koltun, 2018; Liu et al., 2021c) and/or modifying the gradients of loss functions w.r.t. the shared network weights to alleviate the conflicts among tasks (Yu et al., 2020; Liu et al., 2021a; Chen et al., 2020; Chennupati et al., 2019; Suteu & Guo, 2019).

Our method is complementary to the first line of work. In fact, we show in Sect. 5 that our method can be used to boost the state-of-the-art multi-task architectures in dense prediction problems. While our goal is aligned with the one of the second group, we propose a significantly different strategy based on knowledge distillation to solve the unbalanced loss optimization problem. To this end, we first train a task-specific model for each task in an offline stage and freeze their parameters. We then train the universal representation (multi-task) network for minimizing task-specific loss and also for producing the same features with the task-specific networks. As each task-specific network encodes different features, we introduce small task-specific adapters to project the universal features to the task-specific features and then minimize the discrepancy between task-specific and universal features. In contrast to prior works that either rely solely on weighing the individual loss functions (e.g.(Kendall et al., 2018)) or modifying their gradients for parameter updates (e.g.(Yu et al., 2020)) that are limited to prevent one task dominating or interfering the rest, our method provides a more direct control on the model parameters through knowledge distillation such that representations from all tasks are included in the universal representations. Second, unlike task-specific loss functions with different characteristics which are difficult to balance, the distillation loss function is same for all tasks and hence provides a balanced optimization by design. In addition, in this paper, we show that our method can also generalize to learning multiple diverse visual domains for standard image classification (Rebuffi et al., 2017) and few-shot classification (Triantafillou et al., 2020).

Multi-task Self-supervised Learning. Our work is also loosely related to a recent line of self-supervised learning methods that learns representations from unlabelled data from multiple pretext tasks (Doersch & Zisserman, 2017), e.g.predicting rotation, or learning them from unlabelled data which are sampled from multiple domains (datasets) (Zoph et al., 2020; Ghiasi et al., 2021). Unlike these works that focus on pretraining representations from unlabelled data and transferring them to a downstream task at a time (one model for one downstream task), we focus on learning models that can jointly perform multiple tasks.

Continual Learning. Our method is also related to the regularization-based methods in continual learning which aims at sequentially learning a model for a large number of tasks without forgetting knowledge learned from the preceding tasks, where the data for the preceding tasks are not available when learning the model for the new ones. Regularization-based methods prevent forgetting by adding an additional term that regularizes the important parameters between the model learned from the preceding tasks and the one from the new tasks (e.g.EWC (Kirkpatrick et al., 2017), MAS (Aljundi et al., 2018), REWC (Liu et al., 2018), SI (Zenke et al., 2017), and RWalk (Chaudhry et al., 2018)) or constrains features or predictions such as LwF (Li & Hoiem, 2017), LwM (Dhar et al., 2019) and BiC (Wu et al., 2019). Unlike these methods that are designed for continual learning, our goal is to learn a single set of universal representations for multiple tasks or domains when the data for all tasks and domains are available during training rather than learning a model from sequential data. Also, we propose to distill the knowledge from multiple task/domain-specific models into a single universal neural network by aligning features with the help of task-specific adapters and predictions.

2.2 Multi-domain Learning (MDL)

A parallel line of research is learning representations jointly on multiple domains  (Bilen & Vedaldi, 2017; Rosenfeld & Tsotsos, 2018; Rebuffi et al., 2018; Deecke et al., 2022). Unlike (Ganin et al., 2016; Tzeng et al., 2017; Hoffman et al., 2018; Xu et al., 2018b; Peng et al., 2019; Sun et al., 2019) that focus on domain adaptation, this line of work aims at learning a single set of universal representations over multiple tasks and visual domains. Bilen and Vedaldi (2017) proposed to learn a compact multi-domain representation for standard image classification in multiple visual domains using domain-specific scaling parameters. This idea was later extended to the use of domain-specific adapters (Rebuffi et al., 2017; Rosenfeld & Tsotsos, 2018; Rebuffi et al., 2018) and latent domains learning without access to domain annotations by learning gating functions to select domain-specific adapters for the given images (Deecke et al., 2022). In this paper, we also learn a single set of universal (multi-domain) representations by sharing most of the computation across domains (e.g.the feature encoder is shared across all domains, followed by multiple domain-specific classifiers). However, unlike (Rebuffi et al., 2017; Rosenfeld & Tsotsos, 2018; Rebuffi et al., 2018; Deecke et al., 2022) that use representations pretrained only on ImageNet (Deng et al., 2009) as the universal ones, and then learn additional domain-specific representations resulting, our method is capable of learning a single set of universal representations from multiple domains which requires less number of parameters and it is also a significantly harder task due to the challenges in the multi-loss optimization. In addition, those works do not scale up to multi-task learning in a single domain, as they require running network with the corresponding task-specific adaptors for each task separately, while ours requires only a single forward computation for all tasks.

2.3 Cross-Domain Few-Shot Learning

Our work is also related to the few-shot classification scenarios that aim at adapting a classifier to previously unseen tasks and domains from few labeled samples. Earlier works (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017; Nichol et al., 2018) focus on evaluating their methods in homogeneous learning tasks, e.g.Omniglot (Lake et al., 2015), miniImageNet (Vinyals et al., 2016) where both the meta-train and meta-test examples are sampled from a single data distribution (or dataset), and perform poorly in the more challenging cross-domain few-shot tasks, where test data is sampled from an unknown or previously unseen domain (Triantafillou et al., 2020). We refer to Wang et al. (2020), Hospedales et al. (2020) for comprehensive review of early works.

Recent few-shot techniques (Dvornik et al., 2020; Liu et al., 2021b; Bateni et al., 2020; Requeima et al., 2019) leverage powerful representations learned over multiple domains and focus on few-shot learning from multiple domains that generalizes to unseen domains at test time in the recently proposed MetaDataset (Triantafillou et al., 2020). CNAPS (Requeima et al., 2019) consists of an adaptation network that modulates the parameters of both a feature extractor and classifier for new categories by encoding the data distribution of few training samples. Simple CNAPS (Bateni et al., 2020) extends CNAPS by replacing its parametric classifier with a non-parametric classifier based on Mahalanobis distance and shows that adapting the classifier from few samples is not necessary for good performance. SUR (Dvornik et al., 2020) and URT (Liu et al., 2021b) further show that adaptation for the feature extractor can also be replaced by a feature selection mechanism. In particular, both (Dvornik et al., 2020; Liu et al., 2021b) learn a separate deep network for each training dataset in an offline stage, employ them to extract multiple features for each image, and then select the optimal set of features either based on a similarity measure (Dvornik et al., 2020) or on an attention mechanism (Liu et al., 2021b). Despite their good performance, SUR and URT are computationally expensive and require multiple forward passes through multiple networks during inference time. Our method also uses multi-domain features but in a more efficient way, by learning a single network over multiple domains. Our method requires significantly less network capacity and compute load than theirs.

2.4 Knowledge Distillation

Our work is related to knowledge distillation (KD) methods (Hinton et al., 2014; Li & Bilen, 2020; Ma & Mei, 2019; Phuong & Lampert, 2019; Romero et al., 2015; Tian et al., 2020a) that distill the knowledge of an ensemble of large teacher models to a small student neural network at the classifier (Hinton et al., 2014) and intermediate layers (Romero et al., 2015). Born-Again Neural Networks (Furlanello et al., 2018) uses KD proposes to consecutively distill knowledge from an identical teacher network to a student network, which is further applied to few-shot learning in Tian et al. (2020b) and multi-task learning in Clark et al. (2019).

While our method can also be seen as a distillation method, our goal differs significantly. In contrast to the standard KD that aims to learn a single task/domain network from multiple teachers, our goal is to learn a multi-task/domain network. The difference is subtle. Multi-task/domain learning typically involves solving an unbalanced optimization, while aligning the predictions of the multi-task/domain (student) network with the task-specific (teacher) networks does not necessarily alleviate this issue, as this alignment problem leads to another unbalanced optimization problem due to varying dimensionality of task outputs and loss functions required for matching different tasks’ predictions (Clark et al., 2019), e.g., a kl-divergence loss for classification and l2-norm loss for regression. While alignment of intermediate representations are studied for KD in Romero et al. (2015), such an alignment is substantially harder when the representations vary significantly across different teacher networks. We demonstrate that mapping the student (or universal) representations to teacher’s representation space before the alignment is crucial. Finally, we also show that intermediate representation matching across very diverse domains can indeed be improved by using a loss function that is invariant to linear transformations, inspired from Centered Kernel Alignment (CKA) similarity (Kornblith et al., 2019).

3 Background

Fig. 2
figure 2

Illustration of universal representation learning. In the first stage (a), we learn a task-specific deep network for each task. In the second stage (b), our goal is to learn a multi-task network that shares the feature encoder across all tasks and build multiple task-specific decoders on top of the feature encoder such that it performs well on these tasks compared to task-specific models trained in (a). To achieve this, we train such multi-task network by jointly minimizing task-specific losses and aligning the feature and prediction between the multi-task network and task-specific network. For the aligned feature distillation, we introduce a set of task-specific adapters to transform the feature from multi-task network to task-specific space before the alignment with task-specific features

In this section, we review the problem setting for single-task and multi-task learning to provide the required background for the universal representation learning. Let \(\mathcal {D}\) be a training set consisting of N RGB training images and their respective labels. We consider two general label settings, multi-task learning (MTL) and multi-domain learning (MDL). In MTL, we assume that training images are sampled from a single distribution, i.e.dataset, and each training image \(\varvec{x}\) is associated with labels for T tasks, \(y=\{\varvec{y}^1,\varvec{y}^2,\dots ,\varvec{y}^T\}\). This is a common setting for dense prediction problems where multiple tasks such as semantic segmentation and depth estimation are performed on the same image. In MDL, the training set contains samples from T different domains, where each image is associated only with a single domain and its domain-specific task. The standard MDL benchmarks (Rebuffi et al., 2017; Triantafillou et al., 2020) contain images from diverse datasets (e.g.ImageNet, Omniglot, VGG Flowers), each with a classification task over mutually an exclusive set of categories. Hence, an image \(\varvec{x}\) associated with domain t is labelled only with the domain-specific task \(\varvec{y}^t\). In both MTL and MDL settings, our goal is to learn a function \(\hat{y}^{t}\) for each task t in MTL and for each domain t in MDL that accurately predicts the ground-truth label \(\varvec{y}^{t}\) of previously unseen images.

Note that while in MTL different tasks involve solving different problems such as semantic segmentation and depth estimation, in MDL they involve solving the same problem, e.g.image classification, however over a different set of categories. We do not focus on other scenarios where images from different domains are associated with the same task (e.g.digit recognition from hand-written notes and real street-numbers).

3.1 Single-task Learning

Single-task learning (STL) involves learning a task-specific function \(\hat{y}^t_s\)Footnote 2 independently for each task by optimizing a task-specific loss function \(\ell ^{t}(\hat{y}^t_s,\varvec{y}^t)\) (e.g.cross-entropy for classification), which measures the mismatch between the ground-truth label and prediction as following:

$$\begin{aligned} \min _{\phi _s^t,\psi _s^t}\frac{1}{N}\sum _{n=1}^{N}\ell ^{t}(\hat{y}^{t}_s(\varvec{x}_{n}), \varvec{y}^{t}_{n}), \quad \forall t \in \{1,\dots ,T\} \end{aligned}$$
(1)

where each \(\hat{y}^t_s\) is composed of i) a feature encoder \(f_{\phi _s^t}: \mathbb {R}^{3 \times H\times W} \rightarrow \mathbb {R}^{C \times H'\times W'}\) parameterized by \(\phi _s^t\) that takes in an \(H\times W\) dimensional RGB image and outputs a \(H'\times W'\) dimensional feature map with C channels, where \(C>3\), \(H'<H\) and \(W'<W\); ii) a decoder \(h_{\psi _s^{t}}: \mathbb {R}^{C \times H'\times W'} \rightarrow \mathbb {R}^{O^{t} \times H^{t} \times W^{t}}\) that decodes the extracted feature to predict the label for the task t, i.e.\(\hat{y}_s^{t}(\varvec{x})=h_{\psi _s^{t}}\circ f_{\phi _s^t}(\varvec{x})\) where \(O^{t}\), \(H^{t}\), \(W^{t}\) are the dimensions of the output space for task t and \(\psi _s^{t}\) are its parameters.

3.2 Multi-task Learning

A more efficient design is to share a significant portion of the computations and parameters across the tasks via a common feature encoder \(f_{\phi }: \mathbb {R}^{3 \times H\times W} \rightarrow \mathbb {R}^{C \times H'\times W'}\), i.e.convolutional neural network parameterized by \(\phi \) that takes in an image \(\varvec{x}\) and produces a \(H'\times W'\) dimensional C feature maps. The parameters \(\phi \) are shared across all the tasks. In this setting, \(f_{\phi }\) is followed by T task-specific decoders \(h_{\psi ^{t}}: \mathbb {R}^{C \times H'\times W'} \rightarrow \mathbb {R}^{O^{t} \times H^{t} \times W^{t}}\), each with its own task-specific weights \(\psi ^{t}\) that decodes the extracted feature to predict the label for the task t, i.e.\(\hat{y}^{t}(\varvec{x}^{t})=h_{\psi ^{t}}\circ f_{\phi }(\varvec{x}^{t})\). A common way of learning \(\hat{y}^{t}\) for all tasks is to jointly optimize the shared and task-specific parameters as following:

$$\begin{aligned} \min _{\phi , \{\psi ^t\}\vert _{t=1}^T}\frac{1}{N}\sum _{n=1}^{N}\sum _{\varvec{y}^t_n \in y_n}\lambda ^t\ell ^{t}(\hat{y}^{t}(\varvec{x}_{n}), \varvec{y}^{t}_{n}), \end{aligned}$$
(2)

where \(\lambda ^t\) is a scaling hyperparameter for task t to balance the loss functions among the tasks. However, obtaining good universal representations through solving Eq. (2) is a challenging problem, as it requires to leverage commonalities between the tasks while balancing their loss functions and minimizing interference (negative transfer (Chen et al., 2018; Yu et al., 2020)) between them. Hence solving Eq. (2) often leads to lower results than the ones of task-specific models, where each task is independently learned.

4 Universal Representation Learning

Motivated by these challenges, previous methods mainly focus on dynamically balancing the loss functions through loss weights (\(\lambda ^t\)) or manipulating gradients of each task w.r.t.the shared encoder to alleviate the conflicts between them while optimizing Eq. (2). However, as reported in Vandenhende et al. (2021), existing solutions fail to improve over carefully tuning these hyperparameters. We hypothesize that modifying hyperparameters and/or gradients provide only a limited control on the learned representations, and propose a different view on this problem, a two stage procedure inspired by the knowledge distillation methods (Romero et al., 2015; Hinton et al., 2014). Assuming that single-task learning often performs well when sufficient training data is available, we argue that single-task representations provide powerful representations and hence they provide good approximations to universal representations.

4.1 Unified Learning

To this end, we first train task-specific deep networks \(\{\hat{y}_s^t\}_{t=1}^{T}\), each with an encoder followed by a decoder, as described in Sect. 3.1. Next we freeze their weights and distill their knowledge to an universal network by minimizing the distance between the task-specific and universal representations over training samples (see Fig. 2).

As minimizing the distance between a single set of universal representations and multiple single-task representations would not yield a satisfactory solution, i.e.the average of the single-task representations, we instead first map the universal representations to each task-specific representation space through small task-specific adapters (a), and then compute the distance in this space. We denote this knowledge transfer term ‘aligned feature distillation’. In addition, we consider minimizing the distance between the outputs of the universal and single-task networks as in Hinton et al. (2014), which we denote ‘prediction distillation’. Equation (2) can be rewritten as:

$$\begin{aligned}{} & {} \min _{\phi , \{\psi ^t, \theta ^t\}\vert _{t=1}^T} \frac{1}{N}\sum _{n=1}^{N}\sum _{\varvec{y}^t_n \in y_n}\biggl (\lambda ^t\ell ^{t}(\hat{y}^{t}(\varvec{x}_{n}), \varvec{y}^{t}_{n}) \nonumber \\{} & {} \quad +\underbrace{\lambda _f^t d_{f}(a_{\theta ^{t}} \circ f_{\phi }(\varvec{x}^{t}_n), f_{\phi _s^t}(\varvec{x}^{t}_n))}_{\text {aligned feature distillation}} {+} \underbrace{ \lambda _{p}^{t}d_{p}(\hat{y}^t(\varvec{x}_{n}^{t}), \hat{y}_s^t(\varvec{x}_{n}^{t}))}_{\text {prediction distillation}}\biggr ),\nonumber \\ \end{aligned}$$
(3)

where \(\lambda _{f}^{t}\) and \(\lambda _{p}^{t}\) are task-specific hyperparameters for each additional term respectively. \(a_{\theta ^{t}}\) is the task-specific adapter network that aligns the features of the universal network \(f_{\phi }\) with the ones of the task-specific student network \(f_{\phi _s^t}\). Each adapter a is parameterized by \(\theta ^{t}\) which are jointly trained along with the network parameters. \({d_{f}}\) and \({d_{p}}\) are loss functions to measure discrepancies between features and predictions respectively. While a single loss function is used for distilling features, distilling predictions may require different treatment for each task due to the varying output characteristics per task (e.g., multi-class probability, depth). The optimal choice for \(d_f\) and \(d_p\) varies when applied to different problems. We detail \(d_f\) and \(d_p\) choices for different applications in Sects. 4.34.5 and compare different choices in Sect. 5.

Note that adapters are also commonly used in NLP for adapting large pre-trained language models to new tasks or domains (e.g., (Houlsby et al., 2019)). Unlike them, we use adapters for aligning features between the task-specific and universal networks, and discard them after training and only use the universal network without the adapters to predict the labels of unseen images. Hence, the inference time of our method is fixed and does not depend on the number of tasks/domains.

4.2 Hierarchical Distillation with Task Grouping

As the number of tasks (T) grows, solving Eq. (3) and computing the task-specific representations simultaneously can get memory and compute intensive. Hence we propose a hierarchical strategy that decomposes Eq. (3) into multiple independent smaller problems. To this end, we first assign each task into a random group. Grouping can also be done by using task similarity, however, we explore only random selection, as identifying task similarity is a challenging task by itself and an actively studied problem (Zamir et al., 2018). For each group g, we learn a single network by distilling the representations of the single-task networks in the group g like in Eq. (3). Importantly each group training can be run in parallel to accelerate the training. Then we further distill their representations into the final universal network as in Eq. (3). Note that it is also possible to use a deeper hierarchy with nested groupings. We evaluate this strategy in the Visual Decathlon benchmark (Rebuffi et al., 2017) (see Sect. 5.5), as it involves learning representations over a large set of domains.

Next we describe how the universal representations are learned for different scenarios including multi-task dense prediction, multi-domain classification and cross-domain few-shot classification problems.

4.3 Application to Multi-task Dense Prediction

In multi-task dense prediction setting, each image \(\varvec{x}\) is associated with labels for all tasks. The spatial dimensions of labels for each task is equal to the image size – hence it is called dense or pixelwise prediction – and is the same for all tasks. We consider semantic segmentation, monocular depth estimation and surface normal prediction in our experiments.

In this setting, we only minimize the difference between intermediate representations of the universal network and task-specific ones, and do not employ the prediction distillation nor align the predictions of the universal network with single-task ones as in Clark et al. (2019), Hinton et al. (2014), Romero et al. (2015). In our preliminary experiments, we observed that matching the predictions leads to a significant drop in the final performance, which we further analyze in Sect. 5.1.1. As each task prediction has different magnitude range and characteristic, we argue that jointly minimizing these distances along with other loss terms in Eq. (3) leads to a challenging unbalanced optimization.

Let \(\varvec{m}^t=a_{\theta ^{t}} \circ f_{\phi }(\varvec{x}) \in \mathbb {R}^{C \times H'\times W'}\) and \(\varvec{s}^t=f_{\phi _s^t}(\varvec{x}) \in \mathbb {R}^{C \times H'\times W'}\) denote representations obtained from the universal and single-task encoder for a given image \(\varvec{x}_n\) and task t, respectively. We normalize two feature maps with L2 Norm: \(\tilde{\varvec{m}}_{chw}={\varvec{m}_{chw}} / ||\varvec{m}_{\cdot hw}||_2\) and \(\tilde{\varvec{s}}_{chw}={\varvec{s}_{chw}} / ||\varvec{s}_{\cdot hw}||_2\), where \(\varvec{m}_{chw}\) indicates a hidden unit at chw in \(\varvec{m}\). We then measure the distance between two normalized feature maps by using the Euclidean distance function for \({d_{f}}\):

$$\begin{aligned} {d}_{f}(\varvec{m},\varvec{s})=\sum _{w=1}^{W}\sum _{h=1}^{H}\sum _{c=1}^{C}\left\Vert \tilde{\varvec{m}}_{chw}-\tilde{\varvec{s}}_{chw}\right\Vert _2^2. \end{aligned}$$
(4)

We investigate different designs for \(a_{\theta }\) and \({d_{f}}\) and show that using linear adapters for \(a_{\theta }\) with Euclidean distance function for \({d_{f}}\) obtains the best performance in Sect. 5.5.

4.4 Application to Multi-domain Classification

We also consider the multi-domain scenario in Rebuffi et al. (2017), where the training set \(\mathcal {D}\) contains T subdatasets, each sampled from a different domain. Each image is associated with only one domain and hence one task. The associated task is known in both train and test time as in Rebuffi et al. (2017). Like the multi-dense prediction problem, our goal is to learn a single network with a shared feature encoder \(f_{\phi }\) across the domains, thus the tasks. Unlike the multi-dense prediction, both the output of the feature encoder and the predictions are vectors (i.e.\(H'=1\), \(W'=1\), \(H^t=1\) and \(W^t=1\)). \(f_{\phi }\) is a convolutional neural network followed by a average global pooling layer as in He et al. (2016).

Following the two stage procedure, we first independently train a set of domain-specific deep networks \(\{\hat{y}_{s}^{t}\}_{t=1}^{T}\) as in Eq. (1) where each consists of a specific feature encoder \(f_{\phi _{s}^{t}}\) and classifier \(h_{\psi _{s}^{t}}\). Unlike the multiple dense prediction, this setting involves training each domain network on a different training set from a different domain. In the second stage, we learn the universal network over training images of multiple domains using Eq. (3). Here we use KL divergence loss for \({d}_p\) in Eq. (3) to align predictions of single-task and universal networks.

Though we use domain-specific adapters to map the universal features to the domain-specific space, learning a single set of representations over substantially diverse domains still remains challenging, requires to model complex non-linear relations between and hence a more elaborate distance function (\({d}_f\)) than the Euclidean one. To this end, we propose to adopt the Centered Kernel Alignment (CKA) (Kornblith et al., 2019) similarity index with the Radial Basis Function (RBF) kernel that is originally proposed as an analysis tool to measure similarities between neural network representations and shown to be invariant to various transformations, capable of capturing meaningful non-linear similarities between representations of higher dimension than the number of data points. Differently from the original goal, we use CKA as a loss function to minimize the distance between universal and domain-specific representations rather than an analysis tool.

Next we briefly describe CKA. Given a set of images \(\{\varvec{x}^{t}_1,\dots ,\varvec{x}^{t}_B\}\), let \(\textbf{M}=[a_{\theta ^{t}}\circ f_{\phi }(\varvec{x}^{t}_1), \dots , a_{\theta ^{t}}\circ f_{\phi }(\varvec{x}^{t}_B)]^\top \in \textrm{R}^{B \times C}\) and \(\textbf{S}=[f_{\phi _{s}^{t}}(\varvec{x}^{t}_1), \dots , f_{\phi _{s}^{t}}(\varvec{x}^{t}_B)]^\top \in \textrm{R}^{B \times C}\) denote the features that are computed by the multi-domain network adapted by \(a_{\theta ^{t}}\) and domain-specific networks respectively. We first compute the RBF kernel matrices \(\textbf{P}\) and \(\textbf{T}\) of \(\textbf{M}\) and \(\textbf{S}\) respectively and then use two kernel matrices \(\textbf{P}\) and \(\textbf{T}\) to measure CKA similarity between \(\textbf{M}\) and \(\textbf{S}\):

$$\begin{aligned} \text {CKA}(\textbf{M}, \textbf{S}) = \text {tr}(\textbf{P}\textbf{H}\textbf{T}\textbf{H})/\sqrt{\text {tr}(\textbf{P}\textbf{H}\textbf{P}\textbf{H})\text {tr}(\textbf{T}\textbf{H}\textbf{T}\textbf{H})}, \end{aligned}$$
(5)

where \(\text {tr}(\cdot )\) and \(\textbf{H}\) denote the trace of a matrix and centering matrix \(\textbf{H}_n=\textbf{I}_n-\frac{1}{n}\textbf{1}\textbf{1}^\top \) respectively. The loss \({d}_{f}(\textbf{M}, \textbf{Y})\) can be derived as \({d}_{f}(\textbf{M}, \textbf{S})=1-\text {CKA}(\textbf{M}, \textbf{S})\) as dissimilarity between the multi-domain and domain-specific features. As the original CKA similarity requires the computation of the kernel matrices over the whole datasets, which is not scalable to large datasets, we follow (Nguyen et al., 2021) and compute them over each minibatch in our training. We refer to Kornblith et al. (2019), Nguyen et al. (2021) for more details.

4.5 Application to Cross-Domain Few-shot Classification

We also apply our method to cross-domain few-shot classification that aims at learning to classify samples from only few images for each class from an unknown visual domain. Like the previous setting at Sect. 4.4, we are given a large training set \(\mathcal {D}\) that consists of images from T subdatasets, each focusing on a classification task over a set of mutually exclusive categories. Unlike the previous case, the goal is not to accurately classify an unseen image that belongs to one of the previously seen categories and domains but to a previously unseen category from either a previously seen or unseen domain. Hence it is substantially more challenging. In particular, for the unseen task, we are given a support set \(\mathcal {S}\) that contains few image and label pairs, and a query set \(\mathcal {Q}\) that contains samples to be classified. In other words, we would like to learn a classifier on the support set that can accurately predict the labels of the query set.

As in Dvornik et al. (2020), Liu et al. (2021b), we solve this problem in two stages, meta-training and meta-test. In meta-training, we learn our universal representation network on \(\mathcal {D}\) as in Sect. 4.4 by following the two step procedure where we also use CKA loss function (i.e.Eq. (5)) for \({d}_f\) and KL divergence loss for \({d}_p\) in Eq. (3). In meta-test, we transfer the learned universal network, further adapt its representations and learn a classifier for the new task on \(\mathcal {S}\). In contrast to Dvornik et al. (2020), Liu et al. (2021b) that learn multiple domain-specific networks in an offline stage, then employ them to extract multiple features for each image, and then select the most relevant ones at test time, we learn a single universal network that performs well in T domains by distilling the knowledge of the domain-specific networks at train time, but use only the universal network at test time. This has two key advantages over (Dvornik et al., 2020; Liu et al., 2021b). First using a single feature extractor, which has the same capacity with each domain-specific one, is significantly more efficient in terms of run-time and number of parameters in the meta-test stage. Second learning to find the most relevant features for a given support and query set in Liu et al. (2021b) is not trivial and may also suffer from overfitting to the small number of datasets in the training set, while the multi-domain representations, by definition, automatically contain the required information from the relevant domains.

In particular, after the universal representation learning, we freeze and use the universal network to extract features, and learn a task-specific linear mapping \(m_{\varphi }:\textrm{R}^C\rightarrow \textrm{R}^C\), which is parameterized by \(\varphi \), on the support set \(\mathcal {S}\) with the non-parametric nearest centroid classifier (NCC) (Mensink et al., 2013; Snell et al., 2017):

$$\begin{aligned} \min _{\varphi }\frac{1}{\vert \mathcal {S}\vert }\sum _{(\varvec{x}, \varvec{y}) \in \mathcal {S}} \ell _{\text {ce}}(\text {NCC} \circ m_{\varphi } \circ f_{\phi }(\varvec{x}),\varvec{y}) \end{aligned}$$
(6)

where \(\ell _{\text {ce}}\) is cross-entropy loss. NCC computes an average of feature vector over the mapped features of support samples that belong to each category to obtain each class centroids, measures the class probability of each sample by applying softmax over its negative cosine distance to the class centroids.

Discussion. In contrast to the previous setting where the universal representations are learned over multiple domains from sufficiently large data, cross-domain few-shot learning requires obtaining the domain-specific knowledge from only few samples in an unseen domain, which is extremely challenging. Hence, our hypothesis is that transferring universal representations should yield more effective learning of new domains, as the base assumption is that there is a bounded number of representations for vision problems.

5 Experiments

In this section, we analyze and evaluate our method in three problems, i) learning multiple dense prediction tasks on two popular benchmarks, NYU-v2 (Silberman et al., 2012) and Cityscapes (Cordts et al., 2016)) in Sect. 5.1, ii) learning multiple diverse visual domains on the Visual Domain Decathlon (Rebuffi et al., 2017) in Sect. 5.2 and iii) few-shot classification on MetaDataset (Triantafillou et al., 2020) in Sect. 5.3. Finally, we conduct extensive analysis over various design choices in Sect. 5.5.Footnote 3 Please refer to the supplementary for more detailed results and for training cost analysis (Sect. B.1.1).

5.1 Learning Multiple Dense Prediction Tasks

Here, we evaluate our method on learning universal representations for performing multiple dense prediction tasks on two standard multi-task learning benchmarks NYU-v2 (Silberman et al., 2012) and Cityscapes (Cordts et al., 2016) as in Liu et al. (2019, 2021a).

Datasets and Experimental Setting. We follow the training and evaluation settings in Liu et al. (2019, 2021a) for both single-task and multi-task learning in both datasets. More specifically, NYU-V2 (Silberman et al., 2012) contains RGB-D indoor scene images, where we evaluate performances on 3 tasks, including 13-class semantic segmentation, depth estimation, and surface normals estimation. We use the true depth data recorded by the Microsoft Kinect and surface normals provided in Eigen and Fergus (2015) for depth and surface normal estimation as in Liu et al. (2019). All images are resized to \(288 \times 384\) resolution as in Liu et al. (2019). We follow the default setting in Silberman et al. (2012); Liu et al. (2019) where 795 and 654 images are used for training and testing, respectively. Cityscapes (Cordts et al., 2016) consists of street-view images, which are labeled for two tasks: 7-class semantic segmentationFootnote 4 and depth estimation. We resize the images to \(128 \times 256\) to speed up the training as (Liu et al., 2019).

In both NYU-v2 and Cityscapes, we follow the training and evaluation protocol in Liu et al. (2019). We apply our method and all the baseline methods to two common multi-task architectures, encoder and decoder based ones. The encoder-based methods only share information in the encoder before decoding each task with an independent task-sepcific decoder while the decoder-based approaches also exchange information during the decoding stage (Vandenhende et al., 2021). For the encoder-based methods, we use the SegNet (Badrinarayanan et al., 2017) as the backbone. As in  Liu et al. (2019), we use cross-entropy loss for semantic segmentation, l1-norm loss for depth estimation in Cityscapes, and cosine similarity loss for surface normal estimation in NYU-v2. We use the exactly same hyper-parameters including learning rate, optimizer and also the same evaluation metrics, mean intersection over union (mIoU), absolute error (aErr) and mean error (mErr) in the predicted angles to evaluate the semantic segmentation, depth estimation and surface normals estimation task, respectively in Liu et al. (2019).

For the decoder-based methods, we build our method on PAD-Net (Xu et al., 2018a) and MTI-Net (Vandenhende et al., 2020) that use multi-scale feature extractor (encoder) based on the HRNet-18 (Sun et al., 2019) initialized with ImageNet pretrained weight as the feature encoder. We use the same loss functions, evaluation metrics, and training and evaluation protocol as done for SegNet backbone. For our method, we use the uniform loss weights (i.e.\(\lambda ^t=1\) for all tasks) for task-specific losses, unless stated otherwise. As we do not minimize the difference between predictions of the universal and single-task networks, we set \(\lambda ^t_p\) in Eq. (3) to zero. We then first split the train set as train and validation set to search \(\lambda _f^t \in \{1, 2\}\) by cross-validation and train our network on the whole training set. We set \(\lambda _f^t\) to 1 for semantic segmentation and depth and 2 for surface normal estimation. Please refer to the supplementary (Sect. A.1) for more details.

Multi-task Performance. In addition to the abovementioned evaluation metric for each task, following prior work (Vandenhende et al., 2021; Li et al., 2022), we also report the multi-task performance \(\bigtriangleup \)MTL whih measures the average per-task drop in performance w.r.t.the single-task baseline:

$$\begin{aligned} \bigtriangleup \text {MTL}=\frac{1}{T}\sum _{t=1}^{T}(-1)^{\gamma ^t}(P^{t}-P_{s}^{t})/P_{s}^{t}, \end{aligned}$$
(7)

where \(\gamma ^t=1\) if a lower value of \(P_{t}\) means better performance for metric of task t, and 0 otherwise. \(P^{t}\) and \(P_s^{t}\) are performance (e.g.mIoU for semantic Segmentation) of the universal (multi-task) network and single-task network, respectively.

Table 1 Test performance on NYU-v2

5.1.1 Encoder-Based Architecture

Compared Methods Encoder-based architectures, including the vanilla MTL using SegNet (Badrinarayanan et al., 2017) that shares the whole feature encoder across all tasks and consists of task-specific decoders, and MTAN (Liu et al., 2019) which extends the vanilla MTL baseline by sharing the SegNet across tasks and using task-specific attention modules in each layer to extract task-specific features. We compare our method to the single-task learning (STL) baseline, i.e.train individual network per task, the vanilla multi-task learning network with uniform loss weights (MTL), and balanced optimization strategies, including Uncertainty (Kendall et al., 2018), GradNorm (Chen et al., 2018), MGDA (Sener & Koltun, 2018), DWA (Liu et al., 2019), PCGrad (Yu et al., 2020), GradDrop (Chen et al., 2020), IMTL (Liu et al., 2021c) and CAGrad (Liu et al., 2021a). We also consider the BAM (Clark et al., 2019) which is originally designed for natural language processing and adapt this method for visual dense prediction tasks by aligning dense predictions. Importantly, this model performs knowledge distillation on predictions when learning the multi-task network and hence comparing this method sheds light onto importance of matching intermediate representations in the task-specific spaces. Here we use KL-divergence loss, l1-norm loss and cosine similarity loss as knowledge distillation loss on predictions for semantic segmentation, depth estimation and surface normal estimation, respectively. We reproduce all methods in the same settings for fair comparison and the results of the compared methods are similar or better than the ones reported in the corresponding papers.

Table 2 Testing results on Cityscapes

Results on NYU-v2. Table 1 depicts the results of our method and other compared approaches in NYU-v2. We see that the vanilla MTL (Uniform) using SegNet achieves better performance in depth estimation, however, its performance drops in surface normal estimation in comparison with STL. This indicates that joint optimization of multiple tasks with uniform loss weights leads to unbalanced results, and overall worse performance than STL models in \(\varDelta \)MTL metric. While only few balanced optimization algorithms help to improve the MTL performance, IMTL-H and CAGrad obtains the best when applied to SegNet and MTAN. IMTL-H improves by balancing the pace at which tasks are learned via looking at the projection onto individual tasks of the average gradient w.r.t.the shared parameters while CAGrad achieves improvement by modifying the parameter update such that the update not only minimizes the average of task-specific losses but also decreases each task-specific loss. While BAM optimizing the multi-task learning network by knowledge distillation, it performs worse than the Uniform baseline as it aligns the predictions which requires to use different loss functions (e.g.cross-entropy for segmentation) and requires solving another unbalanced optimization problem. This shows that simply distilling the predictions of the multiple single task network leads to poor performance. Finally, our method outperforms these methods with either SegNet or MTAN backbones significantly, +7.84% MTL performance improvement over the IMTL-H, the best baseline. The results suggest that distilling features from multiple single-task networks provides a more effective learning of shared representations.

Results on Cityscapes. We also evaluate all methods in Cityscapes and report the results in Table 2. Similar to the results in NYU-v2, BAM obtains worse results compared with the STL methods (e.g.\(-\)0.58% MTL performance when using the vanilla MTL method with SegNet). Among the loss balancing methods, Uncertainty and IMTL-H obtain the best MTL performance in both backbones (i.e.SegNet and MTAN) but their performance is lower than the STL models in both tasks. This shows the difficulty of optimizing the MTL network in a balanced way in this problem. Our method obtains significant gains on both tasks than all the compared methods and also achieves better results than the STL results. The results again demonstrate that our method is able to optimize MTL model in a more balanced way and to achieve better overall results. In addition, our method has much less parameters (one network) than the STL models (two networks) in Cityscapes.

Incorporating Loss Balancing to Ours While our method achieves consistent improvements over all the target tasks, solving Eq. (3) also involves minimizing a weighted sum of multiple loss terms. Hence here we investigate whether our method can also benefit from dynamically setting weights of the individual loss terms in NYU-v2. In particular, we use SegNet as backbone, and we dynamically update the weights of task-specific losses (i.e.\(\lambda ^t\)) with keeping the weight of distillation loss fixed (\(\lambda _f\)). We evaluate our method with each of three best performing loss balancing methods, i.e.Uncertainty, IMTL-H and CAGrad and report the results in Table 3. We see that our method is complementary to these loss balancing methods and it significantly improves the performance of loss balancing methods (e.g.Ours (Uncertainty) obtains about +12 improvement in MTL performance over the Uncertainty). Also, we can see that by applying our method to loss balancing methods obtains better performance than using our method with the Uniform MTL baseline.

Table 3 Testing results on NYU-v2

5.1.2 Decoder-Based Architectures

We also apply our method to the decoder-based methods, PAD-Net (Xu et al., 2018a) and MTI-Net (Vandenhende et al., 2020) which are particularly designed for MTL by exchanging information during the decoding stage and achieve state-of-the-art performances in MTL (Vandenhende et al., 2021). We incorporate our method to decoder-based methods by only modifying the loss functions as Eq. (3). Apart from these results, we also include results of the vanilla MTL method using the same backbone (HRNet-18 (Sun et al., 2019)) of PAD-Net and MTI-Net as baseline and we report all results in NYU-v2 and Cityscapes in Tables 4 and 5.

Table 4 Results on NYU-v2
Table 5 Results on Cityscapes

First we see that decoder based methods achieves better performance than the encoder based ones, as they use more powerful customized architectures and initialized with pre-trained ImageNet weights. Similar to encoder-based methods, the vanilla MTL method obtains better performance in Segmentation than STL while it performs worse in depth and surface normal estimation than STL models in NYU-v2 due to the unbalanced optimization. In Cityscapes, it performs worse in both tasks than STL baselines. Our method when applied to the vanilla MTL method using HRNet-18 backbone improves the performance over the vanilla MTL method and achieves a balanced MTL performance (i.e.better or comparable results than STL) in both datasets.

We see in Tables 4 and 5 that the decoder-based methods (PAD-Net and MTI-Net) obtain better performance than the vanilla MTL method by first employing a multi-task network to make initial task predictions, and then leveraging features from these initial predictions to improve each task output (MTI-Net obtains +1.72 MTL performance in NYU-v2). Here, PAD-Net improves over the vanilla MTL by aggregating information from the initial task predictions of other tasks by spatial attention for estimating the final task output while MTI-Net extends the PAD-Net to a multi-scale procedure by making initial task predictions and distilling information at each individual scale (of feature). However, they still suffer from the unbalanced optimization problem (e.g.MTI-Net obtains worse performance in surface normal estimation in NYU-v2 and semantic segmentation in Cityscapes, respectively). Building our method on these decoder-based method helps to boost their performance (Ours (MTI-Net) vs MTI-Net: +4.11% vs +1.72%). These results indicate that our method can be used with various architectures and enable more balanced performance over multiple tasks, and boost their overall performance.

Table 6 Universal Representation Learning on Visual Decathlon

5.2 Multi-domain Learning

Here we evaluate our method on learning universal representations for multiple image classification tasks over multiple diverse domains in Visual Decathlon Benchmark (Rebuffi et al., 2017).

Dataset. The Visual Decathlon Benchmark (Rebuffi et al., 2017) consists of 10 different well-known datasets: including ILSVRC_2012 (ImNet) (Russakovsky et al., 2015), FGVC-Aircraft (Airc.) (Maji et al., 2013), CIFAR-100 (C100) (Krizhevsky et al., 2009), Daimler Mono Pedestrian Classification Benchmark (DPed) (Munder & Gavrila, 2006), Describable Texture Dataset (DTD) (Cimpoi et al., 2014), German Traffic Sign Recognition (GTSR) (Houben et al., 2013), Flowers102 (Flwr) (Nilsback & Zisserman, 2008), Omniglot (OGlt) (Lake et al., 2015), Street View House Numbers (SVHN) (Netzer et al., 2011), UCF101 (UCF) (Soomro et al., 2012). In this benchmark (Rebuffi et al., 2017), images are resized to a common resolution of roughly 72 pixels to accelerate training and evaluation by the organizers.

Implementation Details. We follow (Rebuffi et al., 2017, 2018), use the official train/val/test splits, evaluation protocol, also use the ResNet-26 (He et al., 2016) as the backbone for domain-specific network and universal network. In our universal network, the backbone (i.e.ResNet-26) is shared across all domains and followed by domain-specific linear classifiers. We use the same data augmentation (random crop, flipping) and SGD as optimizer, and train domain-specific networks and our universal network for 120 epochs as in Rebuffi et al. (2017), Rebuffi et al. (2018). Here we set the loss weights to 1 (i.e.\(\lambda ^t=1\) for all tasks) and perform cross-validation to search loss weights (\(\lambda ^t_f,\lambda ^t_p\)) in \(\{0.1, 1, 10\}\) for knowledge distillations on features and predictions, and set \(\lambda _f\) and \(\lambda _p\) to 10 for ImageNet, 0.1 for DPed, and 1 for other datasets. Please refer to the supplementary (Sect. A.2) for more details.

Results. In Visual Decathlon, we compare our method to Feature i.e.a feature extractor on ImageNet and learn classifiers on top of the feature extractor for other domains, and single domain learning models that are learned from Scratch or Finetune from the ImageNet pretrained feature extractor. We also compared our method with existing approaches, including Serial Residual Adapters (RA) (Rebuffi et al., 2017), Parallel RA and Parallel RA SVD (Rebuffi et al., 2018), DAN (Rosenfeld & Tsotsos, 2018) and Piggyback (Mallya et al., 2018). Results are from the corresponding papers.

We report the results on the test split on each domains by the official online evaluation (Rebuffi et al., 2018) in Table 6, including testing accuracy in individual datasets, average accuracy over 10 datasets (avg), decathlon evaluation score (S) (Rebuffi et al., 2018), number of parameters (#params) w.r.t.one single task network. We also consider the multi-domain performance (i.e.\(\bigtriangleup \text {MDL}\)) as described in Eq. (7). First, while using ImageNet features requires only 1\(\times \) parameters, they do not generalize well to other datasets when large domain gap is present (e.g.SVHN). In contrast, single-task learning model obtained by either learning from scratch or finetuning achieves significantly better performance (e.g.Finetune obtains 76.51 average accuracy and 2500 score) with the expense of 10 times more parameters.

We use Finetune as the baseline as in Rebuffi et al. (2017), Rebuffi et al. (2018) and compute \(\bigtriangleup \text {MDL}\) metric for existing methods and ours. We can see that Serial RA which learns a set of domain-specific residual adapters for each task with a ImageNet pretrained feature extractor greatly reduce the number of parameter to 2\(\times \) while it obtains slightly worse performance than Finetune (e.g.73.88 vs 76.51 average accuracy for Serial RA and Finetune, respectively ). The performance is further improved by DAN which constrains newly learned filters to be linear combinations of existing ones when adapting a pretrained model for other domains, i.e.DAN obtains 77.01 average accuracy, +0.60 MDL performance and only requires 2.17 parameters). Piggyback learns binary domain-specific masks to select effective filters to adapt a pretrained model for each domains (it obtains 76.60 average accuracy, +0.00 MDL performance and further reduce the number of parameters to 1.28). Connecting the RAs in parallel to the backbone (Parallel RAs) boosts the performance of the serial configuration while keeping the same computation cost (78.07 in average accuracy and +2.20 in MDL performance). The authors show that the performance can be further improved by decomposing residual adapters to low rank adapters through (78.36 average accuracy, +2.70 MDL performance and 1.5 parameters).

Finally, we show that our method (Ours) successfully learns a single feature extractor shared across all domains only with 1\(\times \) parameters, the same number of parameters with a single domain network. Our model obtains better results than the Finetune baseline and existing methods in most domains (Ours obtains 79.25 average accuracy and +4.00 MDL performance). This clearly shows that learning representations from all domains jointly produces more general features than ImageNet representations. However, this is challenging due to the optimization issues. The difference between our model and vanilla MDL model shows that simply optimizing over multiple domain-specific loss functions is not sufficient to obtain good representations and representation distillation is crucial. We also show that RAs can be incorporated to our universal network. Jointly learning a shared ResNet-26 backbone with residual adapters (i.e.Ours (RA)) boosts the performance, e.g.Ours (RA) obtains the best or second best performance in most datasets (e.g.Airc., DTD, etc), best average accuracy (80.52), best MDL performance (+5.96) while only requires 2 in parameters cost and best score (4005).

5.3 Cross-domain Few-shot Learning

Here, we evaluate our method to few-shot learning on recent MetaDataset (Triantafillou et al., 2020).

Dataset. The MetaDataset (Triantafillou et al., 2020) is a few-shot classification benchmark that initially consisted of ten datasets: ILSVRC_2012 (ImageNet) (Russakovsky et al., 2015), Omniglot (Lake et al., 2015), FGVC-Aircraft (Aircraft) (Maji et al., 2013), CUB-200-2011 (Birds) (Wah et al., 2011), Describable Textures (DTD) (Cimpoi et al., 2014), QuickDraw (Jongejan et al., 2016), FGVCx Fungi (Fungi) (Brigit & Yin, 2018), VGG Flower (Flower) (Nilsback & Zisserman, 2008), Traffic Signs (Houben et al., 2013) and MSCOCO (Lin et al., 2014) was further expanded to 13 datasets with the addition of MNIST (LeCun et al., 1998), CIFAR-10/-100 (Krizhevsky et al., 2009).

We follow the standard procedure in Triantafillou et al. (2020) and use the first eight datasets for meta-training, in which each dataset is further divided into train, validation and test set with disjoint classes. While the evaluation within these datasets is used to measure the generalization ability in the seen domains, the remaining five datasets are reserved as unseen domains in meta-test for measuring the cross-domain generalization ability.

Table 7 Comparison to baselines and state-of-the-art methods on MetaDataset

Implementation Details In all experiments we build our method on ResNet-18 (He et al., 2016) backbone for both single-domain and multi-domain networks. In the multi-domain network, we share all the layers but the last classifier across the domains. For training single-domain models, we strictly follow the training protocol in Dvornik et al. (2020), use a SGD optimizer with a momentum and the cosine annealing learning scheduler with the same hyperparameters. For our multi-domain network, we use the same optimizer and scheduler as before, train it for 240,000 iterations. We set \(\lambda _f\) and \(\lambda _p\) in Eq. (3) to 4 for ImageNet and 1 for other datasets and use early-stopping based on cross-validation over the validations sets of 8 training datasets. We refer to supplementary (Sect. A.3) for more details.

Baselines and Compared Methods First we compare our method to our own baselines, (i) the best single-domain model (Best SDL) where we use each single-domain network as the feature extractor and test it for few-shot classification in each dataset and pick the best performing model (see supplementary Sect. B.2.1 for the complete results). This involves evaluating 8 single-domain networks on 13 datasets, serves a very competitive baseline, (ii) the vanilla multi-domain learning baseline (MDL) that is learned by optimizing Eq. (2) without the proposed distillation method. As additional baselines, we include the best performing method in Triantafillou et al. (2020), i.e.Proto-MAML (Triantafillou et al., 2020), and as well as the state-of-the-art methods, BOHB-E (Saikia et al., 2020), CNAPS (Requeima et al., 2019), SUR (Dvornik et al., 2020), URT (Liu et al., 2021b), and the Simple CNAPS (Bateni et al., 2020),Footnote 5. For evaluation, we follow the standard protocol in (Triantafillou et al., 2020), randomly sample 600 tasks for each dataset, and report average accuracy and 95% confidence score in all experiments. We reproduce results by training and evaluating SUR (Dvornik et al., 2020), URT (Liu et al., 2021b), and Simple CNAPS (Bateni et al., 2020) using their code for fair comparison as recommended by MetaDataset.

Results on MetaDataset As described in Triantafillou et al. (2020), we sample each task with varying number of ways and shots and report the results in Table 7. Our method outperforms the state-of-the-art methods in seven out of eight seen datasets and four out of five unseen datasets. We also compute average rank as recommended in Triantafillou et al. (2020), our method ranks 1.3 in average and the state-of-the-art SUR and URT rank 5.0 and 4.4, respectively. In detail, we obtain significantly better results than the second best approach on Aircraft (+2.8), Birds (+2.1), Texture (+4.2), and VGG Flower (+1.5) for seen domains and Traffic Sign (+6.1)Footnote 6 and MSCOCO (+3.8). The results show that jointly learning a single set of representations provides better generalization ability than fusing the ones from multiple single-domain feature extractors as done in SUR and URT. Notably, our method requires less parameters and computations to run during inference than SUR and URT, as it runs only one universal network to extract features, while both SUR and URT need to pass the query set to multiple single-domain networks.

Table 8 Global retrieval performance on MetaDataset

We also see that our method outperforms two strong baselines, Best SDL and MDL in all datasets except in QuickDraw. This indicates that i) universal representations are superior to the single-domain ones while generalizing to new tasks in both seen and unseen domains, while requiring significantly less number of parameters (1 vs 8 neural networks), ii) our distillation strategy is essential to obtain good multi-domain representations. While MDL outperforms the best SDL in certain domains by transferring representations across them, its performance is lower in other domains than SDL, possibly due to negative transfer across the significantly diverse domains. Surprisingly, MDL achieves the third best in average rank, indicating the benefit of multi-domain representations.

Global Retrieval Here we go beyond the few-shot classification experiments and evaluate the generalization ability of our representations that are learned in the multi-domain network in a retrieval task, inspired from metric learning literature (Oh Song et al., 2016; Yu et al., 2019). To this end, for each test image, we find the nearest images in entire test set in the feature space and test whether they correspond to the same category. For evaluation metric, we use Recall@k which considers the predictions with one of the k closest neighbors with the same label as positive. In Table 8, we compare our method with Simple CNAPS in Recall@1 and Recall@2 (see supplementary Sect. B.2.7 for more results). URT and SUR require adaptation using support set and no such adaptation in retrieval task is possible, we replace them with two baselines that concatenate or sum features from multiple domain-specific networks. Our method achieves the best performance in ten out of thirteen domains with significant gains in Aircraft, Birds, Textures and Fungi. This strongly suggests that our multi-domain representations are the key to the success of our method in the previous few-shot classification tasks. We also provide additional experiments in supplementary (Sect. B.2).

5.4 Hierarchical Distillation with Task Grouping

As introduced in Sect. 4, here we first randomly group tasks and learn a single network per group, and then distill their knowledge to learn universal representations. In addition to training of networks per task, networks per group can be trained in parallel. This strategy allows our method to better scale to the scenarios with a big number of tasks and domains by requiring few group-specific networks at the final stage. Note that this strategy does not necessarily reduce the total training time but reduces the number of feature extractors for training the universal representations. Here, we evaluate this strategy on Visual Decathlon under three random different groupings and report the performance of obtained universal representations in Table 9. Note that as ImageNet is a large diverse dataset, we treat it as a single group, and randomly assign the remaining datasets to three other groups. In each grouping, we divide 10 domains/tasks into 4 groups,Footnote 7. The results show that training the universal network with task grouping obtains comparable, even slightly better results than learning it without task grouping. In addition, different groupings achieve similar performance in average to each other, while the first group obtains the best result.

Table 9 Universal Representation Learning with task grouping on Visual Decathlon Benchmark

5.5 Further Analysis

In this section, we provide an extensive analysis over various adapter types, loss functions for knowledge distillation for multi-task learning and multi-domain learning.

Effect of Adapters. As explained in Sect. 4, we employ adapters to align the universal representations with each task-specific representation (see \(\alpha _{\theta ^{t}}\) in Eq. (3)). Here, we evaluate our method without any adapters (by directly matching universal representations with task-specific ones), also with two different adapter parameterizations including linear adapters (i.e.each adapter is constructed by a linear \(1 \times 1\) convolutional layer, this is the default setting in Sects. 5.1, 5.2 and 5.3), nonlinear adapters (i.e.each adapter consists of two linear convolutional layers and a ReLU layer between them). We report their results on NYU-v2 dataset in Table 10. From the results, we can see that, though directly align features without the adapters improves performance on all tasks over the vanilla MTL baseline (Uniform), it still performs significantly worse than using either linear or non-linear adapters. This verifies that the adapters helps aligning feature between the multi-task network with features of different single-task network. We also observe that, using linear and nonlinear adapters obtains comparable results and using linear adapters is sufficient. We hypothesize that there is a tradeoff between the complexity of adapters and informativeness of aligned features. For instance, using deep multi-layer adapters would overfit to the data and align the pairs very accurately, hence lead to inferior representation transfer. Thus we argue that the linear adapters provide a good complexity/performance tradeoff.

Table 10 Testing results on NYU-v2

Loss Functions for Knowledge Distillation. Here we evaluate various loss functions for distilling intermediate representations (i.e.\(d_{f}(\cdot )\) in Eq. (3)) including standard ones such as L2, L1, cosine distance, and also Attention Transfer (AT) (Komodakis & Zagoruyko, 2017) that align the spatial attention maps computed by averaging the feature maps along the channel dimension and CKA. Here, we use linear adapters for aligning features between multi-task and single-task networks before measuring their discrepancy with these loss functions.

Table 11 Testing results on NYU-v2

We first evaluate these loss functions for multiple dense prediction problem in NYU-v2 and report the results in Table 11 where our default loss function is L2. Here, we apply knowledge distillation with different loss function to the vanilla MTL method with SegNet (Badrinarayanan et al., 2017) as backbone (Note that CKA loss function is not included as it requires too large memory cost to operate on feature maps. We refer readers to Sect. B.1.1 in the supplementary for more details about the cost analysis). From the results, we can see that, AT obtains the worst performance among these loss functions as it aligns the averaged features where some information is lost, but it still outperforms the vanilla MTL with uniform loss weights. While L1 and Cosine loss function perform better than AT, using L2 loss function obtains the best results.

Compared to learning multiple dense prediction tasks in a single domain, learning universal representations from multiple visually-diverse domains is a more challenging problem. Hence we use CKA as loss function for representation distillation, i.e.\(d_f\). Here we evaluate the effect of CKA in MetaDataset and compare it to different distillation loss functions, and report their performances in Table 12. In this study, we set \(\lambda _p\) to zero and do not match prediction of the universal network with of the domain-specific ones. Among these loss functions, the best results are obtained with CKA loss in all domains. Although the universal representations are first mapped to the domain-specific spaces via adapters, L2 and cosine loss functions are not sufficient to match features from very diverse domains and further aligning features with CKA is significantly beneficial.

We then evaluate individual contributions of distillation through representations and predictions while using CKA and KL-divergence respectively in Table 13. Compared to only applying KL loss on predictions (‘Ours w/o \({d}_f\)’), only aligning representations with CKA loss function (‘Ours w/o \({d}_p\)’) performs better in most domains. Finally, combining \({d}_f\) (CKA) with \({d}_p\) (KL divergence), i.e.‘Ours (\({d}_f+{d}_p\))’, gives the best performance over the multi-domain models that are trained with the individual loss functions.

Table 12 Quantitative analysis of knowledge distillation loss functions for \({d}_f\)
Table 13 Quantitative analysis of knowledge distillation loss functions on representations and predictions
Fig. 3
figure 3

Qualitative results on NYU-v2. The fist column shows the RGB image, the second column plots the ground-truth or predictions with the IoU (\(\uparrow \)) score of all methods for semantic segmentation, the third column presents the ground-truth or predictions with the absolute error (\(\downarrow \)), and we show the prediction of surface normal with mean error (\(\downarrow \)) in the last column

5.6 Qualitative Results

Here, we analyze our method and qualitatively compare our method to STL, MTL with Uniform loss weights and the best compared method, i.e.IMTL-H (Liu et al., 2021c) for multi dense prediction problem on NYU-v2 with SegNet backbone (see Fig. 3, see supplementary Sect. B.1 for more examples). We can see that, Uniform baseline obtains improvement on segmentation and depth estimation over STL, while it performs worse in surface normal estimation. Though dynamically balancing the loss values with IMTL-H improves the overall performance, it still performs worse in surface normal estimation. Finally by distilling representations from single-task learning model to the universal network, our method can produce better or comparable results to STL, i.e.our method produces similar outputs for surface normal as STL and more accurate predictions for segmentation and depth estimation as our method enables a balanced optimization of universal network and a task can be benefited from another one. This indicates the effectiveness of our method on learning shared representations for multiple dense predictions.

We also qualitatively analyze our method and compare it to the state-of-the-art URT (Liu et al., 2021b) in Fig. 4 for cross-domain few-shot learning in MetaDataset. In particular, we illustrate the nearest neighbors in two different datasets given a query image (see supplementary Sect. B.2.6 for more examples). While URT retrieves images with more similar colors, shapes and backgrounds, our method is able to retrieve semantically similar images and finds more correct neighbors than URT. It again suggests that our method is able to learn more semantically meaningful and general representations.

Fig. 4
figure 4

Qualitative analysis of our method in two datasets. Green and red colors indicate correct and false predictions respectively (Color figure online)

6 Conclusion

We showed that learning general features from multiple tasks and domains is an important step for better generalization in various computer vision problems including multiple dense prediction, multi-domain image classification and cross-domain few-shot learning problems. By distilling representations from multiple task-specific or domain-specific networks, we can successfully learn a single set of universal representations after aligning them via small task/domain-specific adapters.

Obtaining the complete universal representations capable of solving all vision problems is too ambitious and beyond the scope of this manuscript. Here our focus is to learn a single set of representations, contained in a single deep neural network, over a limited set of tasks/domains. We use ‘universal’ to emphasize a compact set of representations that can be employed to solve various vision tasks from diverse domains. We demonstrate that as more complete universal representations are obtained, learning of new domains and tasks can be easier and performed efficiently from only few samples by transfer learning.

So far, we focuesd on multi-task or multi-domain learning. We would like to extend the proposed URL method to problems that involve multiple tasks and multiple domains at the same time such as semantic segmentation and depth estimation from different cities, time of day (e.g.day or night time), weather (e.g.clear, rainy, snowy), or domains.