Universal Representations: A Unified Look at Multiple Task and Domain Learning

We propose a unified look at jointly learning multiple vision tasks and visual domains through universal representations, a single deep neural network. Learning multiple problems simultaneously involves minimizing a weighted sum of multiple loss functions with different magnitudes and characteristics and thus results in unbalanced state of one loss dominating the optimization and poor results compared to learning a separate model for each problem. To this end, we propose distilling knowledge of multiple task/domain-specific networks into a single deep neural network after aligning its representations with the task/domain-specific ones through small capacity adapters. We rigorously show that universal representations achieve state-of-the-art performances in learning of multiple dense prediction problems in NYU-v2 and Cityscapes, multiple image classification problems from diverse domains in Visual Decathlon Dataset and cross-domain few-shot learning in MetaDataset. Finally we also conduct multiple analysis through ablation and qualitative studies.

In contrast, humans in the early years of their development develop powerful internal visual representations that are subject to small refinements in response to later visual experience [1,63,48].Once these visual representations are formed, they are universal and later employed in many diverse vision tasks from reading text, recognizing faces to interpreting visual art forms.
Presence of universal representations in computer vision [6], has important implications.First, it means that vision has limited complexity.A growing number of visual domains and tasks 1 can be modeled with a bounded number of representations.As a result, one can use a compact set of representations for learning multiple domains and tasks, and efficiently share features and computations across them, which is crucial in platforms with limited computational resources such as mobile devices and autonomous cars.Second, as we obtain more complete universal representations, learning of new domains and tasks can be easier and performed efficiently from only few samples by transfer learning.
In practice, learning universal representations requires to address several challenges.First, modelling diverse visual data demands deep network architectures that can simultaneously learn representations while selectively sharing only the relevant representations across multiple tasks and domains.To this end, previous multi-task works proposed 1 While domain and and task definitions vary in previous work and are used interchangeably, in our experiments each domain denotes a data domain, dataset such as ImageNet, Omniglot and each task denotes either different prediction task such as semantic segmentation and depth estimation, and also same prediction task such as image classification, albeit, over different sets of categories.A subtle difference between two settings is that for a single image there can be multiple tasks defined, however, only a single domain.controlling representation sharing across tasks through latent connections [65,83], constructing branched deep neural networks based on task affinities [100], custom attention mechanisms [58], neural architecture search [52,9,32], developing progressive communication across multiple tasks through recurrent networks [5,114,102], multi-scale feature sharing [107,102].Other previous works assume that features extracted from pretrained deep networks on ImageNet provide basis for universal representations and adapt them with a set of compact adapters to various domains [75,76,81,19].In cross-domain few-shot classification, where the goal is to generalize to unseen tasks and domains from few samples, features from multiple domain-specific networks are considered as universal features and transferred to previously unseen domains and tasks after a post selection step [23,56].
The second challenge is to develop training algorithms to learn representations that achieve good performance not only in one of the tasks or domains but in all of them.This problem is especially visible when the training involves jointly minimizing a set of loss functions (i.e. one for each task) with significantly different difficulty levels, magnitudes, and characteristics.Thus a naive strategy of uniformly weighing multiple losses can lead to sub-optimal performances and searching for optimal weights in a continuous hyperparameter space can be prohibitively expensive.Previous works [12,87,40,31,58] address the unbalanced loss optimization problem by weighing loss functions based on the task-dependent uncertainty of the model at training time [40], proposing Pareto optimal solution [87], eliminating conflicting gradient components between the tasks [110].Although these meth-ods are shown to improve over the uniform weighing loss strategy in some benchmarks, they do not consistently outperform the baseline that simply weighs each loss function with a constant scalar [101].
In this work, we focus on the second challenge.Inspired from knowledge distillation [80,34], we approach the problem from a different perspective and propose a general methodology for universal representation learning that can be applied to a diverse set of problems including multi-task and multi-domain learning in few-and many-shot settings.We propose a two-stage procedure for universal representation learning where we first train a set of task or domain-specific models and freeze their parameters, and then distill their knowledge to a universal representation network while simultaneously training it over multiple-tasks/domains.In contrast to the standard knowledge distillation, in our setting each "teacher" network is trained for either significantly different task (e.g.semantic segmentation, depth estimation) and/or domain (e.g.flowers, handwritten characters), and encodes significantly different representations.Hence naively distilling their representations into a single network would result in poor performance.To this end, we propose aligning the universal network with the individual ones via small taskspecific adapters before the distillation, and using specific loss functions that are invariant to certain transformations between representations.Our method has multiple key advantages over the previous work.First, in contrast to relying solely on weighing the individual loss functions (e.g.[40]) or modifying the direction of their gradients (e.g.[110]) that are limited to prevent one task dominating or interfering the rest, we propose more explicit control on the model parameters through knowledge distillation such that representations from all tasks/domains are included in the universal representations.Second, unlike task/domain-specific loss functions with different characteristics which are difficult to balance, the distillation loss function is same for all tasks/domains and hence provides balanced optimization by design.Third, unlike [23,56] that employ multiple feature extractors, our model learns a single set of universal representations (a single feature extractor) over multiple domains which has a fixed computational cost regardless of the number of domains at inference.Finally, our method can be successfully incorporated to various state-of-the-art multi-task/domain customized network architectures [102,76] and loss balancing strategies [40,57,55].
We illustrate our universal representation learning method and its applications to three standard vision problems in Fig. 1.The common step for all the applications is to first train a task or domain specific model and then distill their knowledge to a single universal network (see Fig. 1(a)).We show that the universal representations (depicted by the green network) can successfully be employed in jointly learning (i) multiple dense vision problems such as semantic segmentation, depth estimation (see Fig. 1(b)), (ii) multiple image classification problems from diverse datasets such as ImageNet [20], Omniglot [46], FGVC Aircraft [61] (see Fig. 1(c)), (iii) learning to classify images from few training samples of unseen tasks and domains (see Fig. 1(d)).In all applications, the computations and representations are largely shared across tasks and domains through the universal network, while light-weight task or domain-specific heads are used to obtain the predictions by mapping the universal representations to the task output space.
In summary, our core contribution is a generic framework for universal representation learning that can be employed in very diverse problems including dense multi-task prediction, and multi-domain many-shot and few-shot image classification tasks.We show that the learned universal representations generalize well not only to unseen samples in previously seen domains and tasks but also to unseen tasks and domains in few-shot setting.We rigorously evaluate our method and show that it outperforms the state-of-the-art multi-task dense prediction methods in NYU-v2 [88] and CityScapes [17], cross-domain few-shot image classification methods in Meta-Dataset [98], and multi-domain many-shot image classification methods in Visual Decathlon benchmark [75].We also propose an efficient learning strategy that enables to learn representations from subsets of tasks in parallel and finally merging them to universal representations.We extensively analyze the performance of our method over various design choices including different deep network architectures, adaptor types and loss functions and for knowledge distillation.This work is an extended version of our prior contributions [49,50] that focus on dense multi-task prediction and cross-domain few-shot learning problems.The new contributions are: a unified look at universal representations for a diverse set of problems in multi-domain and multi-task learning, a hierarchical distillation strategy that allows for learning representations in parallel, evaluation in a new problem, multi-domain many-shot learning, evaluation of our method with the state-of-the-art multitask dense prediction architectures [107,58,102], more extensive analysis on adaptor types and loss functions, more extensive and recent literature review.
The rest of the paper is organized as follows: Section 2 provides an extensive overview of related work in multi-task learning, multi-domain learning and knowledge distillation.Section 3 reviews the background learning formulation for single and multiple task learning.Section 4 introduces universal representation learning for multi-task dense prediction, multi-domain classification and cross-domain few-shot classification.Section 5 provides a rigorous analysis of the design choices and evaluates the performance of the proposed models in multiple standard benchmarks for the three problems.Section 6 concludes the paper with future remarks and limitations.

Multi-task Learning
Multi-task Learning (MTL) [11] aims at learning a single model that can infer all desired task outputs given an input.We refer to [82,113,101] for more comprehensive literature review.As discussed above, the prior works can be broadly divided into two groups.The first one focuses on improving network architecture via more effective information sharing across tasks [65,42,83,58,100,52,9,32,7,91,107,115,10,5,114,102,107].The second group aims to address the unbalanced optimization that is caused by jointly optimizing multiple task loss functions with varying characteristics through either actively changing weight of each loss term [40,58,31,12,54,87,57] and/or modifying the gradients of loss functions w.r.t. the shared network weights to alleviate the conflicts among tasks [110,55,13,14,94].
Our method is complementary to the first line of work.In fact, we show in Sec. 5 that our method can be used to boost the state-of-the-art multi-task architectures in dense prediction problems.While our goal is aligned with the one of the second group, we propose a significantly different strategy based on knowledge distillation to solve the unbalanced loss optimization problem.To this end, we first train a task-specific model for each task in an offline stage and freeze their parameters.We then train the universal representation (multi-task) network for minimizing task-specific loss and also for producing the same features with the taskspecific networks.As each task-specific network encodes different features, we introduce small task-specific adapters to project the universal features to the task-specific features and then minimize the discrepancy between task-specific and universal features.In contrast to prior works that either rely solely on weighing the individual loss functions (e.g.[40]) or modifying their gradients for parameter updates (e.g.[110]) that are limited to prevent one task dominating or interfering the rest, our method provides a more direct control on the model parameters through knowledge distillation such that representations from all tasks are included in the universal representations.Second, unlike task-specific loss functions with different characteristics which are difficult to balance, the distillation loss function is same for all tasks and hence provides a balanced optimization by design.In addition, in this paper, we show that our method can also generalize to learning multiple diverse visual domains for standard image classification [75] and few-shot classification [98].
Multi-task Self-supervised Learning.Our work is also loosely related to a recent line of self-supervised learning methods that learns representations from unlabelled data from multiple pretext tasks [22], e.g.predicting rotation, or learning them from unlabelled data which are sampled from multiple domains (datasets) [117,30].Unlike these works that focus on pretraining representations from unlabelled data and transferring them to a downstream task at a time (one model for one downstream task), we focus on learning models that can jointly perform multiple tasks.

Multi-domain learning (MDL)
A parallel line of research is learning representations jointly on multiple domains [6,81,76,19].Unlike [29,99,35,108,73,93] that focus on domain adaptation, this line of work aims at learning a single set of universal representations over multiple tasks and visual domains.Bilen and Vedaldi [6] proposed to learn a compact multi-domain representation for standard image classification in multiple visual domains using domain-specific scaling parameters.This idea was later extended to the use of domain-specific adapters [75,81,76] and latent domains learning without access to domain annotations by learning gating functions to select domain-specific adapters for the given images [19].In this paper, we also learn a single set of universal (multi-domain) representations by sharing most of the computation across domains (e.g. the feature encoder is shared across all domains, followed by multiple domain-specific classifiers).However, unlike [75,81,76,19] that use representations pretrained only on Ima-geNet [20] as the universal ones, and then learn additional domain-specific representations resulting, our method is capable of learning a single set of universal representations from multiple domains which requires less number of parameters and it is also a significantly harder task due to the challenges in the multi-loss optimization.In addition, those works do not scale up to multi-task learning in a single domain, as they require running network with the corresponding task-specific adaptors for each task separately, while ours requires only a single forward computation for all tasks.

Cross-Domain Few-shot learning
Our work is also related to the few-shot classification scenarios that aim at adapting a classifier to previously unseen tasks and domains from few labeled samples.Earlier works [41,103,89,27,69] focus on evaluating their methods in homogeneous learning tasks, e.g.Omniglot [46], miniIma-geNet [103] where both the meta-train and meta-test examples are sampled from a single data distribution (or dataset), and perform poorly in the more challenging cross-domain few-shot tasks, where test data is sampled from an unknown or previously unseen domain [98].We refer to [106,36] for comprehensive review of early works.
Recent few-shot techniques [23,56,4,79] leverage powerful representations learned over multiple domains and focus on few-shot learning from multiple domains that generalizes to unseen domains at test time in the recently proposed MetaDataset [98].CNAPS [79] consists of an adaptation network that modulates the parameters of both a feature extractor and classifier for new categories by encoding the data distribution of few training samples.Simple CNAPS [4] extends CNAPS by replacing its parametric classifier with a non-parametric classifier based on Mahalanobis distance and shows that adapting the classifier from few samples is not necessary for good performance.SUR [23] and URT [56] further show that adaptation for the feature extractor can also be replaced by a feature selection mechanism.In particular, both [23,56] learn a separate deep network for each training dataset in an offline stage, employ them to extract multiple features for each image, and then select the optimal set of features either based on a similarity measure [23] or on an attention mechanism [56].Despite their good performance, SUR and URT are computationally expensive and require multiple forward passes through multiple networks during inference time.Our method also uses multi-domain features but in a more efficient way, by learning a single network over multiple domains.Our method requires significantly less network capacity and compute load than theirs.

Knowledge distillation
Our work is related to knowledge distillation (KD) methods [34,49,60,74,80,96] that distill the knowledge of an ensemble of large teacher models to a small student neural network at the classifier [34] and intermediate layers [80].Born-Again Neural Networks [28] uses KD proposes to consecutively distill knowledge from an identical teacher network to a student network, which is further applied to fewshot learning in [97] and multi-task learning in [16].
While our method can also be seen as a distillation method, our goal differs significantly.In contrast to the standard KD that aims to learn a single task/domain network from multiple teachers, our goal is to learn a multi-task/domain network.The difference is subtle.Multi-task/domain learning typically involves solving an unbalanced optimization, while aligning the predictions of the multi-task/domain (student) network with the task-specific (teacher) networks does not necessarily alleviate this issue, as this alignment problem leads to another unbalanced optimization problem due to varying dimensionality of task outputs and loss functions required for matching different tasks' predictions [16], e.g., a kl-divergence loss for classification and l2-norm loss for regression.While alignment of intermediate representations are studied for KD in [80], such an alignment is substantially harder when the representations vary significantly across different teacher networks.We demonstrate that mapping the student (or universal) representations to teacher's representation space before the alignment is crucial.Finally, we also show that intermediate representation matching across very diverse domains can indeed be improved by using a loss function that is invariant to linear transformations, inspired from Centered Kernel Alignment (CKA) similarity [44].

Background
In this section, we review the problem setting for single-task and multi-task learning to provide the required background for the universal representation learning.Let D be a training set consisting of N RGB training images and their respective labels.We consider two general label settings, multi-task learning (MTL) and multi-domain learning (MDL).In MTL, we assume that training images are sampled from a single distribution, i.e. dataset, and each training image x is associated with labels for T tasks, y = {y 1 , y2 , . . ., y T }.This is a common setting for dense prediction problems where multiple tasks such as semantic segmentation and depth estimation are performed on the same image.In MDL, the training set contains samples from T different domains, where each image is associated only with a single domain and its domain-specific task.The standard MDL benchmarks [75,98] contain images from diverse datasets (e.g.ImageNet, Omniglot, VGG Flowers), each with a classification task over mutually an exclusive set of categories.Hence, an image x associated with domain t is labelled only with the domain-specific task y t .In both MTL and MDL settings, our goal is to learn a function ŷt for each task t in MTL and for each domain t in MDL that accurately predicts the ground-truth label y t of previously unseen images.
Note that while in MTL different tasks involve solving different problems such as semantic segmentation and depth estimation, in MDL they involve solving the same problem, e.g.image classification, however over a different set of categories.We do not focus on other scenarios where images from different domains are associated with the same task (e.g.digit recognition from hand-written notes and real streetnumbers).

Single-task Learning
Single-task learning (STL) involves learning a task-specific function ŷt s 2 independently for each task by optimizing a task-specific loss function t (ŷ t s , y t ) (e.g.cross-entropy for classification), which measures the mismatch between the ground-truth label and prediction as following: where each ŷt s is composed of i) a feature encoder f φ t s : R 3×H×W → R C×H ×W parameterized by φ t s that takes in an H × W dimensional RGB image and outputs a H × W dimensional feature map with C channels, where C > 3, H < H and W < W ; ii) a decoder h ψ t s : R C×H ×W → R O t ×H t ×W t that decodes the extracted feature to predict the label for the task t, i.e. ŷt s (x) = h ψ t s • f φ t s (x) where O t , H t , W t are the dimensions of the output space for task t and ψ t s are its parameters.

Multi-task Learning
A more efficient design is to share a significant portion of the computations and parameters across the tasks via a common feature encoder f φ : R 3×H×W → R C×H ×W , i.e. convolutional neural network parameterized by φ that takes in an image x and produces a H × W dimensional C feature maps.The parameters φ are shared across all the tasks.In this setting, f φ is followed by T task-specific decoders , each with its own taskspecific weights ψ t that decodes the extracted feature to predict the label for the task t, i.e. ŷt (x t ) = h ψ t • f φ (x t ).A Fig. 2 Illustration of universal representation learning.In the first stage (a), we learn a task-specific deep network for each task.In the second stage (b), our goal is to learn a multi-task network that shares the feature encoder across all tasks and build multiple task-specific decoders on top of the feature encoder such that it performs well on these tasks compared to task-specific models trained in (a).To achieve this, we train such multi-task network by jointly minimizing task-specific losses and aligning the feature between the multi-task network and task-specific network.For the feature alignment, we introduce a set of task-specific adapters to transform the feature from multi-task network to task-specific space before the alignment with task-specific features.
common way of learning ŷt for all tasks is to jointly optimize the shared and task-specific parameters as following: where λ t is a scaling hyperparameter for task t to balance the loss functions among the tasks.However, obtaining good universal representations through solving Eq. ( 2) is a challenging problem, as it requires to leverage commonalities between the tasks while balancing their loss functions and minimizing interference (negative transfer [12,110]) between them.Hence solving Eq. ( 2) often leads to lower results than the ones of task-specific models, where each task is independently learned.

Universal Representation Learning
Motivated by these challenges, previous methods mainly focus on dynamically balancing the loss functions through loss weights (λ t ) or manipulating gradients of each task w.r.t. the shared encoder to alleviate the conflicts between them while optimizing Eq. ( 2).However, as reported in [101], existing solutions fail to improve over carefully tuning these hyperparameters.We hypothesize that modifying hyperparameters and/or gradients provide only a limited control on the learned representations, and propose a different view on this problem, a two stage procedure inspired by the knowledge distillation methods [80,34].Assuming that single-task learning often performs well when sufficient training data is available, we argue that single-task representations provide powerful representations and hence they provide good approximations to universal representations.
To this end, we first train task-specific deep networks {ŷ t s } T t=1 as described in Sec.3.1, where each network consists of a task-specific feature encoder f φ t s and decoder h ψ t s with parameters φ t s and ψ t s respectively.In the second stage, we freeze their weights, the task-specific feature extractors f φ t s and decoders h ψ t s , and transfer their knowledge to learn a single set of universal representations by minimizing the distance between the task-specific and universal representations for given training samples (see Fig. 2).
As minimizing the distance between a single set of universal representations and multiple single-task representations would not yield a satisfactory solution, i.e. the average of the single-task representations, we instead first map the universal representations to each task-specific representation space through small task-specific adapters, and then compute the distance in this space.In addition, we consider minimizing the distance between the outputs of the universal and singletask networks as in [34].With the introduction of these two distillation terms, Eq. ( 2) can be rewritten as: where λ t f and λ t p are task-specific hyperparameters for distilling representations and predictions respectively.a θ t : R C×H ×W → R C×H ×W is the adapter for task t which is parameterized by θ t , f and p are distance functions in the task representation space t.While a single distance function is used for distilling representations (i.e.f ), distilling predictions may require a task-specific distance function (i.e.t p ).We provide these details in Sec. 5.The adapters are jointly trained along with the network parameters.Note that we discard the task-specific networks and adapters at test time, only use the universal network to predict the labels of unseen images.Hence, the inference time of our method is fixed and does not depend on the number of tasks/domains.Hierarchical Distillation with Task Grouping.Solving Eq. (3) requires aligning the universal representations with multiple single-task representations.Although obtaining singletask representations involves only forward pass of their input, running multiple single-task networks in training can get memory and compute intensive, as the number of tasks (T ) grows.Here we propose a hierarchical strategy that enables solving Eq. (3) in multiple stages by decomposing it in independent optimization problems.To this end, we first divide all tasks into different task groups which can be done randomly or in accordance with task similarity.Here we only explore random selection, as identifying task similarity is a challenging task by itself and an actively studied problem [111].For each group g, we learn a single network by distilling the single-task representations of the tasks that belong to the group g like in Eq. (3).Importantly each group training can be run in parallel to accelerate the training.Once the group-specific networks are learned, we employ them to obtain the final universal representations as in Eq. ( 3).Note that it is also possible to use a deeper hierarchy with nested groupings.However, we focus on only single level of grouping to validate the idea.We evaluate this strategy in the Visual Decathlon benchmark [75] (see Sec. 5.5), as it involves learning representations over a large set of domains.
Next we describe how the universal representations are learned for different scenarios including multi-task dense prediction, multi-domain classification and cross-domain fewshot classification problems.

Multi-Task Dense Prediction
In multi-task dense prediction setting in a single domain, each image x is associated with labels for all tasks.The spatial dimensions of labels for each task is equal to the image size -hence it is called dense or pixelwise prediction -and is the same for all tasks.We consider semantic segmentation, monocular depth estimation and surface normal prediction in our experiments.
In this setting, we only minimize the difference between intermediate representations of the universal network and task-specific ones, and do not minimize the difference between the predictions of the universal and single-task networks as in [16,34,80] and set λ t p to zero.In our preliminary experiments, we observed that matching the predictions leads to a significant drop in the final performance.As each task prediction has different magnitude range and characteristic, we argue that jointly minimizing these distances along with other loss terms in Eq. ( 3) leads to a challenging unbalanced optimization. Let denote representations obtained from the universal and single-task encoder for a given image x n and task t, respectively.We normalize two feature maps with L2 Norm: where m chw indicates a hidden unit at c, h, w in m.We then measure the distance between two normalized feature maps by using the Euclidean distance function for f as following: We investigate different designs for a θ and f and show that using linear adapters for a θ with Euclidean distance function for f obtains the best performance in Sec.5.5.

Multi-Domain Classification
We also consider a multi-domain scenario as in [75], where the training set D contains T subdatasets, each sampled from a different domain.Each image is associated with only one domain and hence one task.The associated task is known in both train and test time as in [75].Like the multi-dense prediction problem, our goal is to learn a single network with a shared feature encoder f φ across the domains, thus the tasks.Unlike the multi-dense prediction problem, the output of the feature encoder is a vector (i.e.H = 1 and W = 1) and also the predictions (i.e.H t = 1 and W t = 1).In particular, the feature encoder f φ is a convolutional neural network followed by a average global pooling layer as in [33].
Following the two stage procedure, we first independently train a set of domain-specific deep networks {ŷ t s } T t=1 by Eq. ( 1) where each consists of a specific feature encoder f φ t s and classifier h ψ t s with parameters φ t s and ψ t s respectively in the first stage.Unlike the previous setting that trains multiple task-specific networks on the same training set, this setting involves training each domain network on a different training set from a different domain.In the second stage, we then learn the universal network over training images of multiple domains using Eq.(3).Rather than setting λ p in Eq. ( 3) to zero as in Sec.4.1, here we use KL divergence loss for p in Eq. ( 3) to align predictions of single-task and universal networks.
Though we use domain-specific adapters to map the universal features to the domain-specific space, learning a single set of representations over substantially diverse domains still remains challenging, requires to model complex non-linear relations between and hence a more elaborate distance function ( f ) than the Euclidean one.To this end, we propose to adopt the Centered Kernel Alignment (CKA) [44] similarity index with the Radial Basis Function (RBF) kernel that is originally proposed as an analysis tool to measure similarities between neural network representations and shown to be invariant to various transformations, capable of capturing meaningful non-linear similarities between representations of higher dimension than the number of data points.Differently from the original goal, we use CKA as a loss function to minimize the distance between universal and domain-specific representations rather than an analysis tool.
Next we briefly describe CKA.Given a set of images {x t 1 , . . ., ] ∈ R B×C denote the features that are computed by the multi-domain network adapted by a θ t and domain-specific networks respectively.We first compute the RBF kernel matrices P and T of M and S respectively and then use two kernel matrices P and T to measure CKA similarity between M and S: CKA(M, S) = tr(PHTH)/ tr(PHPH)tr(THTH), (5) where tr(•) and H denote the trace of a matrix and centering matrix H n = I n − 1 n 11 respectively.The loss f (M, Y) can be derived as f (M, S) = 1−CKA(M, S) as dissimilarity between the multi-domain and domain-specific features.As the original CKA similarity requires the computation of the kernel matrices over the whole datasets, which is not scalable to large datasets, we follow [68] and compute them over each minibatch in our training.We refer to [44,68] for more details.

Cross-Domain Few-shot Classification
We also apply our method to cross-domain few-shot classification problem that aims at learning to classify samples from a small training set with only few images for each class from an unknown visual domain.Like the previous setting at Sec. 4.2, we are given a large training set D that consists of images from T subdatasets, each focusing on a classification task over a set of mutually exclusive categories.Unlike the previous case, the goal is not to accurately classify an unseen image that belongs to one of the previously seen categories and domains but to a previously unseen category from either a previously seen or unseen domain, hence it is substantially more challenging.In particular, for the unseen task, we are given a support set S that contains few image and label pairs, and a query set Q that contains samples to be classified.In other words, we would like to learn a classifier on the support set that can accurately predict the labels of the query set.
As in [23,56], we solve this problem in two stages, metatraining and meta-test.In meta-training, we learn our universal representation network on D as in Sec.4.2 by following the two step procedure where we also use CKA loss function (i.e.Eq. ( 5)) for f and KL divergence loss for p in Eq. ( 3).In meta-test, we transfer the learned universal network, further adapt its representations and learn a classifier for the new task on S. In contrast to [23,56] that learn multiple domainspecific networks in an offline stage, then employ them to extract multiple features for each image, and then select the most relevant ones at test time, we learn a single universal network that performs well in T domains by distilling the knowledge of the domain-specific networks at train time, but use only the universal network at test time.This has two key advantages over [23,56].First using a single feature extractor, which has the same capacity with each domain-specific one, is significantly more efficient in terms of run-time and number of parameters in the meta-test stage.Second learning to find the most relevant features for a given support and query set in [56] is not trivial and may also suffer from overfitting to the small number of datasets in the training set, while the multi-domain representations, by definition, automatically contain the required information from the relevant domains.
In particular, after the universal representation learning, we freeze and use the universal network to extract features, and learn a task-specific linear mapping m ϕ : R C → R C , which is parameterized by ϕ, on the support set S with the non-parametric nearest centroid classifier (NCC) [64,89]: where ce is cross-entropy loss.NCC computes an average of feature vector over the mapped features of support samples that belong to each category to obtain each class centroids, measures the class probability of each sample by applying softmax over its negative cosine distance to the class centroids.
Discussion.In contrast to the previous setting where the universal representations are learned over multiple domains from sufficiently large data, cross-domain few-shot learning requires obtaining the domain-specific knowledge from only few samples in an unseen domain, which is extremely challenging.Hence, our hypothesis is that transferring universal representations should yield more effective learning of new domains, as the base assumption is that there is a bounded number of representations for vision problems.

Experiments
In this section, we analyze and evaluate our method in three problems, i) learning multiple dense prediction tasks on two popular benchmarks, NYU-v2 [88] and Cityscapes [17]) in Sec.5.1, ii) learning multiple diverse visual domains on the Visual Domain Decathlon [75] in Sec.5.2 and iii) few-shot classification on MetaDataset [98] in Sec.5.3.Finally, we conduct extensive analysis over various design choices in Sec.5.5. 3

Learning multiple dense prediction tasks
Here, we evaluate our method on learning universal representations for performing multiple dense prediction tasks on two standard multi-task learning benchmarks NYU-v2 [88] and Cityscapes [17] as in [58,55].Datasets and experimental setting.We follow the training and evaluation settings in [58,55] for both single-task and multi-task learning in both datasets.More specifically, NYU-V2 [88] contains RGB-D indoor scene images, where we evaluate performances on 3 tasks, including 13-class semantic segmentation, depth estimation, and surface normals estimation.We use the true depth data recorded by the Microsoft Kinect and surface normals provided in [24] for depth and surface normal estimation as in [58].All images are resized to 288×384 resolution as in [58].We follow the default setting in [88,58] where 795 and 654 images are used for training and testing, respectively.Cityscapes [17] consists of street-view images, which are labeled for two tasks: 7-class semantic segmentation4 and depth estimation.We resize the images to 128 × 256 to speed up the training as [58].
In both NYU-v2 and Cityscapes, we follow the training and evaluation protocol in [58].We apply our method and all the baseline methods to two common multi-task architectures, encoder and decoder based ones.The encoder-based methods only share information in the encoder before decoding each task with an independent task-sepcific decoder while the decoder-based approaches also exchange information during the decoding stage [101].For the encoder-based, we use the SegNet [2] as the backbone.As in [58], we use cross-entropy loss for semantic segmentation, l1-norm loss for depth estimation in Cityscapes, and cosine similarity loss for surface normal estimation in NYU-v2.We use the exactly same hyper-parameters including learning rate, optimizer and also the same evaluation metrics, mean intersection over union (mIoU), absolute error (aErr) and mean error (mErr) in the predicted angles to evaluate the semantic segmentation, depth estimation and surface normals estimation task, respectively in [58].
For the decoder-based, we build our method on PAD-Net [107] and MTI-Net [102] that use multi-scale feature extractor (encoder) based on the HRNet-18 [92] initialized with ImageNet pretrained weight as the feature encoder.We use the same loss functions, evaluation metrics, and training and evaluation protocol as done for SegNet backbone.For our method, we use the uniform loss weights (i.e.λ t = 1 for all tasks) for task-specific losses, unless stated otherwise.As we do not minimize the difference between predictions of the universal and single-task networks, we set λ t p in Eq. (3) to zero.We then first split the train set as train and validation set to search λ t f ∈ {1, 2} by cross-validation and train our network on the whole training set.We set λ t f to 1 for semantic segmentation and depth and 2 for surface normal estimation.Please refer to the supplementary (Sec.A.1) for more details.Multi-task performance.In addition to the abovementioned evaluation metric for each task, following prior work [101,51], we also report the multi-task performance MTL whih measures the average per-task drop in performance w.r.t. the single-task baseline: where t = 1 if a lower value of P t means better performance for metric of task t, and 0 otherwise.P t and P t s are performance (e.g.mIoU for semantic Segmentation) of the universal (multi-task) network and single-task network, respectively.

Encoder-based Architecture
Compared methods.Encoder-based architectures, including the vanilla MTL using SegNet [2] that shares the whole feature encoder across all tasks and consists of task-specific decoders, and MTAN [58] which extends the vanilla MTL baseline by sharing the SegNet across tasks and using taskspecific attention modules in each layer to extract task-specific features.We compare our method to the single-task learning (STL) baseline, i.e. train individual network per task, the vanilla multi-task learning network with uniform loss weights (MTL), and balanced optimization strategies, including Uncertainty [40], GradNorm [12], MGDA [87], DWA [58], PC-Grad [110], GradDrop [13], IMTL [57] and CAGrad [55].We also consider the BAM [16] which is originally designed for natural language processing and adapt this method for visual dense prediction tasks by aligning dense predictions.Importantly, this model performs knowledge distillation on predictions when learning the multi-task network and hence comparing this method sheds light onto importance of matching intermediate representations in the task-specific spaces.
Here we use KL-divergence loss, l1-norm loss and cosine similarity loss as knowledge distillation loss on predictions for semantic segmentation, depth estimation and surface normal estimation, respectively.We reproduce all methods in the same settings for fair comparison and the results of the compared methods are similar or better than the ones reported in the corresponding papers.Results on NYU-v2.Table 1 depicts the results of our method and other compared approaches in NYU-v2.We see that the vanilla MTL (Uniform) using SegNet achieves better performance in depth estimation, however, its performance drops in surface normal estimation in comparison with STL.This indicates that joint optimization of multiple tasks with uniform loss weights leads to unbalanced results, and overall worse performance than STL models in ∆MTL metric.While only few balanced optimization algorithms help to improve the MTL performance, IMTL-H and CAGrad obtains the best when applied to SegNet and MTAN.IMTL-H improves by balancing the pace at which tasks are learned via looking at the projection onto individual tasks of the average gradient w.r.t. the shared parameters while CAGrad achieves improvement by modifying the parameter update such that the update not only minimizes the average of task-specific losses but also decreases each task-specific loss.While BAM optimizing the multi-task learning network by knowledge distillation, it performs worse than the Uniform baseline as it aligns the predictions which requires to use different loss functions (e.g.cross-entropy for segmentation) and requires solving another unbalanced optimization problem.This shows that simply distilling the predictions of the multiple single task network leads to poor performance.Finally, our method outperforms these methods with either SegNet or MTAN backbones significantly, +7.84% MTL performance improvement over the IMTL-H, the best baseline.The results suggest that distilling features from multiple single-task networks provides a more effective learning of shared representations.Results on Cityscapes.We also evaluate all methods in Cityscapes and report the results in Tab. 2. Similar to the results in NYU-v2, BAM obtains worse results compared with the STL methods (e.g.-0.58%MTL performance when using the vanilla MTL method with SegNet).Among the loss balancing methods, Uncertainty and IMTL-H obtain the best MTL performance in both backbones (i.e.SegNet and MTAN) but their performance is lower than the STL models in both tasks.This shows the difficulty of optimizing the MTL network in a balanced way in this problem.Our method obtains significant gains on both tasks than all the compared methods and also achieves better results than the STL results.The results again demonstrate that our method is able to optimize MTL model in a more balanced way and to achieve better overall results.In addition, our method has much less parameters (one network) than the STL models (two networks) in Cityscapes.Incorporating loss balancing to ours.While our method achieves consistent improvements over all the target tasks, solving Eq. ( 3) also involves minimizing a weighted sum of multiple loss terms.Hence here we investigate whether our method can also benefit from dynamically setting weights of the individual loss terms in NYU-v2.In particular, we use SegNet as backbone, and we dynamically update the weights of task-specific losses (i.e.λ t ) with keeping the weight of distillation loss fixed (λ f ).We evaluate our method with each of three best performing loss balancing methods, i.e.Uncertainty, IMTL-H and CAGrad and report the results in Tab. 3. We see that our method is complementary to these loss balancing methods and it significantly improves the performance of loss balancing methods (e.g.Ours (Uncertainty) obtains about +12 improvement in MTL performance over the Uncertainty).Also, we can see that by applying our method to loss balancing methods obtains better performance than using our method with the Uniform MTL baseline.Table 3 Testing results on NYU-v2.We evaluate single task learning (STL) method and multi-task learning methods (MTL) on NYU-v2.Mean intersection over union (mIoU) for semantic segmentation, absolute error (aErr) for depth estimation, mean error (mErr) for surface normal estimation and multi-task performance ( MTL) are reported.

Decoder-based Architectures
We also apply our method to the decoder-based methods, PAD-Net [107] and MTI-Net [102] which are particularly designed for MTL by exchanging information during the decoding stage and achieve state-of-the-art performances in MTL [101].Apart from these results, we also include results of the vanilla MTL method using the same backbone (HRNet-18 [92]) of PAD-Net and MTI-Net as baseline and we report all results in NYU-v2 and Cityscapes in Tab. 4    worse in both tasks than STL baselines.Our method when applied to the vanilla MTL method using HRNet-18 backbone improves the performance over the vanilla MTL method and achieves a balanced MTL performance (i.e.better or comparable results than STL) in both datasets.We see in Tab. 4 and Tab. 5 that the decoder-based methods (PAD-Net and MTI-Net) obtain better performance than the vanilla MTL method by first employing a multi-task network to make initial task predictions, and then leveraging features from these initial predictions to improve each task output (MTI-Net obtains +1.72 MTL performance in NYU-v2).Here, PAD-Net improves over the vanilla MTL by aggregating information from the initial task predictions of other tasks by spatial attention for estimating the final task output while MTI-Net extends the PAD-Net to a multi-scale procedure by making initial task predictions and distilling information at each individual scale (of feature).However, they still suffer from the unbalanced optimization problem (e.g.MTI-Net obtains worse performance in surface normal estimation in NYU-v2 and semantic segmentation in Cityscapes, respectively).Building our method on these decoder-based method helps to boost their performance (Ours (MTI-Net) vs MTI-Net: +4.11% vs +1.72%).These results indicate that our method can be used with various architectures and enable more balanced performance over multiple tasks, and boost their overall performance.
Implementation details.We follow [75,76], use the official train/val/test splits, evaluation protocol, also use the ResNet-26 [33] as the backbone for domain-specific network and universal network.In our universal network, the backbone (i.e.ResNet-26) is shared across all domains and followed by domain-specific linear classifiers.We use the same data augmentation (random crop, flipping) and SGD as optimizer, and train domain-specific networks and our universal network for 120 epochs as in [75,76].Here we set the loss weights to 1 (i.e.λ t = 1 for all tasks) and perform crossvalidation to search loss weights (λ t f , λ t p ) in {0.1, 1, 10} for knowledge distillations on features and predictions, and set λ f and λ p to 10 for ImageNet, 0.1 for DPed, and 1 for other datasets.Please refer to the supplementary (Sec.A.2) for more details.
Results.In Visual Decathlon, we compare our method to Feature i.e. a feature extractor on ImageNet and learn classifiers on top of the feature extractor for other domains, and single domain learning models that are learned from Scratch or Finetune from the ImageNet pretrained feature extractor.We also compared our method with existing approaches, including Serial Residual Adapters (RA) [75], Parallel RA and Parallel RA SVD [76], DAN [81] and Piggyback [62].Results are from the corresponding papers.
We report the results on the test split on each domains by the official online evaluation [76] in Tab.6, including testing accuracy in individual datasets, average accuracy over 10 datasets (avg), decathlon evaluation score (S) [76], number of parameters (#params) w.r.t.one single task network.We also consider the multi-domain performance (i.e.MDL) as described in Eq. (7).First, while using ImageNet features requires only 1× parameters, they do not generalize well to other datasets when large domain gap is present (e.g.SVHN).In contrast, single-task learning model obtained by either learning from scratch or finetuning achieves significantly better performance (e.g.Finetune obtains 76.51 average accuracy and 2500 score) with the expense of 10 times more parameters.
We use Finetune as the baseline as in [75,76] and compute MDL metric for existing methods and ours.We can see that Serial RA which learns a set of domain-specific residual adapters for each task with a ImageNet pretrained feature extractor greatly reduce the number of parameter to 2× while it obtains slightly worse performance than Finetune (e.g.73.88 vs 76.51 average accuracy for Serial RA and Finetune, respectively ).The performance is further improved by DAN which constrains newly learned filters to be linear combinations of existing ones when adapting a pretrained model for other domains, i.e.DAN obtains 77.01 average accuracy, +0.60 MDL performance and only requires 2.17 parameters).Piggyback learns binary domain-specific masks to select effective filters to adapt a pretrained model for each domains (it obtains 76.60 average accuracy, +0.00 MDL performance and further reduce the number of parameters to 1.28).Connecting the RAs in parallel to the backbone (Parallel RAs) boosts the performance of the serial configuration while keeping the same computation cost (78.07 in average accuracy and +2.20 in MDL performance).The authors show that the performance can be further improved by decomposing residual adapters to low rank adapters through (78.36 average accuracy, +2.70 MDL performance and 1.5 parameters).
Finally, we show that our method (Ours) successfully learns a single feature extractor shared across all domains only with 1× parameters, the same number of parameters with a single domain network.Our model obtains better results than the Finetune baseline and existing methods in most domains (Ours obtains 79.25 average accuracy and +4.00 MDL performance).This clearly shows that learning representations from all domains jointly produces more gen-

Cross-domain Few-shot Learning
Here, we evaluate our method to few-shot learning on recent MetaDataset [98].
We follow the standard procedure in [98] and use the first eight datasets for meta-training, in which each dataset is further divided into train, validation and test set with disjoint classes.While the evaluation within these datasets is used to measure the generalization ability in the seen domains, the remaining five datasets are reserved as unseen domains in meta-test for measuring the cross-domain generalization ability.

Implementation details.
In all experiments we build our method on ResNet-18 [33] backbone for both single-domain and multi-domain networks.In the multi-domain network, we share all the layers but the last classifier across the domains.For training single-domain models, we strictly follow the training protocol in [23], use a SGD optimizer with a momentum and the cosine annealing learning scheduler with the same hyperparameters.For our multi-domain network, we use the same optimizer and scheduler as before, train it for 240,000 iterations.We set λ f and λ p in Eq. (3) to 4 for ImageNet and 1 for other datasets and use early-stopping based on cross-validation over the validations sets of 8 training datasets.We refer to supplementary (Sec.A.3) for more details.
Baselines and compared methods.First we compare our method to our own baselines, i) the best single-domain model (Best SDL) where we use each single-domain network as the feature extractor and test it for few-shot classification in each dataset and pick the best performing model (see supplementary Sec.B.2.1 for the complete results).This involves evaluating 8 single-domain networks on 13 datasets, serves a very competitive baseline, ii) the vanilla multi-domain learning baseline (MDL) that is learned by optimizing Eq. ( 2) without the proposed distillation method.As additional baselines, we include the best performing method in [98], i.e.Proto-MAML [98], and as well as the state-of-the-art methods, BOHB-E [85], CNAPS [79], SUR [23], URT [56], and the Simple CNAPS [4] 5 .For evaluation, we follow the standard protocol in [98], randomly sample 600 tasks for each dataset, and report average accuracy and 95% confidence score in all experiments.We reproduce results by training and evaluating SUR [23], URT [56], and Simple CNAPS [4]  Table 8 Global retrieval performance on MetaDataset.Here we evaluate our method in a non-episodic retrieval task to further compare the generalization ability of our universal representations.
using their code for fair comparison as recommended by MetaDataset.
Results on MetaDataset.As described in [98], we sample each task with varying number of ways and shots and report the results in Tab. 7. Our method outperforms the state-ofthe-art methods in seven out of eight seen datasets and four out of five unseen datasets.We also compute average rank as recommended in [98], our method ranks 1.3 in average and the state-of-the-art SUR and URT rank 5.0 and 4.4, respectively.In detail, we obtain significantly better results than the second best approach on Aircraft (+2.8),Birds (+2.1),Texture (+4.2), and VGG Flower (+1.5) for seen domains and Traffic Sign (+6.1) 6 and MSCOCO (+3.8).The results show that jointly learning a single set of representations provides better generalization ability than fusing the ones from multiple single-domain feature extractors as done in SUR and URT.Notably, our method requires less parameters and computations to run during inference than SUR and URT, as it runs only one universal network to extract features, while both SUR and URT need to pass the query set to multiple single-domain networks.We also see that our method outperforms two strong baselines, Best SDL and MDL in all datasets except in QuickDraw.This indicates that i) universal representations are superior to the single-domain ones while generalizing to new tasks in both seen and unseen domains, while requiring significantly less number of parameters (1 vs 8 neural networks), ii) our distillation strategy is essential to obtain good multi-domain representations.While MDL outperforms the best SDL in certain domains by transferring representations across them, its performance is lower in other domains than SDL, possibly due to negative transfer across the significantly diverse domains.Surprisingly, MDL achieves the third best in average rank, indicating the benefit of multi-domain representations.
Global retrieval.Here we go beyond the few-shot classification experiments and evaluate the generalization ability of our representations that are learned in the multi-domain network in a retrieval task, inspired from metric learning lit- 6 The accuracy of all methods on Traffic Sign is different from the one in the original papers as one bug has been fixed in MetaDataset repository.See https://github.com/google-research/MetaDataset/issues/54 for more details.As mentioned in the MetaDataset repository, we further update the evaluation protocol and report the updated results of all methods in the supplementary (Sec.B.2.4).erature [71,109].To this end, for each test image, we find the nearest images in entire test set in the feature space and test whether they correspond to the same category.For evaluation metric, we use Recall@k which considers the predictions with one of the k closest neighbors with the same label as positive.In Tab. 8, we compare our method with Simple CNAPS in Recall@1 and Recall@2 (see supplementary Sec.B.2.7 for more results).URT and SUR require adaptation using support set and no such adaptation in retrieval task is possible, we replace them with two baselines that concatenate or sum features from multiple domain-specific networks.Our method achieves the best performance in ten out of thirteen domains with significant gains in Aircraft, Birds, Textures and Fungi.This strongly suggests that our multi-domain representations are the key to the success of our method in the previous few-shot classification tasks.We also provide additional experiments in supplementary (Sec.B.2).

Hierarchical Distillation with Task Grouping
As introduced in Sec. 4, here we first randomly group tasks and learn a single network per group, and then distill their knowledge to learn universal representations.In addition to training of networks per task, networks per group can be trained in parallel.This strategy allows our method to better scale to the scenarios with a big number of tasks and domains by requiring few group-specific networks at the final stage.Note that this strategy does not necessarily reduce the total training time but reduces the number of feature extractors for training the universal representations.Here, we evaluate this strategy on Visual Decathlon under three random different groupings and report the performance of obtained universal representations in Tab. 9. Note that as ImageNet is a large diverse dataset, we treat it as a single group, and randomly assign the remaining datasets to three other groups.In each grouping, we divide 10 domains/tasks into 4 groups 7 .The results show that training the universal network with task grouping obtains comparable, even slightly better results than learning it without task grouping.In addition, different groupings achieve similar performance in average to each other, while the first group obtains the best result.

Further Analysis
In this section, we provide an extensive analysis over various adapter types, loss functions for knowledge distillation for multi-task learning and multi-domain learning.
Effect of adapters.As explained in Sec. 4, we employ adapters to align the universal representations with each taskspecific representation (see α θ t in Eq. ( 3)).Here, we evaluate our method without any adapters (by directly matching universal representations with task-specific ones), also with two different adapter parameterizations including linear adapters (i.e. each adapter is constructed by a linear 1 × 1 convolutional layer, this is the default setting in Sec.5.1, Sec.5.2, Sec.5.3), nonlinear adapters (i.e. each adapter consists of two linear convolutional layers and a ReLU layer between them).We report their results on NYU-v2 dataset in Tab.10.From the results, we can see that, though directly align features without the adapters improves performance on all tasks over the vanilla MTL baseline (Uniform), it still performs significantly worse than using either linear or non-linear adapters.This verifies that the adapters helps aligning feature between the multi-task network with features of different single-task network.We also observe that, using linear and nonlinear adapters obtains comparable results and using linear adapters is sufficient.We hypothesize that there is a tradeoff between the complexity of adapters and informativeness of aligned features.For instance, using deep multi-layer adapters would overfit to the data and align the pairs very accurately, hence lead to inferior representation transfer.Thus we argue that the linear adapters provide a good complexity/performance tradeoff.
Loss functions for knowledge distillation.Here we evaluate various loss functions for distilling intermediate representations (i.e.f (•) in Eq. ( 3)) including standard ones such as L2, cosine distance, and also Attention Transfer (AT) [43] that align the spatial attention maps computed by averaging the feature maps along the channel dimension and CKA.
Here, we use linear adapters for aligning features between   We first evaluate these loss functions for multiple dense prediction problem in NYU-v2 and report the results in Tab.11 where our default loss function is L2.Here, we apply knowledge distillation with different loss function to the vanilla MTL method with SegNet [2] as backbone (Note that CKA loss function is not included as it requires too large memory cost to operate on feature maps).From the results, we can see that, AT obtains the worst performance among these loss functions as it aligns the averaged features where some information is lost, but it still outperforms the vanilla MTL with uniform loss weights.While Cosine loss function performs better than AT, using L2 loss function obtains the best results.
Compared to learning multiple dense prediction tasks in a single domain, learning universal representations from multiple visually-diverse domains is a more challenging problem.Hence we use CKA as loss function for representation distillation, i.e. f .Here we evaluate the effect of CKA in MetaDataset and compare it to different distillation loss func- tions, and report their performances in Tab.12.In this study, we set λ p to zero and do not match prediction of the universal network with of the domain-specific ones.Among these loss functions, the best results are obtained with CKA loss in all domains.Although the universal representations are first mapped to the domain-specific spaces via adapters, L2 and cosine loss functions are not sufficient to match features from very diverse domains and further aligning features with CKA is significantly beneficial.
We then evaluate individual contributions of distillation through representations and predictions while using CKA and KL-divergence respectively in Tab. 13.Compared to only applying KL loss on predictions ('Ours w/o f '), only aligning representations with CKA loss function ('Ours w/o p ') performs better in most domains.Finally, combining f (CKA) with p (KL divergence), i.e. 'Ours ( f + p )', gives the best performance over the multi-domain models that are trained with the individual loss functions.

Qualitative results
Here, we analyze our method and qualitatively compare our method to STL, MTL with Uniform loss weights and the best compared method, i.e.IMTL-H [57] for multi dense prediction problem on NYU-v2 with SegNet backbone (see Fig. 3, see supplementary Sec.B.1 for more examples).We can see that, Uniform baseline obtains improvement on segmentation and depth estimation over STL, while it performs worse in surface normal estimation.Though dynamically balancing the loss values with IMTL-H improves the overall performance, it still performs worse in surface normal estimation.Finally by distilling representations from single-task  learning model to the universal network, our method can produce better or comparable results to STL, i.e. our method produces similar outputs for surface normal as STL and more accurate predictions for segmentation and depth estimation as our method enables a balanced optimization of universal network and a task can be benefited from another one.This indicates the effectiveness of our method on learning shared representations for multiple dense predictions.We also qualitatively analyze our method and compare it to the state-of-the-art URT [56] in Fig. 4 for cross-domain few-shot learning in MetaDataset.In particular, we illustrate the nearest neighbors in two different datasets given a query image (see supplementary Sec.B.2.6 for more examples).While URT retrieves images with more similar colors, shapes and backgrounds, our method is able to retrieve semantically similar images and finds more correct neighbors than URT.It again suggests that our method is able to learn more semantically meaningful and general representations.

Conclusion
We showed that learning general features from multiple tasks and domains is an important step for better generalization in various computer vision problems including multiple dense prediction, multi-domain image classification and cross-domain few-shot learning problems.By distilling representations from multiple task-specific or domain-specific networks, we can successfully learn a single set of universal representations after aligning them via small task/domainspecific adapters.These representations are compact and generalize better to unseen samples, tasks and domains in multiple benchmarks.
So far, we focuesd on multi-task or multi-domain learning.We would like to extend the proposed URL method to problems that involve multiple tasks and multiple domains at the same time such as semantic segmentation and depth estimation from different cities, time of day (e.g.day or night time), weather (e.g.clear, rainy, snowy), or domains.

A.1 Multi-task Dense Prediction
We evaluate our method on learning universal representations for performing multiple dense prediction tasks on two standard multi-task learning benchmarks NYU-v2 [88] and Cityscapes [17] as in [58,55].Here, we provide more details about our implementation.
We follow the training and evaluation settings in [58,55] for both single-task and multi-task learning in both datasets.More specifically, NYU-V2 [88] contains RGB-D indoor scene images, where we evaluate performances on 3 tasks, including 13-class semantic segmentation, depth estimation, and surface normals estimation.We use the true depth data recorded by the Microsoft Kinect and surface normals provided in [24] for depth estimation and surface normal estimation as in [58].All images are resized to 288 × 384 resolution as in [58].We follow the default setting in [88,58] where 795 and 654 images are used for training and testing, respectively.Cityscapes [17] consists of street-view images, which are labeled for two tasks: 7-class semantic segmentation 8 and depth estimation.We resize the images to 128 × 256 to speed up the training as [58].
In both NYU-v2 and Cityscapes, we follow the training and evaluation protocol in [58].We consider two backbone cases.For encoder-based one, we use the SegNet [2] as the backbone.As in [58], we use cross-entropy loss for semantic segmentation, l1-norm loss for depth estimation in Cityscapes, and cosine similarity loss for surface normal estimation in NYU-v2.We use the exactly same hyper-parameters including learning rate, optimizer and also the same evaluation metrics, mean intersection over union (mIoU), absolute error (aErr) and mean error (mErr) in the predicted angles to evaluate the semantic segmentation, depth estimation and surface normals estimation task, respectively in [58].More specifically, we use Adam as the optimizer and train the model for 200 epochs as in [58] with the learning rate of 0.0001 which is halved at the 100-th epoch.The batch size is set to 2 and 8 for NYU-v2 and Cityscapes, respectively.We use the same augmentation as in [58], such as random crop, flipping.
For the decoder-based methods, as the MTI-Net [102] requires multi-scale feature extractor (encoder), we follow [102,101], use the HRNet-18 backbone [92] initialized with Ima-geNet pretrained weight as the feature encoder.As in [102,101], the batch size is set to 8 for both NYU-v2 and Cityscapes.We use the same loss functions, evaluation metrics, and training and evaluation protocol as the encoder-based methods. 8The original version of Cityscapes provides labels 7&19-class semantic segmentation.We follow the 7-class semantic segmentation evaluation protocol as in [58] to be able to compare to the related works.
In our method, we use the uniform loss weights (i.e.λ t = 1 for all tasks) for task-specific losses, unless stated otherwise.As we do not minimize the difference between predictions of the universal and single-task networks, we set λ t p in Eq. (3) to zero.We then first split the train set as train and validation set to search λ t f ∈ {1, 2} by cross-validation and train our network on the whole training set.We set λ t f to 1 for semantic segmentation and depth and 2 for surface normal estimation.For the optimization of adapters, we use Adam as optimizer with the learning rate of 0.01 and weight decay of 0.0001 and anneal the learning rate to 0 using cosine scheduler.
We follow [75,76], use the ResNet-26 [33] as the backbone for domain-specific network and universal network.In our universal network, the backbone (i.e.ResNet-26) is shared across all domains and followed by domain-specific linear classifiers.We use the same data augmentation as in [75,76], such as random crop, flipping.We use SGD as optimizer with weight decay and train domain-specific networks and our universal network for 120 epochs as in [75,76].
For domain-specific models, i.e., learning one model per domain, we first train a model on ImageNet with learning rate of 0.1 and weight decay of 0.0005 and we finetune it for other datasets with weight decay of 0.0005.For finetuning on each dataset (except ImageNet), we set learning rate to 0.1 for Airc., Flwr, OGlt, SVHN, UCF and 0.01 for C100, DPed, DTD, GTSR.
Here we set the loss weights to 1 (i.e.λ t = 1 for all tasks) and perform cross-validation to search loss weights (λ t f , λ t p ) in {0.1, 1, 10} for knowledge distillations on features and predictions, and set λ f and λ p to 10 for ImageNet, 0.1 for DPed, and 1 for other datasets.We optimize our universal network and vanilla MDL using SGD as optimizer as in [75,76] with the learning rate of 0.01 and weight decay of 0.0001 for 120 epochs.The learning rate is scaled by 0.1 at 80-th and 100-th epoch as in [75,76].We optimize the parameters of adapters by using the Adam as optimizer with the learning rate of 0.01 which is annealed to zero using cosine scheduler and weight decay of 0.0005.We evaluate our method though the official online evaluation provided by [75].

A.3 Cross-domain Few-shot Learning
In all experiments we build our method on ResNet-18 [33] backbone for both single-domain and multi-domain networks.
Training details of single-domain models We train one ResNet-18 model for each training dataset.For optimization, we follow the training protocol in [23].Specifically, we use SGD optimizer and cosine annealing for all experiments with a momentum of 0.9 and a weight decay of 7 × 10 −4 .The learning rate, batch size, annealing frequency, maximum number of iterations are shown in Tab.14.To regularize training, we also use the exact same data augmentations as in [23], e.g.random crops and random color augmentations.Training details of our method In the multi-domain network, we share all the layers but the last classifier across the domains.To train the multi-domain network, we use the same optimizer with a weight decay of 7 × 10 −4 and a scheduler as single domain learning model for learning 240,000 iterations.The learning rate is 0.03 and the annealing frequency is 48,000.Similar to [98] that the training episodes have 50% probability coming from the ImageNet data source, each training batch for our multi-domain network consists of 50% data coming from ImageNet.In other words.The batch size for ImageNet is 64 × 7 and is 64 for the other 7 datasets.We first set loss weight λ t of domain-specific losses as 1 for all domains.We set λ f and λ p as 4 for ImageNet and 1 for other datasets, respectively.And we linearly anneal λ by λ ← λ × (1 − t K ), where, t is the current iteration and K is the total number of iterations to anneal λ to zero.Here, K = k × (anneal.f req.), where anneal.f req. is 48, 000 in this work.We search the k = {1, 2, 3, 4, 5} based on crossvalidation over the validation sets of 8 training datasets and k is 5 (i.e.K = 240, 000) for ImageNet, is 2 for Omniglot, Quick Draw, Fungi and is 1 for other datasets.For all experiments, early-stopping is performed based on cross-validation over the validations sets of 8 training datasets.
For the optimization of feature adaptation during metatest stage, we initialize ϑ as an indentity matrix, which allows the NCC to use the original features produced by our universal network and optimize ϑ from a good start point.Similar to the optimization in [23], we optimize ϑ for 40 iterations using Adadelta [112] as optimizer with a learning rate of 0.1 for first eight datasets and 1 for the last five datasets.

B.1 Multi-task Dense Prediction
Here, we analyze our method and qualitatively compare our method to STL, MTL with Uniform loss weights and the best compared method, i.e.IMTL-H [57] and other related methods for multi dense prediction problem on NYU-v2 with SegNet backbone (see Fig. 5).We can see that, Uniform baseline obtains improvement on segmentation and depth estimation over STL, while it performs worse in surface normal estimation.Though dynamically balancing the loss values with IMTL-H improves the overall performance, it still performs worse in surface normal estimation.Finally by distilling representations from single-task learning model to the universal network, our method can produce better or comparable results to STL, i.e. our method produces similar outputs for surface normal as STL and more accurate predictions for segmentation and depth estimation as our method enables a balanced optimization of universal network and a task can be benefited from another one.This indicates the effectiveness of our method on learning shared representations for multiple dense predictions.

B.2 Cross-domain Few-shot learning
In this section, we first evaluate each single-domain model for few-shot classification on each test dataset.We then show complete results on varying-way five-shot and five-way oneshot settings.We also evaluate the effect of the adaptors for aligning features in knowledge distillation.As the code of MetaDataset has been updated, we report results using the updated evaluation protocol from MetaDataset and compare our method with Cross-Transformer [21] and Transductive CNAPS [38] methods.Finally more qualitative results and global retrieval results are reported.

B.2.1 Complete results of single domain learning
To study the universal representation learning from multiple datasets, we train one network on each training dataset and use each single-domain network as the feature extractor and test it for few-shot classification in each dataset.This involves evaluating 8 single-domain networks on 13 datasets using Nearest Centroid Classifier (NCC).Table 15 shows the results of single domain learning models, where each column present the mean accuracy and 95% confidence interval of a singledomain network trained on one dataset (e.g.ImageNet) and evaluated on 13 test datasets.The average accuracy and 95% confidence intervals computed over 600 few-shot tasks.The numbers in bold indicate that a method has the best accuracy per dataset.
As shown in Tab. 15, the feature of the ImageNet model generalizes well and achieves the best results on four out of eight seen datasets, e.g.ImageNet, Birds, Texture, VGG Flower and four out of five previously unseen datasets, e.g.Traffic Sign, MSCOCO, CIFAR-10, CIFAR-100.The models trained on Omniglot, Aircraft, Quick Draw, and Fungi perform the best on the corresponding datasets while the Omniglot model also generalizes well to MNIST which has the similar style images to Omniglot.We then pick the best performing model, forming the best single-domain model (Best SDL) which serves a very competitive baseline for universal representation learning.

B.2.2 Effect of adaptors in knowledge distillation
In this section, we evaluate our method with adaptors or without adaptors for aligning features when we use CKA for knowledge distillation.From Tab. 17, We can see that using adaptors can improve the performance, such as Birds (+1.7) and VGG Flower (+3.6), MSCOCO (+1.3).This indicates that the adaptors A θ help align features between multi-domain and single-domain learning networks which are learned from very different domains.

B.2.3 Complete results of varying-way five-shot and five-way one-shot
We further analyze our method for 5-shot setting with varying number of categories.To this end, we follow the setting in [21], compare our method to the best three state-of-theart methods including Simple CNAPS, SUR and URT.In this setting, we sample a varying number of ways in Meta-Dataset the same as the standard setting but a fixed number of shots to form balanced support and query sets.The mean accuracy and 95% confidence interval of our method and compared approaches are depicted in Tab.16.As shown in Table 16, overall performance for all methods decreases in most datasets compared to results in the conventional setting   Table 18 Global retrieval performance on MetaDataset (seen datasets).In addition to few-shot learning experiments, we evaluate our method in a non-episodic retrieval task to further compare the generalization ability of our universal representations.Table 20 Comparison to baselines and state-of-the-art methods on MetaDataset.Mean accuracy, 95% confidence interval are reported.The first eight datasets are seen during training and the last five datasets are unseen and used for test only.Average rank is computed according to first 10 datasets as some methods do not report results on last three datasets.
MetaDataset 9 and report the results 10 in Tab.20.As shown in Tab.20, the update does not affect much on the results and our method rank 1.2 in average and the state-of-the-art methods SUR and URT rank 5.4 and 4.2, respectively.More specifically, we obtain significantly better results than the second best approach on Aircraft (+4.1),Birds (+1.3),Texture (+4.1), and Fungi (+2.9) for seen domains and Traffic Sign (+4.1) and MSCOCO (+4.3).The results show that jointly learning a single set of representations provides bet- 9 As mentioned in https://github.com/google-research/MetaDataset/issues/54, we also set the shuffle buffer size as 1000 to evaluate all methods and report the results in Tab.20.This change does not affect much on the results as the datasets we used were shuffled using the latest data convert code from MetaDataset. 10Results of Proto-MAML [98], BOHB-E [85], and CNAPS [79] are obtained from MetaDataset.The results of Simple CNAPS [4] are reproduced by the authors and reported at https://github.com/peymanbateni/simple-cnaps.We reproduce the results of SUR [23] and URT [56] with the updated evaluation protocol for fair comparison.
ter generalization ability than fusing the ones from multiple single-domain feature extractors as done in SUR and URT.Notably, our method requires less parameters and less computations to run during inference than SUR and URT, as it runs only one universal network to extract features, while both SUR and URT need to pass the query set to multiple single-domain network.[21] and Transductive CNAPS [38].

B.2.5 Comparison to Cross-Transformer
Here we compare our method to CTX [21] and TCNAPS [3] 11in Tab. 21.Note that TCNAPS and CTX are not directly comparable to our method.TCNAPS extends the Simple CNAPS [4] to a more favorable transductive inference setting and exploits the query set at test time which is in contrast to the inductive learning in our submission.CTX [21] focuses on learning from a single domain (ImageNet), while  our method is proposed to learn a single set of universal representation from multiple domains.In addition, CTX is built on a heavier network (ResNet-34) and larger resolution images (224×224) than the one (ResNet-18, 84×84 images) in ours.Nevertheless, as shown in Tab.21, our method still outperforms TCNAPS and CTX on most of the domains (8 out of 10 and 5 out of 10 respectively).Both the transductive learning in TCNAPS and the cross-attention mechanism in CTX are potentially orthogonal to our universal representation learning and thus can be incorporated to ours, while we leave this as future work.We will include the results and detailed discussion in the final version.

B.2.6 Qualitatively results
We qualitatively analyze our method and compare it to the vanilla multi-domain leanring (MDL) baseline, Simple CNAPS [4], SUR [23] and URT [56] in Figs. 6 to 18 by illustrating the nearest neighbors in all test datasets given a query image.It is clear that our method produces more correct neighbors than other methods.While other methods retrieves images with more similar colors, shapes and backgrounds, e.g. in Figs. 14 and 15, our method is able to retrieve semantically similar images.It again suggests that our method is able to learn more useful and general representations.

B.2.7 Complete global retrieval results
Here we go beyond the few-shot classification experiments and evaluate the generalization ability of our representations that are learned in the multi-domain network in a retrieval task, inspired from metric learning literature [71,109].To this end, for each test image, we find the nearest images in entire test set in the feature space and test whether they correspond to the same category.For evaluation metric, we use Recall@k which considers the predictions with one of the k closest neighbors with the same label as positive.In Tabs.18 and 19, we compare our method with Simple CNAPS in Recall@1, Recall@2, Recall@4 and Recall@8.URT and SUR require adaption using support set and no such adaptation in retrieval task is possible, we replace them with two baselines that concatenate or sum features from multiple domain-specific networks.Our method achieves the best performance in ten out of thirteen domains with significant gains in Aircraft, Birds, Textures and Fungi.This strongly suggests that our multi-domain representations are the key to the success of our method in the previous few-shot classification tasks.

Fig. 1
Fig. 1 We propose a Universal Representation Learning framework in (a) that generalizes over multi-task dense prediction tasks (b), multi-domain many-shot learning (c), cross-domain few-shot learning (d).

Fig. 3
Fig.3Qualitative results on NYU-v2.The fist column shows the RGB image, the second column plots the ground-truth or predictions with the IoU (↑) score of all methods for semantic segmentation, the third column presents the ground-truth or predictions with the absolute error (↓), and we show the prediction of surface normal with mean error (↓) in the last column.

Fig. 4
Fig. 4 Qualitative analysis of our method in two datasets.Green and red colors indicate correct and false predictions respectively.

Fig. 5
Fig.5Qualitative results on NYU-v2.The fist column shows the RGB image, the second column plots the ground-truth or predictions with the IoU (↑) score of all methods for semantic segmentation, the third column presents the ground-truth or predictions with the absolute error (↓), and we show the prediction of surface normal with mean error (↓) in the last column.

Table 21
Comparison to CrossTransformer (CTX) and TransductiveC-NAPS (TCNAPS) on MetaDataset.Mean accuracy, 95% confidence interval are reported.The first eight datasets are seen during training and the last five datasets are unseen and used for test only.Note that TCNAPS and CTX are not directly comparable to our method.

Table 2
Testing results on Cityscapes.We evaluate single task learning (STL) method and multi-task learning methods (MTL) on Cityscapes.Mean intersection over union (mIoU) for semantic segmentation, absolute error (aErr) for depth estimation and multi-task performance ( MTL) are reported.

Table 4
Testing results on NYU-v2.We evaluate single task learning (STL) method and decoder-based multi-task learning methods (MTL) with HRNet on NYU-v2.Mean intersection over union (mIoU) for semantic segmentation, absolute error (aErr) for depth estimation, mean error (mErr) for surface normal estimation and multi-task performance ( MTL) are reported.First we see that decoder based methods achieves better performance than the encoder based ones, as they use more powerful customized architectures and initialized with pretrained ImageNet weights.Similar to encoder-based methods, the vanilla MTL method obtains better performance in Segmentation than STL while it performs worse in depth and surface normal estimation than STL models in NYU-v2 due to the unbalanced optimization.In Cityscapes, it performs

Table 5
Testing results on Cityscapes.We evaluate single task learning (STL) method and decoder-based multi-task learning methods (MTL) with HRNet on Cityscapes.Mean intersection over union (mIoU) for semantic segmentation, absolute error (aErr) for depth estimation and multi-task performance ( MTL) are reported.

Table 6
Universal Representation Learning on Visual Decathlon.Accuracy on the test sets of individual dataset, average accuracy of 10 datasets (avg), evaluation score (S), multi-domain learning performance ( MDL) and the number of parameters (#params) w.r.t. a single task network are reported.

Table 7
Comparison to baselines and state-of-the-art methods on MetaDataset.Mean accuracy, 95% confidence interval are reported.The first eight datasets are seen during training and the last five datasets are unseen and used for test only.Average rank is computed according to first 10 datasets as some methods do not report results on last three datasets.
eral features than ImageNet representations.However, this is challenging due to the optimization issues.The difference between our model and vanilla MDL model shows that simply optimizing over multiple domain-specific loss functions is not sufficient to obtain good representations and representation distillation is crucial.We also show that RAs can be incorporated to our universal network.Jointly learning a shared ResNet-26 backbone with residual adapters (i.e.Ours (RA)) boosts the performance, e.g.Ours (RA) obtains the best or second best performance in most datasets (e.g.Airc., DTD, etc), best average accuracy (80.52), best MDL performance (+5.96) while only requires 2 in parameters cost and best score (4005).

Table 9
Universal Representation Learning with task grouping on Visual Decathlon Benchmark.Accuracy on the test sets of individual dataset, average accuracy of 10 datasets (avg), evaluation score (S), multi-domain learning performance ( MDL) and the number of parameters (#params) w.r.t. a single task network are reported.

Table 10
Testing results on NYU-v2.'linear' means we use a linear convolutional layer for adapters and 'nonlinear' means we use nonlinear adapters (i.e. each adapter consists of two linear convolutional layers and a ReLU layer between them).'w/o adapter' means aligning features without any adapters multi-task and single-task networks before measuring their discrepancy with these loss functions.
Table12Quantitative analysis of knowledge distillation loss functions for f .Mean accuracy, 95% confidence interval are reported.
COSINE denotes negative cosine similarity.All the loss functions are applied to measure the difference between intermediate representations of neural networks.All results are obtained with feature adaptation during meta-test stage.
Test Dataset Ours w/o p Ours w/o f Ours ( f + p ) Table13Quantitative analysis of knowledge distillation loss functions on representations and predictions.Mean accuracy, 95% confi- dence interval are reported.'Ours w/o p ' and 'Ours w/o f ' means we only apply CKA function on representations and apply KL divergence on predictions for knowledge distillation, respectively.'Ours ( f + p )' is our model using both CKA on features and KL on predictions.All results are obtained with feature adaptation during meta-test stage.

Table 14
Training hyper-parameters of single domain learning.

Table 15
Results of all single domain learning models.Mean accuracy and 95% confidence interval are reported.The first eight datasets are seen during training and the last five datasets are unseen for test only.