Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction

In this work, we propose a 3D fully convolutional architecture for video saliency prediction that employs hierarchical supervision on intermediate maps (referred to as conspicuity maps) generated using features extracted at different abstraction levels. We provide the base hierarchical learning mechanism with two techniques for domain adaptation and domain-specific learning. For the former, we encourage the model to unsupervisedly learn hierarchical general features using gradient reversal at multiple scales, to enhance generalization capabilities on datasets for which no annotations are provided during training. As for domain specialization, we employ domain-specific operations (namely, priors, smoothing and batch normalization) by specializing the learned features on individual datasets in order to maximize performance. The results of our experiments show that the proposed model yields state-of-the-art accuracy on supervised saliency prediction. When the base hierarchical model is empowered with domain-specific modules, performance improves, outperforming state-of-the-art models on three out of five metrics on the DHF1K benchmark and reaching the second-best results on the other two. When, instead, we test it in an unsupervised domain adaptation setting, by enabling hierarchical gradient reversal layers, we obtain performance comparable to supervised state-of-the-art.


Introduction
Video saliency detection is the task of predicting human gaze fixation when perceiving dynamic scenes, and it is typically carried out by estimating spatio-temporal saliency maps from an input video sequence.Saliency detection, in general, can be seen as the upstream processing step of multiple applications that include object detection [15], behavior understanding [36,40], video surveillance [35,53,17,70] and video captioning [45,61,7].Existing video saliency detection methods generally apply single-image saliency estimation on individual frames, and combine the results with recurrent layers to temporally model frame-level features.However, the two separate analysis stages in these models make them unable to fully capture spatio-temporal features simultaneously.Recently, 3D fully-convolutional models have addressed this limitation by progressively aggregating spatio-temporal cues, achieving state-of-the-art performance on standard benchmarks.For example, TASED-Net [44] adopts a standard encoder-decoder architecture, as largely used in semantic segmentation tasks [51,1,46], that learns a compact spatio-temporal representation, and feeds it to a decoder subnetwork to perform saliency prediction.While these methods perform well, saliency prediction is constrained by the aggregated representation learned at the model's bottleneck.This leads to learn representations that are more specific to the training data distribution, consequently limiting model generalization.
Fig. 1: HD 2 S overview.Our proposed model generates multiple intermediate saliency maps by using features extracted at different abstraction levels, and combines them to predict the output map.We refer to the intermediate saliency maps as conspicuity maps.
Following the success of 3D convolutional architectures, in this paper we propose a model based on Hierarchical Decoding for Dynamic Saliency prediction -HD 2 S -that, instead of using a compact spatiotemporal representation as in [44], generates multiple saliency maps by using features learned at different abstraction levels and then combines them to compute the final output.We refer to the intermediate saliency maps as conspicuity maps, as the employed architecture recalls the multi-scale model proposed in [22].Using representations extracted at different abstraction levels (from shallow to deeper) allows the model to learn both generic (and more dataset-independent) and datasetspecific features.The twofold advantage we obtain is to enhance performance on a specific dataset and, at the same time, to improve adaptation capabilities.Our approach takes inspiration from DVA [67], but extends it to the video domain by learning spatio-temporal cues for predicting visual saliency.More specifically, HD 2 S, shown in Fig. 1, is a 3D fully-convolutional network that employs an ensemble of multiple prediction models, each producing a conspicuity-like map at a specific abstraction level, for better saliency estimation.
As an additional contribution, we tackle the problem of generalization for video saliency prediction.Indeed, state-of-the-art methods lack domain adaptation capabilities and require a mandatory fine-tuning step to perform well on datasets that they were not trained on.As the deep learning community is moving to build more generalizable models, we argue that this holds, even more so, for saliency prediction research, given its fundamental nature in an artificial vision pipeline.To address this issue, our saliency prediction network is provided with a multi-scale domain adaption mechanism, based on gradient reversal [14], that encourages the model to learn domain-independent features.In particular, each abstraction level of HD 2 S is provided with a gradient reversal layer that prevents the learned representation from becoming dataset-specific.
We also address the opposite problem, i.e., domainspecific learning, by adding to the model some datasetspecific modules whose parameters are learned in order to maximize performance on a given dataset.We carry out extensive experiments testing of HD 2 S on multiple video saliency benchmarks (DHF1K [63], UCF Sports [41,55], Hollywood2 [42]) obtaining stateof-the-art performance and outperforming existing models.Performance that are boosted, as expected, when domain-specific learning is enabled.We also test thoroughly the domain adaptation capabilities of HD 2 S to datasets for which no annotations are available during training.Our model shows remarkable results, achieving performance comparable to state-of-the-art models that, instead, are trained (or fine-tuned) on those datasets in a standard supervised fashion.

Related work
Saliency detection has been long investigated in AI and computer vision research.In general, saliency models can be categorized in: saliency prediction [65] approaches that attempt to predict the fixation points of a human observer during free-viewing (e.g., they aim to predict where people look at in a scene), and salient object detection [38] methods that, instead, focus on assessing the saliency of pixels w.r.t.objects of interest (e.g., they aim to separate the salient objects from the background).Saliency methods can be further categorized according to whether they process still images (static saliency) or videos (dynamic saliency).
Static saliency has been studied for decades.Initial models, biologically-inspired [22] and employing handcrafted features, were followed by recent CNN-based attempts [21,48,47,32,67,13,8,6,31,23] that yield superior performance, rapidly becoming state of the art for static saliency prediction.S To overcome the lack of large eye fixation datasets, CNN-based static methods rely mainly on image classification models, as backbone, exploiting their capability to extract features useful for other visual tasks.Different encoder-decoder architectures with various strategies to combine the extracted features have then been proposed.The release of larger dataset for saliency, such as MIT300 [27], SALICON [26], and CAT2000 [4], led to a performance gain.DeepGaze II [32] investigated the benefit of employing low-and high-level features in saliency prediction.Similarly, ML-NET [9] proposed to combine low-and high-level features at the bottleneck, while [31] concatenates the outputs from several layers and processes them with multiple convolutional layers with different dilation rates.Another approach is to use a two-stream encoder architecture as in [21], where the image at different spatial scales is fed as input to the model, in order to extract low and high resolution information.[13], based on [21], used a similar network adding, after feature extraction, a channel weighting subnetwork that encodes contextual information.Differently from the above models, other works exploit adversarial training [16] for saliency prediction, such as SalGAN [47] and GazeGAN [6].Compared to saliency models for still images, saliency prediction in videos is an even more complex problem, due to the presence of the temporal dimension and to the additional computational effort it requires.Static saliency models have been adapted to dynamic saliency by using them in frame-by-frame mode, but they are outperformed by the dynamic models that jointly process the temporal dimension.
In recent years, a common strategy has been to extend static saliency models to the video scenario by incorporating motion features [64,54,57].For example, [64] proposes a two-model architecture to exploit spatiotemporal features: the first module performs frame-level saliency prediction; the second module, instead, takes pairs of frames with saliency predicted by the first module, and generates a dynamic saliency map.[54] basically employs the same architecture as [64] and self-attention, through non-local operations [66].SalEMA [37], instead, proposes a 2D encoder-decoder architecture with a recurrent module added to the bottleneck for integrating temporal information provided by the previous frames.Motion cues have been also included in saliency prediction through either recurrent neural networks applied to spatial feature encodings or convolutional recurrent networks.OM-CNN [25] is a dual-stream network that extracts spatial and temporal features using YOLO [50] and FlowNet [11], whose respective objectness and motion features are then combined via a two-layer ConvL-STM.Similarly, ACLNet [63] performs static saliency prediction through attention module that performs a global spatial operation on learned features.These features are then given to a ConvLSTM to model temporal information.The recent SalSAC model [68], leveraging the success of self-attention for saliency prediction [10,63], proposes an architecture with a shuffled attention mechanism on multi-level features for better modeling of spatial saliency.Correlation features between multi-level features and shuffled attention on the same features are provided to a ConvLSTM for learning temporal cues.
With the recent availability of a large-scale saliency benchmark, i.e., DHF1K [63], 3D fully-convolutional models [3,44], jointly extracting spatial and temporal features, have been proposed.RMDN [3] processes video clips with a 3D convolutional neural network based on C3D [59], and then employs LSTMs to enforce temporal consistency among the segments.TASED-Net [44] is a 3D fully-convolutional network, based on a standard encoder-decoder architecture, for video saliency detection without any additional feature processing steps.Similarly to the above approaches, our HD 2 S model is a 3D fully-convolutional network extending the multi-abstraction level analysis, proposed in [67] for static saliency, to the video domain by learning spatiotemporal cues.
Multi-level feature learning has been already applied in several application domains, most notably in object detection through the use of feature pyramid networks (FPN) [19].Most relevant to our approach are the works that carry out salient object detection using multi-level feature hierarchies, such as Amulet [72] and DSS [20].However, beside targeting static saliency prediction in images (and not in videos), those approaches apply an early-fusion mechanism of multi-level features, that are combined (through different concatenation schemas) before being further processed.Our method, instead, performs a late fusion of features: we encourage each decoding path to independently extract information from a certain abstraction layer, making sure that no interbranch "contamination" may happen except at the very last layer, and thus pushing it to learn scale-specific and complementary saliency features.HD 2 S also performs domain adaption to generalize across datasets without the need to be fine-tuned.Indeed, in all prediction tasks, shifts in train and test distributions may lead to a significant degradation of the model's performance.Trying to train a predictor capable of handling these shifts is commonly referred to as domain adaptation.Among the different domain adaptation settings1 , we focus on unsupervised domain adaptation, which is the task of aligning features extracted from the model across source and target domains, without any labelled samples from the latter.Several techniques have been proposed (though not for saliency prediction), such as regularizing the maximum mean discrepancy [39], minimizing correlation [56], or adversarial discriminator accuracy [14,60].An effective approach to transfer the feature distribution from source to target domains is proposed in [14] through the use of gradient reversal layers, treating domain invariance as a binary classification problem.This approach addresses domain adaptation by adversarially forcing a model to solve a given task while learning features that are non-discriminative across datasets.In HD 2 S we apply this strategy on multi-level features (unlike typical single-branch usage), in order to support the generalization of the saliency prediction task to datasets for which no annotations are available during training.While unsupervised domain adaptation has been applied to image classification [14,60], face recognition [28], object detection [58], semantic segmentation [73] and video action recognition [34] (among others), our work is, to our knowledge, the first to deal with unsupervised domain adaptation on video saliency prediction.It is worthwhile to note that this is technically and fundamentally different from the form of domain adaptation proposed in UNISAL [12], that, instead, learns domain-specific parameters.This means that, at inference time, UNISAL requires to know the source dataset of a given input in order to select domainspecific learned parameters.Our approach, instead, is domain-agnostic as it employs the learned parameters on any tested domain.It is also different from unsupervised salient object detection [71], which, instead, attempts to predict saliency by exploiting large unlabelled or weaklylabelled samples.However, we also provide HD 2 S with domain-specific learning capabilities as in [12], showing how this mechanism improves performance but cannot be applied in unsupervised domain adaptation scenarios.

Architecture overview
The proposed architecture is a fully-convolutional multibranch encoder-decoder network for saliency prediction, illustrated in Fig. 2.An input sequence of consecutive video frames is first processed by a feature extraction path, which computes spatio-temporal features at different scales and abstraction levels.The extracted features serve as input to separate network branches that estimate a set of conspicuity maps at the corresponding points in the model, while at the same time providing skip paths to ease gradient flow during training.At the output of the model, conspicuity maps are combined to predict the saliency map for the last frame in the input sequence.Our model is trained in a supervised way on a source dataset, for which saliency annotations are available.
Furthermore, the base model is provided with two additional mechanisms (that can be both disabled or enabled exclusively): -Domain adaptation modules that aim to make the model learn, in an unsupervised way, generalizable features (see red items in Fig. 2).In particular, each conspicuity subnetwork forks to a domain classification path, that is trained to classify whether an input video sequence (more precisely, the corresponding features at that abstraction level) is taken from the source domain or from a target domain, which cannot be employed for training through direct supervision since annotations are not available.
In order to perform this adaptation, we apply the gradient reversal technique: the feature extraction layer, shared by the conspicuity networks and the domain classifiers, is trained in an adversarial way, in order to force the model to learn features that are both discriminative and predictive -saliency-wise -as well as domain-invariant, in order to achieve satisfactory results even on the target domain.-Domain-specific learning mechanism that learns specific parameters to enhance the prediction on a given dataset.More specifically, we add modules (shown as light gray items in Fig. 2), used in a multisource training scenario (i.e., when using in training multiple datasets at the same time), whose parameters are optimized on each individual dataset.These modules aim to modulate features shared across multiple datasets based on the test data domain and include: domain-specific priors, batch-normalization and prediction smoothing.
At inference time, saliency maps are predicted for each frame by applying the model in a sliding window fashion, as in [44]; the saliency map S t at time t is predicted from a sequence V t = {I t−T +1 , . . ., I t }, where I t is the video frame at time t.To predict the first T − 1 frames, we reverse the chronological order of the corresponding input clips: each S t for 1 ≤ t ≤ T − 1 is predicted from the sequence V t = {I t+T −1 , . . ., I t }.As a final post-processing step, we apply a Gaussian filter (σ = 5) for smoothing the output saliency map.
In the following, we describe each of the components of our architecture.

Feature extractor
The employed feature extractor performs spatio-temporal encoding of an input videoclip (16 frames of size 128×192), using S3D [69] as a backbone.It then progressively reduces the dimensions of the feature maps through 3D max pooling to 2×4×6 (time × height × width), while increasing the number of channels to 1024.However, in order to exploit the full potential of the learned hierarchical representations, we select feature maps at different Fig. 2: HD 2 S architecture: Our multi-branch decoder predicts four conspicuity maps at different feature abstraction levels, which are then integrated into the final saliency prediction, on which KL-divergence loss L s is minimized.As for unsupervised domain adaption, each decoder branch is equipped with a gradient reversal layer (see red items) that encourages the model to learn features that generalize to a target data domain in an unsupervised way, by maximizing the classification error L d on the prediction of an input sample's domain.Finally, HD 2 S is also provided with domain-specific priors added to encoded features, with removed temporal dimension, and domain-specific smoothing as a last final layer.levels of the extractor, corresponding to different abstraction details, in order to build a skip architecture able to capture multi-headed saliency responses.In our implementation, we select feature maps from the S3D backbone at the output of the second, third and fourth pooling layers and at the input of the last average pooling layer.

Conspicuity networks
After feature encoding, we learn several conspicuity maps from the partial information produced at different levels of the feature extraction stack through multiple decoder networks (referred as conspicuity networks in Fig. 2).
Each conspicuity network in the model processes one of the spatio-temporal feature blocks coming from the feature extractor and returns a single-channel saliency map, encoding the conspicuity of spatial locations at that level of abstraction.In detail, the temporal dimension of the input feature maps is gradually removed, by applying a cascade of spatially point-wise convolutions (i.e., with kernel 3 × 1 × 1 and stride 2 × 1 × 1) that halve the temporal dimension at each step.The number of pointwise convolutions varies for each conspicuity network, depending on the size of the input feature maps.
After that, the (now purely spatial) set of feature maps is processed by a stack of 2D convolutional layers, interleaved with bilinear upsampling blocks, each of which doubles the spatial size of the feature maps until the original resolution of each frame is recovered.

Saliency prediction
The four conspicuity maps produced by the above subnetworks are finally fused to predict saliency on the last frame of the input video.The global fusion layer consists of concatenating the four maps and performing pixelwise 1×1 convolution followed by logistic activation.
At training time, the whole model (feature extractor, conspicuity networks and saliency predictor) is trained supervisedly on the source dataset in order to minimize the Kullback-Leibler (KL) divergence [44,21], between the predicted saliency map and conspicuity maps, and the correct target.More formally, given the predicted output saliency map S t , the four conspicuity maps C t,i with i = 1, 2, 3, 4 and the ground-truth map G t for a given target frame, all normalized over pixels appropriately, our multi-level saliency loss L s is computed as follows: where index i iterates over all pixels, index j iterates over the four conspicuity maps, G t,i , S t,i and C t,i,j are corresponding pixels of, respectively, the ground truth map, the output saliency map and the j-th conspicuity map.

Domain adaptation
In addition to training the model in a supervised way on the source domain, we also encourage the feature extractor to generalize over a target domain, without any supervision.Our unsupervised domain adaptation strategy relies on the Sgradient reversal layer (GRL) approach.
In particular, we integrate domain adaptation by inserting, in all of the conspicuity subnetworks, a branch with a gradient reversal layer and a domain classifier after the temporal-dimension removal layer (see Fig. 2).More formally and generally, given an input video clip V t with associate binary domain label d ∈ {0, 1} (source or target, respectively), we compute a set of associated domain classification losses {L d,1 , . . ., L d,4 } from 4 domain classifiers attached to the conspicuity networks.If we indicate by di the probability of the input being from the target domain estimated by the i-th classifier, the corresponding negative log-likelihood loss is defined as: The overall domain classification loss is simply computed as the sum of the individual contributions, since the interaction between saliency prediction and domain adaptation is controlled by the λ hyperparameter in the gradient reversal layers.As a result, the comprehensive loss for model training with domain adaptation is the following: During training, we alternately pass a batch of videos from the source domain and a batch of videos from the target domain: on the former, we compute and backpropagate both the saliency prediction loss L s and the domain classification loss L d (with target d = 0); on the latter, we can only compute and backpropagate the domain classification loss L d (with target d = 1), since no saliency annotation is available on the target domain.Minimizing the domain classification loss has the effect to train the classifiers to better discriminate between the source and the target domains, while at the same time adversarially training the feature extractor (and the initial temporal-removal layers in the conspicuity networks) to produce features that confuse the classifier, and hence that are domain-independent.
Architecturally, each domain classifier consists of a stack of 1×1 spatial convolutions aimed at reducing the number of features, followed by fully-connected layers, the last of which provides binary classification prediction of the input video's domain.

Domain-specific learning
In certain multi-source training scenarios (e.g., as done in [12]), one may assume that annotations are available for all employed datasets, thus enabling supervised training on all of them.When applying our saliency prediction model to this scenario, we provide it with domain-specific operations [12], which address the domain shift among different datasets.Unlike the unsupervised domain adaption setting, where we attempt to unsupervisedly learn features that generalize over multiple datasets, we here explicitly tailor learned features to the specific characteristics of each dataset.In practice, we adopt a set of domain-specific techniques which have demonstrated to be effective [12]: Domain-specific priors.[12] thoroughly analyzed multiple video saliency benchmarks, identifying the sources of data shift among them and encoding these sources into a set of Gaussian prior maps.We employ the same strategy by initializing domain priors as in [12], and then letting the model learn the most suitable filters to weigh the encoded spatio-temporal features depending on the input data domain.Domain priors are used to modulate the encoded features, after removing the temporal dimension (see light gray blocks in Fig. 2).Domain-specific smoothing.The optimal way in which the output map should be smoothed varies between different datasets and depends mostly on how ground truth is created.To address this issue, we learn a different Gaussian kernel (i.e, with a different value of σ) for each input data domain.Unlike [12], our layer is parameterized by σ only, with convolution coefficients computed accordingly to make the filter Gaussian, while [12] initialize domain-specific convolutional filters to be Gaussian, but they may drift to non-Gaussian as the network updates its parameters.This smoothing is applied to the global saliency map (see Fig. 2).Domain-specific batch normalization aims at mitigating the impact of data distribution shift on the statistics estimated by batch normalization for inference, which may become inaccurate when computed over different benchmarks.Thus, we learn batch normalization statistics for each dataset independently and accordingly apply them at inference time, depending on the input domain.

Datasets
-DHF1K [63]  -LEDOV [24] includes 538 videos of daily action, sports, social activity and art performance; we employ this dataset only as a target dataset for unsupervised domain adaptation.
Fig. 3 provides statistics on the training splits of the datasets employed for training our model: 1) UCF Sports is the smallest one in terms of available videos and average number of frames per video, thus it seems to be unsuitable for models with high capacity as they likely overfit it; 2) Hollywood2 contains the highest number of videos but the majority has a very short number of frames (see the right histogram in Fig. 3), thus it may disadvantage methods that model temporal cues; 3) DHF1K is the most balanced in terms of videos and number of frames per videos.

Training procedure
In our experiments, we pre-train the S3D backbone on the Kinetics-400 [29] dataset; backbone parameters are not frozen, so they are updated during saliency prediction training.After empirically testing different hyperparameter configurations in order to find the best combination, the networks are trained for 2500 iterations, using Adam as optimizer [30] with learning rate of 10 −3 .To reduce overfitting, L 2 regularization is applied, with a weight decay factor of 2 × 10 −7 .The λ parameter of the gradient reversal layers during training gradually varies from 0 to 1: where p linearly goes from 0 to 1 according to the formula: Gradually increasing λ also acts as an additional regularizer, since it prevents the model from focusing too much on the saliency prediction objective as training goes on.During training, sequences of T = 16 consecutive frames are randomly sampled from the dataset's videos, and each frame is spatially resized to 128 × 192.
We employed a batch size of 200, although for memory limitations we forward batches of 8 samples at each time, which accumulating gradients and updating the model's parameters every 25 such forward steps.When training with domain adaptation, we also forward a batch of samples from the source domain and one of samples from the target domain, and use them to update the domain classifier only.
To evaluate performance, we use each dataset training/test split when available, with 10% of the training data used as validation split.An exception is represented by DHF1K, since ground-truth annotations for the test set are not provided for blind assessment: in this case, when comparing to state-of-the-art methods (Tab.1), we report the test accuracy as computed by the dataset curators; while for ablation study (Tab.3 and 4) and domain adaptation analysis (Tab. 5 and 7), we employ the original validation set as test set.
Validation results are used to perform model selection for inference on the test set.When evaluating test performance in single-dataset experiments, the training, validation and test sets all come from the same domain.In domain adaptation experiments (with labeled source and unlabeled target datasets), training and validation splits are from the source domain (whose annotations can be used at training time), while the test set is from either an unseen portion of the target domain or from a different dataset altogether.In multi-dataset experiments, we combine the training splits of DHF1K, UCF Sports and Hollywood2 datasets into a single training set; as validation set, we employ only DHF1K's validation split (because of its better balance compared to the other datasets, as mentioned in Sect.4.1), while inference is carried out on each dataset's test split.In this setting, in order to support domainspecific learning and correctly update domain-specific modules, each training mini-batch contains videos from a dataset at a time, alternating between datasets to deal with different dataset sizes.
To compare the results obtained by the models, we use five commonly used evaluation metrics for video saliency prediction [5]: Normalized Scanpath Saliency (NSS), Linear Correlation Coefficient (CC), Area under the Curve by Judd (AUC-J), Shuffled-AUC (s-AUC) and Similarity (SIM).Higher scores on each metric mean better performance.

Video saliency prediction performance
We first test the performance of our base model (without any form of adaptation) in the supervised scenario on the DHF1K test benchmark, to evaluate its capabilities in the video saliency prediction task.We then integrate domain adaptation by means of GRL layers (as shown in Fig. 2), using the LEDOV dataset as a target domain, due to its wider subject variability than Hollywood2 and UCF Sports.Finally, we compute the performance of HD 2 S when using domain-specific learning, which is the form of adaptation that is most suitable with supervised learning settings and that can leverage all available annotated datasets (DHF1K, Hollywood2, UCF Sports).Tab. 1 shows the performance of our approach compared to the state of the art.HD 2 S, without domain adaptation (referred to in Tab. 1 simply as HD 2 S), outperforms all state-of-the-art methods on three out of five metrics (NSS, AUC-J, CC) and ranks second-best on SIM and third-best on s-AUC.Note that this variant also outperforms UNISAL [12], which already employs domain-specific learning, on four out of five metrics.When we also enable domain-specific learning modules HD 2 S (HD 2 S DSL ), performance (especially NSS, CC and AUC-J) increases sensibly, and it outperforms UNISAL on all metrics, demonstrating better representational and specialization capabilities.When using HD 2 S, with the hierarchical gradient reversal mechanism for domain adaptation(HD 2 S DA ), performance slightly degrades as the model attempts to adapt the learned features to the target datasets (in this case, UCF-Sports, Holly-wood2 and LEDOV).However, remarkably, despite this adaption mechanism, the model yields performance comparable with state-of-the-art ones.Comparing HD 2 S with TASED-Net [44], which also employs S3D [69] as backbone, it is possible to notice that our method (with and without adaptation) significantly outperforms TASED-Net in four out of five metrics using only half of the frames employed by TASED-Net (16 versus 32).TASED-Net slightly outperforms HD 2 S on s-AUC only, a metric that measures performance at the peripheral areas of the image, where a larger temporal context may allow to better capture the motion of an object.The generally better performance obtained by our method w.r.t TASED-Net demonstrates the importance of hierarchical feature learning, with equal backbone features.While our model yields the highest video saliency performance on DHF1K, and performance comparable to the state of the art on Hol-lywood2, its performance on UCF Sports is lower than UNISAL [12] and SalSAC [68], as reported in Table 2.This is explained first with the smaller size of UCF Sports w.r.t.DHF1K and Hollywood2.Indeed, during training, although we use all three datasets, UCF Sports accounts to about 1% of the total number of training video frames (DHF1K: 62%, Hollywood2: 37%, UCF Fig. 4: An example of failure, taken from Hollywood2.Despite a good prediction, HD 2 S misses to match the ground truth, as it is collected in a task-driven experiment (action recognition), thus highlighting more actions than salient objects.Sports 1%).This imbalance causes the model to overfit UCF Sports.
However, the suitability of Hollywood2 and UCF Sports for saliency detection deserves a further discussion.Indeed, both datasets' saliency annotations are collected in task-driven experiments (i.e., action recognition) and, as such, human observers tend to mainly observe specific actions rather than focusing on the salient objects themselves, which defeats the very purpose of saliency detection.An example is given in Fig. 4 where our model fails to match the ground truth: indeed, it focuses on the girl's face at the front (correctly, as it is the most salient area), but the ground truth mostly highlights the action of the person behind the girl.Furthermore, both datasets show a huge center bias [12] and have a rather limited variability of spatio-temporal features, especially Hollywood2, where the majority of video clips is very short in time.Analogously, UCF Sports is significantly smaller in terms of video frames, making it hard to train 3D convolutional models (or deep learning models in general).For all above reasons, we believe that both Hollywood2 and UCF Sports should not be used for saliency prediction.

Ablation Studies
To validate the importance and effectiveness of the HD 2 S architectural design choices, we test some model variants (without any domain adaptation or domain-specific learning) on the validation set of the DHF1K: 1. We first investigate the performance of our network, adding the different conspicuity nets one at a time; 2. We quantitatively and qualitatively evaluate the individual contribution of each conspicuity net, testing them in simple encoder-decoder architecture.
For the ablation study, we define as Baseline our network in a simple encoder-decoder configuration, i.e., without the intermediate conspicuity maps and multi-level loss.
More specifically, in the baseline model, the feature extractor remains unchanged, but only the deepest decoder branch (Conspicuity-net 4 ) is used.
The model variants and their performance are reported in Table 3.The results show that: a) each conspicuity net makes its own contribution to improving the final performance; c) multi-level loss on conspicuity maps enhances saliency prediction too.Overall, these results clearly verify the effectiveness of all important design features in HD 2 S.
In our control experiments, we also evaluate the individual contribution of each conspicituity net by testing the performance of the model when the other decoder streams are ablated.For example, when testing the contribution of the first conspicuity map, we use only Feature 1 (see Fig. 2) from the encoder stream and the related decoder stream (Conspicuity-net 1 in Fig. 2) and so on for the other conspicuity nets.Results, reported in Table 4, indicate that individually the third conspicuity net performs better than the others.To further elucidate this behavior, Fig. 5 shows the weights learned by the fusion layer of HD 2 S model when integrating the four conspicuity maps for final prediction on the DHF1K dataset.The obtained values confirm that Conspicuity-net 3 contributes the most (see left block in Fig 5 ) on the prediction task for our HD 2 S model.However, it is less important when providing the model with domain-specific capabilities (which allowed it to yield the highest performance on DHF1K; see Tab. 1) to HD 2 S (see right blocks in Fig. 5).Furthermore, in the domain adaptation case, it can be noted how the different conspicuity maps contribute almost equally to the prediction, as a consequence of the mechanism to make the features domain-independent.A qualitative interpretation of this behavior and on the contribution of each conspicuity map in the hierarchy is shown in Fig. 6.When comparing the behaviour of the different decoder branches on the standard, domain adaption, and domain-specific learning regimes, the following considerations can be drawn: 1) in standard training case (top line in Fig. 6), Map 4 does not provide additional information w.r.t.Map 3; 2) in the domain adaptation scenario (middle line in Fig. 6), all feature maps appear to contribute equally; 3) in the domain specific learning case (bottom line in Fig. 6), Map 4 provides additional (motion) information to Map 3, while on the standard learning approach the two maps encode similar information.This provides an interpretation to the parameters learned by the fusion layers, reported in Fig. 5. Analyzing the intermediate maps in the domain specific learning (bottom line in Fig. 6), we can observe that the four intermediate maps encode saliency at different levels of detail: Map 1 extracts small background motion, Map 2 focuses mainly on the bull, Map 3 starts highlighting the bullfighter and, finally, Map 4 puts more emphasis on the bullfighter.A standard encoder-decoder architecture would instead use only the last map for saliency, thus missing the bull.This highlights the usefulness of the proposed hierarchical decoding scheme.

Domain adaptation performance
When testing domain adaption performance, we distinguish two cases: a) the capabilities of the model to address domain shift issues, i.e., the case of reducing the shift between training and test data; and 2) the capabilities of the model to learn generalizable features that can be employed, without any additional tuning.Domain-shift.To assess the performance of our hierarchical domain adaptation approach in tackling the problem of domain shift, we run a set of experiments by selecting different combinations of datasets to be employed as source domain (used in a supervised way during training) and target domain, used in an unsupervised way during training; as test set, an unseen portion from the target domain is used.The assumption in these experiments is to perform unsupervised learning on the test domain through our hierarchical gradient reversal approach before testing on it (on the appropriate test split not used for unsupervised learning).
In particular, we compare the performance of our base model in the three scenarios: -Domain generalization, i.e., the model trained supervisedly on the source domain and directly tested on the target domain, with no additional information on the test dataset used during training; -Domain adaptation, i.e., the model trained with unsupervised adaptation on the target domain, enabled through the hierarchy of GRL layers as in our full model in Fig. 2; -Transfer learning, i.e., the model (with gradient reversal disabled) trained on the source dataset and then fine-tuned (in a supervised way) on the target dataset.This scenario represents the upper bound of the evaluation and is, of course, out of the scope of pure domain adaptation, since target domain labels are available at training time.
Tab. 5 shows the results for different combinations of source and target domains.Two main patterns of results can be identified, depending on whether DHF1K is employed as source domain or not.In the former case (top block of Tab.5), it can be noticed that the employment of gradient reversal layers improves performance over all target datasets, compared to simply training on the source dataset.When instead DHF1K is employed as target domain (second and third blocks in Tab.5), the use of gradient reversal layers degrades performance.This may be due to the specific characteristics of Hollywood2 and UCF Sports, which were collected in a task-driven experiments while DHF1K in a free-viewing one.Furthermore, the limited variability of spatio-temporal features from videos in Hollywood2, as shown in Fig. 3, makes harder for the model to move clustered features and to learn more general representations.Similarly, when UCF Sports is used as source domain, the small size of the dataset makes it easier for the model to focus on the supervised saliency prediction task (on which it can easily achieve a low training loss), rather than minimizing the domain adaptation loss.Overall, as expected, the highest performance are obtained in the transfer learning regime.
Learning generalizable features.We also test the capabilities of the model to learn general features by using, in the domain adaptation stream, a target dataset different from the one used for test.We specifically compute performance when training on DHF1K, adapting the learned features to LEDOV, and testing on never seen datasets (UCF and Hollywood2).Performance are reported in Table 6, which reports how the performance gain of HD 2 S, when empowered with hierarchical gradient reversal modules, is higher than in the case of domain-shift experiments (see Table 5).This demonstrates that our hierarchical domain adaptation mechanism is better at learning salience features that gener-alize well on multiple data domains than at addressing the domain-shift for a given dataset.

Multi-source training
A recent trend in video saliency prediction [12] proposes multi-source training as a means for improving performance by leveraging the larger input variability of multiple data sources.This setup also allows for the integration of domain-specific learning capabilities, as mentioned in Sect.3.6, that attempt to tune general features to specific datasets.The idea is to have a model that learns shared features across multiple datasets and then to employ domain-specific modules to adapt such features to a particular data domain.Although these domain-specific approaches do not strictly comply with the standard unsupervised domain adaptation formulation, as they go in the exact opposite direction to learning generalizable features (since they assume that target domain labels are available at training time), it is interesting to evaluate the impact of domain-specific learning on our architecture.In Sect.4.3 and Tab. 1, we already showed that the integration of domain-specific capabilities into the HD 2 S model achieves state-of-theart performance on DHF1K, outperforming [12], that introduced those techniques.Here, we complete our analysis by assessing the impact of domain-specific layers compared to multi-source domain learning.More specifically, for multi-source domain learning, we use the integration of DHF1K, Hollywood2 and UCF-Sports, as an unified dataset, for training and testing our model.As for domain specific learning, we enable the domainspecific modules (described in Sect.3.6) and train their parameters using data from each individual dataset and during inference we provide, as an additional input to the model, the dataset we want to test it.We also compute performance when using single-source domain, i.e., training and test on a single dataset at a time.The results in Tab.7 confirm that multi-source training by itself does not provide a much larger boost compared to single-source analysis, while domain-specific learning of dataset characteristics significantly improves performance, confirming that saliency prediction models surely benefit from embedding domain-specific layers from multiple datasets at training time.

Model size and runtime
From a computing resource perspective, Tab. 8 compares our model with state-of-the-art techniques, in terms of processing time and model size.Reference values for compared approaches are from [12].

Qualitative analysis
We here report quantitative analysis of the results obtained by our model.Fig. 7 shows examples of saliency predictions made by our HD 2 S model with domainspecific learning on the DHF1K benchmark.The model is able to effectively face object occlusion, multiple objects, fast motion, strong camera motion, stationary objects, saliency shift, camera focus change, low-light condition.Sample videos of how our model works are also given in the GitHub page of the project.Fig. 8, instead, shows example of failures that typically happen in case of small global motion or small objects.These failures can be caused by the spatial resolution at which input images are scaled before being processed by the model (128×192).Indeed, in the first two cases of Fig. 8 the models is unable to identify the correct salient region (located in a lateral region of the scene), and instead predicts a generic prior-driven center region.
In the last case, the model fails to detect the movement of a golf ball towards the hole (a slow movement of a small object), and erroneously predicts as salient the upper-right region of the scene, where a man with a red shirt significantly stands out from the surroundings.

Conclusion
In this work, we propose HD 2 S, a new fully-convolutional network for video saliency prediction.The key architectural elements of our proposed approach include a multi-branch decoder which acts at different feature abstraction layers to independently estimate conspicuity maps, which are then combined into the final prediction, and an unsupervised domain adaptation mechanism that enables our model to learn features that, at the same time, allow it to reach state-of-the-art performance on supervised saliency prediction, while generalizing to domains for which no annotations are provided at training time.Additionally, when employing domain-specific learning techniques, as introduced in [12], our model's performance on the supervised saliency prediction task further improves.
Comparing our approach with state-of-the-art models, we find that our late-fusion mechanism of multilevel saliency features provides a significant boost to performance: our ablation studies show that the gradual integration of multiple abstraction levels positively affects prediction accuracy.This is also confirmed by analyzing the learned weights.Interestingly, the impact of each conspicuity map (and, therefore, of each abstraction level of learned features) seems to vary depending on the employed domain adaptation mechanism: highlevel features become predominant when domain-specific learning is applied (possibly due to the larger data distribution variability introduced by multi-source training, which causes shallower features to generalize less), while all conspicuity maps become similarly important when unsupervised domain adaptation is applied, which can be explained through the action of gradient reversal layers, which actively encourage features to become domain-independent and thus to be equally effective at multiple scales.
While the model performs well in several complex cases -e.g., in presence of multiple objects, occlusions  and appareance changes -there are certain conditions in which we find room for improvement.Most of them involve the presence of small objects and small motion, where the model fails to correctly locate areas of interest, and the prediction is dominated by the prior.These situations could benefit from working at higher resolution, given sufficient computing resources, or in a patch-based fashion, to the detriment of inference times.However, major failures seem to be related to the specific characteristics of datasets: Hollywood2 and UCF Sports, for instance, are annotated with task-driven gaze fixations, rather then free-view the scene.This, of course, negatively affects methods that instead attempt to predict bottom-up saliency.Improved dataset availability curation for video saliency prediction may be an enabling factor for the advancement in the field.

Fig. 3 :
Fig. 3: Statistics of the training sets of DHF1K, Hollywood and UCF Sports.

Fig. 5 :
Fig. 5: Weights learned by the fusion layer when integrating the four conspicuity maps on DHF1K dataset: (left block) Full HD 2 S model, (middle block) HD 2 S model with domain adaptation, (right block) HD 2 S model with domain specific learning.For the HD 2 S model with domain adaptation, we use LEDOV as a target dataset.

Fig. 6 :
Fig. 6: Qualitative interpretation of the contribution of hierarchical decoding used under different settings.(Top line) HD 2 S, (Middle line) HD 2 S with domain adaptation and (Bottom Line) HD 2 S with domain-specific learning.

2 Fig. 7 :
Fig. 7: Qualitative evaluation of the proposed model on the DHF1K validation set.Comparison of the saliency predicted by our model with the ground truth on some frames: (left block) saliency with multiple objects, (upperright block) saliency on an occluded object, (lower-right block) saliency on moving objects whose appearance changes rapidly among consecutive frames.

Fig. 8 :
Fig.8: Examples of failures.The model struggles with small objects and small motion: in the first two cases, the model missed the salient region and highlights a generic prior; in the third example, the model does not manage to identify the golf ball, focusing instead on a man in a red shirt, standing out from the surroundings.
[55]ists of 1,000 high-quality videos with a large diversity of scenes, objects, types of motion, complexity of backgrounds.In total, it includes 582,605 frames annotated with fixation points from 17 observers during a free-viewing experiment.The dataset is split into 600/100/300 videos for training, validation and test sets, respectively.The test set is not released and the results are maintained by the dataset curators 2 .-UCFSports[41]contains150videostaken from the UCF Sport Action Dataset[55].Fixations are collected from 16 subjects while attempting to identify the action that occurred in the video.The dataset is split into 103 videos for training, and the remaining 47 for test, for a total of around 6,500 frames for training and 3,000 frames for test.The length of the videos varies between 20 and 140 frames.

Table 2 :
Comparison of HD out of five metrics (NSS, CC, AUC-J), while ranking second-best on SIM and s-AUC. 2 S and its variants (with domain adaptation: HD 2 S DA ; with domain-specific learning: HD 2 S DSL ) with other state-of-the-art methods on Hollywood2 and UCF Sports datasets.In bold the best results, in italic the second-best results.

Table 3 :
Comparison of various HD 2 S (without DA and DSL) configurations.The Consp-net4 configuration refers to the network in a simple encoderdecoder configuration, i.e., with Conspicuity-net 4 only.The full model includes all four conspicuity networks with multi-level loss on conspicuity and saliency maps, defined in Equation5.

Table 4 :
Individual contribution of each conspicuity net.Each configuration refers to a simple encoderdecoder architecture with different sets of encoded features.

Table 5 :
Analysis of domain-shift capabilities.Performance evaluation in the domain generalization (supervised training on source; testing on target) and domain adaptation (supervised training on source; unsupervised training and test on target) scenarios.Upper-bound performance is measured by the transfer learning scenario (supervised training on source and fine-tuning on target).

Table 6 :
Analysis of generalization capabilities.

Table 7 :
Performance evaluation on the multi-source and domain specific learning scenarios.

Table 8 :
Size (in MB) and processing time (in seconds) for the proposed model and state-of-the-art approaches.Best values in bold.