Unsupervised video object segmentation: an affinity and edge learning approach

This paper presents a new approach, called TMNet, to solve unsupervised video object segmentation (UVOS) problem. The UVOS is still a challenging problem as prior methods suffer from issues like generalization errors in unseen test videos, over reliance on optic flow, and capturing fine details at object boundaries. These issues make the UVOS an ill-defined problem, particularly in presence of multiple objects. Our focus is to constrain the problem and improve the segmentation results by fusion of multiple available cues such as appearance and motion, as well as image and flow edges. To constrain the problem, instead of predicting segmentation directly, we predict affinities between neighbouring pixels for being part of the same object and cluster those to obtain category agnostic segmentation. To further improve the segmentation, we fuse multiple-sources of information through a novel Temporal Motion Attention (TMA) module that uses neural attention to learn powerful spatio-temporal features. In addition, we also design an edge refinement module (using image and optic flow edges) to refine and improve the accuracy of object segmentation boundaries. The overall framework is capable of segmenting and finding accurate objects’ boundaries without any heuristic post processing. This enables the method to be used for unseen videos. Experimental results on challenging DAVIS16 and multi object DAVIS17 datasets show that our proposed TMNet performs favorably compared to the state-of-the-art methods without post processing.


Introduction
Video object segmentation (VOS) is a common task in many video analysis and scene understanding applications such as video editing [1], autonomous driving [2], robotics, surveillance and tracking [3]. It involves pixel level segmentation of independent moving objects across frames in a given video sequence.
Traditionally, VOS (also called motion segmentation) is solved via fitting geometric models to matched key-points in adjacent frames [4]. More recently, deep learning-based methods have become prominent. Existing deep learningbased VOS methods can be classified as semi-supervised, interactive and unsupervised methods based on the amount of human involvement in the segmentation process. The semi-supervised methods (SVOS) require annotations of objects of interest in the first frame. The interactive methods require user interactions like scribbles, to guide and correct the segmentation. Recently introduced unsupervised VOS methods (UVOS) are expected to identify all moving objects in the scene, with no prior information about the number of objects or manual annotation of the first frame.
Unsupervised video object segmentation is challenging since the segments or the number of segments are not unknown at start. Other challenges include dynamically varying number of objects, camouflaged object motions, occlusions, articulated non-rigid object motions, and background motion.
Prior works follow three main strategies to solve the UVOS problem: (1) using motion and appearance features together; (2) using semantic segmentation of objects in a video frame, followed by tracking of the detected objects using temporal information in subsequent video frames; and 1 3 (3) using neural attention to leverage temporal information from optic flow to focus on obtaining better features that represent moving objects.
Methods using motion and appearance features together [5][6][7][8] generally rely on two stream architectures. i.e., processing image (appearance features), and optic flow (motion features) independently. The performance of the two stream architectures are strongly dependent on accuracy of the data correspondences provided by optic flow which is often poor around object boundaries [9], non-textured regions, and fast-moving objects [10]. In such cases, motion features would not be reliable to complement the appearance features for accurate segmentation. This leads to an overreliance of these methods towards the appearance features of the objects.
Object detection and tracking methods [11][12][13][14] use stateof-the-art object detection methods like Mask-RCNN [15] to detect foreground objects and track all detected objects using tracking algorithms. Tracking has its own challenges: it often suffers from drift, and it relies heavily on object reidentification to track missed or re-appearing objects [16]. These methods usually suffer from generalization errors, when applied to larger test videos containing objects that do not appear in the training data.
To overcome the above challenges, in a solution called Motion-Attentive Transition Network (MATNet) [17], a neural attention mechanism was introduced that is similar to how humans perform motion segmentation. The motionattentive module uses optic flow to focus on moving objects and to obtain better features. This method demonstrated improved performance in tackling the above mentioned dilemmas, but it has its own limits and room for further improvement. For instance MATNet's primary focus on foreground/background segmentation limits its applicability in scenarios where multiple moving objects need to be segmented. In addition, the optic flow information employed by MATNet is drawn from only two consecutive frames. This introduces limited temporal information. Finally, fine details of objects are not entirely captured by optic flow, leading to poor object boundaries.
To solve the UVOS problem, we propose a novel approach, called TMNet, that combines temporal, edge and affinity information. The integration of various cues is performed in the sequence shown in Fig. 1 and produces segmentation affinities that can be clustered to segments that are significantly more accurate than those returned by state-of-art.
The design shown in Fig. 1 is developed to ensure that several desirable outcomes are achieved with significantly enhanced accuracy compared to the state-of-art. A major enabling factor here is the inclusion of several consecutive optic flows as the inputs to TMNet. It is common sense that with more information employed, better accuracy is expected. However, it is the design and implementation of the overall system and its details to a surgical precision that make the resulting accuracy enhancements feasible. More precisely, it is the choice of the operations (blocks) and the sequence in which they operate, as well as the procedures executed within each block that all together, make the difference. The resulting implementation that will be presented in next sections in detail, deliver the following advantages.
Firstly, multiple objects can now be segmented in video as affinities are employed instead of segmentation masks. This stems from assessing the relationship between neighboring pixels by assigning them the probability of TMNet. Temporal attention encoder block captures spatio-temporal information from the current video frame ( I t ) and consecutive optic flows (V t−1 , … , V t− ) . Affinity learning decoder block predicts neighbour affinities using the features learnt by the encoder. Edge enhancement block refines the object segmentation boundaries by aligning and merging the image edge, temporal flow edge and affinity information using non-linear aggregation functions. Correlation clustering block converts the edge-enhanced affinity graph into the required segmentation output belonging to either the same object (when pixels lie inside an object) or a different one (when pixels straddle object boundaries).
Secondly, the proposed design is capable of detecting and segmenting highly complex object motions with continuing and fluid changes in their appearance through time. This is enabled by by encoding temporal attention from image and consecutive optic flow information. The idea is inspired by the observation that humans are attracted first to anything that moves before learning to map objects to semantic object classes [18].
Thirdly, TMNet is less prone to inaccuracies at moving object boundaries. This is achieved through the edge enhancement block. The edge enhancement block aligns the output of image edge network, temporal flow edge network and the affinity learning decoder network through a series of simple conv blocks. The conv blocks act like a non-linear aggregate function that merges the image edge, flow edge and affinity information in order to refine the edges of the segmentation. Figure 2 presents instances of motion segmentation executed on DAVIS17 dataset to demonstrate how the above three areas of improvement are achieved.
The main contributions of this paper include: • A new Temporal Motion Attention (TMA) module within the Temporal Attention Encoder block to obtain powerful spatio-temporal features for segmenting moving objects. • Predicting neighbourhood affinities instead of directly predicting the segmentation to obtain category agnostic segmentation of multiple moving objects in unseen videos. • An edge refining network, within the Edge Enhancement block, for combining the image and flow edges, and affinity information to refine the object segmentation boundaries.
This paper is organized as follows: Sect. 2 introduces the related works. Section 3 presents the network architecture of TMNet in more detail, including its implementation and the loss function used. Ablation studies on DAVIS16 dataset are presented in Sect. 4. They demonstrate the performance improvements achieved. Section 5 concludes the paper and discusses future work.

Definition and categorization
According to Gestalt's "common fate" principle [19], VOS is the grouping of pixels with the same motion. Torr [20] defines VOS as segmenting all objects that move relative to the background. This is ambiguous since in many applications, motion grouping may not always be equivalent object grouping. For example, this may occur in applications where there are intermittent object motions (objects are static for few frames in the sequence), articulated objects (only part of the object moves), or similarly moving objects (different objects with same motion). Bideau et al. [21] analyzed various applications and identified challenging cases similar to above, then proposed that VOS is defined as grouping moving objects in a way that (i) an entire object is segmented even if only part of it moves, (ii) a temporarily static object is segmented due to past recent movements; and (iii) similarly moving objects are segmented separately unless they are connected in 3D.
Depending on the amount of human involvement, a VOS solution can be classified as semi-supervised, interactive or unsupervised. Unlike semi-supervised methods that require first frame ground-truth of objects to be segmented, unsupervised or zero shot VOS methods should identify all moving object(s) without any prior information. Note that the work reported in this paper addresses the challenging UVOS problem that requires the model to have generalizing capability.

UVOS based on multiple cues
It is well-known that UVOS cannot perform accurately if only a single source of visual information is extracted from moving images and applied. For instance, appearance-based methods would fail to segment texture-less objects and would segment static objects in the background. On the other hand, The solutions that focus on motion and merely use optic flow information may over-segment non-rigid objects, and segmenting occluded or camouflaged objects and those with degenerate motions.
Inspired by the above observations, a number of UVOS solutions have been recently developed based on combining the appearance and motion (optic flow) information [22][23][24]. Due to inaccuracy of optic flow at object boundaries, Koh et al. [25] and Papazoglou and Ferrari [26] proposed to employ edge cues to obtain better performance at object boundaries. Differently, reinforcement learning has also been used to solve the UVOS task [27].
Recent advances in deep learning for object recognition [15,28] have enabled the use of temporal information to track object proposals and generate consistent segmentation for the entire video [11,13,29,30]. Zhao et al. [31] have suggested to perform UVOS for multiple objects by detecting and tracking objects using human-centric re-identification. In another solution called unsupervised offline video object segmentation and tracking (UnOVOST) [11], tracklets are generated for object proposal masks and long-term consistent tracklets are merged to perform segmentation. A propose-reduce paradigm [30] extends UnOVOST [11] by improving the merging step using only key-frames selected by some heuristic rules. Object-based detection methods use Siamese re-identification networks for association. This can lead to failure for detecting fast moving objects, occlusions, and non-rigid motions [16].
In contrast to above methods, our approach effectively combines all the cues discussed above to produce category agnostic multi-object UVOS. Most methods perform only binary foreground/background segmentation, thereby limiting their applicability to scenes with only one moving object. In contrast, we perform multi-object UVOS by predicting affinities instead of predicting the segmentation directly. These are explained in Sect. 3.

Attention in neural networks
Inspired by human perception, detection of visual attention has been recently implemented using deep neural networks to improve the performance of machine vision solutions for various applications such as attention guided object segmentation [32], dynamic visual attention prediction [17], depth estimation [33], action recognition [34], and visual question answering [35]. Attention helps the network form effective feature representations from the data. The effectiveness is achieved by focusing only on the relevant informative regions of interest (avoiding unnecessary information). Neural attention has also been used for improving the performance of the UVOS solutions in [17,[36][37][38][39].
In-order to avoid the use of computationally expensive optic flow for UVOS, AGNN [36] and AGS [38] first use an attention mechanism to capture the higher order relationships in a message passing graph neural network framework. Differently, COSNet [37] solved UVOS using co-attention between frames in a video sequence by a Siamese neural network to learn the global context. MATNet [17] introduced a motion attentive two stream interleaved encoder to learn powerful spatio-temporal features for UVOS by an attention mechanism using optic flow to focus on only the moving objects. FEM-Net [39] extended MATNet [17] by additionally using the optic flow edge information in a Flow edge connect module that helps in segmenting the foreground salient objects, and their boundaries accurately. Our method also extends the motion-based attention mechanism proposed by MATNet [17] to learn powerful spatio-temporal features by additionally incorporating temporal context from several frames in the video into the attention mechanism. This helps to resolve ambiguities in optic flow information of complex dynamic scenes in two consecutive frames. Different from the previous methods, our method also predicts affinities instead of predicting the output segmentation directly. This enables our method to be used for category agnostic multiple object segmentation.

Problem statement
Consider a video sequence represented by its frames where each video frame, I t , denotes an RGB image with width W and height H in pixels. We denote the forward optic flow computed from those frames by The objective of VOS is to generate the multi-object segmentation where S t denotes the segmentation at frame t and is itself a set of binary object segmentation masks, and each mask is a W × H matrix of object labels in t . Note that both the number of moving objects, K t and the space of possible object labels (discovered so far), t , depend on the frame number t, as they can dynamically vary in the video sequence due to new objects entering or outgoing objects leaving the scene.
Notations: The notations used throughout this paper are as follows. ℝ denotes the Space of real numbers. W and H represent the width and height of the image respectively. I t denotes the colour image at time t. V t denotes the optic flow at time t. is the hyperparameter that denotes the number of consecutive optic flows used. F i a,t and F i m,t denotes appearance and temporal features extracted at the i-th residual stage at time t respectively. F t denotes the temporal motion attentive feature obtained from the encoder. A t denotes the segmentation affinity matrix obtained from the decoder. I e and V e represent the image edge and flow edge maps respectively. Â denotes the refined affinity matrix after edge enhancement. Finally, S t denotes the segmentation output at time t.
Our proposed solution to this problem, TMNet, is an end-to-end deep neural network with its overall architecture presented in Fig. 1. At its core, TMNet predict affinities by learning powerful spatio-temporal features (extracted from video and consecutive optic flows) through neural attention and enhancing them around the edges. The internals of the network are described in the following subsections.

Temporal attention encoder
This encoder uses a motion attentive two stream interleaved architecture to learn robust feature representations for moving object(s). We can achieve a degree of robustness due to the use of our novel temporal motion attention (TMA block) described in Sect. 3.3 that uses a neural attention mechanism to focus on only the moving object(s) (Fig. 3).
For the appearance stream, we use the image I t ∈ ℝ W×H×3 and for the motion stream, we use the optic flows (1) However, to enable the integration of additional temporal information from the previous frames, several innovations are in place.
Firstly, we use the initial 5 convolutional blocks of the standard ResNet backbone to extract appearance feature F i a,t from the image I t , and temporal features (F i m 1 ,t , … , F i m ,t ) from the optic flows ̂ t at various residual stages ( i = 2, … , 5 ) with different spatial resolutions (1/2th, 1/4th, 1/8th, 1/16th) of the original image size. Note that spatial resolution of the i-th stage is indeed 1∕2 (i−1) of the original image size.
Next, we use the temporal features to update the appearance features at all intermediate stages, and obtain the enhanced appearance features F i a,t as output features that would be employed for affinity learning and prediction. This enhancement is performed using a newly designed TMA block for each stage: Figure 4 shows a schematic diagram demonstrating how appearance and temporal features are combined in the TMA block. Note that in the diagram, signals appear inside boxes and operations (such as convolutions, fully connected layers and multiplications) appear outside blocks and arrows. In addition, for the sake of simplicity, we have removed the frame time t and stage index i from the signal notation.

The TMA block
A temporal aggregation block within our TMA block to extend the Motion Attentive Transition block developed by MATNet [17] so that temporal information from previous frames can be accommodated. Firstly, there is a soft attention that weights each of feature maps ( F = F a , {F m1 , … , F m } ) at every pixel location. This is performed by a 1x1 conv function that learns the probability that a particular region of the feature map is important. The probability is then normalized using a softmax function to obtain the normalized importance weights I ∈ ℝ W×H . The feature maps in F are all converted separately to obtain the spatial attentive features Z by using a channel-wise Hadamard product: Next, we find the correlation between the spatial attentive appearance features Z a , and each of the spatial attentive motion features Z mi . This correlation S i , or the nonlinear affinity, is learnt to find the relationship between the two feature spaces. The affinity is high (both appearance and temporal motion features are similar) for regions of the image where moving object(s) are present: where P i and Q are trainable weights learnt during training to compress the size of the model and avoid the overfitting problem. Q matrix which is learnt to compress the appearance features, are shared in the calculation of all S i . In the third step, we normalize the affinity matrix S i along the row dimension to ensure that the sum of the contributions of all channels is 1. Finally we aggregate the normalized affinities S r i for all optic flows to obtain the temporal motion attention factor S.
The aggregation function can be implemented by simple maximum, minumum, average or median values. Experiments in Sect. 4 indicate that the averaging performs better than other functions. Finally, the enhanced appearance features F a ∈ ℝ W×H×C a are obtained from the temporal motion attention factor S by The enhanced appearance features F a and the motion features F m are concatenated to form the combined features. The combined features of all four residual stages are scaled to the size of the image, and combined to obtain the final robust feature representation F t . Finally, the features F t are fed to the affinity learning decoder stage.

Affinity learning decoder
The affinity learning decoder is designed to use the temporal motion attentive features F t ∈ ℝ W×H×C , and the predict affinities A t ∈ ℝ W×H×4 : The segmentation affinity is defined for every pixel u ∈ I t , and one of its neighbouring pixels v ∈ N(u) . The affinity describes the probability that the selected pixel, and its neighbour belongs to the same motion. The value varies from 0 to 1. The affinity is 1 if the label of the pixel matches its neighbor (high affinity for edges between pixels within the same segmentation). The affinity is 0 if the labels do not match (low affinity for edges between pixels that belong to different segments). This representation is permutation invariant and has a fixed size, making it easy to use for training purposes. Here we face the tradeoff between accuracy and memory/time. Increasing the neighborhood size will return more accurate results at the expense of the large memory footprint of segmentation affinity leading to more prediction time. Hence, we restrict our model to predict affinities for only the four immediate neighboring pixels.
The correlation module is used for learning the segmentation affinities from the temporal motion attentive features. The correlation module calculates the correlations between the spatio-temporal features F t of the image I t , and warped features F shifted t of the same image. The warping is performed by shifting the features to its four neighbouring nodes (left, right, top and bottom nodes). For calculating the correlations, we utilize the simple cosine similarity function instead of learning from a cost volume constructed from the features (since constructing the full cost volume makes model large in-terms of memory). Figure 5 shows the internal design of the affinity learning decoder.

Edge enhancement network
The edge enhancement network is designed to refine the object boundaries by updating the predicted affinities A t obtained from the decoder. In order to capture fine details and improve the accuracy of object segmentation boundaries, the edge enhancement network aligns and merges three sources of complementary information available: (1) Image edge I e ∈ ℝ W×H of the image I t , (2) Flow edge V e ∈ ℝ W×H of the optic flow V t−1 , and (3) Predicted affinities A t ∈ ℝ W×H×4 obtained from the decoder by using a non-linear aggregation function REFINE(): Figure 6 shows the motivation for using both the image edge I e , and the flow edge OF e information jointly in the edge refinement module of our TMNet model. The first row shows failure of using optic flow edges V e to detect the swan (inaccurate boundaries, background noise due to moving water). As our proposed method uses additional image edge cues, it can overcome this issue and segment the swan correctly. The second row shows failure of using image edges I e to segment the moving car (due to the object having no texture). Again, our proposed method has been able to detect the car correctly using the flow edge information. The above two cases highlight the fact that both image and flow edges act as complementary information aiding the segmentation together to refine the boundaries. The third row also shows the improvement due to the complementary information even though both flow and image edges are accurate.  The non-linear aggregation function REFINE() is implemented by series of conv blocks on concatenated features. Firstly, we concatenate the three sources of input information: (1) Image edge I e ∈ ℝ W×H , (2) Flow edge V e ∈ ℝ W×H , and (3) Predicted affinities A t ∈ ℝ W×H×4 to form concatenated edge features R t : Secondly, the concatenated edge features R t form the input to the convolutional modules. The input R t has a spatial dimension of W × H and six feature channels. We use three conv blocks with a size of 32, 32 and 4 as the output channel dimension. We use the stride of one to maintain the aligned spatial dimension of W × H . The spatial dimensions of the input do not change as we do not use any pooling layers. The convolution blocks perform non-linear mapping and aggregation at a higher dimensional space to capture the fine details of the boundaries. Figure 7 depicts how the above mentioned operations take place inside the edge enhancement block. The output refined affinity Â at the end of the block is clustered to obtain the required boundary aware segmentation.

Predicted affinity to video object segmentation
The refined edges are finally clustered to obtain the required segmentation. Our network is not limited the specific set of object classes present in the training class since we predict affinities. But in-order to obtain the required segmentation S from the affinity Â predicted by our TMNet model, we need to perform a clustering step. Unlike other clustering methods, correlation clustering finds the optimal number of   Fig. 1 clusters automatically. So we apply correlation clustering on a pixel grid graph that uses the predicted affinities.
Firstly, we create a pixel grid graph G = (V, E, W) for the image I t from predicted affinities Â as follows: • Nodes V: Set of N = W × H vertices for every pixel in the image I t . W and H denote the width and height of the image I t . • Edges E: Set of edges e uv ∈ ℝ N×4 connect four neighbouring nodes v (left, right, top and bottom nodes) of every node u that form the pixel grid. • Weights W: The affinities Â ∈ ℝ N×4 predicted by our model are used as weights w uv for every edge defined in the graph. It is the cost associated with assigning the two nodes u and v of the edge e uv to distinct components.
Next, the segmentation is performed by solving the optimization problem defined on the pixel grid graph created using the predicted affinities as edge weights. The correlation clustering or the graph multicut optimization problem is solved using the method described in [40]. The output is a unique decomposition of the graph G, which assigns 0/1 labels to all the edges. Edges labelled 1 straddle distinct clusters. Finally, once edges straddling distinct clusters are identified, the clusters can be separated to obtain the output segmentation S t as the set of binary object segmentation masks M 1 , M 2 , … , M k ∈ ℝ W×H . Our method automatically assigns the optimal number of independently moving objects k which can vary dynamically in the video sequence.

Loss function
Since our network predicts affinities instead of the segmentation directly, we use a loss function based on the predicted affinities. For the training, we formulate our loss between the predicted affinities and the ground-truth affinity (instead of the usually used binary cross entropy loss between the predicted and the ground-truth segmentation).
Firstly, in-order to define a loss term, we need to convert the labelled ground-truth segmentation to ground-truth affinities. The ground-truth affinity is 1 if the label of the pixel in the image, and its neighbour are the same. Consider an image I t ∈ ℝ W×H×3 , and its segmentation ground-truth L t ∈ ℝ W×H . The ground-truth affinity matrix A ∈ ℝ W×H×M is then defined for each pixel u ∈ I t and one if its neighbouring pixels v ∈ N(u) , as follows: where, W & H are the width and height of the image, M is the number of pixels v in the neighbourhood of a pixel u.
For the loss function, we use the mean square error (MSE) function between the ground-truth affinities A defined previously, and the affinities Â predicted by our TMNet model given by: We split the loss into two parts to overcome the imbalance problem in the ground-truth affinity as number of 0's (edges -labels of the compared pixels do not match each other) ≪ number of 1's (non-edges -labels of the compared pixels match each other) in A. The first term is the normal mean square error loss for 0's (edges). The second term is the weighted mean square error loss for 1's (non-edges). The equations for those losses are as follows: The weights w ∈ ℝ W×H are the edge weights generated using the normalized gradient magnitude of the image. We incorporate the use of geometry to control the importance of non-edge pixels during training. This loss penalizes edges in an image, that are non-edges in A (penalize static object boundaries that do not appear in the motion boundaries).
Normalization terms e and ne count the number of the edges and weighted sum of the non-edges in A, respectively.

Implementation details
Our TMNet model is end-to-end trainable to predict affinities that are clustered to obtain the required segmentation.
Training: For pre-processing, the images are scaled to 384 × 512 × 3 . We also augment training data to prevent over-fitting. We use the open-source Flownet2 [41] for optic flow estimation. We also use RCF [42] for obtaining image and flow edge maps. For fair comparison, we adopt the same method for generating optic flow and edge maps in all of our experiments. We train the model only on the 30 training set video sequences of the DAVIS17 dataset [43] without the use of any additional training data. The model is trained from the scratch using the loss term and affinity ground-truth as explained in Eq. (11) in a supervised manner using random initial weights. We train the network with a batch size of 2. We follow a learning rate of 10 −4 for (11) pre-training, and 10 −5 for fine-tuning as the training schedule using the ADAM optimizer. The number of previous frames for extracting the temporal attention is chosen to be 3. The number of neighbours M for every pixel used to calculate the affinity is chosen to be the 4 immediate neighbours. We choose the values to be as low as possible since increasing both and M, results in increased accuracy in the output segmentation at the cost of increased model capacity and run-time.
We have used traditional neural networks with iterative training mechanisms for updating the weights of the proposed model. The use of non-iterative training mechanisms can optimize the time-consuming training process and lead to faster convergence [44]. For instance, Cao et al. [45] proposed a randomized neural network called Bidirectional stochastic configuration network (BSCN) that can perform effective training for regression problems even in absence of a GPU. It will be interesting to study the improvement in the efficiency of our TMNet using BSCN as we also deal with a similar affinity regression problem.
Testing: For testing, we apply our trained TMNet model to the unseen videos. We use the current image and the optic flow of the previous frames to produce the output segmentation. It is to be noted that for obtaining the segmentation of the first frames, we craete copies of the initial optic flow.
Run-time: We implement our TMNet method in Pytorch on a NVidida TitanX GPU with 12GB memory for both training and testing. For our trained TMNet model, pre-processing steps like optic flow estimation and edge information extraction take around 0.07 s/frame and 0.05s/frame, respectively, TMNet affinity prediction also takes 0.28 s/ frame. Additionally, the clustering and tracking takes 2.17 s/frame. Speeding up the clustering process can be achieved by using the algorithm in [40]. To prove the speed up, we selected three random sequences from DAVIS16 dataset, applied our model that uses the algorithm in [40] for efficient clustering, and observed a great reduction in inference time. The average time of the clustering step reduced from 2.17s/ frame to 0.33 s/frame with a slight reduction in accuracy (of around 3%). There is often a trade-off between accuracy and computational speed when the range of additional used data is increased. In this case, the trade-off can be controlled by changing the number of neighbours used in the calculation of the affinity matrix.

Dataset and evaluation metrics
We report results on two widely used benchmarks: single object DAVIS16 dataset [46] and multi-object DAVIS17 dataset [43]. The datasets contain many challenging video sequences with multiple objects, occlusion, fast moving objects, background clutter, articulated motion, etc.
DAVIS16 [46] contains 50 HD video sequences, 3455 manual instance segmentation ground-truths. DAVIS17 [43] is a more challenging benchmark extending DAVIS16 [46] to multiple moving objects, and contains 120 HD video sequences (60 for train, 30 for val, 30 for test-dev) and 10K manual instance segmentation ground-truths. The task is more challenging due to the inclusion of multiple objects that additionally create occlusions, background clutter, etc.
We use the following performance measures described in DAVIS challenge [43] to evaluate the performance of our method: The mean of a metric is the average error measured across all objects in all video sequences. The recall is the fraction of sequences scoring higher than a threshold of 0.5.

Ablation study
To examine the effectiveness of the temporal neural attention and edge refinement components of our TMNet model individually, we performed an ablation study of our model on the DAVIS16 dataset [46]. The decrease in performance due to the removal of specific key components of our method is calculated as: where m represents the metric for which the decrease in performance is calculated (J &F, Jmean, Jrecall, Fmean or Frecall). Δm represents the performance loss, m partial and m full represent the metric values without and with the specific component of our model whose efficiency is studied. Table 1 shows the results of our key component analysis.
The first row in Table 1 shows the loss in performance of our model without the temporal attention module, compared to the full model performance described in the third row. There is a significant improvement of 5.01% in global J &F metric due to the inclusion of the temporal attention module. This shows that the addition of tracking information helps the networks performance by resolving ambiguities in appearance of objects for challenging scenarios (unseen objects, dynamically varying number of objects, occlusions, non-rigid motions, and noisy background). Similarly, the second row in Table 1 shows the loss in performance of our model without the edge refinement module. There is a 2.38% improvement due to the inclusion of the edge refinement module. It is also seen that edge refine module improves the boundary metric Fmean by a large margin ( 3.52% ). This performance gain is attributed to the edge refinement module as it improves the segmentation at object boundaries.

DAVIS16
We evaluated our method for UVOS on the single object DAVIS16 dataset [46]. Table 2 demonstrates that our method performs favourably compared to the state-of-theart methods. Our method outperforms the state-of-the-art methods in one metric (Jrecall), and the third best performance in both boundary metrics (Fmean and Frecall) as highlighted in Table 2. The overall performance indicates the robustness of the appearance features, and is attributed to the temporal attention module. The increased performance in the boundary metrics compared to the other methods is attributed to the use of edge refinement module that refines object boundaries. Figure 8 shows qualitative results of our method. The results for the sequence 'camel' shows the capability of our model to work on non-rigid articulated motions. In sequence 'blackswan', our method robustly segments the object amongst the noisy background. Apart from the quantitative results, our method also has another advantage. Different from most of the existing methods which perform binary foreground/background segmentation, our method is able to perform segmentation and tracking for multiple moving objects as explained in the next section.

DAVIS17
To show that our method performs accurate segmentation for sequences with multiple moving objects, we applied our method to DAVIS17 dataset [43] and compared our results with existing methods. The performance of the proposed method is compared with several related state-of-the-art approaches (we have selected the top methods that do not use additional training data) including: (1) RVOS [23], (2) STEM-seg [51], (3) MATnet [17], (4) UnOVOST [11]. Table 3 compares the methods using the performance metrics described in the previous section.
The best results are obtained by UnOVOST [11]. This method is computationally expensive as it uses Mask-RCNN [15] for object proposal generation. It also has many heuristic post-processing steps requiring hyper-parameters tuning of multiple parameters (hyper-parameters are required for converting instance object mask proposals to short term tracklets, and merging short term tracklets to long term object trajectories), therefore limiting its use to specific datasets. Similarly MATnet [17] uses a CRF-based dataset specific heuristic post-processing to convert the foreground/ background saliency maps, to multi-object segmentation. In contrast, our method uses no such post-processing, and shows comparable accuracy to other similar methods that perform multi-object segmentation directly without the requirement of any heuristic post-processing operations. Overall, our method has the following advantages compared to the Mask-RCNN object detection and tracking methods [11,23,51]: (1) Our method performs category agnostic segmentation (segmentation of objects not seen by the training data) without using any prior knowledge of known object proposals from pre-trained Imagenet models (as we only use affinities to perform UVOS). (2) Our method has no heuristic post-processing steps that are required in Mask-RCNN based methods like multiple object proposal generation step, short term tracklet generation step, association step for long term tracklet generation and tracking (as we use correlation clustering on affinities learnt to perform UVOS). Hence our model can segment all probable moving object(s) and is capable of generalizing well to unseen object classes. Figure 9 shows qualitative results for performing UVOS on multi-object sequences of DAVIS17 dataset [43]. The sequence 'pigs' demonstrates that the model can handle

FBMS
To show that our method is capable of generalizing well to unseen test videos, we train our model using the above mentioned DAVIS17 dataset [43] but test it on the FBMS dataset [53] (FBMS dataset is not used during training). FBMS dataset is composed of 59 sequences with sparse annotations and contains challenging multiple moving object sequences. The performance of the proposed method on the test sequences of the FBMS dataset is compared with several related state-of-the-art approaches in Table 4. The results demonstrates that our method produces the second-best performance in the region similarity metric as highlighted in Table 4 (even when our method does not use the FBMS dataset for training). The sparseness of the groundtruth annotations does not affect our method compared to other methods (like LVO [24] & ARP [25]). Our method also produces multiple object segmentation different from other saliencybased methods that produce binary segmentation masks (like PDB [29] & COSNet [37]). Figure 10 shows qualitative results for performing UVOS on challenging multiple object sequences of the FBMS dataset. The sequence 'goats01' demonstrates that the model can handle multiple moving objects under challenging scenarios like occlusions and similar objects. The results for the sequence 'horses02' shows the capability of our model to robustly segment very small objects amongst the noisy background that are very far away from the camera. This robustness is attributed to the temporal attention module that tries to segment anything that moves irrespective of the object class. This experiment also shows that our model does not overfit on the DAVIS16 and DAVIS17 datasets.

Conclusions
In this paper, we propose a new method (TMNet) to solve UVOS. Our model combines appearance, motion and edge cues. The motion cues from consecutive frames of video sequences help to find temporal connections, guide our model to learn powerful object representations, as they resolve ambiguities in appearance features through neural attention towards the moving object(s). The edge cues help to refine the errors at object boundaries where motion cues are inaccurate. Different from previous neural attention UVOS methods, our method predicts affinities instead of predicting binary segmentation masks, making the method capable of handling multiple moving objects in one forward pass. The model is efficiently optimized by a loss function on the predicted affinities using geometric constraints. Our experiments on two popular benchmarks (DAVIS16 and DAVIS17) demonstrate that TMNet is capable of effectively handling unseen object categories, multiple moving objects, occlusions, articulated object motions, and cluttered background. Extensive experiments on the datasets also show that the improvement in performance is due to the addition of the temporal neural attention and edge refinement modules.
Our method fails when objects in the video temporarily stop for multiple frames. This failure occurs since we process only selected number of frames at a time. So, we plan to extend the work further by storing important features of objects seen in a memory. This will allow the model to merge the tracklets accurately once the objects are re-identified later. Another area for improvement is to make use of the 3D scene flow available to incorporate the additional depth change information(not available in 2D optic flow) to aid the segmentation.