WaterNet: An adaptive matching pipeline for segmenting water with volatile appearance

We develop a novel network to segment water with significant appearance variation in videos. Unlike existing state-of-the-art video segmentation approaches that use a pre-trained feature recognition network and several previous frames to guide segmentation, we accommodate the object’s appearance variation by considering features observed from the current frame. When dealing with segmentation of objects such as water, whose appearance is non-uniform and changing dynamically, our pipeline can produce more reliable and accurate segmentation results than existing algorithms.

video. However, the performance of VOS algorithm often decreases significantly when objects in the video have changing appearance caused by illumination changes, or motion or deformation. For example, water often has a volatile appearance. The color and texture of water can vary between consecutive frames due to specular reflections, ripples, waves, turbulence, sediment concentration, etc. Such rapidly changing appearance often leads to poor water segmentation in videos.
Water is not the only case: appearance variations are common in practice. Examples include buildings with glass windows, along with cars or other objects with shiny paint or reflective surfaces. In this work we focus on segmenting water from videos, as it is a typical and representative object with dynamically changing appearance. In particular we consider water present as lakes, canals, rivers, floods, and so on.
In the semi-supervised VOS task, an annotated segmentation of the first frame is provided as part of the input. Most recent VOS techniques apply image semantic segmentation modules (e.g., fully convolutional networks (FCN) [1]) to learn the appearance of the object of interest. To tackle the appearance disparity between the training and test data, recent semi-supervised VOS algorithms usually adopt one of two architectures detection-based schemes, such as Refs. [2][3][4][5][6][7][8][9][10][11], compute and then propagate the segmentation of the past few frames to the current frame. Many approaches in this category require an online training process that adaptively fine-tunes the pretrained network to the object's specific appearance in the test video. Matching-based schemes [12][13][14][15] formulate video object segmentation as pixel-wise classification in a learnt embedding space. Such methods achieve promising results without online training.
However, these methods are built upon the assumption that appearance does not change significantly in consecutive frames. If this assumption does not hold and the object in the current frame looks different from previous frames, such approaches become unreliable. In this work, we aim to develop a more reliable VOS pipeline for water (and other objects with changing appearance) in such more challenging scenarios.
We observe that features of water learnt from previous frames may change significantly and may not work well in identifying water pixels in the current frame. Figure 1 illustrates two example frames in one of our testing videos. Between two consecutive frames (a) and (b), the water's appearance (color, ripples, and also certain reflections) clearly changes, and the texture from previous frames can not effectively guide segmentation of the later frame. Indeed, in such scenarios, it is likely that water in the first frame may also look different and not provide good guidance. In Fig. 1(c), we draw a heatmap to show the l 2 -norm distance between the feature maps extracted from these two frames: the corresponding water regions are quite different.
Our main idea is based on this aforementioned observation: the appearance of water (or other specular objects) may change dynamically and be difficult to predict, but its spatial locations and shapes in two consecutive frames are often more predictable and stable. Therefore, certain sub-regions identified in the previous frame, under some appropriately estimated transformations (e.g., obtained by simple tracking), are likely to be still occupied by the object in the current frame. These regions in the current frame provide valuable clues in learning the new appearance of this object. For example, in Figs. 1(d)-1(f), if we take water regions in the center (of the water region detected in the last frame), e.g., the green pixel region, as our reference, and use their feature vectors as templates, then other water regions in the current frame have better similarity to one of these reference regions.

Water segmentation dataset and benchmark
Another challenge in developing effective VOS systems is the lack of pixel-wise annotated training Fig. 1 Appearance differences between frames. (a, b) Two consecutive frames, f28 and f29, of a video from which we wish to segment the water region. Using our feature encoder trained on WaterDataset, the feature maps of f28 and f29 are very different; their l2-norm distance are visualized in (c). If a pixel in f29, the green pixel in (d), is picked as a reference, features extracted from other water regions in f29 share better similarity with this reference. (d) color-encodes the l2-norm distance between the green pixel's feature vector and features of other regions. (e) The l2-norm distance when 5 reference pixels are selected. (f) The l2-norm distance when 20 reference pixels are selected. Green pixel regions are selected as references. When appearance changes dramatically, the spatial correlations of features may be stronger than their temporal correlations.
datasets. Specifically, for this water segmentation task, on the one hand, water-related image annotations are rather few, and on the other hand, water's appearance can be very varied. These make the learning of water appearance significantly difficult. For this work, we have thus built a waterrelated image database, which we referred to as the WaterDataset. This training dataset contains 2388 water-related images that come with annotations. It also contains 20 manually labeled water videos for testing. Our model and the comparative methods are all trained and evaluated using this dataset. This WaterDataset and the performance scores are available for use in future comparisons.

Contributions
The main contributions of this work are: • a novel video object segmentation network for water, named WaterNet, which can effectively capture variations in water's appearance in video through online learning and updating, and • a water segmentation database and benchmark to support image and video water segmentation research. Our experiments demonstrate that our new pipeline clearly outperforms existing state-of-the-art VOS approaches in identifying water undergoing large variations in appearance. Our benchmark and source codes and the water segmentation dataset are available at https://github.com/xmlyqing00/WaterNet.

Related work
Video object segmentation (VOS) has been an active research topic for the last decade. Existing approaches can be generally classified as detectionbased methods and matching-based methods.

Detection-based methods
Methods in this category segment objects from videos frame by frame. The pipelines of OSVOS [2], OSVOS-S [9], and OnAVOS [5] are similar to that of an FCN. Their models are trained on offline datasets. Given a test video with first frame annotation, they apply data augmentation to the first frame and use that to fine-tune their models. However, without temporal information, these methods may produce jittering segmentations because of object motions or appearance variations.
Recent approaches such as LucidTracker [4] and MSK [3] build neural networks that take the first frame annotation and masks of previous frames as inputs to create the mask for the current frame. Given a test video, most of these approaches heavily rely on online learning to remember the object appearance in this specific video. While these methods achieve strong performance, they require online training to recognize the target object, which takes an extra 10-20 minutes. RGMP [16] takes the first frame and the previous frame as references to predict object masks without online training. However, as shown in Fig. 1, if object appearance changes between frames, previous frames may not be able to effectively guide the segmentation and these methods can fail.

Matching-based methods
While there is a strong interest in semi-supervised video object segmentation by leveraging online training on the first frame annotation to achieve better performance, other approaches aim to obtain better runtime and performance without online training. Recent matching-based methods such as PML [13], VideoMatch [14], FAVOS [8], and FEELVOS [15] formulate the segmentation problem as a pixel-wise assignment task. These algorithms learn pixel-wise embedding spaces and maintain a set of feature templates to explicitly memorize the appearance of the target object in the reference image. At test time, a matching mechanism is designed to match the features of the current frame per pixel. These approaches update the feature templates after the segmentation of the each frame. However, when the appearance of the object changes suddenly between consecutive frames, feature templates built upon previous frames may not adapt to changes in the current frame: the outdated templates may not match features of the objects. In this work, we specifically design WaterNet to adapt to volatile appearance.

WaterNet segmentation
We now explain the design of our WaterNet, an appearance-adaptive network.

Overview
Given a sequence of N video frames {f 0 , f 1 , · · · , f N −1 }, and an annotation s 0 of the first frame in the form of a mask indicating the object segmentation, we wish to compute the segmentation mask of the object in the subsequent video frames, denoted as {s 1 , s 2 , · · · , s N −1 }. The frames f ∈ R H×W ×3 are in RGB space. The segmentation masks s ∈ [0, 1] H×W are maps in which 0 indicates background and 1 indicates water. Figure 2 illustrates the main pipeline of our proposed WaterNet. It consists of two branches: a parent network (ParentNet) and an appearanceadaptive branch (AA-branch). They share the same feature encoder E, which generates a feature map from an input image. The ParentNet, which is based on standard image semantic segmentation, is trained to learn the appearance of water from static images, and it predicts a binary water mask h P for a given image frame f t .
The AA-branch makes the segmentation adaptive to water appearance in the current video, which may look different from the training dataset and change from frame to frame. The AA-branch maintains three template sets: initial-reference templates T I , recentframe templates T R , and current-frame templates T C . Each template set is a list of feature vectors. The feature encoder E extracts pixel features from the first frame, a few previous frames, and the current frame, respectively, and rearranges them into these three template sets. The feature map x t of the current frame f t is also extracted by E. The similarity calculator (SC) matches x t with these three template sets to produce three water segmentations h I , h R , and h C . They are fused to compose the AA-branch segmentation h A .
Finally, the ParentNet segmentation h P and the AA-branch segmentation h A are combined to give the output segmentation s t .
Note that in recent matching-based VOS algorithms [8,[13][14][15], features of the current frame are also compared with feature templates (obtained over a few previous frames) to estimate segmentation. However, because water has an inconsistent appearance, when its appearance changes suddenly between consecutive frames, features learnt from the past few frames cannot always effectively guide recognition in the current frame. The proposed current-frame templates in our AA-branch can use regions in the current frame as guidance to better accommodate such sudden appearance changes.

Fig. 2
Overview of WaterNet, which consists of a parent network (ParentNet) and an appearance-adaptive branch (AA-branch). They share the same feature encoder E, which generates features of input image. In ParentNet (blue background), a feature decoder D uses the current frame's feature xt to predict a water segmentation h P . In the AA-branch (yellow background), a deterministic similarity calculator matches features of the current frame xt with feature templates T I , T R , T C , to predict water segmentations h I , h R , h C . Fusion modules merge these segmentations of the current frame to give the final segmentation st.
We now consider the components of our system in detail.

Parent Network
Our parent network ParentNet is based on an FCN and has two components: a feature encoder E and a feature decoder D, as shown in Fig. 3. The encoder E encodes appearance information from RGB space to the embedding space. We use f t ∈ R H×W ×3 to represent a frame in RGB space, and x t ∈ R h×w×c to denote a feature tensor in the embedding space, where t is the time index, H and W are the width and height of the frame, h and w are the width and height of the feature tensor, and c is the number of feature channels. The ratio W : w (and H : h) depends on the downsampling layers of E. The decoder D consists of a set of deconvolutional layers, which take the feature tensor x t and also the features in the corresponding stream in E through skip connections, and generates a parent segmentation h P . We build E based on ResNet-34 [17], with the last fully connected layers removed. Its weights are initialized from the ImageNet pre-trained model. After end-to-end training of ParentNet on the WaterDataset, E and D are used to generate feature tensors and parent segmentations, respectively.

Appearance-adaptive branch
The ParentNet learns the appearance of water from the offline static image data. But the appearance of water varies from video to video, and even frame to frame. Recent VOS approaches use the first frame annotation to fine-tune the parent network to enable it to recognize the appearance of water in this specific video. However, information from the first frame still may not accurately reflect the water's current appearance in later frames. Also, online training often requires 10-20 minutes on modern GPU cards to retrain the network, which restricts the system's applicability to certain real-time flood monitoring and prediction tasks, for example.
Our appearance-adaptive branch (AA-branch) aims to tackle frame-to-frame appearance changes and provide better runtime efficiency. The AA-branch predicts a water mask h A , which is later fused with the segmentation from the ParentNet to give the output segmentation. The pipeline of the AA-branch may be summarized as follows: 1. Initialize T I and T R using the annotation s 0 of the first frame f 0 , and its extracted feature map x 0 (see Section 3.3.1); For each subsequent time step t 1: 2. Use E to extract the feature map x t for f t and get a parent segmentation h P ; then create the current frame templates T C by adding a subset of features from x t (see Section 3.3.1); 3. Compare each region of f t with T I , T R , and T C (see Section 3.3.2), and then output the AA-branch segmentation h A ; 4. Fuse h A and h P to give the final segmentation s t for f t (see Section 3.3.3); 5. Update as necessary (see Section 3.3.3).

Feature template settings
We maintain three active template sets to remember the water's appearance recently observed in the video. There are two types of templates: object (water) templates, and background templates, which are separated using a feature splitter module. Figure 4 shows the pipeline of the splitter module. The feature splitter F S reorganizes the feature maps generated by encoder E into a list of object templates U o ∈ R L o ×c and a list of background templates U b ∈ R L b ×c according to the given template mask MSK. The template mask MSK is defined as a binary image in which 0 represents the background and 1 represents the object. Therefore, where i ∈ [1, hw] enumerates all regions in the feature map Y . Initial-reference templates. The initial-reference templates T I remember the initial appearance of the water. We first use the encoder E from the ParentNet to convert f 0 to the feature tensor x 0 ∈ R h×w×c . Using the feature splitter module F S, we use the first frame mask s 0 to divide the feature map x 0 into object and background templates. These object and background templates are together the initialreference templates T I .
Recent-frame templates. We maintain recentframe templates T R containing features from the previous M frames to track recent water appearance. Like the initial-reference templates, T R consists of object templates T o R and background templates T b R . Since we propagate the water segmentation frame by frame, using the segmentation of the previous frame s t−1 , we update the recent-frame templates for the current frame segmentation. The mask s t−1 of the feature map x t−1 of the previous frame is used to separate it into an object map V o and a background map V b . To provide more robust feature templates, we append new features from V o and V b to the recent-frame templates that both (i) have high segmentation scores (larger than a threshold T hc ), and (ii) are far from the object boundary (distance to the boundary larger than a threshold r 1 ). In addition, to restrict feature templates to a moderate size for computational efficiency, we remove features that were added M frames ago.
Current-frame templates. Unlike recent VOS approaches that only use previous frames to model object appearance features, we further model object appearance from reliable regions of the current frame. For example, in most of water videos we have observed, the water does not move significantly. Based on the segmentation s t−1 of the last frame, the object's central region (its pixels that are far away from changing boundary) is almost always still occupied by the object in the current frame. More generally, if objects are moving, but their motion can be estimated by tracking or optical flow algorithms, then motion of the central region of the object could also be estimated. We denote such regions as highconfidence regions. We can then learn the object's up-to-date appearance from texture sampled in such high-confidence regions.
In our current implementation, high-confidence regions of the object and the background are extracted from the current frame f t . E produces the feature map x t ∈ R h×w×c of the current frame f t . Mask s t−1 is a binary map where 0 represents background and 1 represents the water. In the high-confidence feature extractor HC module, let U o = s t−1 be the water mask and U b = 1−s t−1 be the background mask. We perform r 0 rounds of erosion operations on U o and U b to obtain high-confidence regions. Then, as for the feature splitter F S, we allocate the feature map of the high-confidence regions to the object template and the background template. These two templates form the currentframe templates T c .

Feature matching
We compare the feature map x t of the current frame f t with the above three template sets T I , T R , and T C to identify potential object regions. A similarity calculator (SC) provides efficient matching. It takes two inputs, the current frame features x t and feature templates, and outputs a score map. Higher values in the score map indicate these regions have higher likelihood to be water. Figure 5 shows the details of the similarity calculator module.
Specifically, the object feature templates and background feature templates are initialized to U o and U b . Let the size of the object feature templates be m w and the size of the background feature templates be m b . Two similarity calculators compute an object score map and a background score map for the given feature map x t ∈ R h×w×c . The object score map H o gives the regions' likelihood to belong to the object, and the background score map H b to belong to the background. Let the feature vector of pixel i in the feature map of the current frame f t be x t (i), i ∈ {1, · · ·, hw}, the feature vector of an entry j 1 in the object feature templates be U o (j 1 ), j 1 ∈ {1, · · ·, L o },

Fig. 5
Similarity calculator SC module. For each frame ft, the encoder E generates a feature tensor xt. The feature vector corresponding to each region i in ft is xt(i).
We compute the cosine similarity between each feature vector of the xt(i) and the object/background features in the template list. The object/background score map is the average of the top K similarity scores. The fusion module Fuse2 fuses the object/background score maps to give a segmentation mask. and the feature vector of an entry j 2 in the background feature templates be U b (j 2 ), j 2 ∈ {1, · · · , L b }. First, we compute cosine similarity between the feature map and templates using where i ∈ {1, · · · , hw}, CS o ∈ [−1, 1] hw×L o is a cosine similarity matrix between the feature map and the object templates, and CS b ∈ [−1, 1] hw×L b is a cosine similarity matrix between the feature map and the background templates. L o and L b are the sizes of the feature templates. We compute the object score H o and background score H b of the feature map x t from the top K cosine similarity matrices CS o and CS b : where i ∈ {1, · · · , hw}, and top K (CS(i), j) is a function that returns the j-th largest similarity scores from the i-th row. K is set to 10 in our experiments. In the AA-branch, we deploy three similarity calculator modules and match the current frame feature x t with the three feature templates T I , T R , and T C , to obtain three object segmentations: the initial-reference-based segmentation h I , the recentframe-based segmentation h R , and the current-framebased segmentation h C .

Segmentation fusion
The above three segmentations are fused, using a module named Fuse0, to give the current frame's appearance-adaptive segmentation h A : We initialize λ 0 = 0.4, λ 1 = 0.2, λ 2 = 0.4 and gradually decrease λ 0 every 10 frames since the appearance of the first frame becomes less informative as the time goes on using: λ 0 = 0.9λ 0 The weight for the current-frame segmentation remains unchanged. We fuse the appearance adaptive segmentation h A and the ParentNet segmentation h P using another module, Fuse1, to obtain the final segmentation for the current frame f t : where λ A is a balancing factor.

Implementation details
Note that the initial-reference templates are constant during the evaluation, while the current-frame templates are updated for each frame. The recentframe templates track features in the previous M frames.
WaterNet can be trained on a still water image dataset and evaluated on dynamic water videos. Once the ParentNet has been trained, the AA-branch can directly reuse the encoder E and decoder D from the ParentNet to extract feature maps. We use ResNet-34 [17] as the backbone of the encoder E. We set the total epochs to 200, the initial learning rate to 0.1, and gradually decrease it during training. To train the ParentNet, we randomly pick an image and its ground-truth from the WaterDataset, and augment the training data following Ref. [4] by randomly adjusting colors and applying affine, flipping, and cropping transformations. During testing, we set K = 10, M = 2, r 0 = 12, r 1 = 8, T hc = 0.7, and λ A = 0.5, and run the whole WaterNet to predict the water mask.

Experiments
We have compared our proposed WaterNet with several state-of-the-art video object segmentation methods on our new benchmark, WaterDataset.

Dataset and evaluation metrics
Our new benchmark for the water segmentation task, named WaterDataset, includes a training set and an evaluation set. The training set has 2388 waterrelated still images with annotations; 1888 images are from ADE20K [18] and 300 images are from RiverDataset [19]. These images contain various types of water, including lakes, canals, rivers, oceans, and floods. The evaluation set contains 20 water related videos: 1. 7 videos recorded on days with heavy rain, when local creeks and ponds were flooded. Frames in these 7 videos were all manually labeled.

10 surveillance videos from Farson Digital
Watercams [20] that recorded open waters from 8 a.m. to 6 p.m. Frames in these 10 videos were uniformly labeled every 50 frames. 3. 3 surveillance videos taken at the beach that recorded changes in sea waves. We adopt the evaluation measures used by the DAVIS Challenge [21,22]. In particular we use region (J ) and boundary (F) measures to evaluate segmentation quality. The region measure, also called Jaccard index, is a widely used evaluation metric in video object segmentation. It calculates the intersection-over-union (IoU) of the estimated mask and the ground-truth mask. We compute the mean IoU across all frames in the test videos. The boundary measure evaluates the accuracy of boundaries, via bipartite matching between the boundary pixels of both masks. Finally, J &F is the average of J and F.
In addition, we adopt the three error measure statistics from Ref. [23]. Let O = {F i } be the dataset of video sequences and C be an error measure, either the region (J ) or boundary (F) measure. First, the mean is the average error defined as Second, the recall measures the fraction of sequences scoring higher than a threshold τ , defined as where τ = 0.5 and I is the indicator function having the value 1 when the condition is satisfied and the value 0 otherwise. Third, the decay measures how the performance changes over time.
For the mean and the recall measures, higher numbers are better, while for the decay measure, lower numbers are better.

Quantitative comparison
We compared our method with several state-of-theart methods on the WaterDataset. Recent VOS approaches can be generally classified into three categories.
1. Detection-based methods such as OSVOS [2], OSVOS-S [9], and OnAVOS [5], which segment the video frame-by-frame without considering temporal consistency. We chose OSVOS as the representative approach of this category. 2. Propagation-based methods such as LucidTracker [4], MSK [3], and RGMP [16], which use the segmentation of the previous frame(s) to predict an object mask for the current frame. We chose RGMP as representative of this category because it outperforms other mask propagation methods. 3. Methods without online training such as PML [13], VideoMatch [14], FAVOS [8], and FEELVOS [15]. We chose FEELVOS [15] as the representative approach in this category, as it significantly outperforms PML [13] and VideoMatch [14].
All our experiments were performed on an Intel Xeon(R) E5-2630 v2 (2.60 GHz × 24) with a GTX 1080Ti GPU card and 32 GB RAM. Table 1 documents the comparison of WaterNet with these state-of-the-art methods. We use the Table 1 Comparison of WaterNet with other state-of-the-art methods on WaterDataset. Region (J ) calculates the intersection-over-union (IoU) of the estimated mask and the ground-truth mask. Boundary (F ) evaluates the accuracy of boundaries. Mean is the average error. Recall measures the fraction of sequences scoring higher than a threshold. Decay measures how performance changes overtime. Mean and Recall are the two most important measures super-script " − " to denote a method for which online training was disabled. As for OSVOS [2], we followed the authors' pipelines to fine-tune the model with the first frame annotation. Note that OSVOS requires an extra 10 minutes for segmenting each video. OSVOS achieves 0.597 for J &F-Mean compared with 0.382 from OSVOS − . Online training does improve segmentation accuracy, but we can see that OSVOS has the worst decay scores, as segmentation performance decreases over time. We conclude that online training cannot adapt to appearance changes during the video. In terms of region measure (J ), WaterNet outperforms the other methods, as its three feature templates help capture the changing appearance of water. In terms of boundary measure (F), WaterNet's F-Recall is little weaker than FEELVOS's, as FEELVOS adopts a strong neighbor filter that only considers features in a small window, which improves the boundary measure, but may fail if the object moves dramatically. Note that the decay measures how the segmentation results change over time. Because OVSOV − is an image-based method and it ignores temporal information, it achieves good decay scores while its segmentation results are poor: only 0.382 for J &F-Mean. In terms of overall measure J &F-Mean, WaterNet achieves the highest score 0.645 of the methods compared.

Appearance difference between the first frame and the test frame.
Figures 6 and 7 visualize segmentation results for the tested methods on the test videos "Buffalo0" and "Stream3". "Buffalo0" is a time-lapse video taken near Houston's Buffalo Bayou during Hurricane Harvey in August 2017. The bayou was flooded, and our goal is to track the water elevation at this location during that time. "Stream3" is a video taken near a local creek on campus during heavy rain in August 2018. In Fig. 6, the first frame was captured at 07:55 while the test frame was captured at 13:25. Different solar altitudes make the water look distinct. In Fig. 7, different weather conditions (wind and rain) make the appearance of the water dissimilar. Online training based methods (such as OSVOS) and first frame guided methods (such as RGMP) fail in this case because the appearance of the test frame is  very different from the first frame's. Our model outperforms other methods as track appearance changes during evaluation. Figure 8 shows segmentation results for the tested methods on the test video "Boston Harbor", taken near Boston Harbor in February 2019. From the 8th frame to the 9th frame, although the camera position is fixed, the appearance of the water is quickly affected by reflections, shadows, and waves. Figure 9 shows results for the test video "Holiday Inn Beach". The appearance of the sea is highly dynamic in the video. Mask propagation based methods such as RGMP and FEELVOS fail in this case because they exploit the information of the previous frames to segment the current frame. Such a mechanism works poorly when object appearance in consecutive frames changes greatly. Our model has an appearanceadaptive branch, which captures the appearance of the object by the high-confidence features observed in the current frame. The segmentation results show that our model is more robust to appearance variation in such scenarios as well.

Ablation study
We also analyzed the effectiveness of the key components of our model, through two variants. One was to remove the module which matches current frame features and current-frame templates (see Section 3.3.1). The other one was to remove the entire appearance-adaptive branch to assess the performance of the ParentNet (see Section 3.3).

WaterNet without current-frame templates
When processing each frame, WaterNet compares current frame features with the current-frame templates to identify water regions. We set the weight of current-frame segmentation λ 2 = 0 and tested our model without current-frame templates. Without this procedure, our model's J &F decreases from 0.645 to 0.638. WaterNet without current-frame templates still performs better than matching-based approaches such as FEELVOS, mainly for two reasons: (i) our module weights in the AA-branch are adaptive,   and we decrease the weight of the initial-reference templates and increase the weight of the recent-frame templates as time goes on, and (ii) our recent-frame templates track features from the past M frames, while FEELVOS only utilizes features from the last frame.

WaterNet without AA-branch
Our WaterNet consists of two components: ParentNet and the AA-branch. The appearance-adaptive branch maintains a set of feature templates to identify the object in each frame. We removed the AA-branch and ran our model on ParentNet. Because ParentNet is an image-based segmentation network which does not consider temporal information, although the resulting performance is more stable, it is less accurate. Note that mean and recall are the two most important measures. Without the AA-branch, the score of the J &F-Mean decreases from 0.645 to 0.479.

Summary
We developed an adaptive matching pipeline, WaterNet, to tackle appearance change in water in video object segmentation. Our main idea is to use the object's appearance as observed in the current frame to help its identification and segmentation. We built an annotated dataset of water images and videos, to facilitate water-related image and video segmentation. Our experiments demonstrated that with our new AA-branch, the accuracy of VOS on appearance-changing objects clearly improves, and our WaterNet outperforms existing state-of-the-art algorithms in video water segmentation.

Limitations
The feature templates are updated based on each frame's segmentation result without supervision. If in some frame the segmentation is incorrect, the derived feature templates and the estimated highconfidence region could also be incorrect, which would negatively impact further segmentation accuracy. This is also a problem in existing approaches where segmentations of the past few frames are used to guide the subsequent segmentation. We will study the relationship between appearance change and other information and priors such as saliency, attention, or tracking information, and explore the possibility of integrating these priors and preprocessing mechanisms to help tackle this issue. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.