1 Introduction

The automatic tracking and segmentation of individual fish have emerged as pivotal tools in the field of ecological behavioural analysis, with a broad spectrum of applications. This is evidenced by numerous studies in the domain [1,2,3,4,5,6]. The ability to understand and predict animal motion in their natural habitats could yield significant benefits across various research and industry domains [7,8,9,10,11]. However, the inherent complexity of animal movement in the wild presents a great challenge. Factors contributing to this complexity include intermittent visibility of animals in videos and the presence of multiple animals within a single video frame, both of which complicate tracking and segmentation tasks. Addressing these challenges often necessitates the deployment of advanced computational methods.

A large number of studies have attempted to tackle these challenges [12,13,14,15,16,17,18]. These studies predominantly rely on pixel-level annotations to train or enhance their deep neural networks (DNNs). However, obtaining these annotations is both costly and time-consuming, particularly for fish segmentation in the wild. Most current automated methods operate under the assumption that training data are typically paired with ground truth derived from videos containing a large number of fish [13, 19,20,21,22]. Despite the high cost associated with obtaining ground truth, it is necessary to acquire a substantial number of video sequences. This is due to the difficulty in achieving accurate results using only a limited number of sample videos.

Fig. 1
figure 1

Combining background subtraction and optical flow demonstrate how both levels work in concert to preserve object boundaries and temporal coherence throughout the video. Please refer to Sect. 3 for details

This study was motivated by the importance of the challenges faced when trying to annotate and segment animals in videos in the wild. Unlike in controlled conditions, where animals are easily distinguishable from the background, fish are difficult to distinguish in realistic videos [23,24,25], even with domain knowledge. This is due to the large variations in the appearance of the animals, lighting conditions, and background.

Our approach aims to develop an unsupervised method for fish tracking and segmentation without the need for human annotations, by leveraging spatial and temporal variations in video data using known techniques of background subtraction and optical flow, as shown in Fig. 1. Specifically, we propose to generate pseudo-labels based on unlabelled video data. The use of pseudo-labels can benefit various learning-based algorithms since it can significantly reduce the labelling cost. The key to the proposed method is to take advantage of the intrinsic temporal consistency between consecutive frames to improve the generated labels by refining them with a self-supervised model. We propose training a Deep Neural Network (DNN) to segment individual fish based on the generated pseudo-labels. As long as the pseudo-labels are generated in a way that they have similar structure and appearance to real ones, the model can learn to understand the underlying structure from the pseudo-labels. In general, the more realistic the pseudo-labels, the better the segmentation accuracy. We include a short video of our model prediction in https://youtu.be/Z5G7YBoL3eM and https://youtu.be/8LOKsVSiY9U.

The main contributions of this paper are listed as follows:

  • Propose to use pixel-level pseudo-labels generated by an optical flow model and background subtraction to learn the segmentation and tracking of individual fish automatically without manual interaction.

  • Demonstrate that using self-supervised refinement, we can further improve the accuracy of the pseudo-labels for fish tracking and segmentation.

  • Evaluate our method on three public datasets with different image quality.

  • Discuss the limitations of the current model and our future research directions.

The rest of the paper is organised as follows. Section 2 covers related works and provides background information on the novel aspects of our work. Our model’s framework is described in detail in Sect. 3. Section 4 presents our method for training and evaluating our model. The experimental setup and results are presented in Sect. 5, while detailed discussions of our results are presented in Sects. 6. Finally, Sect. 7 concludes our paper.

2 Related work

The field of video object segmentation and animal tracking has witnessed substantial advancements in recent years. A noteworthy contribution is the unsupervised video object segmentation model, UVOSAM, proposed by Zhang et al. [6]. This model, which operates without the need for masks, has introduced new possibilities in the field. In the realm of animal tracking, Dutta et al. [11] have developed a deep learning workflow that holds particular relevance for ecological studies. Complementing this, Javed et al. [12] have provided a comprehensive survey of visual object tracking techniques, significantly enhancing our understanding of the current landscape.

Further enriching the field, Cao et al. [22] proposed a method of dense spatio-temporal position encoding to improve tracking accuracy. Proencca et al. [25] introduced TRADE, a method that utilises 3D trajectory and ground depth estimates for UAVs. In terms of segmentation, Jahanbakht et al. [26] explored distributed deep learning and energy-efficient image processing for fish segmentation. Zhang et al. [27] developed MSGNet, which uses multiple sources of information to improve the precision of fish segmentation. This approach has been influential in shaping our own methodology. In the following subsection, we will provide a brief review of the research domains that are most relevant to our work.

Video object segmentation is a task that is used to locate and segment each target object [28,29,30]. The target object to be segmented can be either a class of interest in the videos or moving objects of interest. Object segmentation is generally categorised into two categories: segmentation with instance-level semantics and segmentation without instance-level semantics, which is the main objective of this paper. Therefore, this study focusses on generating labels without human intervention. Some segmentation methods for moving objects have been developed by using background subtraction techniques [31,32,33]. Several of these approaches are based on the assumption that the scene is locally constant [34, 35]. This means that the background in one frame is assumed to be similar to the background in the next frame or only a few pixels away. In order to use this assumption, they estimate the local background and threshold it according to the similarity threshold to identify foreground regions. However, this method is known to be sensitive to illumination changes and may even lose all detail within the image due to overestimating the local background. Another approach to segmentation uses the detection of optical flow to define motion boundaries [36,37,38,39].

Optical flow predicts the relative motion of objects in two consecutive frames of a video [38, 39]. It gives a dense correspondence between frames, but at the cost of being limited to rigid objects, and computation entangled. Additionally, optical flow can only work within scenes where the movement of the camera is significantly lower than the movement of the object [36, 37, 40]. This can be seen as a limitation, as the background subtraction method can be used in a wider range of applications. However, the key element of optical flow is that it can also be used for background subtraction [41, 42]. By tracking the movement of the pixels between frames, we can determine the background. If a pixel that is part of the background does not match the static background within a given threshold, then that pixel is determined to be an instance of an object.

Another segmentation approach is based on the detection of visual motion. It is based on the fact that moving objects in the scene induce consistent changes to the flow of pixels in a region [37, 40]. However, due to substantial displacements or occlusions, their calculated optical flow may contain considerable inaccuracies [43,44,45]. In our method, we address these issues and enhance both estimated optical flow and object segmentation, simultaneously.

Video object tracking is the task of assigning a consistent label for each individual object in the scene as it moves [46, 47]. This tracking is generally divided into multiple steps, including detection of the object of interest, tracking of the moving object in the scene, and then associating labels between frames. The tracking task, therefore, consists of identifying the bounding box of the object over several video frames and, at the same time, updating the location of the object in the image [48, 49]. This can be done based on a similarity metric between different frames [50, 51]. The idea of such a metric is to find the closest objects in the frame with an overlapping bounding box. This can be performed at either the pixel level or at the region level. The major drawback of this method is the computation time [52, 53] that is needed to compute all the similarities between all the different frames. On the other hand, if the computational resources are available, this method has been proven to be useful when tracking fast-moving objects and when the objects are not occluded in the frame [54, 55]. In our method, we produce the rotating 2D object bounding box from each instance mask of the object over several video frames.

In contrast to our work, Yang et al. [56] uses a Siamese network with an anchor-free tracker for general object tracking, simplifying the tracking algorithm by avoiding the anchor box design that predicts the tracking target with a pair of corners (top-left and bottom-right corners). While both our work and SiamCorners [56] utilise deep learning techniques for object tracking, there are key differences. Our approach specifically addresses the challenges of fish segmentation and tracking in underwater videos, leveraging optical flow and background subtraction to generate pseudo-ground truth labels. In contrast, SiamCorners simplifies the tracking algorithm by avoiding anchor box design but does not specifically address the unique challenges of fish segmentation and tracking in underwater videos. We believe these distinctions highlight the novelty and significance of our work in this specific domain.

Supervised And Unsupervised Learning. Supervised learning has been used to build object detection [57,58,59], video object segmentation [29, 30] and video object tracking [47, 48]. These methods require extensive human annotation and therefore are not suitable for video annotation in the wild. To reduce the labelling costs of data, unsupervised learning has emerged as a powerful technique for the learning of video data. In the traditional image domain, unsupervised methods are expected to outperform their supervised counterparts [13, 14, 16, 60, 61] due to their potential to train data without labels.

The idea behind many of the unsupervised DNN models is to learn a feature representation from unlabelled data [62,63,64]. Then, a DNN model can be applied to the learned feature representation to produce the output. For example, in the domain of video segmentation [15, 65,66,67], DNNs have been used to learn a representation from the difference between a pair of unlabelled videos [68,69,70] and from warped frames [71].

In this work, we focus on unsupervised learning. Our proposed method will generate labels referred to as pseudo-labels to train a multi-task supervised DNN for video object segmentation and video object tracking.

Fig. 2
figure 2

Our proposed framework consists of three main components: generate pseudo-labels, unsupervised pseudo-labels refinement, and segmentation network. The proposed segmentation model trains with the generated pseudo-labels, which are refined with self-supervised training. Please refer to Sect. 3 for details

3 Framework

The overall framework can be divided into three stages as shown in Fig. 2. The first stage is to generate pseudo-labels using background subtraction and optical flow for both videos and still images. The second stage is to train a self-supervised model to refine the pseudo-labels using their spatial structure. In the last stage, the refinement of the video and still-image versions are applied jointly to train the segmentation network and to predict the final label. The segmentation network’s training behaviour closely matches supervised training because we employ improved pseudo-labels. As a result, the network’s training process is more reliable than that of current unsupervised learning techniques [68,69,70, 72]. In the following subsections, we describe the details of these three components and the corresponding loss functions.

3.1 Background subtraction

As a first step to generating pseudo-labels, background subtraction is performed on the video frames. A clean background image is estimated for every video sequence by computing the median of the first 10 frames of the video sequence along the first axis. This is to average out any distracting elements that come in front of the clean background. Then, each video frame is subtracted from the clean background to create the mask sequence. After subtracting, all foreground pixels take on a value of 1, and pixels belonging to any background region have 0 values using adaptive Gaussian thresholding [73].

Adaptive Gaussian thresholding is used instead of one global value as a threshold because it sets a pixel’s threshold based on a local region surrounding it. As a consequence, we obtain various thresholds for various areas inside the same image, which produces better results for images with varying illumination.

This background subtraction step is crucial in eliminating any stationary elements or shadows from the video sequences that might disturb the next step, optical flow.

3.2 Optical flow

The next step in pseudo-label generation is to calculate the optical flow using recurrent all-pairs field transforms (RAFT) [74]. However, optical flow is frequently inaccurate at object boundaries, so we want our segmentation to be accurate exactly at these borders. Therefore, we consider video segmentation from background subtraction and optical flow estimation simultaneously. Using pixel level and temporal information sources, the segmentation algorithm is improved by removing artefacts induced by background subtraction and optical flow. We demonstrate how both levels work in concert to preserve object boundaries and temporal coherence throughout the video. The key is that we need to remove motion blurs while preserving the motion of the fish boundaries.

To achieve the pseudo-labels, we first deconstruct a pair of video frames, \(x_t\) and \(x_{t+1}\), and estimate a mask \(m_t\) and \(m_{t+1}\) with the background subtraction method as described in Sect. 3.1. Segmented masks \(m_t\), \(m_{t+1}\) are used to synthesise frames \(\hat{x_t}\) and \(\hat{x_{t+1}}\) by warping \(x_t\) and \(x_{t+1}\) with \(m_t\), \(m_{t+1}\), respectively. The optical flow [74] takes two frames \(\hat{x_t}\) and \(\hat{x_{t+1}}\), and produces a motion vector \({\hat{v}}\) between them. This motion vector is used to compute the magnitude and angle of the motion. Specifically, pixels with a motion vector \({\hat{v}}\) outside \(m_t\) (and \(m_{t+1}\)) are assigned the value of the background, and pixels with a motion vector \({\hat{v}}\) inside \(m_t\) (and \(m_{t+1}\)) are reassigned the object. We denote the reassigned images as \(\hat{x_t^*}\) and \(\hat{x_{t+1}^*}\) and use them as input for our segmentation step, as shown in the top panel of Fig. 2 (Fig. 3).

We show the optical flow results for the three video datasets with and without background subtraction of frames \(x_t\) and \(x_{t+1}\) in Figs. 4, 5, and 6. A mask \(m_{t+1}\) that better distinguishes the background from the foreground from the optical flow step is then refined with our proposed unsupervised refinement method in the next section. A sample optical flow comparison video before and after background subtraction is available at https://youtu.be/8LOKsVSiY9U.

Fig. 3
figure 3

Sample image from each of the four utilised datasets. From left: Seagrass [23], DeepFish [24], YouTube-VOS [75], and Mediterranean Fish Species [76]

Fig. 4
figure 4

Sample optical flow results for Seagrass [23]. From left, the original image, optical flow without background subtraction, optical flow with background subtraction, mask overlay

Fig. 5
figure 5

Sample optical flow results for DeepFish [24]. From left, the original image, optical flow without background subtraction, optical flow with background subtraction, mask overlay

Fig. 6
figure 6

Sample optical flow results for YouTube-VOS [75]. From left, the original image, optical flow without background subtraction, optical flow with background subtraction, mask overlay

Fig. 7
figure 7

Sample fish trajectory results. Zoom in for a better view. See also a short video of fish trajectory results at https://youtu.be/Z5G7YBoL3eM

3.3 Unsupervised refinement

The second stage in our method is cumulative pseudo-label refinement through unsupervised historical moving averages (MVA) [77] using DeepLabv3 [78] network for semantic segmentation and Conditional Random Fields (CRF) [79] by minimising the F-score until the MVA predictions reach a stable state. The CRF can “sharpen” initial location predictions to make them more accurate and consistent with edges and parts of the source image that have a constant colour.

Given the pseudo-labels of the previous step, we train the network for 50 epochs. The number of epochs is low to avoid a significant over-fitting of the network to the noisy pseudo-labels. Then, the network is reinitialised with trained weight to predict a new set of pseudo-labels to train on again.

Let D be the set of training examples and M be the network model. By M(xp), we denote the mask prediction of model M on the pixel p of the image \(x \in D\). During this stage, a historical moving average (MVA) from the last training stage is composed as follows:

$$\begin{aligned} \text {MVA}(x, p, k)= (1-\alpha ) * CRF(M(x, p)) + \alpha * \text {MVA}(x, p, k-1), \end{aligned}$$

where M(xp) is the network mask prediction, k is the epoch number, \(\alpha \) is a positive real factor, and CRF is the conditional random fields (CRF) [79].

We use \(L_\beta = 1 - F_\beta \) as an image-level loss function w.r.t. each training example x. F-score (\(F_\beta \) ) is the harmonic mean of precision and recall of the prediction output of pixel p on image x w.r.t. the pseudo-labels, which use a positive real factor \(\beta \) as follows:

$$\begin{aligned} F_\beta =\left( 1+\beta ^2\right) \frac{\text {precision} \cdot \text {recall}}{\beta ^2~ \text {precision} + \text {recall}}. \end{aligned}$$

The network is retrained until the MVA reaches a stable state, as shown in the middle panel of Fig. 2. By doing so, the quality of pseudo-labels is improved over time.

3.4 Segmenting objects by locations

Our last stage is training a supervised segmentation model using the refined pseudo-labels from the previous stage. The supervised model is based on segmenting objects by locations (SoloV2) [80]. SoloV2 is an updated version of Solo [81], a previous method for instance segmentation. The idea is to dynamically segment objects by location.

Given an image as input, the network generates the object mask, then the object mask generation is decoupled into a mask kernel prediction and mask feature learning. Furthermore, matrix non-maximum suppression (MNMS) is applied to reduce inference overhead. Specifically, SoloV2 is composed of two modules: (1) dynamic instance segmentation and (2) matrix non-maximum suppression (MNMS). The dynamic instance segmentation scheme dynamically segments objects by location by learning the mask kernels and mask features separately. The mask kernels are predicted dynamically by the fully convolutional network (FCN) [82] when classifying the pixels into different location categories, then constructing a unified mask feature representation for instance-aware segmentation. The non-maximum suppression process is achieved by performing NMS with a parallel matrix operation in one shot to reduce inference overhead and suppress duplicate predictions. Compared to the widely adopted multi-class NMS [83], where the sequential and recursive operations result in non-negligible latency, the parallel non-maximum suppression with matrix operation can achieve similar performance with much lower latency. The parallel processing strategy performs MNMS inference on-the-fly and enables processing at a high frame rate (34 frames per second). For more details, we refer the reader to [80].

3.5 Rotating bounding box

From each instance mask that we predicted from the previous stage, we are able to produce the rotating 2D object bounding box. The minimum bounding rectangle (MBR) technique is used to obtain a rotated bounding box from a binary mask of the object. We used OpenCV [84] to find the minimum area of a rotated rectangle. It takes the binary mask of the object as an input and returns a Box2D structure that contains the following information: (centre (x, y), (width, height), angle of rotation). The output of this step is used to track the objects as discussed in the following section.

3.6 Online tracking

We used simple online and real-time tracking (SORT) [85] as an online tracking framework that focuses on frame-to-frame prediction and association. The position and size of the bounding box are used only for both motion estimation and data association. Kalman filter [86] is used to handle motion estimation and the Hungarian method [87] is used for data association.

Motion estimation is used to propagate a target’s identity into the next frame. The inter-frame displacements of each object are approximated with a linear constant velocity estimation. The detected bounding box is used to update the target state where the Kalman filter [86] solves the velocity components. The state of each target is estimated as:

$$\begin{aligned} x = [h, v, s, r, {\hat{h}}, {\hat{v}}, {\hat{s}}]^T, \end{aligned}$$

where h and v represent the horizontal and vertical pixel location of the centre of the target, while s and r represent the scale and the aspect ratio of the target’s bounding box, respectively. Here, \({\hat{h}}, {\hat{v}}, {\hat{s}}\) are for the source.

Data association is assigning new detections to existing targets. Each target’s bounding box is estimated by predicting its new location in the current frame. The intersection-over-union (IOU) distance between each detection and each forecasted bounding box from the existing targets is used to calculate the assignment cost matrix. The assignment cost matrix is then resolved using the Hungarian technique [87] to produce the fish trajectory as shown in Fig. 7.

4 Method

This section describes our method in detail. Our method is based on three main components: the pseudo-labels generation, the unsupervised learning method to refine the generated pseudo-labels and the DNN for fish tracking and segmentation. Figure 2 shows the algorithm flow diagram for the fish tracking and segmentation framework.

4.1 Datasets

We performed experiments using four publicly available datasets, i.e. Seagrass [23], DeepFish [24], YouTube-VOS [75], and Mediterranean Fish Species [76]. Figure 3 demonstrates a sample image from each dataset.

Seagrass [23] is comprised of annotated footage of Girella tricuspidata in two estuary systems in south-east Queensland, Australia. The raw data were obtained using submerged action cameras (HD 1080p). The dataset includes 4280 video frames and 9429 annotations. Each annotation includes segmentation masks that outline the species as a polygon.

DeepFish [24] consists of a large number of videos collected from 20 different habitats in remote coastal marine environments of tropical Australia. The video clips were captured in full HD resolution (\(1920 \times 1080\) pixels) using a digital camera. In total, the number of video frames taken is about 40k.

YouTube-VOS [75] is a video object segmentation dataset that contains 4453 YouTube video clips and 94 object categories. The videos have pixel-level ground truth annotations for every 5th frame (6fps). For a fair comparison, we extracted only the videos that contained fish, which include 130 video clips and 4349 video frames in total.

Mediterranean Fish Species [76] consists of a large number of images collected from 20 different Mediterranean fish species. In total, the number of images is about 40k. The dataset was split into two subfolders, training and test sets. The training set contains 34k and the test set contains 6k images. The image resolution ranges between (\(220\times 210\) pixels) and (\(1920 \times 1080\) pixels). The original images are stored in an RGB file format in subfolders as a class label.

We train our feature extractor on all of the four datasets and evaluate it on the video datasets only, Seagrass [23], DeepFish [24], and YouTube-VOS [75].

4.2 Pseudo-labelling

To train our supervised model, which is explained in Sect. 3.4, we first generate pseudo-labels for the image dataset, Mediterranean Fish Species [76] and the video datasets, Seagrass [23], DeepFish [24], and YouTube-VOS [75].

4.2.1 Image dataset

Since our image dataset [76] is curated from static images of different fish species, our framework discussed in Sect. 3 was not applicable to this dataset. Therefore, we used DeepUSPS [77] as an unsupervised saliency prediction network for a pseudo-labels generation. DeepUSPS is trained on the unlabelled MSRA-B dataset [88] for predicting salient objects. And it is an unsupervised learning method that produces pseudo-labels with high intra-class variations, which is useful for the training of the supervised model.

However, DeepUSPS is only good in pseudo-prediction for a single object in the image that is not disturbed by additional intricate details, which is not ideal for the more challenging video datasets [23, 24, 75].

4.2.2 Video datasets

Unlike our image dataset, our video datasets contain multiple objects in a single frame as well as across multiple frames. Therefore, we adapted our pseudo-label generation framework discussed in Sect. 3 that is capable of predicting multiple salient objects in the same video clip and handling the case of a cluttered background. This pseudo-label generation framework aims to tackle the issue of single-image datasets by generating more pseudo-labels with intra-class variations in image space.

The pseudo-label generation framework consists of three steps:

  1. 1)

    Obtain salient objects by performing background subtraction using adaptive Gaussian thresholding [73], as explained in Sect. 3.1.

  2. 2)

    Enhance the obtained salient object boundaries from the previous step with optical flow using RAFT [74], as explained in Sect. 3.2.

  3. 3)

    Apply cumulative pseudo-label refining via unsupervised historical moving averages (MVA) [77], as explained in Sect. 3.3.

In this way, we can get pseudo-labels for video datasets, Seagrass [23], DeepFish [24], and YouTube-VOS [75], which are used to train the supervised model.

4.3 Model training

Our models were trained with an input resolution of \(256 \times 256\) pixels. We scale the lowest side of the video frames to 256 and then extract random crops of size \(256 \times 256\). We sample two video sets, \(B = 2\) (of size \(T = 5\) frames); therefore, \(B \times T = 2 \times 5 = 10\) frames are used per forward pass.

We found that for this problem set, a learning rate of \(1 \times 10^{-3}\) works the best. It took around 300 epochs for all models to train on this problem. Our networks were trained on a Linux host with a single NVidia GeForce RTX 2080 Ti GPU with 11 GB of memory, using Pytorch framework [89]. We used stochastic gradient descent (SGD) optimiser [90] with an initial learning rate of 0.01, which is then divided by 10 at 27th and again at 33th epoch. We use light augmentation (resizing, greyscale). Following [80, 91], a scale jitter is used, where the shorter image side is randomly sampled from 640 to 800 pixels.

We applied the same hyperparameter configuration for all of the models. However, the optimum model configuration will depend on the application, hence, these results are not intended to represent a complete search of model configurations.

4.4 Inference

During tracking, we extract frames from the input video, forward each frame through the network, and obtain the fish category score from the classification branch. Initially, to filter out predictions with low confidence, we use a threshold of 0.1 and perform convolution on the mask feature using corresponding predicted mask kernels. Then, after applying a per-pixel sigmoid, we binarise the output of the mask branch at the threshold of 0.5. The final step is the matrix NMS, which fits the output mask with the Min-max box.

Our model operates online without any adaptation to the video sequence. On a single NVidia GeForce RTX 2080 Ti GPU, we measured an average speed of 34 frames per second.

5 Experiments

We report experimental results for our model’s trained representation on 50% of the DeepFish, Seagrass, YouTube-VOS datasets and the train set of the Mediterranean Fish Species dataset. We then evaluated it in the other 50% of the first three datasets. We provide quantitative and qualitative results that demonstrate our model’s generalisation capabilities to a range of different underwater habitats.

5.1 Results

We summarise our main results on Seagrass [23], DeepFish [24] and YouTube-VOS [75] datasets in Table 1. The quantitative results for all datasets were obtained using the COCO dataset [92] evaluation script. The average precision (AP), the average recall (AR), and intersection over union (IoU) were measured for the predicted bounding boxes and segmentation masks in the output images obtained from the trained SoloV2 [80], as explained in Sect. 3.4 in detail.

The average precision (AP) and average recall (AR) metrics provide a comprehensive view of the model’s performance. The values \(AP^{.50}\) and \(AP^{.75}\) indicate that the model has a high precision rate when the intersection over union (IoU) thresholds are 0.5 and 0.75, respectively. This means that the model is able to accurately predict the bounding boxes and segmentation masks for a majority of the objects in the images. The values \(AP^{M}\) and \(AR^{M}\) show that the model maintains its precision and recall across a range of IoU thresholds, indicating its robustness to variations in object size and shape. The \(AP^{L}\) and \(AR^{L}\) values specifically measure the model’s performance on large objects. These metrics are particularly important in our case, as they reflect the model’s ability to accurately segment and track larger fish species.

The results across different datasets demonstrate that our model is capable of generalising well to unseen videos in other environments. This is a significant achievement, as it suggests that our approach could be applied to a wide range of underwater video data.

To the best of our knowledge, no prior research has reported detection and segmentation evaluation for these datasets. To compare our proposed unsupervised method to a supervised approach, we present the results of SoloV2 [80] in the three data sets in Table 2. This table displays the results of a fully supervised model with the original labels, not our generated pseudo-label.

In both tables, Tables 1 and 2, higher values are better because they indicate that the model’s predictions are more accurate. From these tables, we can see that both unsupervised and supervised methods perform well across all three datasets, with some variation in performance depending on the specific dataset and whether detection or segmentation was being evaluated.

For example, in Table 1 (unsupervised method), we can see that the model performs best on the DeepFish dataset in terms of segmentation (\(AP^{M} = 31.2\), \(AR^{M} = 56.6\)), but struggles more with detection on this dataset (\(AP^{M} = 11.7\), \(AR^{M} = 34.5\)).

In contrast, in Table 2 (supervised method), we can see that although performance generally improves across all metrics compared to the unsupervised method, there are still some challenges with certain datasets—for example, detection on the DeepFish dataset (\(AP^{M} = 12.2\), \(AR^{M} = 41.0\)).

Our proposed unsupervised method has yielded close accuracy results to the original supervised SoloV2 [80] in both detection and segmentation experiments, validating the effectiveness of our generative approach. Furthermore, our results suggest that the proposed model is not heavily impacted by different underwater habitats, with almost similar performance for DeepFish [24] and Seagrass [23] datasets. The latter is particularly challenging due to the difficulty of visually detecting the fish. In some cases, the proposed model is not as good as fully supervised approaches. However, the primary objective of this study is the development of an unsupervised method for fish tracking and segmentation. We postulate that our proposed approach offers enhanced stability during training compared to other unsupervised methods without a dedicated pseudo-label generation step. This stability, coupled with the robust performance of our method across diverse datasets, underscores its potential for further refinement and application in this domain.

Table 1 Comparison of *unsupervised* detection and segmentation on Seagrass [23], DeepFish [24] and YouTube-VOS [75] datasets

The qualitative results of our algorithm for the DeepFish [24], Seagrass [23] and YouTube-VOS [75] datasets are illustrated in Figs. 8, 9 and 10, respectively. Additional examples of failure cases are provided in Fig. 11.

Despite the challenges posed by fast movements and complex, crowded backgrounds, which often result in significant distortion, our algorithm produces favourable outcomes for the majority of images. This is particularly noteworthy for non-rigid objects.

For a more dynamic view of our model’s predictions, you can watch a short video at this link https://youtu.be/Z5G7YBoL3eM. The video showcases the performance of our model in various scenarios, further demonstrating its effectiveness.

Table 2 Comparison of *supervised* detection and segmentation on Seagrass [23], DeepFish [24] and YouTube-VOS [75] datasets
Fig. 8
figure 8

Sample images from our model results for DeepFish [24]; from left, the original image, the ground truth, the predicted image

Fig. 9
figure 9

Sample images from our model results for Seagrass [23]; from left, the original image, the ground truth, the predicted image

Fig. 10
figure 10

Sample images from our model results for YouTube-VOS [75]; from left, the original image, the ground truth, the predicted image

Fig. 11
figure 11

Sample of the failure cases of our model. From the left, the original image, the ground truth mask overlay, and the predicted image. Images show instances where the model struggles with heavy occlusion, variability in fish size and shape, segmentation of foreground items, and influence of training videos. These scenarios highlight the limitations of our current approach and provide directions for future improvements

5.2 Ablation study

We performed an ablation study to demonstrate the proposed approach’s effectiveness in generating pseudo-labels. Specifically, we analysed the contribution of the vital component in the proposed method, the optical flow with background subtraction (Sect. 3.2). In addition, we evaluated the segmentation network training with refined pseudo-labels (Sect. 3.4) for different epochs. The results reported in Table 3 are for unsupervised segmentation based on optical flow without background subtraction as a baseline. And the results reported in Table 4 are for the four epoch trials with the same random seeds, see Sect. 4.3 for the details.

The metrics used in the ablation study are as follows:

  1. 1)

    \(AP^{M}\) (Average precision for medium objects): This is the average precision for medium-sized objects. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The higher the \(AP^{M}\), the better the model is at predicting medium-sized objects correctly.

  2. 2)

    \(AP^{.50}\) and \(AP^{.75}\): These are the average precision values at different intersection over union (IoU) thresholds. IoU is a measure of overlap between two bounding boxes. \(AP^{.50}\) is the average precision when IoU is 0.50, and \(AP^{.75}\) is the average precision when IoU is 0.75. Higher values indicate better precision at these IoU thresholds.

  3. 3)

    \(AP^{L}\) (Average precision for large objects): This is the average precision for large-sized objects. Like \(AP^{M}\), a higher \(AP^{L}\) indicates that the model is better at predicting large-sized objects correctly.

  4. 4)

    \(AR^{M}\) (Average recall for medium objects): This is the average recall for medium-sized objects. Recall is the ratio of correctly predicted positive observations to all actual positives. The higher the \(AR^{M}\), the better the model is at identifying all actual medium-sized objects.

  5. 5)

    \(AR^{L}\) (Average recall for large objects): This is the average recall for large-sized objects. Like \(AR^{M}\), a higher \(AR^{L}\) indicates that the model is better at identifying all actual large-sized objects.

In all these metrics, higher values are better because they indicate that the model’s predictions are more accurate.

It is apparent from the results that the segmentation accuracy of our proposed method has improved significantly when compared to that of the baseline method. We also note that the accuracy of the models also depends on the number of epochs used in the training. We observe from the results shown in Table 4 that the segmentation accuracy decreases after 100 epochs. The reason for this is the over-fitting of the network to the noisy pseudo-labels. While the training losses for both the baseline and our model decreased, the segmentation accuracy for our model was still greater than that for the baseline.

Table 3 Comparison of *unsupervised* segmentation based on optical flow without background subtraction
Table 4 Comparison of *unsupervised* segmentation for different epochs: 50, 100, 150, 300

5.3 Failure cases

While our model has shown promising results, there are specific scenarios where it fails to perform optimally.

  • Occlusion: Our model’s performance degrades when several fish are heavily occluded. While it can estimate the fish mask in some parts as long as they are part of the animal body, it struggles when the occlusion is severe, see Fig. 11.

  • Variability in fish size and shape: The large variability in the size and shape of fish presents a challenge for our model. It can identify a certain shape of fish, but determining the number of fish in an image remains a difficult task.

  • Segmentation of foreground Items: Given a set of unlabelled video collections, our model is only capable of segmenting foreground items and cannot distinguish between distinct object instances or semantic classes. Occasionally, the whole object or parts of the object may not be segmented out.

  • Influence of training videos: Our model’s performance is highly influenced by the characteristics of training videos, the coverage of object categories, and the motion of both the camera and the objects. This is similar to other data-driven learning techniques.

These failure cases provide valuable insights for future improvements to our model.

6 Discussion

Fish segmentation and tracking are notoriously difficult tasks, especially for small fish in video data where the background, lighting conditions and fish shape can vary significantly. In particular, for real data, the quality of ground truth labels varies from video to video, since it is difficult to annotate the animal’s entire path. Therefore, our model aims to generate a pseudo-ground truth by leveraging temporal consistency between frames and improving its quality based on self-supervised learning. The key to our proposed model is to leverage the intrinsic temporal consistency between consecutive frames by using the optical flow and background subtraction method to improve the generated labels. This is especially important when the fish is moving quickly and not in the same location in consecutive frames, as is the case in natural data. Tracking fish in video data is also challenging because their motion is very irregular and small fish may not be visible throughout the entire dataset. The other problem is that segmentation and tracking are time-consuming tasks, especially when dealing with large datasets.

Our model outperforms the baseline method (the optical flow without background subtraction) with higher AP values in most of the cases. Our approach can utilise temporal consistency to produce consistent labels. In the case of the DeepFish dataset [24], we observed that our proposed unsupervised model results in higher accuracy compared to the Seagrass dataset [23]. This is mainly due to the more challenging videos in the Seagress data set [23] compared to the DeepFish video data [24]. Furthermore, we show that for different video datasets, our model shows similar accuracy. Therefore, we can expect that the accuracy would be similar when tested under the same conditions but in new underwater video datasets.

In addition, segmentation accuracy does not degrade after training with supervised training, and training converges in only a few epochs, as shown in Table 4. In our experiments, we found that segmentation quality has a significant impact on tracking performance. This is because the quality of the produced object bounding box has a high impact on tracking performance. Even in this case, we still achieved decent results.

We also analysed the robustness of our proposed model with respect to the environmental conditions. We observed degradation of the model’s performance when several fish were heavily occluded, like in Fig. 11. However, our proposed model is still able to estimate the fish mask in some parts as long as they are part of the animal body. One of the main challenges in this task is the large variability in the size and shape of fish, as well as the variation in the shape of the fish’s body. Although it is possible to identify a certain shape of fish, it is not always possible to determine the number of fish in the image.

Given a set of unlabelled video collections, the main limitation of our study is that it is only capable of segmenting foreground items and cannot distinguish between distinct object instances or semantic classes. Occasionally, the entire object or parts of the object may not be segmented out. The performance of our model is highly influenced by the characteristics of training videos, the coverage of object categories, and the motion of both the camera and the objects, similar to other data-driven learning techniques. Our results are based on a few assumptions. One is that a small subset of semantically similar objects (e.g. all fish) exists in the scene, and these objects are likely to share the same motion feature or to be semantically similar. These assumptions are reasonable if the objects are within a certain size range, they all belong to the same class, and most of them share similar colours, shapes, and sizes. Another limitation of our approach is that we used a relatively large number of videos with a relatively small number of object categories (for instance compared to ImageNet). This allows our model to segment objects of all shapes and colours with only a handful of training examples.

One other limitation of our current framework is that in some cases, it is unable to detect all the objects that appear in the video. In future work, we intend to study how to develop a detection-based model that is able to detect all objects appearing in a given scene. Therefore, in the next step, we should look for a more robust and generic objectness model that is able to generalise across a variety of object categories and a variety of background types. Further work could be conducted on more fine-grained object segmentation, especially with new video datasets.

7 Conclusion

In this study, we introduced an innovative unsupervised methodology for the segmentation and tracking of fish in uncontrolled video environments. Our approach leverages a pseudo-label generation method that combines optical flow with background subtraction, followed by an unsupervised refinement network. This method has proven to yield accurate segmentation results when used to train a supervised deep neural network (DNN) for segmentation. Furthermore, our approach has shown its efficacy in tracking applications.

Our methodology was rigorously tested on three challenging datasets, with the results indicating its robustness and adaptability across different scenarios. This suggests that our approach could serve as a valuable tool for researchers and conservationists working with video data in aquatic environments.

Future research directions include extending our methodology to encompass other aquatic species. This extension, however, would necessitate further adaptations to account for the unique movement patterns and physical characteristics of these species. Another promising avenue for future research is the application of our model to autonomous driving systems for tracking-by-detection. Although this application would present additional challenges, such as dealing with faster-moving objects and more complex backgrounds, we believe the core principles of our approach remain applicable.

In conclusion, our study contributes to a significant advancement in the field of video processing for fish behaviour analysis. The proposed methodology not only enhances our ability to study fish behaviour, but also has potential implications for conservation efforts by providing more accurate data on fish populations and movements. Despite the limitations and challenges, we believe that our work lays a solid foundation for future research in this area.