1 Introduction

The Earth’s surface is covered by more than \(70\%\) ocean, which includes numerous biological species and natural resources. Human exploration of the ocean has been ongoing, and with the development of modern exploration technology, such as underwater robots, many previously unknown areas of the ocean have been uncovered. However, only \(5\%\) of the ocean floor has been explored so far, with \(95\%\) remaining unknown due to the vastness of the ocean. Currently, artificial intelligence has empowered deep learning-based marine image analysis, which has become a popular research topic. Various aspects of marine analysis, such as marine object detection [1, 2], and marine animal segmentation [3], have made significant progress. However, when a well-trained model encounters a "new class" or "different knowledge", it tends to misclassify the objects. In other words, the model assigns them to pre-defined categories [4]. Unknown Marine Objects (UMOs) with unknown categories frequently appear in real ocean scenes, making it challenging not only to label many known categories [5] but also to identify the locations of UMOs. Traditional detection or segmentation models, such as Mask RCNN [6], SOLO [7], and YOLOX [8], are unable to handle these "unknown classes". In practical systems, for the sake of performance and safety, it is crucial to make every effort to reject unknown objects to prevent irreparable losses caused by classification errors, such as misidentifying a peculiar-looking branch as an underwater robot. To address the problem of object instance segmentation of marine environment objects with UMO, an open instance segmentation model with prototype learning is proposed. This model aims to improve the misclassification issue encountered by traditional closed-set models when they encounter unknown objects. In the proposed model, we first separate the known objects of different categories in the feature space as much as possible, while minimizing the feature differences between individual classes to obtain a robust closed set classifier. Then, on this basis, the unknown probability is predicted by low score samples. Specifically, we integrate a prototype module and an unknown learning module into the Mask-RCNN model, which imparts the model with the capability for open-set detection. The advantage of utilizing a prototype is that it can enhance the classification accuracy of closed-sets and identify the open world [9]. By adding a prototype module, known classes become more compact in the feature space, while the unknown learning module optimizes the uncertainty of low-probability samples within the classifier. During the actual testing stage, the unknown probability of an instance determines whether it is detected as an unknown object. To validate the effectiveness of our method and assess the actual impact of each module, we use the Trashcan dataset and the CH-DUTUSEG dataset to detect closed sets and open sets. The results of the model under different datasets demonstrate significant progress in the open-set index while ensuring closed-set accuracy. Our model reduces the error of taking an unknown class as a known class. The main contributions of this paper are highlighted as follows:

  • This study introduces prototype learning into the open set instance segmentation model, thereby enhancing the accuracy of the model.

  • An unknown learning module is incorporated, and the optimization of the unknown boundary is achieved by training low-scoring samples, thereby enhancing the models capacity for identifying unknown objects.

  • Given the limited availability of marine life datasets, we extract samples from the existing marine dataset DUT-USEG and curate a novel dataset called CH-DUTUSEG for the purpose of model validation.

This article is structured as follows: In Sect. 2, we review related work. In Sect. 3, we provide a detailed introduction to the prototype module and unknown learning module. Section 4 discusses the experimental details and main results. Finally, in Sect. 5, we summarize our work.

2 Related work

Application of Deep Learning in Ocean Target Analysis. With the development of deep learning technology, numerous scholars have investigated the application of deep learning in the marine field. In 2020, Tseng et al. [10] realized the automatic measurement network of fish body length using a CNN network. Siddiqui et al. [11] proposed a visual method based on deep learning to classify fine-grained fish. Reus et al. [12] presented a machine learning approach that uses CNN to estimate the coverage rate of seagrass by describing seagrass patches and superpixels. Ma et al. [13] utilized a fusion algorithm to collect and integrate face image resources from videos, trained a face recognition model using R-CNN, and developed an application platform for crew face recognition and positioning analysis on ships. Wang et al. [14] provided an overview of recent developments in marine biological identification and a detailed analysis of the benefits and drawbacks of deep learning in this area. In order to address the issue of underwater degradation, Chen et al. [15] proposed an underwater scene semantic segmentation network (USSSN), which may minimize artifacts and preserve the integrity of foreground objects while enhancing photos. These examples indicate the growing maturity of deep learning models in the marine field.

Instance Segmentation Model. The fundamental idea of instance segmentation involves first detecting instances in an image and then generating a segmentation mask for each detected instance. Among these methods, Mask RCNN [6] evolved from Fast RCNN [16] by incorporating a mask branch into the target detection network to predict instance segmentation results. PointRend [17] treats instance segmentation as a rendering problem in image processing, producing superior masking results compared to Mask RCNN. Another idea of two-stage instance segmentation is to perform pixel-level semantic segmentation first, followed by classification through clustering and other post-processing techniques [18]. Influenced by the research on single-stage target detection, single-stage instance segmentation model has also been explored. YOLACT [19] uses different layers to generate mask coefficients and prototype masks, maintaining spatial consistency and near real-time speed. Using the concept of a class activation diagram to build the case activation layer and sparse the corresponding connection, Cheng et al. [20] introduced an unique instance segmentation technique in 2022. However, these closed-set instance segmentation algorithms typically require strong supervision and struggle to reject unknown objects.

Open Set Recognition. The Open Set Recognition (OSR) proposal aims to overcome the limitations of models in real-world situations. OSR models are classified as either generative or discriminative depending on the modeling form [21]. SVM [22, 23] was first used in the discriminant model to minimize the risk of open sets and optimize the space occupied by unknown classes. With the development of deep learning, Zheng et al. [24] takes advantage of RPN’s insensitivity to categories, and takes some candidate frames with high confidence and no labeled information as open set position objects, and then discriminates open set objects by clustering. Through comparative learning and incremental learning, Joseph et al. [25] introduced a new field of open-world object detection and achieved open-set object detection. In the generation model, Neal et al. [26] used GANs to expand the training set samples, and generated the synthetic open set samples for model training. However, there are still some practical differences between the generated samples and the open set samples. Prototype learning is also widely used in open set identification, among which Yang et al. [9] first applied prototype learning to convolutional networks, which proved that the integration of prototypes improved the robustness of closed set classification and made it possible to identify unknown samples. Lu et al. [27] proposed a new framework for prototype mining and learning, and made open set identification after considering the multi-attributes of prototype sets.

3 Methodology

3.1 Preparations

Our model design is based on two basic premises: 1) the real ocean scene is full of "unknown" possibilities, and open-set recognition is suitable for ocean scenes. 2) when faced with "unknown" objects, the traditional model will misclassify the objects, as shown in Fig. 1. Therefore, we use \(D={(x,y), x\in X, y\in Y}\) to represent the scene dataset, where x represents a sample instance and \(y={(c,b,m)}\) represents the label of this sample instance, including category c, detection box b, and segmentation mask m. To reflect the complexity of the ocean scene and the open-set encountered in the test as much as possible, we use the \(D_{train}\) data set to train our model, in which \(D_{train}\) contains the known class K, which is expressed as \(C_{K}={1,...,K}\). We use \(D_{test}\) data set to test our model, in which \(D_{test}\) contains known class k and other classes \(C_{U} \) that did not appear in training, which can be expressed as \(C_{U}=K+1\). Our goal is to make the model detect not only the known classes in \(D_{test}\), but also the location of unknown classes, thus reducing the probability of the wrong classification.

At the same time, we consider that a picture may contain samples of both known class \(C_{K}\) and unknown class \(C_{K}\)[28]. So, we have made the following preparations: Try to avoid unknown objects in training, and then better distinguish background class \(C_B\) from unknown class \(C_U\).

Fig. 1
figure 1

Incorrect classification. a A branch is classified as an eel; b a branch is classified as a remotely operated vehicle (ROV); and c a plastic bag is classified as a fish

3.2 Model architecture

Considering the high accuracy and robustness of Mask RCNN model, we use Mask RCNN as our baseline architecture. During baseline learning, we found that the baseline tends to classify unknown objects as background or known classes with low scores. This demonstrates that while the classic model has some rejection potential, it will result in incorrect classification since it lacks significant separation potential for unidentified class traits. To enhance feature separation and unknown identification. According to Fig. 3, we add a prototype module and an unknown learning module to the foundation network.

3.3 Prototype module

In this section, we introduce the feature learning module, which makes different categories more separated and the same category more compact through the learnable prototype, according to Fig. 2. We classify features according to their distance scores from different prototypes. We use \(m_i\) to represent the prototype, where \(i\in {1,2,...,K}\) represents the known class index corresponding to the prototype. Quantitatively, we use the Euclidean distance between features and different prototypes to measure the probability score. Where the Euclidean distance is:

$$\begin{aligned} d(f(x), m_i) =\Vert f(x)-m_{i}\Vert _{2} \end{aligned}$$
(1)

Among them, f(x) represents the features extracted in the early stage, and \(d(f(x),m_i)\) represents the Euclidean distance from the sample features to the corresponding prototype. As shown in Fig. 2, during the training process, the features should be as close as possible to the corresponding prototype, hence we define \(loss_{d}\) as:

$$\begin{aligned} loss_{d}=-\frac{1}{2N}\sum _{j=1}^{N}d(f_{i}, m_{i})^{2} \end{aligned}$$
(2)

where N is the total number of features. At the same time, we introduced classification loss to strengthen the model’s robustness and improve the separation of the prototype. The stability of the feature during training is aided by classification loss, which bases its label judgment on the distance between each feature and the prototype, as shown in Fig. 2. The Euclidean distance of each feature and each category prototype is calculated to get a distance distribution matrix D:

$$\begin{aligned} D_{ij}=-d(f_{j}, m_i)^{2} \end{aligned}$$
(3)

where \(i\in {1,2,...,K}\) and \(j\in {1,2,...,N}\). In addition, a background class prototype is kept around to filter out negative samples. Then the cross-entropy loss is applied on D, the \(loss_{1}\):

$$\begin{aligned} loss_{1}=-\frac{1}{N}\sum _{j=1}^{N}Y_{j}*\log \Big (\frac{\exp (D_{i})}{\sum _{i=0}^{K}\exp (D_{i})}\Big ) \end{aligned}$$
(4)

We also consider the impact of some atypical points. Figure 2 illustrates how some feature points could be rather distant from the associated prototype, which could result in incorrect classification. Therefore, in order to penalize the incorrect classification of boundary samples, we include a prototype region module. We select some low-scoring foreground and background samples (the number of Weak Samples is M) to make cross entropy loss and loss function:

$$\begin{aligned} loss_{2}=-\frac{1}{M}\sum _{j=1}^{M}Y_{j}*\log \Big (\frac{\exp (D_{i})}{\sum _{i=0}^{K}\exp (D_{i})}\Big ) \end{aligned}$$
(5)

Finally, we define the prototype loss function as:

$$\begin{aligned} loss_{p}=\sigma _{1}*loss_{d}+loss_{1}+\sigma _{2}*loss_{2} \end{aligned}$$
(6)

It is worth noting that in a closed-set detection environment, we need only determine the category score based on the corresponding distance, as follows:

$$\begin{aligned}&s_{i}(x)\propto -\Vert f(x)-m_{i}\Vert ^{2}\nonumber \\ Y_{x}&=\arg \max _{0\le i\le k}s_{i}(x) \end{aligned}$$
(7)

The \(s_{i}\) above represents the score of the feature, and y represents the label measured by the sample x. In the open-set environment, if we only rely on the threshold of measurement to debug the model, it is bound to make the model, like the traditional model, unable to consider the accuracy of the closed set and the ability of open set recognition. Therefore, we consider adding an unknown learning module to ensure both as far as possible.

Fig. 2
figure 2

Prototype module

3.4 Unknown learning module

This section focuses on how to make the model learn the ability to judge the unknown without sample data. Based on the premise mentioned above, we find that traditional networks tend to classify unknown objects into low-scoring backgrounds or low-scoring known categories. Based on the analysis of the above phenomena, we believe that the traditional network has a certain ability to identifying unknown objects. It’s just that the traditional model regards space as a global and closed-set, and can’t reject unknown objects. Therefore, we extend the \(K+1\) classifier to the \(K+2\) (including unknown probability) classifier in training. Our focus is on the uncertainty of low-scoring samples in the learning and training process.

During the training process, we select an equal number of low-scoring known samples and background samples as the training data. We define the probability score of each class of \(K+2\)(including background and unknown probability) as follows:

$$\begin{aligned} s_{u}&= \exp (soft_{u})/(\sum _{j=0}^{K+1}\exp (soft_{j})-\exp (soft_{c}))\nonumber \\ s_{i}&= \exp (soft_{i})/\sum _{j=0}^{K+1}\exp (soft_{j}),~i\in [0, K] \end{aligned}$$
(8)
Fig. 3
figure 3

The process of our method

In the above formula, "soft" represents the score assigned by the extended classifier, while "c" denotes the actual label of the training sample. We utilize features from low probability samples that resemble those from unknown samples to establish the boundary for unknown samples. Given this information, the loss function is defined as follows:

$$\begin{aligned} loss_{k}&=-\log (s_{i})\nonumber \\ loss_{u}&=-\gamma *\log (s_{u})\nonumber \\\ loss_{ul}&=loss_{k}+\tau *loss_{u} \end{aligned}$$
(9)

where \(\gamma \) is the degree factor of unknown probability to the probability of unreal tag, indicating the possibility of unknown in tag, and \(\tau \) is the coefficient. We set \(\gamma =(1-s)^{T}*s\) to optimize the unknown probability score, where s is the true label probability.

3.5 Detection optimization and prediction

According to Sects. 3.3 and 3.4, in this section, we will outline our open model, as shown in Fig. 3. Firstly, to address the issue of feature separation and compactness, we have added a prototype module. Secondly, to enable the model to learn the ability to distinguish the unknown, we added the unknown learning module. We define the loss function of the whole detection part as:

$$\begin{aligned} loss_{dect}=loss_{ul}+loss_{p} \end{aligned}$$
(10)

Firstly, the object category is assigned based on the prediction score:

$$\begin{aligned} Y_{x}= \left\{ \begin{aligned}&~i,~~~~~~~~~\arg \max _{1\le i\le k}(soft_{i})>soft_{u}\\&K+1,~~~\arg \max _{1\le i\le k}(soft_{i})<soft_{u} \end{aligned} \right. \end{aligned}$$
(11)

If it is known, we determine the category label of the known class based on the distance score of the prototype to ensure the highest possible detection accuracy for the known class. The relationship between probability and distance is as follows:

$$\begin{aligned} s_{i}=\rho *\exp (D_{i})/\sum _{i=0}^{K}\exp (D_{i}) \end{aligned}$$
(12)

Here, \(\rho \) represents the degree coefficient of the closed set, which describes the influence of the open set on the score of the closed set. We set \(\rho =1\).

4 Experiment

4.1 Experimental setup

Baseline Method. We use the two-stage Mask RCNN network as the baseline for comparison. Simultaneously, we integrate ablation experiments to select specific experimental parameters in order to explore the influence of different modules on the result.

Validation Metrics. Our goal is to maximize the model’s accuracy in detecting known categories, approaching the performance in a closed-set scene, while ensuring stable identification of unknown categories. Given the above objectives and specific scenarios, we use the Mean Average Precision (mAP) to assess the test accuracy of known classes. At the same time, based on our research on open-sets and the concept of open-set classification, we use Absolute Open-Set Error (AOSE) to measure the number of unknown errors of model classification. Furthermore, we consider the relationship between open and closed sets and use Wilderness Impact (WI) to measure the degree to which unknown objects are misclassified as known categories, where \(WI=(p_{k}/p_{k\cup u}-1)\).

Datasets. To evaluate the model’s effectiveness in real-world marine applications, we conduct experiments using the Trashcan dataset and the CH-DUTUSEG dataset. Figure 4 provides an overview of the datasets. The Trashcan dataset has 6065 training and 1147 testing images with 8 non-garbage and 14 garbage classes. Analyzing this dataset, we used 8, 14, and 22 categories to verify the prototype module functionality and Trashcan8-14 and Trashcan14-8 as two benchmarks to verify open-set performance. We select 400 images containing 1191 instances from the DUT-USEG dataset as the CH-DUTUSEG dataset.

Fig. 4
figure 4

Among them, Figures a, b, and c are samples from the Trashcan dataset, and Figures d, e, and f are samples from the CH-DUTUSEG dataset

Setup Details. We use ResNet-50 and Feature Pyramid Network (FPN) as the backbone of the improved model and baseline. Regarding hyperparameter settings, we set \(\sigma _{1}\) to 0.001, \(\sigma _{2}\) to 0.0001, \(\tau \) to 0.1, and T to 1.15. For optimizer and learning rate settings, we use an SGD optimizer with an initial learning rate of 0.08, a momentum of 0.9, and a weight decay of 0.0001.

4.2 Main results

Firstly, we verify the positive effect of adding a Prototype module on Trashcan and CH-DUTUSEG, in which the test set only contains known classes. The results are shown in Table 1.

Table 1 Test accuracy of different methods

The comparison findings in the above table indicate that adding a prototype module will increase the model closed set accuracy. We then compared the improved model to the baseline on the Trashcan8-14 and Trashcan14-8 datasets. Table 2 shows the comparison results on the Trashcan dataset. Table 3 shows the comparison results on the CH-DUTUSEG dataset. As the CH-DUTUSEG dataset has fewer samples, we only used WI as the measurement metric.

Table 2 Comparison results of different models on Trashcan subset
Table 3 Comparison results of different models on CH-DUTUSEG

Through the above comparison, we find that the accuracy of known classes is greatly improved by adding a prototype module, and the open set ability of models is improved by adding an unknown learning module. Finally, our model demonstrates an overall improvement compared to the baseline. Figure 5 below shows our prediction results on Trashcan.

Fig. 5
figure 5

The comparison of results between baseline (top) and improved models (bottom)

4.3 Ablation experiments

In this section, we explore the optimal performance of the relevant modules and investigate the optimal hyper-parameters in the experiments and their possible effects. Regarding the prototype module, we discuss the potential layer’s dimensions (i.e., constructed feature dimensions) impact on known class accuracy. We trained and tested on Trashcan8-14. The results are shown in the table below:

Table 4 The influence of different feature dimensions

According to the results in the above table, we find that the dimension depth of the prototype features has a very important influence on the accuracy of the model. Considering the accuracy, we make it deeper than the feature dimension of the classification layer. Regarding the unknown learning module, we investigate the impact of the T setting on the results. We conducted training and testing on the Trashcan14-8 dataset. The results are presented in the table below:

Table 5 The impact of T setting on the results

Finally, we analyze the effect of \(\rho \) settings on the model by conducting training and testing on the Trashcan14-8 dataset. The results are displayed in the table below:

Table 6 The effect of closed-set degree coefficient \(\rho \) on results

5 Conclusions

This paper proposes a novel method for open-set instance segmentation in ocean scenes. Building upon the baseline model, we introduce two learning modules, the prototype module, and the unknown learning module. These modules are designed to enhance the accuracy of closed-set classification, allowing the model to maintain stable accuracy in identifying known classes while effectively recognizing unknown classes in open-set scenes. The performance of the model is evaluated on Trashcan and CH-DUTUSEG datasets, demonstrating improved classification accuracy for closed sets and enhanced recognition capability for open sets. Finally, misclassification still exists in the improved model. This is because there is a good chance that unknown samples will be classified incorrectly as background samples. The focus of the following study will be on feature separation between background samples and unknown samples.