Introduction

Synthetic Aperture Radar (SAR) is a high resolution image radar. On the one hand, as an active microwave imaging sensor, SAR has a particular penetration effect on ground objects, so it is less affected by the environment and can effectively detect various hidden objects. On the other hand, its all-weather advantages enable it to complete exploration missions in all extreme conditions. Owing to these properties, SAR has been widely used in ship detection [1,2,3,4,5,6].

Traditional SAR ship detection methods rely on several handcrafted features. There are three main methods, including methods based on contrast information [7,8,9], the geometric and texture features [10, 11], as well as the statistical analysis [12, 13]. In addition, [14] consider both Marine clutter and signal backscattering in SAR images, and propose a Generalized Likelihood Ratio Test (GLRT) detector. Lang et al. [15] proposed a Spatial Enhanced Pixel Descriptor (SEPD) to realize the spatial structure information of the ship target and improve the separability between the ship target and the ocean clutter. Leng et al. [16] defined the Area Ratio Invariant Feature Group (ARI-FG) to modify the traditional detector. Among them, CFAR (Constant False alarm Rate) [17,18,19] detection method and its variant methods are the most widely studied.

In recent years, thanks to the development of deep learning and GPU computing performance, the convolution neural network has thrived in SAR ship detection [20,21,22,23]. Compared with the traditional method using artificial design features, the object detection algorithm based on the convolution neural network has significantly improved accuracy. Among them, the dense detection network gradually leads the development trend of target detection. To explore the rules of the detection model, we usually divide the detection model into several components [24]: backbone network, neck network, and head network. After a large-scale training of the image classification task, the backbone network migrates to the object detection task to fine-tune its parameters, so it has many mature structures and parameters. Neck network is generally used to realize multi-scale feature fusion. The fused features can better represent objects with various shapes and positions. Head network is a crucial structure of object detection, which uses the features output by the neck network to classify and locate targets. The earlier object detection algorithms, Faster RCNN [25], YOLO [26], and SSD [27], share a head network for classification and regression tasks. With the advent of RetinaNet [28], FCOS [29], FoveaBox [30], IoU-Net [31], and other algorithms [32,33,34], classification and regression tasks are separated, and parallel classification and regression branches are constructed. However, there are still some problems in this structure, limiting the improvement of model detection performance.

Inconsistencies between training and testing in regression branches. Specifically, the widely used regression loss is the L1 loss function, smooth L1 loss function, and L2 loss function, which make the coordinates of the proposal boxes generally regarded as the Dirac Delta distribution. This distribution enables all coordinate predictions to be aggregated to labels as much as possible, as shown in Fig. 1a. Representative algorithms using Dirac Delta distribution include SSD, FCOS, YOLO series, RCNN series, etc. However, after the following simulation experiment, it is found that the obtained result distribution is often very different from the Dirac Delta distribution. Therefore, coordinate prediction is regarded as Dirac Delta distribution in the training process, which leads to the inconsistency between the training and testing process. Finally, the ship detection performance is reduced.

Fig. 1
figure 1

a The conventional ship detection algorithm training and testing process respectively fit the distribution; b the ship detection algorithm training process to fit Gaussian distribution

Some recent work [35, 36] has considered prediction coordinates as Gaussian distributions to bridge the gap between training and testing, as shown in Fig. 1b. To some extent, Gaussian distribution can describe the coordinate distribution and weaken the requirements of Dirac triangle distribution. But in fact, the predicted values of coordinates may show very rich and flexible distributions during the testing process, even distributions that are generally not recognized. Therefore, it is not the most appropriate method to set up the coordinate prediction as Gaussian distribution and can only achieve a suboptimal performance.

Inadequate training of classification branches. We noticed that the conventional classification branch usually only uses a single loss, which easily leads to the phenomenon of low object classification scores when the background is chaotic. Finally, the ship with low scores are removed as background in the post-processing process, which ultimately affects the detection effect of the model. In this regard, the current research mainly focuses on the integration of classification and regression loss, so as to use regression loss to intervene in classification loss. This can optimize the training of classification branches to some extent. However, the background of SAR ship targets is usually chaotic. Background information interferes with classification and regression features. Therefore, the upper limit of classification and regression features determines that the fusion loss of classification and regression is difficult to improve the classification score further. Consequently, the current research still cannot break through the upper limit of classification branch training, resulting in the training process cannot be more sufficient.

To solve the above problems, this paper proposes a simple but effective detection network named twin branch network and designs two loss function: regression reverse convergence loss (RRC Loss) and classification mutual learning loss (CML Loss). Firstly, the twin head network designed in this paper will derive classification and regression branches, forming twin classification network and twin regression network. The two networks output two sets of data in parallel during the training process. On this basis, the loss function of regression reverse convergence is proposed to normalize the two coordinate predicted values in twin regression networks. Then the special relationship between two coordinate predicted values is used to get more accurate coordinate predicted values effectively. In addition, inspired by knowledge distillation, this paper proposes a mutual learning loss for classification, enabling self-knowledge distillation within the twin classification network. In the training process, the twin classification network will be transformed iteratively between the teacher and student network and continue to learn from each other to enhance the classification branch training. Finally, the experiments on SSDD dataset show that compared with conventional RetinaNet, our method can improve 2.7–4.9% AP in different backbone networks. At the same time, our detection performance is better than other current advanced methods. In addition, experiments on the HRSID dataset show that the proposed method has good portability. For example, FoveaBox improved by the proposed method can improve 1.5–2.0% AP based on different backbone networks. RetinaNet improved by the proposed method can improve 1.5–4.8% AP based on different backbone networks. PISA improved by the method in this paper can improve AP by 1.5% based on ResNet-50. Experiments show that our method is advanced.

The main contributions of our work can be summarized as follows

  • We conducted detailed experiments and analyses of existing methods. At the same time, the common problems of inconsistencies between training and testing in regression branches and inadequate training of classification branches are found in current methods.

  • This paper proposes a simple but effective detection network named twin branch network and designs two loss function: RRC Loss and CML Loss. To resolve inconsistencies between training and testing in regression branches, we use twin regression branches and RRC losses. For inadequate training of classification branches, twin classification branches and CML losses were used.

  • We conducted extensive experiments on the benchmark SSDD and HRSID datasets to prove the effectiveness of the proposed method. The experimental results confirmed that the proposed method is effective.

Fig. 2
figure 2

a The conventional ship detection algorithm training and testing process respectively fit the distribution; b the ship detection algorithm training process to fit Gaussian distribution

Motivation

In this section, the results obtained by the mainstream dense detector RetinaNet are studied to explore the main problems existing in the current detection model and then provide the basis for the theory of this paper.

RetinaNet is based on a feature pyramid network, and its detection head network is a general parallel structure. The classification branch will score each anchor box. Accordingly, the regression branch will predict each anchor’s center point offset \((\varDelta x,\varDelta y)\) and aspect ratio offset \((\varDelta w,\varDelta h)\). Then the output of the classification and regression branches will jointly determine the detection result. In the training phase, each anchor box of \((\varDelta x,\varDelta y, \varDelta w,\varDelta h)\) will learn a Dirac Delta distribution for labels y. At the same time, the classification score of each anchor fits the unique thermal coding of the category label. Finally, non-maximum suppression is used to process classification scores in the test phase to remove overlapping prediction boxes.

We study the regression branch of RetinaNet. In this paper, two single ship images are selected respectively. Trained RetinaNet models detect the target and then collect data without non-maximum suppression. Next, We conducts a statistical analysis of the collected data, and the results are shown in Fig. 2.

It can be seen from Fig. 2 that the distribution of \((\varDelta x,\varDelta y, \varDelta w,\varDelta h)\) corresponding to the two images is not similar to the Dirac triangle distribution. At the same time, they do not constitute a Gaussian distribution but a random distribution, even the label value is not within the maximum probability range. On the one hand, this phenomenon shows that the mainstream detector represented by RetinaNet has the problem of inconsistent training and testing. On the other hand, it also shows that it is unreliable to make the training process fit the Gaussian distribution.

In order to deal with the above problems, this paper tries to consider the following two perspectives:

Increasing the number of branches. In the testing process, the distribution of the prediction box will present an unknown distribution, and the mean of the distribution is the accurately predicted value. Then, to obtain accurate target box coordinates from the unknown distribution, the simple idea is to do much random sampling on the unknown distribution and, finally, calculate the mean to obtain accurate coordinates. Multiple regression branches are set up in the detector if we want to achieve the above purpose. In the training process, the output value of each regression branch learns the label separately. In this case, the output values should be independent and identically distributed. Then, the average of all the regression outputs is calculated to get an accurate result.

Increasing the number of branches is a novel idea. However, it has no practical significance because it requires many output values as samples, which means that the number of regression branches and model parameters need to be significantly increased.

Adjusting the distribution of the results. In some studies, it is assumed that the distribution of test results conforms to the Gaussian distribution. Then the coordinates of the prediction box are fitted to the Gaussian distribution during training, as shown in Fig. 1b. In the training process, the training samples should be as close to the distribution of test results as possible, but it has been proved in Fig. 2 that this method is not the most effective.

We may try to consider the reverse. Consider the Dirac triangle distribution during training as a label. Let the test distribution approach the Dirac triangle distribution. Specifically, the test distribution is adjusted heuristically without modifying the training pipeline, as shown in Fig. 3. Although the test distribution is unknown, if the range and shape of the distribution can be modified, it can be made as close to the Dirac triangle as possible.

Fig. 3
figure 3

The test distribution is adjusted heuristically without modifying the training pipeline

Based on the above two research perspectives, twin branch network and RRC Loss are proposed in this paper. In addition, this paper continues to study the classification branches. In order to intuitively understand the training of the RetinaNet model, all the predicted classification scores are counted in this paper. We assume that proposal boxes with a classification score greater than 0.5 are regarded as targets, while those with a classification score less than 0.5 are regarded as false detections. The statistical results are shown in Fig. 4.

It can be seen from Fig. 4 that among all detected ship targets, the number of targets with classification scores ranging from 0.8 to 0.85 is the largest. However, the number of detected targets dropped sharply from 0.85 to 1.0. The scores of most targets are in the range of 0.5–0.8, and the classification scores of targets are generally not high.

Fig. 4
figure 4

The results of statistics on all classification scores on the test dataset. The bar chart on the left shows that most ships are in the range of 0.8–0.85. In the range of 0.85–1.0, the number of detected ships decreased sharply. Almost all ships have scored in the range of 0.5–0.8. In addition, we list the ship images with different classification scores on the right

Therefore, the above statistical results indicate that the detector represented by RetinaNet has the problem of inadequate classification branching training. In order to alleviate this problem, we design a CML Loss based on the twin branch network.

Method

In this section, we first explain the structure of the twin branch network based on the classical RetinaNet model. Then, we introduce RRC loss’s composition and working principle. Finally, the CML Loss will be introduced in detail.

Twin branch network

The general detection model consists of classification and regression branches, and there is no interaction between the two branches. The classification and regression branches contain four convolution layers and activation functions. On this basis, the twin branch network designed in this paper derives the regression and classification branches, and the network structure is shown in Fig. 5.

Fig. 5
figure 5

The structure of twin branch networks

It should be noted that the derived branch is structurally consistent and symmetric with the original branch, but there is no intersection between the two. This structure ensures that the distribution of derived branches and original branches is as close as possible. In addition, this design will not affect the original training pipeline. At the same time, the implementation of the difficulty is small, and the idea is simple. Finally, the RRC Loss and CML Loss is added to the network based on retaining the original loss function. The form of the total loss function is as follows:

$$\begin{aligned} \mathcal {L} = \underbrace{\mathcal {L}^1_\textrm{cls}+\mathcal {L}^2_\textrm{cls}}_{\text {Classification Loss}}+ \underbrace{\mathcal {L}^1_\textrm{reg}+\mathcal {L}^2_\textrm{reg}}_{\text {Regression Loss}}+\mathcal {M}_\textrm{cls}+\mathcal {M}_\textrm{reg} \end{aligned}$$
(1)

where \(\mathcal {L}^1_\textrm{cls}\) and \(\mathcal {L}^2_\textrm{cls}\) represent the original classification loss function in the RetinaNet model, that is, focal loss; \(\mathcal {L}^1_\textrm{reg}\) and \(\mathcal {L}^2_\textrm{reg}\) respectively represent the original regression loss function, and the smooth L1 loss function is used in this paper. Finally, \(\mathcal {M}_\textrm{cls}\) and \(\mathcal {M}_\textrm{reg}\) are CML loss and RRC loss, respectively.

Regression reverse convergence loss

Twin regression network derived regression branches into two branches, increasing the number of regression branches. However, if the mean of the test distribution is calculated only by two independent regression branches, it is difficult to get an accurate prediction. Therefore, this paper will use RRC loss to guide the training process of the two regression branches. The network structure is shown in Fig. 6.

Fig. 6
figure 6

The structure of twin regressive network

On the one hand, the distributions of two regression branches should be as close as possible, so the L1 loss function is used to constrain the distance between predicted results. On the other hand, in order to make the distribution of output conducive to the generation of accurate predictive values, the predicted results of the two branches converge to the label from opposite directions so that the output values of the two regression branches are respectively concentrated on both sides of the label. Finally, the new distribution is obtained by averaging the two distributions. After averaging, the distribution becomes sharper and closer to the Dirac Delta distribution. Specifically, cosine similarity is used to reduce the similarity of the two branches predicted values relative to the label’s direction. The form of the loss function of regression reverse convergence is as follows:

$$\begin{aligned} \begin{aligned} \mathcal {M}_\textrm{loc}&= \frac{1}{\mathcal {N}_{\text{ pos } }} \sum _{i} \cos \left( b b o x_{i}^{1}-\mathcal {T}_{i}, b b o x_{i}^{2}-\mathcal {T}_{i}\right) \\&\quad +\frac{1}{\mathcal {N}_{\text{ pos } }} \sum _{i} L_{2}\left( b b o x_{i}^{1}, b b o x_{i}^{2}\right) \end{aligned} \end{aligned}$$
(2)

where, \(bbox^1_{i}\) and \(bbox^2_{i}\) respectively represent the predicted value of the ith positive samples on the two regression branches, \(\mathcal {T}_i\) represents the regression target value, and \(\cos \) is used to calculate the cosine similarity of the predicted value on the two branches. After subtracting a predicted regression value \((\varDelta x_i,\varDelta y_i)\) from the label \((\varDelta x,\varDelta y)\), the change vector of the regression value \((\varDelta x_i - \varDelta x, \varDelta y_i - \varDelta y)\) can be obtained. At the same time, with the help of cosine similarity, the cosine value between the change vectors on the two branches will keep decreasing, leading to the angle gradually approaching \(\pi \). The cosine similarity between the two branches is calculated as follows:

$$\begin{aligned} \begin{array}{c} \textbf{x}=\left( x_{1}, x_{2}\right) , \quad \textbf{y}=\left( y_{1}, y_{2}\right) \end{array} \end{aligned}$$
(3)
$$\begin{aligned} \cos =\frac{\textbf{x} \cdot \textbf{y}}{|\textbf{x}| *|\textbf{y}|}=\frac{x_{1} * y_{1}+x_{2} * y_{2}}{\sqrt{x_{1}^{2}+x_{2}^{2}} * \sqrt{y_{1}^{2}+y_{2}^{2}}} \end{aligned}$$
(4)

In the latter part of Eq. (2), we use the L2 norm to reduce the Euclidean distance of the regression predicted values on the two branches to promote their eventual convergence to the same result.

Classification mutual learning loss

CML loss will result in self-knowledge distillation within the twin classification network. In the training process, the twin classification network will be transformed iteratively between the teacher and student network and continue to learn from each other to enhance the classification branch training.

Inspired by knowledge distillation, in order to promote the training of classification branches, we constructed self-knowledge distillation inside the twin classification network. During the training process, the two classification branches would be iteratively converted between the teacher and student networks. The model structure is shown in Fig. 7.

Fig. 7
figure 7

The structure of twin classification network

When one classification branch is backpropagated, it will be treated as the student network, and the other branch is automatically treated as the teacher network. Once the classification score of the student network is lower than that of the teacher network, the teacher network will provide an additional loss for the student network to promote the training of the student network.

CML Loss can establish mutual learning between two branches to continuously strengthen the training of two branches of classification. Different from the conventional knowledge distillation, when the classification score of the student network is higher than that of the teacher network, the teacher network may provide training direction contrary to the label, thus hindering the training of the student network. Therefore, when the classification score of the student network is higher than that of the teacher network, the student network does not punish. Specifically, the CML Loss is as follows:

$$\begin{aligned}{} & {} \mathcal {M}_\textrm{c l s}\left( \mathcal {C}_\textrm{s}, \mathcal {C}_\textrm{t}, y\right) =\nonumber \\{} & {} {\left\{ \begin{array}{ll}\left\| \mathcal {C}_\textrm{s}-y\right\| _{2}^{2}, &{} \text {if }\left\| \mathcal {C}_\textrm{s}-y\right\| _{2}^{2}+m>\left\| \mathcal {C}_\textrm{t}-y\right\| _{2}^{2} \\ 0, &{} \text{ otherwise },\end{array}\right. } \end{aligned}$$
(5)

where, m represents the margin, which is set to 0 in the experiment in this paper. y represents the label value of the classification. Since SAR ship datasets are usually single categories, \(y=1\) is used when the anchor box is classified as a positive sample, and \(y=0\) is used otherwise. \(\mathcal {C}_\textrm{s}\) represents the classification score of the student network, and \(\mathcal {C}_\textrm{t}\) represents the classification score of the teacher network. CML Loss can only be penalized if the student loss is greater than the teacher loss. This loss encourages the student network to approach or better the teacher network in classification. However, it does not give the student network much of a boost once it achieves the teacher network’s performance.

Experiment

We first introduce the datasets, experiment setting, and evaluation criteria in the experiment section. Then, the effects of twin networks and different loss functions on the experimental results will be studied in detail through ablation study. Next, this paper will verify the detection performance and robustness of the proposed method on SSDD and HRSID datasets.

Datasets

To prove the superiority of this method, we conducted extensive experiments on the SSDD [37] and HRSID [38] datasets.

SSDD is the first SAR ship dataset established in 2017. It has been widely used by many researchers since its publication and has become the baseline dataset for SAR ship detection. The SSDD dataset contains many scenarios and ships and involves various sensors, resolutions, polarization modes, and working modes. Additionally, the label file settings of this dataset are the same as those of the mainstream PASCAL VOC [39] dataset, so training of the algorithms is convenient.

In using the SSDD dataset, researchers used to randomly divide training, validation, and test datasets. These inconsistent divisions often result in the absence of common evaluation criteria. As researchers gradually discovered this problem, they began to establish uniform training and test datasets. Currently, 80% of the total dataset are training datasets, and the remaining 20% are test datasets. There are 1160 images in the SSDD dataset. Therefore, the number of images in the training dataset is 921, and the number of images in the test dataset is 239. For further refinement, images whose names end with digits 1 and 9 are set as test datasets. In this way, the performance of various detection algorithms can be evaluated in a targeted way.

The HRSID dataset is a dataset released by UESTC in January 2020. HRSID is used for ship detection, semantic segmentation, and instance segmentation tasks in high-resolution SAR images. The dataset contains 5604 high-resolution SAR images and 16,951 ship instances. Its label file settings are the same as those of the mainstream of the Microsoft common objects in context (MS COCO) [40] dataset.

Evaluation metrics

To evaluate the detection performance of the algorithm model, we adopted the evaluation criteria AP50, AP75, APS, APM and APL in the MS COCO dataset. Average Precision (AP) is the area under the accuracy-recall curve, and mean Average Precision (mAP) is the average of various categories AP, where accuracy and recall are shown in formula 13. AP is the mean AP exceeding Intersection over Union (IoU) = 0.50: 0.05: 0.95 (primary challenge measure), AP50 is the AP with IoU = 0.5 (PASCAL VOC measure), and AP75 is the AP with IoU = 0.75. APS, APM and APL represent AP of small target, medium target and large target respectively, where small target with an area less than \(32^2\) pixels, medium target with an area between \(32^2\) pixels and \(96^2\) pixels, and large target with an area greater than \(96^2\) pixels.

$$\begin{aligned} P={\frac{\text {TP}}{\text {TP}+\text {FP}}}\times {100\%}. \end{aligned}$$
(6)
$$\begin{aligned} R={\frac{\text {TP}}{\text {TP}+\text {FN}}}\times {100\%}. \end{aligned}$$
(7)

Here, TP (true positive) is the number of ships correctly detected, FP (false positive) is the number of ships incorrectly classified as positive, and FN (false negative) is the number of ships incorrectly classified as negative. AP is defined as

$$\begin{aligned} R=\int \limits _{0}^{1}P(R)\text {d}R. \end{aligned}$$
(8)

where P represents precision and R represents recall. AP is equal to the area under the curve.

Experimental settings

All experiments were implemented in PyTorch 1.6.0, CUDA 11.2, and cuDNN 7.4.2 with an Intel intel(R) xeon(R) silver 4110 CPU and an NVIDIA Geforce TITAN RTX GPU. The PC operating system is Ubuntu 18.04. Table 1 presents the computer and deep learning environment configuration for our experiments.

Table 1 Environment of this experiment
Table 2 The effect of each component on the experimental results
Table 3 The effect of measurement function on experimental results

The algorithm model in this paper is based on the MMDetection [41] framework. Where training strategy \(1\times \) represents 12 epochs of training, \(2\times \) represents 24 epochs of training, the optimizer adopts the stochastic gradient descent method, learning rate Settings include 0.01, 0.005, and 0.001, momentum is 0.9, weight delay is 0.0001.

Ablation study

The influence of each model component

Our method is based on Resnet-50 RetinaNet model. Firstly, the influence of each component structure in our method on detection performance is studied. The experimental results are shown in Table 2.

Without the CML Loss and RRC Loss, we only retain the conventional classification and regression loss functions in RetinaNet, train the twin branch network, and finally average the output results of the two groups of classification and regression for target prediction. The detection results are shown in row 2 of Table 2. Compared with the conventional RetinaNet model, the AP of the twin branching network is 50.2%, which is 1.4% higher than that of the conventional model.

Next, CML Loss is added to the twin branch network, and the detection results are shown in row 4 of Table 2. After testing, the AP is 52.9%, which is 2.8% and 1.4% higher than the original structure and the loss before adding, respectively, proving that our CML Loss can enhance the model’s training.

Then, to continue proving the effectiveness of the RRC Loss, this paper only added the RRC Loss to the twin branch network for training, and the detection results are shown in row 3 of Table 2. After adding the RRC Loss, the AP is 52.9%, which is 4.1% higher than the original structure.

Finally, when the twin branch network, the CML Loss, and the RRC Loss are used simultaneously, the detection results are shown in line 4 of Table 2. The improved AP improves by 5.1%, indicating that the twin branch network and the two loss functions proposed in this paper can significantly improve the detection performance.

The influence of different measurement functions

We studied the influence on the experimental results when the RRC Loss and the CML Loss adopt different measure functions. The experimental results are shown in Table 3, where \(N\backslash A\) indicates that no loss function is adopted.

Table 4 The effect of the output on the experimental results
Fig. 8
figure 8

The density of the outputs of the two regression branches in the twin network. The two contour lines correspond to the dispersion of the two regression branches, which are distributed on both sides of the red dot

The CML Loss measures the distance between the output results on two branches. Since the distance between two branches is usually tiny, L1 and L2 losses are used to realize the CML Loss in this paper. In addition, since the loss function of regression reverse convergence is composed of reverse and convergence, cosine similarity is used to realize the reverse function, and L1 and L2 loss functions realize the convergence function. Compared with other schemes in the table, when the classification loss is L1 loss, the reverse function adopts cosine similarity, and the convergence function is L1 loss, the detection performance of the model is the best, and the evaluation criterion AP is 53.9%.

The influence of different aggregation methods

In the inference, the twin branch network outputs two groups of classification scores and two groups of regression predictive values, respectively. In this section, we studied the influence of different combination methods for four groups of output values on detection performance, and the experimental results are shown in Table 4.

In the table, branch output X means that the output result of the X branch network is used for post-processing. In addition, \(1\backslash 2\) means that the output result of the two branches is combined, that is, the output result of the two branches is averaged.

As can be seen from the table, when only a single branch is used for detection, the evaluation criterion decreases drastically. When the classification and regression output are combined, the AP of the evaluation criterion is optimal and significantly improved. The results show that the two loss functions proposed in this paper have constraints on the training of twin branch network. Also, it is worth noting that the AP in the first four rows of the table are almost identical, indicating that the two branches within a twin branch network are independent of each other.

The results on SSDD dataset

The visualization of reverse convergence

This section uses cosine similarity to make the two regression branches in opposite directions. In order to intuitively prove that the theory has a tangible impact on the training process, images will be randomly selected from the test dataset to visualize the distribution near the center point of the target, which is used to prove that the two regression branches approach labels from different directions, as shown in Fig. 8.

The red dots indicate the target location. The red and green contour lines represent the density hierarchy of the output results of the two regression branches. The darker the color in the figure, the greater the density of the predicted value at that position. As can be seen from Fig. 8, the two contour lines are located on both sides of the red dot, respectively, proving the authenticity of reverse convergence once again.

The adequacy of classification training

In order to further illustrate that the CML Loss promotes model training, we will analyze the changes in classification loss in this section, and the results are shown in Fig. 9.

Fig. 9
figure 9

The loss curves of twin classification networks. a The corresponding loss function of the proposed method is the lowest, indicating that the proposed method can promote the model’s training. b We count the classification scores of all the objects in the test dataset. The red and blue areas are the statistical results of the original method and our method, respectively. The proposed method significantly improves the number of predictions in the range of 0.8–1, indicating that the classification score of the proposed method is generally higher than that of the original method, which further proves that the proposed method strengthens the training of the model

Table 5 The experimental results on SSDD dataset

As shown in the figure, the trend curves of RetinaNet and the proposed method fluctuate during training. However, the loss values of the proposed method are lower than those of the original method in both backbone networks. Therefore, the proposed method facilitates the training of twin classification networks.

In addition, all classification scores without non-maximum suppression were counted, as shown in Fig. 9. The red bars are the conventional RetinaNet model and the blue ones are the proposed methods.

As can be seen from the figure, the peak of our classification score is 0.85–0.9, and it is mainly concentrated in 0.75–0.95. Compared with the baseline method, the classification score is significantly improved, which further proves that the proposed method positively affects classification training.

The comparison with other advanced methods

In order to prove the effectiveness of the method, this paper conducted an extensive comparison of the SSDD dataset, and the experimental results are shown in Table 5.

For RetinaNet models, the AP of our method improved by 2.7% after 24 epochs of training on ResNet-50. When the ResNet-101 backbone network was used, the AP improved by 4.6% and 2.8% after 12 and 24 iterations compared to the baseline method. On the more powerful ResNext-101, the proposed method achieved a 4.9% AP improvement over the baseline method. In addition, various models are tested in this paper, and the AP of the PAA method with the best performance is 52.4%. Compared with the PAA method, our method improves AP by 1.2% under the same backbone network, and our other criteria are better than the PAA algorithm. Experimental results show that our method can significantly improve ship detection performance under different backbone networks under the SSDD dataset.

Figure 10 shows the visualized detection results on the SSDD dataset. As can be seen from the figure, repeated detection and missed detection occurred to varying degrees in the other seven models. However, the proposed method is more accurate in SAR ship detection under complex background.

Fig. 10
figure 10

The visual detection results on SSDD dataset

The results on the HRSID dataset

In order to verify the robustness of the proposed method, the twin branch network is migrated to FoveaBox and PISA algorithms in this section, and an extensive comparison is made on the HRSID dataset. The experimental results are shown in Table 6.

Table 6 The experimental results on HRSID dataset

Unlike the RetinaNet method, FoveaBox is a classical anchor-free object detection method with advanced performance. PISA algorithm is based on RetinaNet, which reweights training samples and improves the performance of the baseline method. One is different from the RetinaNet training pipeline, and the other is an improved version of RetinaNet, so the experiments conducted on these two methods are representative.

As can be seen from Table 6, in the FoveaBox model, our method is almost unchanged in the ResNet-50 backbone network but improves 1.3–2% AP in other backbone networks. In PISA, our method improved by 1.5% over the RseNet-50 backbone network; In addition, 1.5–4.8%AP was improved in the RetinaNet method.

Experimental results show that the twin branch network can significantly improve the detection performance under two benchmark SAR ship datasets and three advanced algorithm models, so the proposed method has good robustness.

Conclusion

In this paper, a twin branch network is proposed, and two loss functions are designed, which are CML Loss and RRC Loss, respectively. Combined with the two losses, the twin branch network can standardize the test distribution and obtain more accurate detection results. As far as we know, without changing the test distribution, most of the work is to make the network fit the test distribution during training. We innovatively modify the test distribution heuristically to make the test distribution appear similar to the training distribution, which is ultimately beneficial to obtaining more accurate results. Finally, our method can significantly improve the SAR ship detection accuracy by adding a few training parameters, and its application in other computer vision tasks is also worth exploring.