1 Introduction

Ship detection exhibits excellent application prospects in maritime trade, ship traffic control, port transportation, and national defense security. The conducted research on advanced networks is of much importance [1]. Although researchers have paid much effort into investigating ship detection, it remains challenging due to the various orientations of ships, color contrast within specific ship types, and the higher resolution requirements of the ship images. Many researchers have proposed networks for ship detection; however, they often do not incorporate a fine-grained study of specific ships. This is due to the fact that changes in external light intensity or the use of different colors on similar ships can greatly affect detection accuracy.

With the advancement in object detection algorithms for deep convolutional neural networks, two detection modes with different stage numbers have emerged and are distinguished by whether the region proposal is utilized [2]. One-stage algorithm for detection is characterized by direct access to positional coordinates and corresponding regression for target categories, which helps to reduce time complexity. Typical one-stage detectors are You Only Look Once (YOLO) [3], single-stage detector (SSD) [4], and RetinaNet [5]. Two-stage detection introduces the application of region proposal as front refinement, and regions of interest are classified and located in the latter stage [6]. Typical two-stage detectors are faster region-based convolutional neural network (faster RCNN) [7], mask region-based convolutional neural network (mask RCNN) [8], and feature pyramid network (FPN) [9]. In this paper, the base classifier is a one-stage RetinaNet, which operates through the use of a robust focal loss method. As a result, it is able to combine the benefits of a one-stage detector, such as fast speed, with those of a two-stage detector, such as high detection precision. With respect to computer vision, the horizontal frame target detection algorithm based on a region-based convolutional neural network (R-CNN) shows rich application scenarios in the classification and detection of remote sensing images. However, it may generate background noise when detecting targets with large aspect ratios. The omission of detection targets is likely to occur when performing non-maximum suppression (NMS) processing. Therefore, in recent years, a significant number of scholars have devoted themselves to the study of rotating region proposal algorithms designed by introducing anchor frame rotation angle parameters, which effectively retain the target orientation feature information as opposed to background noise [10] and improve tilt target detection accuracy with the help of skew non-maximum suppression (skew-NMS).

In this paper, we introduce YOLF-ShipPnet, a novel network architecture that utilizes the state-of-the-art Pyramid Vision Transformer Version 2 (PVTv2) as the backbone network for the RetinaNet base classifier. The rotating frame is used for global and fine-grained ship image detection. Distinguished from foregoing works, the depth of the network is increased, and we perform random data augmentation using YOLOX’s HSV to improve its fine-grained classification capability for the ship dataset.

Our main contributions are:

  1. (1)

    We propose the YOLF-ShipPnet network, which will be used in ship detection for commercial and military purposes. In the YOLF-ShipPnet network, we introduce the application of popular PVTv2 architecture to the construction of the backbone of the RetinaNet base classifier, fully exploring its depth effectiveness. Also, YOLOX’s HSV is led into the field to manipulate random data augmentation on the ship datasets.

  2. (2)

    We demonstrate that the detection precision of YOLF-ShipPnet outperforms the conventional scheme and, as the depth of the network's depth increases, it gradually shows better performance. After random data augmentation, it is shown that the model can have better generalization abilities and is effective in enhancing the designed network. With the deepened network and effective data augmentation, the proposed YOLF-ShipPnet network is applicable for fine-grained ship detection and transcends the baseline by a large margin.

The remaining parts of this paper are developed as follows. In Sect. 2, we summarize the related work concerning the development of transformer architecture and the evolution of rotated frame detection. Also, existing methods with different functionalities for ship detection are briefly summarized. In Sect. 3, we demonstrate the detailed construction of the PVTv2 backbone and the mechanism of YOLOX’s HSV for data augmentation. In Sect. 4, we provide the results of the validation and ablation experiments on two datasets to prove the depth effectiveness of the PVTv2 network and the validity of the YOLOX’s HSV data augmentation strategy. It is proved that our proposed network can be applied to fine-grained ship detection. We also compare our network with other advanced networks to demonstrate its superiority. The conclusion of our research is given in Sect. 5. Reflection on our current work and future research prospect are demonstrated in Sect. 6.

2 Related Work

Transformer architecture is used in place of residual network (ResNet) [11] to form the backbone of the RetinaNet [5] network. In 2017, the Google team first proposed the transformer model, which abandoned the traditional convolutional neural network (CNN) and recurrent neural network (RNN) architecture, making the entire network structure composed completely of the attention mechanism. Transformer is the pioneer in transduction mode design with the functionality of calculating primary substitution of input and output based on the principle of self-attention [12], which is widely used in the computer vision field to manipulate image detection tasks.

The development of transformer structure can be divided into three stages, with its enhancement in function and boost in efficiency. In the first stage, the emergence of the attention mechanism enhanced the traditional CNN network with an optimized fusion of functionality. Bello et al. (2019) introduced a self-attention transformer model with two-dimensional architecture that combines convolutional feature maps and feature mapping generated by self-attention[13]. By leveraging a global perspective to analyze the entire image, the model outperforms a CNN that is limited to processing only local information, resulting in a significant improvement in accuracy for image detection tasks. Later, the transformer architecture reached the level of complete replacement of CNN with its excellent testing performance, due to the attention mechanism being used in image detection. For example, Dosovitskiy et al. (2020) introduced a vision transformer (ViT), which is directly applied to a series of image patches without any reliance on CNNs, demanding less computational power and achieving better detection performances as compared to first-class prototypes of convolutional neural networks [14]. Since then, based on ViT, a series of methods have emerged to improve and optimize the transformer structure for enhanced efficiency and effectiveness. Han et al. (2021) issued a brand-new vision architecture of a transformer called Transformer iN Transformer (TNT), which divides the local patch into sub-patch. It integrates both information and generates representation at patch granularity with the help of the outer transformer [15]. Wang et al. (2022) proposed the pyramid vision transformer (PVT), which can obtain higher output resolution when trained on denser regions of an image and reduce the cost of large feature maps’ computations by employing a progressively contracting pyramid [16]. To conclude, the transformer architecture integrates the attention mechanism into the construction of the forward feedback network, which has better parallelism and global optimization capabilities. It significantly improves the execution of dense image detection in terms of efficiency and accuracy, exhibiting broad application prospects in multimodality and object identification.

Rotated frame detection is widely used when conducting ship detection. For example, Liu et al. (2016) proposed a novel ship rotation bounding box that accurately captures the true shape of ships embedded in complex backgrounds. The method involves generating representative candidate regions using a closed-form region approach, which outperforms traditional horizontal frame target detection schemes [17]. Hu et al. (2017) introduced the rotated region-based convolutional neural network (RR-CNN), which integrates a rotated region of interest (RRoI) pooling layer and a regression model equipped with a rotating bounding box to accomplish ship detection. It excels in the extraction of key features within rotated regions and thus can capture the inclined detection targets more precisely [18]. Liao et al. (2022) proposed a novel rotated region proposal network (R2PN) to form multi-directional proposals featured by the angle information of the orientation of ships, which adopts a pooling layer activated by rotated region of interest to manipulate key feature extraction and uses bounding boxes regression to increase the accuracy of the inclined ship region proposals. The proposed network model achieves superior performance in ship detection, particularly for ships with multiple orientations [10].

Existing methods for ship detection hardly investigate into the depth effectiveness of the feature extraction networks and have limited generalization or fine-grained detection ability. Liu et al. (2023) proposed multi-scale convolutional attention fusion network (MSCAF-Net), a framework with PVTv2-B2 backbone for detecting camouflaged objects that focuses on learning features that are sensitive to context at different scales. While the efficacy of the network is evident for the reference datasets, its potential for profound exploration is constrained due to the utilization of a solitary layer of the PVTv2 network [19]. Sun et al. (2022) proposed gradient harmonized transformer network (GHFormer-Net), which utilizes PVTv2-B1 as the backbone network and incorporates gradient harmonized single-stage detector with context aggregation (GHM-C) and gradient harmonized regression (GHM-R) loss functions to improve fruit detection in low-light conditions. The experimental results demonstrate the effectiveness of the model, but the study only investigates the first layer of the network and does not explore the potential benefits of using deeper layers of PVTv2 [20]. Hao et al. (2021) proposed a unified network called UP-Vision Transformer (UP-ViTs) for systematic pruning of vision transformers and their extensions. However, their study revealed that when using UP-ViTs to prune PVTv2-B2 into UP-PVTv2-B1 on ImageNet–1 k validation, it increased the accuracy of PVTv2-B1, but was less effective than the deepened PVTv2-B2. This suggests that the lack of depth effectiveness in the network's design may have contributed to the suboptimal results [21]. Liu et al. (2017) proposed RR-CNN, which features an intensive task approach for non-maximum suppression among different classes, overcoming challenges in detecting strip-like rotated assembled objects. The network outperforms baseline models by a significant margin, but its compatibility with other rotation-based frameworks is limited [18]. Yan et al. (2019) proposed an innovative data augmentation method that utilizes simulated remote sensing ship images to augment positive training samples, thereby improving the quality of the training set. Experimental results on the ship detection dataset using Faster R-CNN demonstrate the effectiveness of the approach. However, the method is only applicable to a limited number of ship models and does not possess the ability of fine-grained classification [22]. Zhao et al. explores low-resolution fine-grained object classification and proposes a new model, which combines feature equilibrium principle and progressive interaction theory. It improves the accuracy of network when applied to low-resolution image detection, but when it comes to fine-grained classification, it only increases the baseline model by 3.4%, which is not satisfactory enough [23].

3 Model and Network

We propose a brand-new network called YOLF-ShipPnet, which incorporates deepened PVTv2 into the construction of the backbone of the RetinaNet network. The network structure of YOLOF-ShipPnet is demonstrated in Fig. 1. The RetinaNet network is a comprehensive baseline network consisting of a backbone, a neck and a head consisting of two subnets. The backbone network, namely, PVTv2, implements the convolutional feature mapping over the target image and is treated as a non-self-convolutional network. The neck part of the network is the feature pyramid network (FPN), which is utilized for multi-scale feature integration. The output of the FPN is then fed into the head part of the network, which comprises two subnets, namely the class subnet and the box subnet. Two subnets are distinguished by branch functions, one for classification and the other for regression. Specifically, the first subnet carries out object classification with a convolutional technique targeted at the backbone output, and the second subnet implements convolutional regression with the help of a bounding boxes. In YOLF-ShipPnet, considering the need for higher precision accompanied by the deepened network, we choose to use PVTv2 for its deepened network architecture.

Fig. 1
figure 1

The architecture of YOLF-ShipPnet network diagram of the model

3.1 Backbone Network: PVTv2 with Transformer Architecture

Since the introduction of ViT, there has been a large number of researches on vision transformers, roughly along two main directions: one is to improve the effectiveness of ViT in image classification; the other is to apply ViT to other image tasks, such as image segmentation and target detection. The PVT [24] introduced in this paper belongs to the latter. PVT is a simple, non-convolutional backbone that can be applied for many prediction tasks containing dense images. Unlike ViT, which employs a pure transformer architecture, PVTv2 incorporates a hybrid architecture that combines both transformer and convolutional neural network (CNN) structure. PVT overcomes the difficulty of applying a transformer to various task-oriented predictions with complex partitions, exhibiting better feature extraction performance.

PVT was originally proposed by Wang Wenhai and Xie Enze at Nanjing University and has undergone two generations of evolution, Pyramid Vision Transformer Version 1 (PVTv1) and PVTv2 [25]. Generally, PVTv1 has three main limitations. Firstly, PVTv1 treats the images as a series of non-overlapping facets, which somewhat loses the characteristic of partial continuity of the images, limiting its application for fined-grained ship classification. Secondly, the size of the encoding of position in PVTv1 architecture is pre-determined and invariant for processing images of discretional size. However, the most significant drawback of PVTv1 is that its network architecture has limited depth, which harms the precision of image classification. Taking into account the fact that the detection precision of our baseline network maintains at a low level, we choose to use deepened PVTv2 with depth-wise convolution as the backbone to trade off a lightweight network for higher precision, as shown in Fig. 2. It can detect dense ship images and perform feature extraction of local features more smoothly for fine-grained classification, which is satisfactory for application on ship image detection (Table 1).

Fig. 2
figure 2

Comparison of the depth of two versions of PVT

Table 1 Overall network architecture of PVTv2

Different layers of the PVTv2 network (B0–B5) are constructed by changing the following hyperparameters:

\({S}_{i}:\) The stride in stage \(i\) for overlapping patch embedding.

\({C}_{i}:\) The number of channels in the output of the ith stage.

\({L}_{i}:\) The number of encoded overlapping in the ith stage.

\({R}_{i}\) Deceleration ratio of the ith stage Spatial Reduction Attention (SRA).

\({P}_{i}:\) Mean pool size of linear SRA in the ith stage.

\({N}_{i}:\) Head number of the self-attention network in the ith stage.

\({E}_{i}:\) The expansion ratio of the ith stage feedforward layer.

The design of the PVTv2 network adheres to the principle that is used to construct ResNet, where the number of channel dimensions increases as the layers deepen, leading to a theoretical improvement in the detection precision of PVTv2 with increased depth. So, we suppose that the effect of depth-wise convolution of PVTv2 is still manifested in the HRSC2016 dataset. Taking into account the matching of the dataset, network complexity, and computational cost, we choose to deepen the layer of our network from B1 to B5.

3.2 Neck: Feature Pyramid Net

We apply FPN to the neck part of the network. The origin of the idea of FPN is the image pyramid in traditional image processing [26]. It aims to enhance the robustness of the model when the input images are of different sizes or various objects exist in the scenarios of target detection. FPN adopts the multi-scale feature fusion method, which considers global and local features during target detection. FPN enhances the conventional convolutional network with novel transverse connections and top-down pathways, thereby constructing a comprehensive, multi-dimensional feature pyramid from singular input images. Each layer of the pyramid can be used to detect objects with various dimensions. FPN is a powerful technique for improving multi-dimensional predictions from fully convolutional networks (FCN). It has been used to generate a range of subsequent networks such as region proposal network (RPN), deep mask object proposal, and two-stage detectors like faster R-CNN and mask R-CNN.

3.3 Head: Classification and Regression of Rotating Frame Networks

The objective of focal loss [27] is to address the issues of imbalanced class distribution and the resulting challenges in classification, particularly when the dataset contains a large number of easy background samples and a few foreground samples that are challenging to classify. Focal loss mitigates these problems and enhances the accuracy of detection by modifying the cross-entropy function, increasing the category weights \(\alpha\) and the sample difficulty weight modulating factor \((1-{p}_{t})\). The focal loss function takes the following form:

$$FL\left({p}_{t}\right)={-{\alpha }_{t}\left(1- {p}_{t}\right)}^{\gamma }\mathrm{log}\left({p}_{t}\right).$$
(1)

In formula 1, \(-log({p}_{t})\) stands for the initial cross-entropy loss function, \(\alpha\) is the weight parameter between categories, \({(1-pt)}^{\gamma }\) is the modulating factor between simple and complex samples, and \(\gamma\) is the focusing parameter.

One common loss function used for bounding box regression in the head part of object detection models is the L1 loss. In ship detection, L1 loss is particularly useful for accurately predicting the coordinates of the bounding box around a ship. By minimizing the mean absolute difference between the predicted and actual bounding box coordinates, L1 loss helps to improve the accuracy of the ship detection model. The formula for L1 loss is as follows:

$$L1= \sum_{i=1}^{n}\left|{y}_{i}-f\left({x}_{i}\right)\right|,$$
(2)

where \({y}_{i}\) denotes the true label and \(f\left({x}_{i}\right)\) indicates the predicted label.

3.4 Data Augmentation Strategy: HSV [28]

HSV is a color space put forward by a.r. Smith, in 1978 inspired by the intuitive properties of the color [29], also known as the hexagonal model. In the field of data augmentation in ship detection, it is used to enhance the color contrast of the image by adjusting the intensity ratio of hue, saturation, and value channel [30]. It extracts and manifests the feature color space of the ship image corresponding to the change of light state and external color of ships. The color space of the HSV model can be visualized using a cone, as shown in Fig. 3, accompanied by target images with contrast brightness and colors. H (hue) in the cone represents the phase angle of the color, with a range of \(0^\circ\) to \(360^\circ\). S (saturation) stands for a ratio value, which is correlated with the purity of a specific color. Following the direction of the S arrow, the purity of color witnessed a significant increase. V (value) represents the brightness of the color, ranging from 0 to 1. V value of the cone ranges from 0 at the black bottom point to 1 at the top white point, with higher values indicating greater brightness.

Fig. 3
figure 3

HSV augmentation for color-oriented data augmentation

In the YOLF-ShipPnet network, HSV is used to manipulate data augmentation on the ship dataset by adequately adjusting the three-channel values of the color space, aiming to simulate the background state of ship images under various lighting conditions and also to adjust the brightness, colors, and other factors of the image to reduce the sensitivity of our proposed model to ship colors. The data augmentation strategy eliminates disturbance factors such as potential changes in light intensity and color differences of a specific ship, which significantly improves the local feature extraction ability and the robustness of the network. The efficiency of training and the performance of our network is also further enhanced with the help of the YOLOX’s HSV random data augmentation technique.

3.5 YOLOF-ShipPnet

The model of YOLOF-ShipPnet is shown in Fig. 4. HSV color space is employed as a data augmentation technique to augment the light effects on ships and the external colors of certain ships. This approach produces a set of synthesized images by leveraging the HRSC2016 dataset, which helps in improving the model's training. We select PVTv2 as the backbone network, which is an enhanced transformer network with depth inheritance. After testing and refinement, our proposed network is expected to carry out global ship image and fine-grained classification.

Fig. 4
figure 4

Design flowchart of YOLOF-ShipPnet

4 Experiment and Analysis

4.1 Dataset

Ablation experiments are performed on the famous remote sensing dataset HRSC2016 [31] and the synthetic aperture radar (SAR) dataset SSDD [32] to validate the effectiveness of our proposed YOLF-ShipPnet network.

Northwestern Polytechnical University published the HRSC2016 [31] dataset in 2016. The set issued by Google Earth contains 1,061 images with 4 classes and 19 subclasses, covering 2976 instances of ships. The training, validation, and test sets incorporate 436, 181, and 444 images, respectively. The image sizes of HRSC2016 range from \(300\times 300\) to\(1500\times 900\), with the majority of the images having sizes greater than \(1000\times 600\). The dataset covers 27 types of remote sensing ground objects. For a fair comparison with other networks, only ship objects are selected for our experiments.

The SSDD dataset [32] was first unveiled at the SAR In Big Data Era (BIGSARDATA) conference in Beijing in 2017. The set contains 1160 images and 2456 ships, with an average number of ships of 2.12 per image. The image sizes are around\(500\times 500\). The set is partitioned into the training, validation, and test sets, with a random ratio of \(7:1:2\). This dataset contains SAR images specially used for ship detection with a single ship type.

Our ablation experiments use average precision (\(AP\)) and mean average precision (\(mAP\)) for evaluation of the performance of YOLF-ShipPnet. In MMROTATE, the general definition of \(AP\) is the gross area below the precision–recall curve. \(Precision\) measures the accuracy of prediction, while \(recall\) reflects the proportion of positive samples that are successfully retrieved. So to calculate them, the quantities that shall be known in advance are\(tp\), the number of correctly determined positive samples, and \(fn\) and\(fp\), which are incorrectly determined negative and positive samples. Formulas 3 and 4 illustrate the calculation processes of \(precision\) and \(recall\):

$$Precision=\frac{tp}{tp+fp}.$$
(3)
$$Recall=\frac{tp}{tp+fn}.$$
(4)

After plotting corresponding data points of \(precision\) and \(recall\) into a curve, the value of \(AP\) can be calculated by integrating the area beneath the curve. Then, \(mAP\) is derived by averaging over the \(AP\)of each epoch.

4.2 Configuration of Ablation Experiment and Model Training

All the experiments are conducted on a deep-learning server. The detailed configuration is shown in Table 2.

Table 2 Configuration of parameters

Our experiments are trained on the HRSC2016 dataset. The optimizer of YOLF-ShipPnet is AdamW. The momentum coefficient is \(0.9\) and the weight decay coefficient is equal to 0.05. The original learning rate of the model is \(0.0001\). The significance of weight decay is that the learning rate gradually reduces during training and converges quickly. Also, a threshold of 72 epochs is set to ensure the convergence of the network.

4.3 Ablation Experiments

The YOLF-ShipPnet we propose employs PVTv2 as the backbone, and YOLOX’s HSV is used for random data augmentation. To analyze the extent to which the proposed network elevates the performance of the model, we design a set of ablation experiments.

RetinaNet serves as the baseline object detection framework in our experimental setup, acting as a standard of comparison. It is composed of a backbone network and an FPN. The backbone network extracts image features, while the FPN produces feature maps of varying resolutions for further regression and classification. To demonstrate the effectiveness of our backbone network, we compare the performance of PVTv2 with the baseline. For depth effectiveness experiments, we explore the efficacy of PVTv2 layer by layer, comparing their \(mAPs\) and investigating the general trend of \(mAPs\) with increased depth. Random data augmentation based on HSV is also performed and compared with the baseline and the PVTv2 layer with the best performance. Additionally, we assess the fine-grained classification capability of our network and use the baseline for comparison. Finally, the generalization ability of YOLF-ShipPnet over different datasets is evaluated by replacing the original dataset with SSDD.

4.3.1 Effectiveness of PVTv2

In this section, we use only PVTv2 as the backbone for the baseline, referred to as PVTv2-B0, to evaluate the effectiveness of PVTv2. Table 3 below presents the results of the experiment after 72 epochs.

Table 3 Ablation experiments of PVTv2 on HRSC2016

According to the results from Table 3, it can be seen that the feature extraction accuracy reaches\(52.50\%\), indicating that our baseline is reliable. Compared with the values of \(AP\) under different categories, the PVTv2 group generally has higher \(AP\) values than the RetinaNet group, and the \(mAP\) is finally improved by \(0.41\%\). Therefore, it can be concluded that PVTv2 effectively enhances the ability of the feature extraction of the YOLF-ShipPnet.

4.3.2 Depth Effectiveness of PVTv2

In this section, PVTv2-B0 is used as the control group. We inherit PVTv2-B0 and modify the weights to obtain the B-series networks based on PVTv2 to prove the depth effectiveness of PVTv2. To observe the changes in indicators, Table 4 lists the effect of B0, B3, and B5: the control group, the group with moderate effect, and the group with the best effect.

Table 4 Ablation experiments of the depth effect of PVTv2 on HRSC2016

According to Table 4, it can be seen that there is an upward trend in the \(mAP\) from B0 to B5. The overall detection accuracy of the model was improved by \(2.27\%\) and \(2.78\%\) in each step, with a total increase of \(5.05\%\). Comparing the average accuracy of each method in Table 4, the mean precision level shows an overall upward trend from PVTv2-B0 to PVTv2-B5. PVTv2-B5 has the highest mean average precision, which is expected and demonstrates the depth effectiveness of PVTv2.

4.3.3 Effectiveness of HSV Data Augmentation on Ship Dataset

In this section, we aim to verify the contribution of data augmentation to the detection performance of our model. Based on the network involved in the above experiments, we only add \(YOLOXHSVRandom\) to randomly adjust the hue, saturation, and value of ship images.

Considering the inheritance relationship among the networks, our experiment adds augmentation to the baseline only to verify that it can improve the model detection accuracy without adding the PVTv2. Then, we add HSV strategy to the PVTv2 B5 to verify that the combination of deepened PVTv2 and data augmentation jointly contribute to the model performance.

Table 5 presents the results of two groups of experiments based on the baseline, with or without data augmentation. The \(mAP\) value increases from \(52.50\) to \(53.84\%\) before and after the augmentation, showing a leap of \(1.34\%\) in average precision. It indicates that adding data augmentation alone can improve the model effect.

Table 5 Ablation experiments of HSV data augmentation on HRSC2016

Table 6 shows the results of the three experimental groups: PVTv2-B0, PVTv2-B5, and PVTv2-B5_Aug. By comparing the results of PVTv2-B5 with and without data augmentation, the \(mAP\) increases by \(0.17\%\). We can also find that \(mAPs\) of PVTv2-B5 and PVTv2-B5_Aug are 5\(.05\%\) and \(5.22\%\) higher than PVTv2-B0. The results demonstrate our model’s robustness and show that the augmentation can further improve the detection performance of the model on the PVTv2, verifying the effectiveness of the data augmentation strategy.

Table 6 Ablation experiments of HSV data augmentation with increased depth on HRSC2016

4.3.4 Effectiveness of Fine-Grained Classification Experiment for Ship Dataset

The above ablation experiments validate the effectiveness of PVTv2-B5_Aug, which is \(5.63\%\) more accurate than the baseline.

In this section, the baseline and PVTv2-B5_Aug are used to detect 31 subclasses of ships to examine the ability of the YOLOF-ShipPnet network for fine-grained ship detection. Table 7 shows the results of the fine-grained experiments of baseline and PVTv2-B5_Aug on HRSC2016.

Table 7 Fine-grained classification experiments on HRSC2016

From Table 7, it can be known that both the baseline and PVTv2-B5_Aug can be used for fine-grained detection. PVTv2-B5_Aug performs better on fine-grained detection, showing an enormous leap of \(10.03\%\).

4.3.5 Performance of YOLF-ShipPnet on the SSDD Dataset

In this section, the dataset is replaced with SSDD to verify the generalization ability of PVTv2-B5_Aug(YOLF-ShipPnet). Table 8 shows the \(mAP\)of baseline and PVTv2-B5_Aug on the SSDD dataset.

Table 8 Ablation experiments on the SSDD dataset

In comparison to the detection accuracy between the two groups, PVTv2-B5_Aug showed an improvement in performance of 5.46%. This reflects the strong generalization ability of our proposed network and indicates its potential application in other datasets.

4.3.6 Loss Curve for Training

The following plots are the training loss of the above ablation experiments. In these plots, the networks reach convergence after 72 epochs (Figs. 5, 6, 7).

Fig. 5
figure 5

Loss curve for ablation experiments on HRSC2016

Fig. 6
figure 6

Loss curve for ablation experiments on SSDD

Fig. 7
figure 7

Loss curve for fine-grained classification experiments on HRSC2016

4.4 Visualization of the Result

We visualize the results of baseline and PVTv2-B5_Aug on HRSC2016 to intuitively compare the detection effect before and after the model improvement.

As shown in Fig. 8, part (i) shows the visualization results of the baseline and part (ii) demonstrates the results of PVTv2-B5_Aug.

Fig. 8
figure 8

Effectiveness of PVTv2-B5_Aug on HRSC2016

In Fig. 8, some ships that are not identified with the baseline detector are identified by PVTv2-B5_Aug, indicating that PVTv2-B5_Aug shows a better detection performance than baseline.

In Fig. 9, the detection precision of PVTv2-B5 is higher than the baseline for the same ship, indicating that PVTv2-B5_Aug can identify ships more accurately. From Fig. 10, we discover that the baseline and PVTv2-B5_Aug can detect multiple classes of ships, and PVTv2-B5 performs better in terms of identifiability and accuracy.

Fig. 9
figure 9

Higher detection accuracy of PVTv2-B5_Aug on HRSC2016

Fig. 10
figure 10

Effectiveness of PVTv2-B5_Aug with fine-grained experiment on HRSC2016

At the same time, PVTv2-B5_Aug on the SSDD dataset also achieves a better detection effect, which verifies the generalization ability of the model. Figure 11 demonstrates the visualization results, which show the effectiveness of PVTv2-B5_Aug on the SSDD dataset.

Fig. 11
figure 11

Effectiveness of PVTv2-B5_Aug on SSDD

4.5 Comparisons Among the Advanced Networks

Table 9 shows the performance of YOLF-ShipPnet and other networks on the HRSC2016 dataset. It is observed that the \(mAP\) of our proposed network shows a significant leap compared to other ship detection models, further verifying the depth effectiveness of the PVTv2 backbone and the excellent performance of the YOLOX’s HSV random data augmentation strategy. Then, it is observed that the networks listed can only perform global ship detection. However, our network extends the function of fine-grained classification for more specific purpose ship classification.

Table 9 Performance of YOLF-ShipPnet and other networks on HRSC2016

5 Conclusion

This paper proposes a rotation ship detection network YOLF-ShipPnet based on RetinaNet, which innovatively introduces the application of deepened PVTv2 network and HSV strategy for data augmentation. Generally, the backbone network utilizes the popular transformer structure along with the deepened PVTv2 network, which focuses on exploring the depth effectiveness in the context of ship image detection. The neck part employs the FPN model for multi-scale fusion of features. The head part takes in the combined characteristics and performs classification and regression of the rotating frame. To further improve the generalization and fine-grained classification abilities of our proposed network, we applied the random data augmentation strategy HSV on the ship datasets to complement the PVTv2 network and achieve a more cohesive and effective performance. Through a series of validation and ablation experiments, it has been confirmed that the YOLF-ShipPnet exhibits promising depth effectiveness for the detection of ships. Furthermore, the efficacy of the HSV data augmentation strategy has been demonstrated, resulting in significantly improved accuracy compared to the baseline model. The use of this strategy also makes the models less sensitive to such external factors as color or light changes. In addition, the YOLF-ShipPnet has demonstrated exceptional generalization abilities, particularly for fine-grained classification, as verified using the HRSC2016 and SSDD datasets. These results suggest that the proposed network has great potential for applications in industrial ship management. Overall, our work highlights the significant strengths of the PVTv2 network for enhancement of accuracy in the depth dimension and the importance of the HSV data augmentation strategy for improving the generalization capability. The use of this network in real-world scenarios may lead to significant improvements in the efficiency and effectiveness of ship management systems.

6 Reflection and Future Work

The YOLOF-ShipPnet network has shown promising results in terms of its depth effectiveness. However, there is still room for improvement in the model's performance by further tuning the parameters associated with the number of layers, which will be investigated in future studies. While FPN has been used as the neck part of the network, other networks such as faster R-CNN may offer promising performance in ship detection tasks due to their robustness and flexibility. Therefore, it may be worthwhile to retrain the network using faster R-CNN and compare the results with the previous ones. Currently, the HSV technique is utilized as a means of random data augmentation to enhance the network’s ability to generalize when presented with ship images that vary in color or lighting conditions. However, this method is limited in some circumstances, and other data augmentation strategies should be explored in the future to accommodate different application scenarios.