Ship feature recognition methods for deep learning in complex marine environments

With the advancement of edge computing, the computing power that was originally located in the center is deployed closer to the terminal, which directly accelerates the iteration speed of the "sensing-communication-decision-feedback" chain in the complex marine environments, including ship avoidance. The increase in sensor equipment, such as cameras, have also accelerated the speed of ship identification technology based on feature detection in the maritime field. Based on the SSD framework, this article proposes a deep learning model called DP-SSD. By adjusting the size of the detection frame, different feature parameters can be detected. Through actual data learning and testing, it is compatible with Faster RCNN, SSD and other classic algorithms. It was found that the proposed method provided high-quality results in terms of the calculation time, the processed frame rate, and the recognition accuracy. As an important part of future smart ships, this method has theoretical value and an influence on engineering.


Introduction
A large cruise or tanker collision is a major maritime disaster. A collision causes a series of consequences, such as personal injury, property damage, and pollution of the wrecked sea area. As shown in Fig. 1 the tanker, Sanchi, which sank in the East China Sea, passed through the International Maritime Organization (IMO). The expert group investigated and determined that its crash was mainly caused by human factors, specifically, the eyes' misrecognition of targets in complex marine environments, which cause serious consequences, such as that collision. In terms of DP-SSD, a real-time identification and navigation method is proposed based on deep learning (DL) for ships. Artificial intelligence (AI) recognition technology has been developing rapidly, and target recognition and tracking methods based on complex environments have been researched and experimented [9,53]. At the same time, the AI and DL research on maritime engineering has mainly focused on the real-time monitoring data fusion method for integrating navigation and communication (NavCom) in the complex maritime environment and mainly process the data of the automatic ship identification system (AIS) [2,11]. As previous researchers did not use experimental platforms for feature recognition of marine ships, no real-time monitoring method has been based on NavCom in the complex maritime environment for DL [43].
AIS time intervals do not reflect ship data in real-time. In specific ship navigation scenarios, such as in narrow waterways, specific waters with high ship exchange flow density, or extremely low navigation, AIS may be invalid.
The ship's single-shot multi-box detector (SSD) in the real-time identification method for NavCom in a complex maritime environment is a brand-new method that can help ships avoid collisions and identify obstacles in autonomous navigation. Compared with the traditional ship AIS, the SSD ship real-time identification method is based on human vision. In addition, ships sailing in polar regions, narrow waterways, and ship anchorages have strong practical needs. In recent years, there have been numerous marine accidents and marine pollution incidents caused by marine vessel navigation and marine engineering construction. Only new methods can reduce these types of incidents.
In the past two decades, the large-scale use of automatic ship identification systems has not substantially reduced maritime accidents. Therefore, the multisource heterogeneous data fusion method for NavCom in a complicated maritime environment is important to research. This paper focuses on three points: first, there is an actual need for a ship feature identification method based on deep learning for the integration of communication and navigation in a complex maritime environment; second, the paper explains the shortcomings and limitations of the current AIS, and the limitations of the real-time dynamic ship monitoring and identification method based on AIS data are proposed; finally, the paper explains the role of the ship feature identification method based on deep learning for the integration of communication and navigation in a complex maritime environment. The results show that the use of the deep learning-based ship feature recognition method with integrated communication and navigation in a complex marine environment on the surface can mimic biological visual recognition and identify ships.
The rest of this paper is organized as follows: in Sect. 2, we present a literature review on the research of multisource heterogeneous data fusion methods for the integration of communication and navigation in a complex maritime environment. In Sect. 3, the experimental framework is used to reveal a ship's feature recognition method based on DL for NavCom in complex environments. Section 4 introduces the research area and dataset, before presenting the results and discussion. Finally, a conclusion is made.

Literature survey
Collision avoidance among objects covering multidisciplinary areas has been studied by many scholars and engineers. The solutions introduced can be divided based on positioning and navigation information, radio frequency identification (RFID) information, and on image information detection and multi-sensor data fusion [41]. Among them, the method based on positioning information emphasizes the use of positioning and navigation information as the identification of collision detection. Its advantage is that it can use position, speed and time information to solve collision prediction and real-time detection issues. The disadvantage is that it strongly relies on PVT data. The refresh rate is high, Fig. 1 "Sangji" accident scene and it is difficult to have real-time alarms for the positioning deviation caused by multiple factors [29,35,57]. The method based on RFID information converts the multi-object detection problem into the anti-collision problem between multiple readers and tags. The advantage is that the use of the time division multiple access (TDMA) ALOHA algorithm can effectively increase the number of object detections, but the disadvantage is that a time reference is required to synchronize all tags in the reader's reading area [1,3,51,52]. The method based on image information uses image information as a sign to detect collisions. Its advantage is that it is more intuitive and focuses on the principle of bionics, such as human eyes. The disadvantage is that if accurate and fast recognition is required, it needs to rely on training sets and algorithms [12,19,59].
Certain specific fields have already possessed unique advantages over manual operations. For example, the detection of the human colon, gallbladder, appendix, and stomach, as well as the identification of diseased bodies can provide doctors with robust and reliable pathological data [14,39,40,58] based on the application foundation of artificial intelligence technology. AI target recognition methods based on visual recognition have been applied in research on multisource heterogeneous data fusion for the integration of communication and navigation in complex maritime environments [34,50].
Machine learning has been widely used in various fields. The literature [20] is focuses on recognition tasks of natural language and studies the sentence clustering recognition method of the network. The concept of pre-reliability is used to measure the credibility of the sentence recognition results. The network simulation program analyzes the influence of the neural network on sentence recognition and performs a post-reliability evaluation index for the credibility of the model's construction. The literature [8,33,36,37,46] is examines research on multiple application scenarios, including B2B communication services, wireless sensor network keys and personal wireless networks. Among them, [46] proposes a deep Boltzmann machine (DBM) for learning a data generation model that is composed of multiple input modes. The experimental results of bimodal data composed of images and text show that multimodal DBM can learn a good generation model. The joint space for image and text input is useful for retrieving information from unimodal and multimodal queries.
There are also many reviews on AI target recognition. The literature [44] systematically introduces the deep learning technology of marine target recognition. The method chapter is divided into two parts: supervised learning and unsupervised learning. The part of supervised learning focuses on introducing the respective principles and progress of AlexNet, ZFNet, VGG-16, GoogLeNet, ResNet, and SENet, while focusing on the deep belief network (DBN) and automatic coding model in the unsupervised learning part; it briefly introduces the target dataset on the surface and underwater, lists the existing datasets of Fish4K, Kyutech-10 K, QuickBird, SPOT-5, HRSC2016, R2VV-p, VAIS, E2S2-Vessel, FleetMon, MARVEL, CCF-BDCI, and performs comparative research. The method is divided into three parts: data preprocessing, feature extraction and recognition, and model optimization for analysis. Among them, image preprocessing is a key step in the recognition of ocean targets in deep learning models because images or videos are the premises and application foundations of deep learning methods. The literature [22,28,31,55] compares target detection algorithms (R-CNN, Fast-RCNN, Faster-RCNN), end-toend detection algorithms (YOLO series, SSD algorithm) and novel algorithms proposed recently (Cascaded RCNN algorithm, RefineDet algorithm, SNIP algorithm, R-FCN-3000 algorithm, DES algorithm, STDN algorithm). Comparing accuracy and speed as measurement indicators, it analyzes and summarizes their respective advantages and disadvantages using public datasets.
The majority task of target detection is composed of object recognition and object location. The task of the former is to classify objects, and the task of the latter is to obtain the position of the object in the picture [32,42].
The mainstream target detection model can be divided into a two-stage model and a one-stage model. The first step is to generate a series of candidate frames. The candidate frame generation methods include selective search, edgeboxes, deepmask, region proposal network (RPN), etc., and then the second step is to perform classification regression using Region convolutional neural networks (R-CNN), Fast R-CNN, faster-RCNN and other algorithms as typical representatives. The latter performs classification and regression while generating candidate boxes in each cell of the feature map, such as You only look once (YOLO), Single shot multi-Box detector (SSD), etc. The algorithm is a typical representative.
A two-stage technology evolution model is developed from R-CNN to fast-RCNN and then to faster-RCNN. R-CNN target detection can be divided into four steps: setting the extraction frame, extracting features by frame, training the classifier for image classification, and using regression to fine-tune the position of the selection frame. During the evolution of R-CNN, the idea of spatial pyramid pooling (SPP Net) contributed substantially to it [23,30]. Fast R-CNN was proposed by Microsoft in 2015. Compared with the R-CNN algorithm, there are two differences. When training the classifier, a neural network is used to replace the support vector machine (SVM) in terms of efficiency, the training phase and the test. In this stage, Fast R-CNN is 9 times and 213 times that of R-CNN [15]. The Faster-R-CNN algorithm consists of two modules: the PRN candidate frame extraction module and the Fast R-CNN detection module. Among them, RPN is a fully convolutional neural network used to extract candidate frames. Fast R-CNN detects and recognizes the target in the proposal based on the proposal extracted by RPN [6,27,47].
In the one-stage technology evolution model, the YOLO algorithm represented by YOLO uses a simple convolutional neural network to predict the border and the prediction category at the same time, while using the characteristics of the entire image to predict and reduce errors. It does not require preprocessing and postprocessing. The problem with processing is that the test scale must be consistent with the training scale and each grid can only predict one object. YOLOv2 uses YOLO, which is faster and more accurate, with more than 9,000 recognition categories. However, the disadvantage is that when using the 13 × 13 output feature map, the smaller object may not have obvious features or even be ignored, though YOLOv2 uses a 1 × 1 convolution to reduce the number of channels from 512 to 64. In addition, after the convolutional layer, batch normalization (BN) is added to achieve a 2% improvement, which helps the regularization model remove the dropout. YOLOv3 implements logistic regression to regress the box confidence, and the prior and actual box Intersection over Union (IOU) is greater than 0.5 as a positive example. Unlike SSD, if there are multiple priors that satisfy the goal, only one prior with the largest IOU is taken [18,45,49,56].
In summary, for the target detection network, two stages have higher accuracy, and one-stage speed is faster.

Proposed methodology
At present, in the industry, target feature learning and recognition technology has made great progress. For example, leading technological companies such as China SenseTime Technology and Questyle Technology have incorporated classic algorithms into engineering applications. However, there is no fundamental breakthrough in innovative research based on classic algorithms. Analyzing the principle of classic algorithms, YOLO and SSD are both very concise. They do not perform deep fitting on the sample dataset, though they do only on the linear connection of multidimensional data sample points. Therefore, the meaning of the test is only in the test set and the data sample. The numerical value of the points is simple and relative; therefore, only a single stage allows them to have a fast detection effect relative to Faster RCNN.
Furthermore, target recognition has been listed as a trending research topic in various areas. Shi et al. [38] proposed an accurate and effective target detection method in the literature, which is called the feature fusion and enhancement for single shot detection (FFESSD) model. It is verified through experiments, proving that when the input rate is 54.3 FPS and the input image size reaches 300 × 300, the average accuracy rate of the FFESSD can reach 79.1%. When the input rate is 30.2 FPS and the input image size is 512 × 512, the average accuracy rate can reach 81.8%, which is significantly better than the conventional SSD model, the deconvolution single shot detector (DSSD), feature fusion SSD (FSSD) and other advanced detectors. Scholars, such as Liu et al. [25] proposed a new type of SSD model in the literature. Through verification in the MICCAI challenge, it was found to have the fastest and best accuracy and recall rate. In general, it has excellent performance in polyp detection. The target detection method in SAR ship detection has also become a very prevalent study point. Scholars, such as Zhang et al. [53] proposed an improved Faster-RCNN method in the literature. The method could improve the detection results of small-scale distribution ships, while simultaneously improving the recall rate, which could provide a high-resolution remote sensing image-based detection method for offshore and inland ships. Chang et al. [5] proposed a YOLOv2-reduced method in the literature. Through testing and verification on the SAR (SSDD) and DSSD, it was found that compared with Faster R-CNN, the YOLOv2reduced method improved the accuracy of ship detection and substantially reduced the calculation time. Zhang et al. [54] proposed a high-speed ship detection method for SAR images based on a convolutional neural network (ABYI, G-CNN for short) in the literature. The experimental results show that the detection speed of G-CNN is faster than that of other existing methods, such as faster regional convolutional neural networks (Faster R-CNN), SSD, and the YOLO method. Li et al. [21] proposed a SAR-based context-based feature fusion single-shot multi-box detector (CBFF-SSD) framework in the literature and adopted parallel deep learning processing with multiple neural processing units. The high-efficiency hardware architecture (NPU) of the processor is composed of two-dimensional processing elements (PE), which can calculate multiple output element maps at the same time. Experiments show that it can be further optimized in the detection of small targets, reducing the computational complexity and parameter size. Scholars, such as Ivan Lorencin used AI algorithms in the literature to identify marine targets from the air [16] and classified marine targets into ships and other targets through classification.
In 2016, Liu et al. [26] proposed a deep neural network method called SSD, which uses a single object to detect objects in images, according to the literature. It discretizes the output space of the bounding box into a set of default boxes at different aspect ratios and proportions for each function graph position. For a 300 × 300 input, SSD reaches a 74.3% mAP. When VOC2007 is tested at 59 FPS, the SSD reaches a 76.9% mAP for 512 × 512 inputs, which is better than the latest Faster R-CNN model. Fu et al. [10] proposed a method to introduce other contexts into the latest general object detection in the literature in 2017. In the PASCAL VOC and COCO tests, the DSSD of a 513 × 513 input reaches an 81.5% mAP in the VOC2007 test, an 80.0% mAP in the VOC2012 test, and a 33.2% mAP in the COCO test, which is better than the latest R-FCN method in each dataset. Based on the traditional SSD, literature [17] obtains better generalizability by changing the structure to make it close to the classifier network instead of adding layers. For the Pascal VOC 2007 test set, when using the VOC 2007 and VOC 2012 training sets, the mAP is 78.5% at a 300 × 300 input and a rate of 35.0 FPS. With an input of 512 × 512 and 16.6 FPS, the mAP can reach 80.8%. The accuracy is better than those of the traditional SSD, YOLO, Faster-RCNN and RFCN, and it is faster than the RCNN and RFCN accuracies. Reference [48] also proposed a context-aware single-shot multi-frame object detector (CSSD) based on a traditional SSD. In addition, the reference combined a high detection accuracy and realtime speed and experimentally demonstrated that CSSD can achieve better detection efficiency, especially when detecting small objects. The literature [4] proposed a multilevel feature fusion method for small object detection. The experimental results show that the mAP obtained by the multilevel feature fusion method on PASCAL VOC2007 is 2-3% higher than the baseline SSD, especially for detecting small objects. Their speeds are 43 FPS and 40 FPS, which are better than the 29.4 and 26.4 FPS of the latest DSSD. Reference [24] proposed a novel method based on pyramids and an SSD, called the function single shot detector. In the Pascal VOC 2007 test, the network can reach an 82.7 mAP with an input size of 300 × 300 and a speed of 65.8 FP. The performance on other test sets is also better than those of the other datasets. With multiscale context fusion technology, reference [7] proposes the WeaveNet method, which greatly simplifies the structure. The PASCAL VOC 2007 and PASCAL VOC 2012 tests show that WeaveNet brings vast performance improvements.

DP-SSD framework
To suppress the detection frame that eliminates the number of repeated times throughout the process of optimizing the computing power of a large amount of data, we propose an optimization algorithm using the SSD network topology. The algorithm is a deep learning algorithm based on a feedforward CNN optimized computing power, which arbitrarily identifies the picture boundary according to the size of the target picture. We used a feature extractor to extract features from network images and compared the VGG-16 network with a convolutional neural network (CNN) in the experiment. We designed a new feature recognition framework to obtain two completely different feature detectors. As shown in Fig. 2 compared with the traditional convolutional neural network (CNN), it has a stronger feature recognizability.
In Fig. 2 we use image features generated from different spatial and different resolution layers to construct a new feature extractor. Due to the irregular pyramid structure's characteristics of multilayer images with different sizes and resolutions, it has extremely high robustness requirements for network feature recognizabilities. Based on this, we construct several convolutional feature layers at the network endpoint of each feature map detector of different scales. Due to the different characteristics of the characteristic layer of each convolutional network, the design refers to the output result of the optimized calculation of the matched convolutional network filter on the prediction set. There are c channels in the feature recognition, and the resolution of the image feature is designed as an m × n feature layer. Therefore, the calculation kernel of the 3 × 3 × c convolutional network is used for classification calculation, otherwise the coordinate system of the network feature recognition of the alternating frame is replaced.

Training model
First, automatically identifying and labeling the ship tag information is carried out on the target ship picture recognized by the system. Second, the SSD network model is used to input the image with the ship information label for the feature recognizer. Finally, we discuss the output of the detection results after the recognition of the computing power. When the corresponding relationship is clearly specified, the loss function and reverse compensation can be performed on the data point-to-point calculation method Fig. 3.
Network training is divided into three steps: , where X P i j = 1 indicates that the i-th default box matches the j-th real label box of category P. According to this matching strategy, i X P i j ≥ 1 is obtained, at least one can match the j-th real label box. The overall target loss function is the weighted sum of confidence loss and position loss and can be expressed by: where x denotes whether the matched box belongs to class P and c is the confidence level. If x is 1, the default box is positive; otherwise, x is 0, and the default box is negative. b indicates the predicted box, and g denotes the truth box. N is the total number of matching default boxes. When N = 0, the loss is considered to be 0.
where c p i represents the probability that the i-th default box belongs to P, which can be defined as: The position loss is the smoothing loss between the prediction box and the true label value box parameters.
L loc (x, l, g) = N i∈Pos m∈{cx, xy, w, h} where l m i denotes the offsets of the predicted box relative to the default box, g m j represents the truth box matching the

Loss function calculation
Model Parameter Update j-th real label box of category P, and m indicates the fourdimensional vector as: The smoothing loss is given as: where cx, cy are the coordinates of the j-th sea truth box, and w, h are the width and height of the i-th default box, respectively. d w i , d h i is the width and height of the i-th default box, respectively.
Step 2: to reduce the high demand for computing power of the ship identification system in the complex marine environments, the calculation process is greatly reduced, besides the computer's feature recognition method for the GPU. This method makes the robust optimization and the calculation results of the training set of pictures of different scales to obtain the best results. At the same time, the experimental results of Hariharan et al. [13] show that in the special image of the highly guided lower layer, the segmentation ratio of the image is increased, and the result is output. Similarly, the global text smoothing feature recognition method can be used to optimize the results. Using the settings of each scene of the default frame in the SSD framework, the prediction results of feature maps of different sizes and default frames of different aspect ratios, a prediction set containing different sizes and shapes of input objects can be obtained. For example, (for example, in Fig. 4a), the green box of the ship matches the 4 × 4 feature map. The red box matches the 8 × 8 feature map, as shown in Fig. 4b and c. However, sometimes the default box cannot match the corresponding object. This is because the default box has different sizes, but it does not match the ship's box, and thus, it is considered a negative sample during training.
Step 3: we use the ship target identification label of real experimental data to take part of the default frame in the center as sampling samples for training, which leads to a serious imbalance of the sample test results during training. This also leads to a decrease in the detection rate of marine ship target monitoring results. We select the output result of 3:1 with the first test result in the sample set and the ratio of positive and negative samples for comparison, rather than all of the samples in the overall dataset. Our experiments show that in the process of ship target recognition training in a complex maritime navigation environment, different models are used to output robust image sizes and shapes. The graphics training method is used for randomly sampling to solve different image data enhancements. If the label information of the real ship target picture is used as the sampling block diagram of the center frame, the sampling overlap model is retained. At the same time, to ensure the robustness and reliability of the feature recognizability, we adopt a uniform network and a fix sample resolution value, and then flip and compare the results with a probability level of 0.5 to enhance the final detection result.

Proposed algorithm
The DP-SSD network algorithm is used to detect ships in the following steps: Step 1: similar to SSD, the training method classifies different default boxes into positive samples (objects) and negative samples (background).
Step 2: by calculating the IOU values of the default box and true box (truth value box), the label with IOU > 0.5 is a positive sample. Then, the label of the default box is the label of the truth box with the largest IOU of this default box, and IOU < 0.5 are marked as negative samples.
Step 3: this is caused by the classification vector setting a background, making negative samples able to participate in the loss calculation of the category, though they cannot participate in the loss calculation of the coordinate regression.
In summary, we first select the timeliness and accuracy of the appearance of the SSD that can be detected by the characteristics of the SSD algorithm. Accordingly, the DP-SSD is a new feature recognition method that uses a basic network structure, network feature extraction method and optimized multilayer network computing architecture. This is a brandnew feature recognition method for automatic identification of marine target features.

Data selection
We use the DP-SSD model to measure and evaluate Kaggle's ship dataset. We adopt experimental scenes at different times in the same year and under different complicated maritime navigation environments. For example, Shanghai, Fuzhou City, Fujian Province, and Zhoushan City, Zhejiang Province have a total of more than 10 voyages, collecting more than 50 video sets and more than 2000 experimental pictures. The test results of the detailed datasets are as follows: First, the Kaggle ship dataset is used to test the feature recognition of more than 1000 ships, including special scenes with different backgrounds and complex environments, such as multitarget ships, coral reefs, lighthouses, etc. Our experiment adaptively selects more than 1000 multitarget and special scene pictures from a total of 25 video detection results in 5 voyages.
We test the source code pixel by pixel based on the results of ship feature recognition in complex maritime scenes constructed by the SSD model. Our experimental results show that the maritime ship identification method constructed by the SSD method has the relative limitations of the real frame boundary. To avoid this situation, we convert the mask to a frame structure.
According to this, all ship targets can be placed in the framework of the model identified by SSD, as shown in Fig. 5.

Evaluation indicators
Our article reveals a new method of ship target recognition and evaluation in complex maritime navigation scenarios. First, we identify whether the sailing scenes of sea ships are true or false in different situations. If any target other than the ship is detected, it is considered a false alarm. If the ship target is not found, the ship picture is not output, and the value range is defined as a false negative. If the ship monitoring system does not find or output a picture with ship characteristics, it is defined as a true negative value. Among them, true positives are defined as TP; false-positives are defined as FP; false negatives are denoted as FN; true negatives are denoted as TN. The measurement results show that there are huge measurement differences in different complex environments at sea in different scenarios. We use different indicators, such as accuracy, recall, specificity, F1 measurement, and F2 measurement, to measure the experiment. The measurement results are as follows: (see Table 1).
Our measurement results propose key evaluation indicators for the fourth part, which can accurately measure the performance of DP-SSD target feature recognition and the adaptability of different scenarios in complex environments. Finally, we use methods, such as accurate recall histograms, mAP, FPS and running time to measure and evaluate the evaluation algorithm of the ship target recognition method in a complex navigation environment.
In summary, the fourth part proposes the key performance indicators to evaluate all of the algorithm models, such as the accurate recall histogram, mAP, FPS and running time.

Implementation
We use the Python programming language for measurement and evaluation, NVIDIA GeForce RTX 2080Ti graphics card to measure and evaluate ship target recognition data in complex sailing environments, and Intel (R) Core (TM) i9 9900 KF CPU @ 3.6 GHz processor for experimental measurement. All data are shown in parameter Table 2.
Our experiment is based on the training dataset, which has the characteristics of the limitation of the number of sampling targets. Based on this, we use the process of repeated training and sampling, apply the mean method to train the model to improve the reliability of ship feature recognition and reduce the number of overfits, and use different data processing theories to conduct experiments and evaluations, as follows: -Random sampling, different scenarios; -Adjust all pictures to 300 × 300, 320 × 320, 416 × 416, 448 × 448, 512 × 512, 544 × 544; -Random sampling; -Finally, the results of multiple sets of experiments are evaluated.

Fig. 5
Bounding box with a green rectangle is the notation used during DP-SSD training
The experimental results show that the method of ship target recognition in a complex sailing environment using SSD has the characteristics of classifying and regressing target pictures, while also having the ability to recognize picture pixels. The method of ship target recognition in a complex navigation environment has the strongest detection efficiency for the hardware operating environment and has the ability to identify ship targets in different scenarios, as well as in complex navigation environments. Accordingly, this method is more accurate and effective than two-stage detection methods, such as Faster RCNN.

Recall
Rec As shown in the Figs. 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 on the right side of the above figure indicate that the recall rate is 10-100%, and the step size is 10%. Legend 1 indicates that the recall rate is 10%, 2 indicates that the recall rate is 20%, and 10 indicates that the recall rate is 100%. Therefore, the target detection characteristics of the SSD, which has three backbones, are more accurate than those of the other methods in the ship target recognition method in a complex navigation environment. However, they are slightly lower than the performance characteristics of the two-stage target recognition method.

Results and analysis
Real-time detection of ships is a very difficult task because ships have different shapes and video monitoring angles. During the test, due to the effects of ship roll, pitch, and strong and weak lighting, the test results also affect the speed and real-time detection results of the ship. In this article, the tested SSD framework has a high accuracy, a low computational complexity, and a fast speed. To explore the power of the SSD method, multiple sets of experiments were also performed, including Faster RCNN (VGG16), YOLO, YOLOv2, YOLOv3, SSD300, SSD512, RefineDet320, RefineDet512, and SSD (DP-SSD300 and DP-SSD512). Finally, the SSD method was tested using image and video data from Shanghai, Zhoushan City, Zhejiang Province, and Fuzhou City, Fujian Province, to compare and evaluate real-time ship monitoring data of Faster RCNN (VGG16), YOLO, YOLOv2, YOLOv3, SSD300, SSD512, RefineDet320, RefineDet512, and SSD (DP-SSD300 and DP-SSD512).
As shown in Fig. 7a, the average recognition precision indicates that the blue legend of DP-SSD is slightly better than Faster RCNN (VGG16), but the difference is only approximately 5%. The red legend of DP-SSD300 indicates the recognition frame rate is up to 54.49, and it is significantly better than those of DP-SSD512 and Faster RCNN (VGG16). In Fig. 7b, the average recognition precision displayed with the blue legend of YOLO is up to 90.5%, which is significantly better than DP-SSD300's 76.45% and DP-SSD512's 77.97%. The recognition frame rate of YOLO is between DP-SSD300 and DP-SSD512, at up to 42.33. YOLO v2 is roughly the same as DP-SSD in average recognition precision, but the recognition frame rate is better than that  of DP-SSD, as shown in Fig. 7c. In Fig. 7d, YOLO v2 544 × 544 is roughly the same as DP-SSD in average recognition precision, but the recognition frame rate is between DP-SSD300 and DP-SSD512. The average recognition precision of YOLO v3 reaches 85.11%, which is better than those of DP-SSD300 and DP-SSD512. However, the mAP of YOLO v2 reaches 74.21%, which is slightly weaker than that of DP-SSD300 and much stronger than that of DP-SSD512. As shown in Fig. 7f, SSD300 is slightly smaller than DP-SSD300 and DP-SSD512 in terms of average recognition precision, while FPS is up to 58.79, proving to be better than DP-SSD. In Fig. 7g, the mAP of SSD-512 is close to that of DP-SSD, and the FPS of SSD-512 is slightly higher than that of DP-SSD512 but obviously below that of DP-SSD300. In Fig. 7h, the mAP of RefineDet320 can reach 76.81%, and it is comparable to DP-SSD. However, the FPS of RefineDet reaches 46.81, which is significantly superior to DP-SSD512 but slightly lower than DP-SSD300.
In Fig. 7i, the average recognition precision of RefineDet512 can be comparable to DP-SSD, reaching 77.71%, while the FPS of RefineDet512 is 29.46, and it is much smaller than that of DP-SSD300 and slightly larger than that of DP-SSD512.

Precision comparison
It can be clearly seen from the results that DP-SSD is considerably better than the other method in terms of target detection performance, whether it is the average accuracy or the detection speed. The root cause analysis is due to the multiscale network proposed in this paper. At the same time, it also has the ability to detect multilevel features, which makes the final detection result more accurate and more efficient.

Comparison with RefineDet
The target detection of the RefineDet method requires multiple stages for achievement. It first generates several detection frames at the same time, then performs statistical regression classification on the detected target pictures, and finally determines the target recognition result.
Because the RefineDet method is very cumbersome in the process of ship detection, the proposed scheme has comparative advantages over the former in terms of detection accuracy and time.

Detection and identification
The confidence test of the ship area is shown in Fig. 8. The different methods and even the different scales in the same method can have a considerable impact on the average accuracy of detection. When the confidence level of the ship recognition area is easy, the average recognition precision of DP-SSD300 and DP-SSD512 proposed in this paper are both up to 90.13%, the performance is superior to 89.21% of YOLO v2, 88.97% of YOLO v2 544 × 544, 87.38% of YOLO v3, 88.15% of SSD300, 76.22% of SSD512 and 89.79% of RefineDet512, and slightly lower than YOLO's 91.99% and 90.37% of RefineDet320. However, when the confidence level is more moderate and harder, the performance of our proposed models cannot achieve the desired results.
Regarding the complexity of all of the algorithms, as illustrated in Fig. 9 the running time of DP-SSD is approximately 0.3 s, which is less than 0.37 s of YOLO, 0.7 s of SSD512 and RefineDet320, though it is slightly better than the 0.34 s of RefineDet512.
The difference between the object and the external environment gives the ship the opportunity to miss the detection situation. In response to this problem, the DP-SSD network we propose adopts a variety of feature combinations in the learning stage to try to make the learning network converge to a target. The environmental differentiation has a robust combination of features to achieve the generalization of testing.
In addition, the multiscale, multi-shape convolution allows the method proposed in this article to obtain more differentiated features, from contours to textures.
The choice of filter is of great importance to the ship's target. It is manifested at least at the level of connection speed and feature extraction. Specifically, lower-level data features can reflect direct features, such as geometric features, and the higher the level High is, the more hidden the characteristics reflected by the data characteristics are.
The DP-SSD method for real-time ship detection is shown in Fig. 10. When the proposed scheme can correctly detect the ship wave object, the target frame is green. It can be seen from the figure that our method is capable of operating at different distances and scales. We identify single or even multiple targets, which is more feasible for the scheme proposed in this article.

Conclusions and future work
In this research, the authors propose a neural network named DP-SSD. In the feature extraction training stage, it uses different sizes of extraction boxes to learn from the sample training set. In the sample testing stage, it uses video. The sample combined with the picture sequence is used as the test set. From the perspective of calculation time, processed frame rate, and recognition accuracy, a comprehensive evaluation of the three dimensions shows that the former is better than the classic algorithm for comparison. As an important part of future intelligent ship automatic identification edge platforms, this research is particularly of theoretical value and practical engineering. What needs to be emphasized here is that due to the limitation of the number of data samples collected in this study, sample differences and the limitations of the equipment itself, there is still a certain gap between the feasibility of the project and the actual application of the scene. At the same time, the external environment of the terminal sensor is also impossible. Completely consistent, this will cause the performance results to have short-term stability and long-term time-varying characteristics. However, the detection of ship  This is the guarantee of success and the weight of the feature. For factors, the future will be further adjusted from the perspective of weights, giving the algorithm stronger timeinvariant characteristics and robustness.
In the future, hardware that improves detection accuracy and efficiency will be further developed, and it will be cheaper and more stable. In addition, recommendations will be made to the IMO and the IALA for the establishment of demonstration projects Fig. 11.