Introduction

Ships face complex marine environment and harsh sea conditions when sailing at sea. The ship stability is an important measure of whether the ship can navigate safely at sea. At present, the Second Generation Intact Stability Criteria classifies the ship’s stability failure modes into five types [1], including Pure loss of stability, Parametric roll, Dead ship stability, Surf-riding and broaching and Excessive acceleration. Among them, Dead ship stability and Excessive acceleration are two failure modes that tend to occur when the ship is sailing under crosswinds and waves, and have many similarities in terms of phenomena and calculation methods. Therefore, both are studied together in this paper as the focus of research. The ship stability failure probability is the probability of a failure mode occurring when a ship sailing under a range of sea states and wind and wave conditions. The calculation not only takes into account a large number of sea conditions and requires a large number of ship features, but also calculates parameters, such as roll damping coefficient, steady heel angle and resistance, etc. Therefore, this paper is dedicated to build a black-box model by machine learning method to predict the stability failure probability of the ship by inputing a small number of ship features.

For the rapid prediction assessment and intelligent control [2,3,4,5,6], the algorithms of machine learning, for instances, neural networks and support vector machine, are emerging to solve multi-dimensional fitting problems and approximate any nonlinear function applied in intelligent marine systems [7,8,9,10,11,12]. In the field of ship performance assessment, machine learning is widely used to calculate and predict the ship speed and resistance, maneuverability and seakeeping. Regarding to ship speed and resistance prediction, researchers have adopted neural networks to predict various resistances of ships [13,14,15,16]. Yang [16] adopted Radial Basis Function Neural Network (RBFNN) to predict the resistance of container ships and compared with the prediction results of Error Back Propagation Neural Network (BPNN), Support Vector Machine (SVM) and Random Forest (RF). In terms of ship seakeeping, scholars have carried out a lot of research on the rolling motion prediction methods of ships [17, 18], and further extended the roll prediction to multi-degree-of-freedom motions [19,20,21]. Silva conducted related research on the six-degree-of-freedom motion of ships [20]. He [22] established a non-parameterized ship maneuvering motion model based on a fully connected neural network, in the study of ship maneuverability [23]. The prediction method of machine learning has also been widely extended to the prediction of ship engine fuel consumption [24,25,26,27], ship engine state prediction [28], ship positioning [29], ship classification [30] and other studies [31]. Fan [32] used machine learning method to predict ship collision force on bridge. Artificial intelligence technology has also been continuously added to the ship design stage. Cepowski adopted artificial neural network to determine the design formulas for container ships at the preliminary design stage [33].

Table 1 Nomenclature and abbreviations

However, in the field of ship stability, most of the research work conducted the direct calculation and assessment by complying with level criteria of the second generation intact stability [34,35,36]. Some scholars use other methods for stability prediction, in which Duan [37] proposed a domain prediction method of ship nonlinear motion and acceleration response, and Liu [38] adopted Computational Fluid Dynamics (CFD) method on evaluating the pure loss of stability. These are all studies for the ship stability assessment resulted by direct calculations.

In this paper, we try eight machine learning methods including two neural network models RBFNN and BPNN, and six ensemble models RF, Bagging, Extremely randomized trees (Extra-tree), Adaptive Boosting (AdaBoost), Gradient Boosting Decision Tree (GBDT) and eXtreme Gradient Boosting (XGboost) that perform well in other fields, and based on them, proposed a joint multi-model confidence-based method for ship’s stability failure probability prediction. The method is inspired by the idea of co-training in semi-supervised learning and the idea of heterogeneous ensemble in Ensemble learning. The method combines models that have high confidence in the current prediction results, thus avoiding influenced by inferior learners. By testing on a test set, it is verified that the proposed method has a low mean square error compared to other models while maintaining a low mean absolute percentage error, i.e., higher accuracy. Main contributions in this paper are listed as follows:

  • Ship features required for the prediction are obtained by sensitivity analysis. The initial features include various ship parameters, such as principal dimensional, ship form parameters, load parameters, etc. Features that have no effect on different failure modes will be removed as redundant features in the corresponding failure probability prediction.

  • Eight machine learning models were tried, and based on this, proposed a joint multi-model confidence-based method. The core of this method is to add the current prediction results of current model as labels to the training set of other models, and determine the accuracy improvement effect of other models before and after its addition through cross-validation, and use it as the prediction confidence of the current model. Ultimately, only high-confidence models are retained for combination.

  • By conducting experiments that simply average the results of the component learners, it is confirmed that simple superposition different models does not necessarily improve the accuracy, but instead is sometimes very susceptible to poor quality learners. Also in most cases it is accuracy can only be better than the worst and worse than the best.

  • Experiments on various ship features’ pre-processing methods were conducted. By testing on real ship data, it is verified that the proposed method in this paper possesses better accuracy than other models and if not only has a smaller variance in prediction, but also maintains a smaller bias. The method is not susceptible to the perturbation of the input ship features’ processing methods as other models.

The remainder of this article is organized as follows. In section “Ship Stability Failure Mode”, we introduce the principles of the two failure modes studied in this paper. In section “Machine Learning Algorithms for Ship Stability Failure Probability Prediction”, we introduce several commonly used machine learning prediction methods, describe our proposed method, and elaborate its specific implementation steps. In section “Other Details”, some other details are presented, including the search strategy for hyperparameters and the screening method for ship features. Then, We investigate the effects of multiple input feature preprocessing methods on the prediction performance in section “Experiments”. The experimental results under multiple evaluation metrics are provided to demonstrate the superiority of our proposed method. In the final section, a discussion and conclusion of the study are presented. The nomenclature and abbreviations of this paper can be seen in Table 1.

Ship stability failure mode

Currently, the Second Generation Intact Stability Criteria issued by International Maritime Organization (IMO) classifies the stability of a ship into five failure modes, including Dead ship stability, Excessive acceleration, Pure loss of stability, Parametric rolling and Surf-riding/broaching[39]. Among them, Dead ship stability and Excessive acceleration are two failure modes that tend to occur when the ship is sailing under crosswinds and waves. These two failure modes have many similarities in terms of phenomena as well as calculation methods. Therefore, both are studied together in this paper as the focus of the research.

Dead ship stability failure mode

Dead ship stability refers to the stability failure mode in which the ship loses power and maneuvering ability due to mechanical failure while sailing in wind and waves, and resonant rolling (or even capsizing) occurs under free floating conditions[40].

Fig. 1
figure 1

Scenario of stability failure in dead-ship conditions

As shown in Fig. 1, when the ship is rolling to the windward side, it is affected by the sudden wind and constant wind, the ship starts to roll back under the joint action of wind and wave, the speed and reaction force of lateral drift increases. The ship continues to roll to the upwind side, at this time the speed and reaction force of lateral drift continue to increase, providing additional heeling moment, and the ship rolls to the maximum roll angle on the upwind side, at this time the maximum chance of stability failure occurs.

Excessive acceleration failure mode

Excessive acceleration[41] is the phenomenon of excessive rolling acceleration when the ship is sailing in wind and waves, which causes damage to cargo as well as injuries to people.

As shown in Fig. 2, when a ship undergoes rolling, the higher the position(such as the bridge), the greater the distance of transverse movement. For a ship, the period of rolling is the same for each point on the ship, and the greater the distance of transverse movement, the greater the linear velocity it has. So when the direction of the ship’s rolling changes, the point with greater linear velocity will have greater acceleration, and the greater the acceleration, the greater the inertia force. For people and cargo in different positions on the ship, the danger of lateral inertia force is much greater than the danger of vertical inertia force.

Machine learning algorithms for ship stability failure probability prediction

In this section, we introduce the proposed joint multi-model machine learning prediction method, which is used to make predictions of ship stability failure probability. In addition, the specific implementation steps of the method are described in detail.

Artificial neural network prediction models

Artificial neural networks are widely used in the fields of ship resistance forecasting, ship maneuverability forecasting, etc. It has powerful nonlinear fitting capability. In this paper, BPNN and RBFNN are taken as the focus of the research.

Fig. 2
figure 2

Plot of ship excessive acceleration

BPNN [42] is a multilayer feedforward network trained according to the error back propagation algorithm. A typical BP neural network consists of one input layer, one or more hidden layers, and one output layer, and the choice of the number of hidden layers and the number of nodes in the hidden layer of the BP neural network can have a great impact on its prediction accuracy. At the same time, the choice of activation function and optimizer are also influencing factors of the prediction accuracy.In this paper, the activation function is selected as “ReLU” and the optimizer is selected as L-BFGS. The activation function “relu” is calculated as follows:

$$\begin{aligned} f(x)=max(0,x) \end{aligned}$$
(1)

RBFNN is a three-layer feedforward neural network, which can be used to solve the problem of the multi-dimensional fitting. In the forward transmission process, the signal enters the network from the input layer processed by the RBF in the hidden layer, and then the network output is obtained by linearly combining the results of each hidden layer. Figure 5 shows the structure of the RBFNN used to predict ship stability.

Fig. 3
figure 3

Network structure of ship stability prediction

According to the stability failure mode to be predicted, ship parameters as input, are processed by the basis function in the hidden layer. Then, the output results of the hidden layer are linearly superimposed to obtain the failure probability of each failure mode. In Fig. 3, \(f_{1}\),\(f_{2}\),...,\(f_{n}\) are the basis function with different centers and width, \(w_{ij}\) is the weight coefficient from the ith hidden layer to the jth output layer.

In this study, the Gaussian function [43] is used as the basis function, and its calculation formula is as follows:

$$\begin{aligned} \varphi (r)=e^{-\frac{r^{2}}{2\sigma ^{2}}} \end{aligned}$$
(2)

where r is the distance between the input vector x and the center of the basis function, and \(\sigma \) is the standard deviation.

When using the trained RBFNN to predict the ship stability, the prediction results can be obtained by taking parameters such as ship length, ship width, moulded depth, and speed as input according to the different failure modes selected for prediction. The calculation formula is as follows:

$$\begin{aligned} y_{RBF}(x)=\sum _{p=1}^{P}w_{p}\varphi (||c-x||) \end{aligned}$$
(3)

In which, x is the input features, \(w_{p}\) is the weight coefficients from the hidden layer to the output layer, c is the center the Gaussian function.

The performance of RBFNN is affected by the number of center points, the position of center points and the width of basis function. In this paper, we use random search to determine the number of center points, the Particle Swarm Optimization (PSO) algorithm to find the position of center points, and the standard deviation are determined by the following equation:

$$\begin{aligned} \sigma = \frac{d_{max}}{\sqrt{2n}} \end{aligned}$$
(4)

where \(d_{max}\) is the maximum distance between centers, n is the number of centers.

Ensemble learning prediction models

Ensemble Learning is a class of methods that address supervised machine learning tasks based on the idea of integrating multiple learning algorithms to improve prediction results. By combining multiple learners, Ensemble Learning can often achieve significantly better generalization performance than a single learner. The integration of Ensemble Learning is divided into two types, one is sequential ensemble and the other is parallel ensemble. The algorithms studied in this paper can be divided into two categories, one based on the idea of Bagging and the other based on the idea of Boosting.

Bagging

Bagging [44] algorithm uses a put-back sampling method to generate training data. By randomly sampling the initial training set with multiple rounds of put-back, multiple training sets are generated in parallel, corresponding to multiple base learners can be trained (no strong dependencies between base learners), and then these base learners are combined to build a strong learner. The essence is the introduction of sample perturbation, which achieves the effect of variance reduction by increasing sample randomness.In this paper, a bagging algorithm which uses regression tree as base learner is used (hereinafter referred to as Bagging-tree), and the output results are obtained using the averaging method.

RF (Random Forest) is an ensemble learning algorithm based on decision trees, which introduces random attribute selection based on bagging. RF is very simple, easy to implement, which has very little computational overhead, and shows very impressive performance for both classification and regression. RF was proposed by Breiman [45] and can be used for classification, regression, and multidimensional data processing. The basic unit of RF is a series of decision trees that follow binary rules, also known as classification and regression trees (CART). Compared with the traditional regression model, the RF model can better tolerate noise and outliers, which bears higher computational efficiency, can self-learn multi-dimensional nonlinear mapping, and have a better fitting effect. Figure 4 shows the structure of the RF for predicting ship stability.

Fig. 4
figure 4

Forest structure for ship stability prediction

When combining prediction outputs, RF usually use the voting method for classification tasks and the averaging method for regression tasks. This study is a kind of regression analysis, so the averaging method is used to combine the prediction outputs. Through taking parameters, such as ship length, ship width, moulded depth, and speed as input characteristics, and using different random forests according to different predicted failure modes, the prediction results can be obtained as follows:

$$\begin{aligned} y_{RF}(x)=\frac{1}{T}\sum _{i=1}^{T}h_{i}(x) \end{aligned}$$
(5)

where T is the number of CART, \(h_{i}(x)\) is the prediction result of a single regression tree.

Extra-Trees (Extremely randomized trees) [46] are very similar to Random Forests, and are sometimes referred to as Random Forests. The extreme randomness of Extra-Trees compared to Random Forest is manifested in the partitioning of decision tree nodes. Extra-Trees are partitioned directly using a random feature and a random threshold on the random feature.The randomness of each submodel (decision tree) in Extra-Trees becomes greater, and therefore the variability between each submodel (decision tree) is greater. When making predictions, the base regressor combination method of Extra-Trees is exactly the same as RF, as shown in Eq. (5).

Boosting

The training process of Boosting is ladder-like, the training of base models is sequential, each base model will learn on the basis of the previous base model learning, and finally combine the prediction values of all base models to produce the final prediction results, the more comprehensive way used is the weighting method.

AdaBoost (Adaptive Boosting) [47], which is adaptive in the sense that samples that are wrongly classified by the previous base classifier are strengthened, and the weighted whole samples are used again to train the next base classifier. At the same time, a new weak classifier is added in each round until some predefined sufficiently small error rate is reached or a pre-specified maximum number of iterations is reached. In this study, the base learner of adaboost uses regression trees, and the prediction results when Adaboost is used for regression are calculated as follows:

$$\begin{aligned} y_{AdaBoost}(x)=\sum _{m=1}^{M}\left( ln\frac{1}{a_{m}}\right) g(x) \end{aligned}$$
(6)

where M is the number of the base regressor, \(a_{m}\) is the weight of the mth regressor, g(x) is the median of all \(a_{m}G_{m}(x)\), i.e., the median of the weighted output results of all weak learners \(G_{m}(x)\).

GBDT (Gradient Boosting Decision Tree) [48] is also an Ensemble Learning algorithm based on the idea of boosting. The core of GBDT is to accumulate the results of all trees as the final result. Each tree of GBDT updates the target value with the residuals obtained from the previous trees, so that the value of each tree is added up to the predicted value of GBDT, which shown as follows:

$$\begin{aligned} y_{GBDT}(x)=\sum _{i=1}^{k}f_{i}(x) \end{aligned}$$
(7)

where k is the number of the base regressor, \(f_{i}(x)\) is the result of ith tree.

XGBoost’s (eXtreme Gradient Boosting) [49] basic idea is the same as GBDT, but with some optimizations, such as second-order derivatives to make the loss function more accurate; regular terms to avoid tree overfitting; Block storage to allow parallel computation, etc. When making predictions, the base regressor combination method of XGBoost is exactly the same as GBDT, as shown in Eq. (7).

Combined strategies for the joint multi-model machine learning prediction method

In this paper, we propose a joint multi-model prediction method to predict the ship stability failure probability. The method is inspired by the idea of co-training in semi-supervised learning [50,51,52,53,54]. When multiple models are used to predict the same failure mode, the models with high prediction confidence are combined, and the models with low prediction confidence are discarded. How to judge the confidence of a model in forecasting the probability of steady failure for a ship that does not enter the training set is key. The idea of cc-training in semi-supervised learning holds that for two learners obtained by using the same training set for training, when a set of features from the test set is input into learner 1 for forecasting, if the prediction result of this learner 1 is used as the label and added to the training set of another learner 2, and training learner 2 with the newly constructed training set can improve its forecasting accuracy, then it is considered that learner 1 s prediction confidence is higher for this test sample. The joint multi-model prediction method proposed in this paper continues this idea.

T different regression models were obtained using the same training set L with different algorithms. Although the training sets of different models are the same, the training set of each model is still denoted as \(L_{i}\) for the convenience of presentation.T regressors are used to predict a test sample \(x_{i}\), and the prediction result of each regressor \(y_{i}\) is denoted as \(y_{i}(x_{i})\). Add \(\{x_{i}, y_{i}(x_{i})\}\) to the training set of all other regressors except \(y_{i}\), retrain each model, and obtain the mean square error \(MSE_{i}\) of each model using cross-validation. Here, a fivefold cross-validation is used in this paper, where four copies are used as the training set and one copy as the validation set each time, while \(\{x_{i}, y_{i}(x_{i})\}\) should be guaranteed to be added to the training set each time. The mean square error of all copies obtained is averaged as \(MSE_{i}\). It should be noted that for different models, the variation of this value may increase or decrease, so the mean value of the mean square error of all models is found to obtain \(\overline{MSE}\):

$$\begin{aligned} \overline{MSE} = \frac{1}{T}\sum _{t=1, t \ne i}^{T}MSE_{t} = \frac{1}{T}\sum _{t=1,t \ne i}^{T}(h_{t}-y_{t}(x_{t}))^{2} \end{aligned}$$
(8)

where T is the number of regressor, \(h_{t}\) is the expective value, \(y_{t}()\) is the tth regressor, \(x_{t}\) is the features of the test sample. After obtaining the \(\overline{MSE}\), the prediction confidence of different models can be calculated as follows:

$$\begin{aligned} Confidence = \frac{\overline{MSE}_{before} - \overline{MSE}_{after}}{\overline{MSE}_{before}} \end{aligned}$$
(9)

where \(\overline{MSE}_{before}\) is the average of the mean square error before models retrained, \(\overline{MSE}_{after}\) is the average of the mean square error after models retrained. After obtaining the prediction confidence of each model, the results of the top Z models are selected and averaged as the prediction results of the joint prediction model. In this paper, a simple averaging method as shown in Eq. (10) is used in the combination, which is one of the points that can continue to be improved in the future:

$$\begin{aligned} y_{jiont} = \frac{1}{Z} \sum _{z=1}^{Z} y_z(x) \end{aligned}$$
(10)

To verify the advancedness of this method, it was compared not only with eight other machine learning prediction models on the test set, but also with a simple averaging method, which averages the results of all models.

The flow chart of the joint multi-model prediction method for ship stability failure probability is shown in Fig. 5.

Fig. 5
figure 5

Flow chart of the joint multi-model prediction method

The pseudocode for this method is shown in Algorithm 1:

Algorithm 1
figure a

Combined algorithm training process

Other details

In this section, we introduce the search strategy for the hyperparameters of each model. In addition, we describe the input features screening method based on sensitivity analysis in detail.

Models hyperparameters searching strategy

The hyperparameters of machine learning have an impact on the performance of the model and therefore usually require manual or automatic tuning to ensure optimal model performance. In this paper, we adopt the idea of grid search and cross-validation for hyperparameters search. When tuning the model hyperparameters, the hyperparameters to be adjusted will be randomly combined within a given range, and then the training set will be divided into five parts, each time four parts will be selected as the training set and one part will be used as the validation set, and it is guaranteed that each part will be used as the validation set once, and the performance of the model under the current hyperparameters will be determined by all the validation sets. Finally, the model with the highest performance will be used to predict the stability failure probability of the ship.The hyperparameters of each model and the search range are shown in Table 2.

Table 2 Hyperparameters of each model and the search range
Fig. 6
figure 6

Training process of different prediction models

The model evaluation criterion chosen for the hyperparameters search in this paper is the Mean Square Error. The complete models training, hyperparameters search and testing process are shown in Fig. 6. All algorithms and models implemented in this paper are based on python, and random seeds are fixed for easy reproduction.

Selection of ship features for ship stability failure probability prediction

The input features of the machine learning prediction method is generally determined according to the needs of researchers. Very few input characteristics may not be able to make accurate predictions while too many input features will also lead to a long training time. Therefore, to determine the appropriate input features for ship stability prediction, it is necessary to study the influence of ship parameters on each failure mode in the second-generation intact stability, among which the ship parameters with obvious influence are selected as the input features of each prediction model. The ship parameters studied in this paper include: The principal dimensional parameters of the ship include length, breadth, moulded depth and draught; Ship form parameters include block coefficient, waterline coefficient and mid-ship section coefficient; Load parameters include the height of the center of gravity. This features extraction method can filter out irrelevant features, alleviate the dimensionality explosion problem and reduce the difficulty of machine learning tasks.

For the excessive acceleration failure mode(as the results shown in Fig. 7), the changes of moulded depth have little effect on the calculation results of long-term failure probability. Therefore, in this study, ship length (L), ship breath(B), draught (\(T_{m}\)), height of center of gravity (\(Z_{g}\)), block coefficient (\(C_{b}\)), waterline coefficient (\(C_{w}\)), mid-ship section coefficient (\(C_{m}\)) and longitudinal position of the center of buoyancy (\(X_{b}\)) are selected as the input features of each prediction model to predict the probability of this failure mode.

Fig. 7
figure 7

Influence of ship parameters change on long-term failure probability of excessive acceleration

For the dead ship stability failure mode(as the results shown in Fig. 8), in addition to the above-mentioned ship parameters, the effect of flooding angle on the capsizing probability is also taken into the consideration. It is found that the change of the longitudinal position of the center of buoyancy (\(X_{b}\)) has almost no effect on the calculation results. Therefore, in this study, ship length (L), ship breadth (B), moulded depth (D), draught (\(T_{m}\)), height of center of gravity(\(Z_{g}\)), block coefficient (\(C_{b}\)), waterline coefficient (\(C_{w}\)), mid-ship section coefficient (\(C_{m}\)) and flooding angle (\(\varphi _{f}\)) are selected as the input features of each prediction model to predict the probability of this failure mode.

Fig. 8
figure 8

Influence of ship parameters change on dead ship stability failure probability

The ship features of each failure mode when using each machine learning models for prediction are shown in Table 3.

Table 3 Input characteristics of each failure mode

The symbols in Table 3 include: ship length (L), ship width (B), ship depth (D), draught (\(T_{m}\)), the height of center of gravity (\(Z_{g}\)), block coefficient (\(C_{B}\)), waterline coefficient (\(C_{w}\)), mid-ship section coefficient (\(C_{m}\)), flooding angle (\(\varphi _{f}\)), and longitudinal position of the center of buoyancy(\(X_{b}\)).

Experiments

In this section, a series of experiments are executed to evaluate the performance of the proposed model. A fair comparison with other methods is conducted to validate the efficacy of our approach.

Stability failure probability of sample ships

In this study, a series of ships are selected as research objects. The range of each ship features in the ship sample set is shown in Table 4.

Table 4 Range of features of sample ships

The ship stability failure probability is the probability of a failure mode occurring when a ship sailing under a range of sea states and wind and wave conditions. Each sea state corresponds to a short-term failure probability, and the long-term failure probability is a weighted average of these short-term probabilities:

$$\begin{aligned} C=\sum _{i=1}^{N}W_{i}C_{i}/\sum _{i=1}^{N}W_{i} \end{aligned}$$
(11)

where C is the long-term failure probability; \(W_{i}\) is the weight coefficient under the different sea states which is obtained from North Atlantic Wave Scattering Map shown in Table 5, \(C_{i}\) is the short-term failure probability.

Table 5 Table of North Atlantic Wave Scattering Map [55]

The long-term failure probabilities of each failure mode corresponding to the selected ship samples in this paper are shown in Figs. 9 and 10. It should be noted that the number of sample vessels for the study of each failure mode varies in this paper, with 116 vessels for Dead ship stability and 94 vessels for Excessive acceleration. The images show the failure probabilities of the training samples only. To verify the generalization ability of the algorithm, 8 vessels were randomly selected as test samples for each failure mode.

Fig. 9
figure 9

Dead ship stability failure probability of different ships

Fig. 10
figure 10

Excessive acceleration stability failure probability of different ships

Evaluation metrics of prediction models

In this paper, the performance of the model is evaluated using the Mean Squared Error (MSE), the Mean Absolute Percentage Error (MAPE) and the R-squared (\(R^{2}\)). This is because some models focus on reducing the bias and some focus on reducing the mean squared error, so it is necessary to consider the accuracy of the models from various aspects. The formulas for calculating the three evaluation metrics are as follows:

$$\begin{aligned}{} & {} MSE=\frac{1}{N}\sum _{i=1}^{N}(f_{i}-y_{i})^{2} \end{aligned}$$
(12)
$$\begin{aligned}{} & {} MAPE=\frac{1}{N}\sum _{i=1}^{N}\frac{|f_{i}-y_{i}|}{y_{i}} \end{aligned}$$
(13)
$$\begin{aligned}{} & {} R^{2}=1-\frac{\sum _{i=1}^{N}(f_{i}-y_{i})^{2}}{\sum _{i=1}^{N}(\hat{y}_{i}-y_{i})^{2}} \end{aligned}$$
(14)

where N is the number of predicted samples, \(f_{i}\) is the expected value, \(y_{i}\) is the predicted value, and \(\hat{y}\) is the average of the expected value.

The dead ship stability failure probability prediction results

When using machine learning models, it is often necessary to preprocess the input features, and different pre-processing methods can affect the performance of the model. So a total of three data pre-processing approaches are tried to process the input ship features, including standard scaling, max absolute scaling and normalizing.

Table 6 shows the MSE, MAPE and \(R^{2}\) of different ship stability prediction models on the test set of Dead ship stability failure probability. For this failure mode, the hyperparameter Z of the joint prediction model is set to 3. Figure 11 shows the calculated results of the three performance criteria for all models with different ship features processing methods. According to the calculation results of the accuracy criterion of each model, the following conclusions are obtained.

Table 6 Prediction result of different models on Dead ship stability failure probability
Fig. 11
figure 11

Calculated results of the three performance criteria for each model

  • Different input feature processing can have an impact on the accuracy of the models, especially for RBFNN. Normalizing the input features, regardless of the kind of metric, can adversely affect the performance of all models.

  • Simply averaging the prediction results of all models can reduce MSE and improve \(R^{2}\) to some extent, but MAPE is not reduced, i.e., it can reduce the prediction variance but not effectively reduce the bias.

  • According to the prediction results, it can be seen that not the model with lower MSE has lower MAPE. Without preprocessing, GBDT has the smallest MSE but its MAPE is more than 10%, higher then that of Joint prediction models. After the max absolute scaling, the MSE of the Averaging method is similar to Joint prediction models and is smallest, but its MAPE is twice as large as that of the joint method. After normalization, the MSE of the joint method is the smallest, but its MAPE is more than 20%. After standard scaling the input features, Bagging has the smallest MAPE, but its MSE is still higher than some models.

  • After standard scaling the input features, compared to all other models with all data processing methods,the joint method has the smallest MSE and the largest \(R^{2}\), while the MAPE is only 6.2%.

Among all the models, only the MAPE of RF, Bagging and the joint model is less than 10%, while the MSE of Bagging does not reach the level of the averaging method. The results of the probability of the dead ship stability on test set by RF, Bagging-Tree and the joint model are shown in Figs. 12 and 13. The figure shows that for each prediction sample, the prediction results of the joint model are closer to the expected value.

Fig. 12
figure 12

Prediction results of the Dead ship stability failure probability by RF and joint model

Fig. 13
figure 13

Prediction results of the Dead ship stability failure probability by Bagging-Tree and joint model

The excessive acceleration failure probability prediction results

The results of excessive acceleration failure probability prediction of each model for the ships in test set under different data processing methods are shown in Table 7. For this failure mode, the hyperparameter Z of the joint model is also set to 3. Figure 14 shows the calculated results of the three performance criteria for all models with different ship features processing methods. The following conclusions can be drawn from the prediction results.

Table 7 Prediction result of different models on Excessive Acceleration failure probability
  • In the same way as the Dead ship failure mode, the way to pre-processing the ship features is highly impressive for neural network models. For RBFNN and BPNN it is more suitable not to process the ship features. Ensemble models based on the Bagging idea improve their Excessive Acceleration failure probability prediction performance after either standard scaling or max absolute scaling of the ship features. Ensemble models based on the Boosting idea are generally unaffected except for normalization. The joint forecasting model proposed in this paper uses all preprocessing methods can improve the prediction accuracy.

  • Simply averaging the prediction results of each model can only make its accuracy worse than the best and better than the worst, and cannot achieve the purpose of improving the prediction effect.

  • RBFNN has the smallest MSE when the ship features are not processed, but its MAPE is higher than 10%. After max absolute scaling the features, the average result of all models is used as the final prediction result although it can achieve the smallest MSE, but the MAPE still has 10.23%. Only after normalizing the features, the prediction accuracy of RF is comparable to that of the joint model, with an MSE of 1.14E−07 and a MAPE of 4.87%.

  • After standard scaling the input features, compared to all other models with all data processing methods,the joint method has the smallest MSE of 1.05E−07 and the largest \(R^{2}\) of 0.991832, and the lowest MAPE of 4.7%.

In all cases, the joint prediction model not only focuses on reducing the variance, but also significantly reduces the bias of the prediction. Figure 15 shows the prediction results on the Excessive Acceleration failure probability of the RF, which has the highest accuracy among the 8 models, with the joint prediction model on the test set. It can be seen that even with the constant tuning of the hyperparameters and adjustment of the processing of the ship features, the best performing model still has some prediction results that differ significantly from the expected values.

Fig. 14
figure 14

Calculated results of the three performance criteria for each model

Fig. 15
figure 15

Prediction results of the excessive acceleration failure probability by RF and joint model

Combining all the predictions results and the comparison of the errors, the joint multi-model ship stability prediction method proposed in this paper does not combine component learners blindly, but selects top k component learners with high confidence in the current prediction results for combining. Therefore, the method is not susceptible to the perturbation of the input ship features processing methods as other models. Moreover, the method does not care for this and lose that, i.e., the bias or variance will always be higher than the remaining model, but can minimize both the bias and the mean variance in a comprehensive way.

Conclusion

This paper uses eight machine learning models to predict two stability failure probabilities—Dead ship stability and Excessive acceleration—of fishing vessels as well as fishery administration vessels, and based on this, a joint multi-model stability prediction model based on confidence level is proposed. The method is inspired by the idea of co-training in semi-supervised learning and the idea of heterogeneous ensemble in Ensemble learning. By calculating the confidence of the model on the current prediction results, the component learners with high confidence are selected for combination and the low confidence learners are discarded.

The models studied include, artificial neural network models RBFNN and BPNN, Ensemble models Bagging-tree, RF and Extra-tree based on idea of Bagging, and Adaboost, GBDT and XGBoost based on idea of Boosting. The optimal hyperparameters of each model are obtained through grid search strategies. Ship features for input to the model are obtained by sensitivity analysis. And this paper also investigates the effect of different ship feature processing methods on different models. Compared the prediction results of each model on the test set, the following conclusions can be drawn:

  • Regardless of the prediction of the failure probability for either failure mode, the data processing of the ship characteristics can have a significant impact on RBFNN as well as BPNN. These two models are more suitable for prediction without any processing.

  • RF, GBDT and Bagging-Tree have higher accuracy than other Ensemble models when predict ship’s Dead ship stability failure probability. All of the Ensemble models have good accuracy in predicting the ship’s Excessive acceleration failure probability. And the Ensemble models are generally not affected by the data processing method of the input ship features.

  • Simply averaging the results of multiple component learners does not necessarily achieve the expected improvement in accuracy, but instead is sometimes very susceptible to poor quality learners, making MAPE higher.In particular, the MSE of the averaging method is only moderate when predicting the Dead ship stability failure probability, while the MAPE is higher than 10% regardless of the data processing strategy used.

  • The joint multi-model prediction method proposed in this paper has higher accuracy than the other models in predicting both stability failure modes. This method has the lowest mean square error, the highest R-squared and the second lowest MAPE after Bagging-Tree in predicting the ship’s Dead ship stability failure probability. And it can achieve the lowest MAPE while having the lowest mean square error and the highest R-squared in predicting the ship’s excessive acceleration failure probability.

The confidence-based joint multi-model ship stability prediction method can maintain small bias while having small mean square error. In future research, the computational efficiency can be improved by expanding the data set so that the cross-validation process for each confidence level calculation can be discarded. Consider trying more combinations of component learners to continue improving the accuracy of the model. Also, expanding the dataset while adding more scales as well as types of ships allows the method to have a larger range of applications. We may conduct a further research from another perspective by regarding the prediction of ship stability failure probability as a time-series problem. Based on this idea, we may use models such as recurrent neural network (RNN) and long–short term memory (LSTM) that are applicable to time-series problems[56], which may has the potential to improve the prediction accuracy.