1 Introduction

Air emissions are among the top environmental issues of port authorities (PAs) (COGEA 2017) and citizens (Ortiz et al. 2019), threatening to increase climate change and direct negative health impacts for approximately 230 million people by the top 100 ports (Merk 2014). Specifically shipping can contribute to the largest share of port emissions (Merk 2014). Yet, less than half of the world’s top 49 container ports provide emissions reporting (Cammin et al. 2022) and thus lack transparency concerning the performance of emissions abatement measures for becoming more green ports. PAs, being the regulators of ports (Notteboom et al. 2021) justify the lack of EIs with their generated economic effort, as shown in case studies (Cammin et al. 2020). The motivation to create EIs, based on, e.g., political, regulatory or stakeholder requirements, may not be sufficient to justify this effort; for instance, through obtaining and further processing data, or engaging or integrating external (data) service providers. The vested interest to minimize data costs conforms with our observations in practice and academia. The shared and accepted solution is the use of assumptions for unavailable data in EI methodologies.

Although this practice is accepted, it increases the imprecision of the EIs. Therefore, it seems advisable to carefully scrutinize the capability of methodologies to determine the performance of emissions abatement measures. This includes the proposal in this study.

An example for vessel characteristics and activity-based assumptions is the use of average engine power and average speed per vessel type, respectively. Among the works that surmount the unavailability of data with assumptions are activity-based bottom-up methodologies. For instance, POLA (2019) and Ekmekçioğlu et al. (2021) adapt the EPA and ENTEC methodologies, respectively, i.e., using assumptions by relying on historic data on average speeds in certain areas. Likewise, academics suffer from the unavailability of data, trying to find a trade-off between the accuracy and costs, which is reasonable specifically for multi-port EIs. For instance, Wan et al. (2020) use multi-port Automatic Identification System (AIS) data and assume engine powers with linear regression based on deadweight tonnage (DWT), and Merk (2014) extrapolates data from one month to a year for budgetary reasons to create multi-port vessel EIs. As the necessary data scales with the number of ports assessed, less data-demanding methodologies can foster the (successive) creation of port-related vessel EIs.

To this end, the objective of this paper is the development of tiered prediction models to create port-related container ship EIs taking into account data requirements. In this sense, lower-tier models rely on less features, potentially resolving data dependencies as well as reducing costs to adhere to budgetary constraints. A range of alternative models can then be input to a trade-off analysis, which has not received much attention in previous studies that focus on the prediction performance.

In this paper, the development of prediction models is based on (1) data from the one-off application of an activity-based bottom-up methodology that incorporates detailed vessel-characteristics data and, (2) the use of vessel-characteristics data, both as input for (3) different machine-learning algorithms. While aiming to foster the creation of EIs in order to promote green ports, we also point out limitations and pitfalls of this approach.

The remainder of this paper is organized as follows. Section 2 reviews the application of machine learning for fuel consumption and emissions prediction of vessels. Next, Sect. 3 describes the methodology of the proposed approach and Sect. 4 addresses the numerical experiments. Section 5 discusses the key results and limitations of the presented approach. Finally, Sect. 6 concludes this study and outlines limitations and potential future research.

2 Literature review

Academia has taken advantage of data-driven models and machine learning for a variety of problems in fuel consumption prediction. Solving this problem is an intermediate step towards emissions calculation, i.e., the necessary next steps are determining the used fuel type, applying the dependent emissions factors with respect to exhaust gas cleaning systems, such as scrubbers that are installed on vessels or barge-mounted to serve vessels at berth. Regarding the prediction of vessel emissions and fuel consumption, the literature can be classified by its focus on different machine-learning algorithms (e.g., artificial neural network (ANN), multiple linear regression (MLR), support vector regression (SVR), least absolute shrinkage and selection operator (LASSO), random forest (RF), nonlinear regression (NLR)), the focus on different ship types (e.g., container or cruise ships), the scale of application (e.g., single to multiple vessels), and the feature set. Notably, many of the works that employ vessel-specific and voyage-specific data, e.g., shaft revolution per minute (RPM) speed from on-board sensors or cargo weight, do so for a single vessel. From the perspective of PAs or public institutions, it is questionable if such a data coverage of all vessels operating in a port is feasible as it would generate high effort, whereas obtaining vessel-characteristics data from the usual maritime data providers is a common choice in literature and practice. Many of the works employ noon report data for the features and the response.

The examined literature is categorized in three groups. The first two groups refer to shipping-related literature that address the prediction of fuel consumption, and emissions. The summary is presented in Table 1. The third group refers to shipping-unrelated references that concern the prediction of emissions, which is summarized in Table 2. Due to the narrower domain, we found fewer shipping-related references concerning emissions, for which more insights are provided in the following.

First, Ekmekçioğlu et al. (2021) apply the regression analysis using the feature gross tonnage (GT) and the response emissions for container ships in the activity modes of cruising, maneuvering, and berthing at four ports in Turkey. Prior, the emissions are calculated using the ENTEC activity-based bottom-up method. The study generalizes time for the cruising and maneuvering modes, but appears to use actual berthing times for the calculation. Notably, times are not used as a feature for training. The model’s performance is reported using the coefficient of determination (\(\hbox {R}^2\)) with 0.93 for the cruising mode, 0.92 for the maneuvering mode and 0.80 for the berthing mode.

The weaker performance of the berthing mode indicates that GT does not sufficiently correlate with the berthing time, although a loose correlation is reasonable, i.e., larger vessels potentially berth longer due to longer overall container handling operations. High variations of berthing times for similar-sized vessels are adverse for the performance of the model. This emergence could be based on inconsistent container volume handling across similar-sized vessels (see Cullinane et al. (2006)) or disturbances in terminal operations such as the breakdown of quay cranes (Nourmohammadzadeh and Voß 2022). If the effort to obtain the berthing time is acceptable, a regression model including time and GT could improve the performance for the berthing mode. Concerning cruising and maneuvering times, including the time as a feature would strengthen the model’s performance in ports with different sailing routes, and thus different sailing durations within the geographic domain of the EI.

Second, Fabregat et al. (2021) use machine-learning algorithms to predict the hourly local pollutant concentration and estimate the impact of cruise ships’ activities on the air quality of the port of Barcelona. To this end, twenty-five features and the target from eight stations are employed. The application of six algorithms shows that the day of the year and the traffic intensity are the two most important features. The gradient boosting machine (GBM) has the best prediction performance with the \(\hbox {R}^2\) of 0.80 and the root mean squared error (RMSE) of 14%.

Third, Fletcher et al. (2018) use regression algorithms to predict the emissions concentration of two cargo ships’ main engines and auxiliary engines in the activity modes of cruising, maneuvering, and berthing. The selected features are engine power and shaft speed. The results show that all five regression algorithms have a similar prediction performance for the cruising mode, while using supervised mixture probabilistic latent factor regression has the best prediction performance for the maneuvering and berthing modes.

Fourth, Schaub et al. (2019) apply the ANN to simulate the time series of the particulate matter (PM) concentration during a transient engine operation. The input features include the values of RPM and fuel consumption by time unit, while the prediction target is the PM concentration by time unit. The simulation results by using ANN are overall consistent with those by measurement while there are a lot of small amplitude oscillations and underestimation of the peak values of the PM concentration.

Another recent review of fuel consumption prediction (which also addresses optimization) is carried out by Yan et al. (2021a). In line with the findings by Yan et al. (2021b), this review indicates that machine learning has not been applied extensively for port vessel EIs, showing potential in its further exploration. Moreover, the application of machine-learning algorithms to predict emissions unrelated to shipping, as exemplified in Table 2, corroborates the idea of emissions prediction. It shows that machine-learning algorithms recently have received much attention for the prediction of emissions in various fields. In general, the popular applied methods include but are not limited to ANN, MLR and SVR. It is observed that literature focuses on improving the prediction performance of single models, whereas the present study examines the prediction performances of models considering different levels of data scarcity.

Table 1 Summary of shipping-related literature concerning the prediction of fuel consumption and emissions
Table 2 Summary of shipping-unrelated literature concerning the prediction of emissions

3 Methodology

In this section, the methodology carried out is presented. Section 3.1 outlines the data acquisition and Sect. 3.2 presents the steps concerning data preprocessing. Section 3.3 presents and justifies the feature engineering. In Sect. 3.4, the prediction models are defined and the algorithmic exposition is provided in Sect. 3.5. The performance assessment metrics are outlined in Sect. 3.6. The results are shown in the succeeding Sect. 4. The overall framework employed is compiled in Fig. 1.

Fig. 1
figure 1

Framework

3.1 Data acquisition

For this study, a three-month data set covering August to October 2019 from a large North American PA, residing in an emissions control area, is used. The data set includes the activity data per vessel and generated emissions. For each vessel and call, the operational time in different activity modes is provided. Moreover, the data set includes the volume of emitted emissions (disaggregated by gas type) by each engine type during those activity modes. The activity modes comprise transiting, maneuvering and berthing. The engines considered are the main engine, auxiliary engine and boiler. Finally, the gas types include \(\hbox {CO}_{2}\), \({\hbox {SO}_{\mathrm{x}}}\), \(\hbox {NO}_{\mathrm{x}}\) and \({\hbox {PM}_{2.5}}\). The emissions abatement measure shore-side electricity is not considered in the calculations; thus, the data set is considered as a preliminary EI. The majority of vessels calling at this port are container ships and bulk carriers, which is reflected in the EI. Vessel-characteristics data, such as the power of different engines are employed and, in this study, we focus on 104 container ships that are addressed in the preliminary EI for which we could acquire the data for all the features (see Sec. 3.3). These are responsible for 119 vessel port calls. The EI methodology of the PA follows an activity-based bottom-up calculation approach with assumptions concerning the load factor in different areas of the geographic domain and is based on AIS data gathered via the PA’s own terrestrial antennas, as well as satellite-AIS and vessel-characteristics data from external maritime data providers.

3.2 Data preprocessing

Concerning activity data preprocessing we differentiate between the common tasks of data smoothing and data transformation. Data smoothing is considered as an essential step for the data preparation, to address missing data and suppress outliers by data denoising schemes, such as the Empirical Mode Decomposition (Chen et al. 2021; Zhao et al. 2022). This has been taken into account with the preparation of the provided preliminary EI; thus, data smoothing is not further implemented here.

The data set of the EI comprises different data tables that store the information for 119 port calls. These tables are transformed into a single table. The resulting fields are vessel identifier (IMO number), vessel port call, minutes, mode, engine, gas, and emissions in tons. In total, 3,808 samples are generated, which is based on the 119 vessel port calls and 32 combinations (purposes). Therefore, each sample refers to one of the 32 combinations based on the three engines, three modes and four gas types, where the main engine does not emit during berthing. A combination expresses the specific purpose of a model (see Fig. 1). The gas type further specifies the target variable tons of emissions. Furthermore, the data is merged with vessel-characteristics data fields based on the vessel identifier. The resulting data set is the input to the feature engineering which is described in the subsequent Sect. 3.3. Notably, a combination is used to filter the samples accordingly for model training, hence, Engine, Mode and Gas are considered technical constraint variables.

3.3 Feature engineering

Feature engineering can be differentiated into feature generation and feature selection. As we do not merely aim to provide one best model but several best model alternatives for each tier, the feature selection for dimensionality reduction is limited to excluding redundant features (see Venkatesh and Anuradha (2019)).

The feature selection could be driven by known physical (causal) relationships from domain knowledge, or by statistical correlations (Karagiannidis and Themelis 2021; Lee and Lee 2021). The feature Min (minutes) relates to the operational runtime in the activity Mode and is suspected to have a major effect on a vessel’s fuel consumption. Dynamic activity times in all modes per vessel and port are possible and thus, Min could be beneficial in a port-independent prediction model. For instance, the terminal locations in the port EI geographic domain could require north- or south-bound transiting sailing routes with different sailing distances as found in this paper’s case port. The constraint variable Mode is included because it might reflect average load factors assumed in the activity-based bottom-up methodology. Moreover, the engine-power-related features, TKW ME (main engine power in kw), THP ME (main engine power in hp), TKW AE (auxiliary engines power in kw), as well as AE Num (auxiliary engines quantity) and vessel DS (design speed) can be utilized. TKW ME and THP ME correlate by about 0.75 (compare Fig. 2), hence, TKW ME is selected arbitrarily for model training to reduce redundancy. The feature GT correlates with a vessel’s size, similar to DWT which is used, e.g., in Wan et al. (2020), to assume the engine power by using regression. Hence, GT is regarded as a potential option to substitute the aforementioned engine-power-related features. Table 4 presents the correlation of features. The strongest correlations can be observed between TKW ME and DS as well as between TKW ME and GT. In this study, the categorical feature IMOT (IMO Tier) is generated based on the vessel build year, relating to the nitrogen oxide (\(\hbox {NO}_{\mathrm{x}}\)) control requirements of MARPOL Annex VI (IMO 2019). The feature IMOT is label-encoded to reflect the ordinal dependency (see Stauder and Kühl (2021)). Furthermore, we apply the zero mean standardization to normalize the feature data (Wang et al. 2018).

Table 3 Features and constraints
Table 4 Correlation of features

The variable VT (vessel type) could be used as a feature or as a constraint variable to filter out samples for training multiple models each for a different type of vessels. In this study, VT is used merely as a constraint variable and is fixed to container ships. Thereby, the trained models potentially only respect potential vessel-type-specific methodologies and assumptions such as operating patterns. However, in this study, no further categorization or voyage data is employed. This would be beneficial if the training data would respect voyage-related data; for instance, the number of powered reefer containers.

Although both GT and VT are somewhat visible by nature, options to obtain mass data legally, conveniently, and reliably in one batch, are most likely only accessible through paid services.Footnote 1

Given the unavailability of the exact EI methodology employed by the PA, statistical correlation serves as an indicator for the usefulness of the features. However, in this paper, all feature set combinations are explored by computations with different algorithms so that alternative models can be suggested. Table 3 provides an overview of the used features and constraint variables.

3.4 Prediction models

Two sets of feature combination sets are defined. The following notation is used:

F:

Set of n features, \(F=\{f_1, \, f_2, \, \ldots , \, f_n\}\).

\(F_1\):

Set of feature sets based on select-k-best, \({ {F_1}=\{(f_{1}), \, (f_{1}, \, f_{2}), \, \ldots , \, (f_{1}, \, \ldots , \, f_{n})\} }\), the elements in set F are in descending order based on their mutual information.

\(F_2\):

Set of feature sets, \({{F_2}=\{{\hat{F}} \, |\, {\hat{F}}\subseteq F\}}\).

\(F_i(e, m, g)\):

Models using features of set \(F_i\) constraint by engine e, mode m and gas g.

\(T_i\):

Tier of models with i being the number of used features.

The impact assessments refer to the applied algorithms, the combinations (purposes) of the models, and the feature sets. For instance, \(F_1(Mai, MAN, CO_2)\) denotes models based on the feature sets of \(F_1\) with the purpose to predict \(\hbox {CO}_{2}\) emissions generated by the main engine when vessels maneuver (see, e.g., the use of this notation in Figs. 2 and 3).

The set \(F_1\) is defined based on the select-k-best method. The seven features are sorted (descending) based on their mutual information score which detects the statistical dependence between the feature and the target, as shown in Fig. 2. The mutual information is a non-negative value and estimated by the Shannon entropy from k-nearest neighbor distances (Kraskov et al. 2004; Ross 2014). If the mutual information is equal to zero, the two variables are independent; a higher value of the mutual information means a higher dependency between the two variables. Those features are then added cumulatively resulting in seven feature sets as shown in Table 5. The purpose of \(F_1\) is to exemplify a statistical approach based on models using \(F_1(Mai, MAN, CO_2)\). The results are presented in Sect. 4.1.

The set \(F_2\) is defined by using the product of the seven features. Thereby, \(F_2\) contains 127 feature sets.Footnote 2 The purpose of \(F_2\) is fourfold:

  • The best-performing models for all purposes \(F_2(e, m, g)\) are identified to show the viability of the approach. The results are presented in Sect. 4.2.

  • The performance of the models is differentiated by tier and algorithm for selected e, m and g. This may show the benefits of using different algorithms as well as the performance improvements with higher tiers. The results are presented in Sect. 4.3.

  • The best models and alternatives for \(F_2(Mai, MAN, CO_2)\), including the used features, adopted algorithm, and hyperparameters are shown. By doing so, substituting important features to resolve data unavailability is exemplified. The results are presented in Sect. 4.4.

  • A comparison with the study of Ekmekçioğlu et al. (2021), introduced in Sect. 2, who trained regression models with GT for container ships, is conducted. We exemplify the effect of using the feature Min on the prediction performance for time-dependent targets. The results are determined in Sect. 4.5.

Fig. 2
figure 2

Mutual information used for univariate feature selection based on \(F_1(Mai, MAN, CO_2)\)

Table 5 Constraints and features utilized for models based on \(F_1(Mai, MAN, CO_2)\)

3.5 Algorithms

This section briefly introduces the algorithms applied in this study. The selection of ANN, MLR and SVR is based on their widespread usage in the literature as presented in Sect. 2, and according to Yan et al. (2021), ANN is the most frequently used algorithm by far for fuel consumption prediction. Besides ANN, MLR and SVR are also popular supervised learning algorithms for the prediction of fuel consumption. In contrast to ANN and SVR that both can pick up nonlinear and linear relationsships, MLR is only able to find linear relationsships. Moreover, ANN, SVR, and MLR are representative algorithms of the neural-network-based models (Panapakidis et al. 2020; Karagiannidis and Themelis 2021), instance-based models (Zhou et al. 2021; Zhu et al. 2021), and statistical-learning-based models (Uyanık et al. 2020; Kim et al. 2021), respectively. In this study, to compare and analyze the prediction performance of the three different types of models, ANN, MLR, and SVR are selected to predict the vessel emissions.

The ANN algorithm is developed by imitating the information handling process in human brain neurons to figure out the complex relationships implied between the input and output pairs of data (Rojas 1996). In the structure of the ANN algorithm, there are input layers, hidden layers, and output layers. Its training process is based on the forward transfer information and combined with the backpropagated errors between the predicted values and the actual values. During the training process, the connection weights between any two neurons are adjusted continuously until the errors are small enough to be accepted (Heidari et al. 2016). In this paper, the architecture of the ANN consists of one input layer with the nodes corresponding to the input feature set, a diverse number of hidden layers by hyperparameter tuning, and one output layer which is the predicted emissions. Given a neuron with m inputs, \(x_i\) is the ith input, \(i\in \left\{ 1,...., m\right\} \), \(\omega _{i}\) is the weight on the connection, \(b_i\) is the bias, and \(f\left( \right) \) is the activation function, then y is the output according to Eq. (1). The training process of the ANN is to find the proper values of \(\omega \) and b based on the loss function as shown in Eq. (2), where n is the number of samples, \(D_i\) is the actual value and \(y_i\) is the predicted value.

$$\begin{aligned}&y=f\left( \sum _{i=1}^{m}\left( \omega _{i}x_i+b_i\right) \right) \end{aligned}$$
(1)
$$\begin{aligned}&{\mathop {\arg \min }\limits _{\varvec{\omega },\,{\varvec{b}}}} \frac{1}{n}\sum _{i=1}^{n}\left( D_i-y_i\right) ^2 \end{aligned}$$
(2)

The SVR algorithm is extended from the support vector machine (SVM) algorithm (Smola and Schölkopf 2004). In the SVR algorithm, the linearly inseparable features (\(x_1, x_2,\ldots , x_m\)) are transferred to be linearly separable by mapping into a higher dimensional space as shown in Eq. (3). \(\omega \) is the weight vector, b is the bias and \(\varphi \left( x\right) \) is the kernel function.

$$\begin{aligned}&y = \omega \varphi \left( x\right) +b \end{aligned}$$
(3)
$$\begin{aligned}&min \frac{1}{2}\Vert \omega \Vert ^2 \end{aligned}$$
(4)

subject to:

$$\begin{aligned} |D_i-y_i|\le \varepsilon , \quad i=1,\ldots ,n \end{aligned}$$
(5)

In the higher dimensional space, a hyperplane is set approximately. The training process of the SVR algorithm is to minimize the largest distance between the mapped points and the hyperplane to maximize the distance from the nearest mapped points to the hyperplane (Eq. (4)), with the constraint that the distance from any mapped point to the hyperplane is no more than a defined tolerant margin \(\varepsilon \) (Eq. (5)).

The MLR algorithm uses the regression analysis in mathematical statistics to identify the relationship between one dependent variable (y) and multiple independent variables (\(x_1, x_2,\ldots , x_m\)) as formulated by Eq. (6) (Tranmer et al. 2020). The training process of the MLR algorithm aims to identify the proper values of \(\theta _0, \theta _1,\ldots ,\theta _m\) to minimize the residual sum of squares between the actual values and the predicted values as shown in Eq. (7).

$$\begin{aligned}&y = \theta _0+\theta _1x_1+\theta _2x_2+ \cdots +\theta _jx_j+\cdots +\theta _mx_m \end{aligned}$$
(6)
$$\begin{aligned}&{\mathop {\arg \min }\limits _{\varvec{\theta }}}\left\{ \sum _{i=1}^{n} \left( D_i-y_i\right) ^2\right\} \end{aligned}$$
(7)

3.6 Performance assessment

To evaluate the performance of the prediction model, we adopt the metrics RMSE and mean absolute error (MAE) and \(\hbox {R}^2\). A five-fold cross-validation is applied with the performance assessment metrics being the average values. The RMSE is given by the standard deviation of the residuals that express the distance between the predicted values and the actual values on the regression curve. The MAE is the absolute error between the predicted values and the actual values. Lower values for RMSE and MAE represent better prediction results. The \(\hbox {R}^2\) provides an assessment of the accuracy rate of the prediction model with the best value of 1. The calculations of RMSE, MAE and \(\hbox {R}^2\) are shown in Eqs. (8), (9), and (10), respectively, where y is the actual value, \({\hat{y}}\) is the predicted value, \({\bar{y}}\) is the mean of the y values, and n is the number of samples.

$$\begin{aligned} RMSE\left( y,{\hat{y}}\right)&= \sqrt{\frac{1}{n}\sum _{i=1}^n\left( y_i-{\hat{y}}_i\right) ^2} \end{aligned}$$
(8)
$$\begin{aligned} MAE\left( y,{\hat{y}}\right)&= \frac{1}{n}\sum _{i=1}^{n}|y_i-\hat{y_i} |\end{aligned}$$
(9)
$$\begin{aligned} R^2\left( y,{\hat{y}}\right)&= 1-{\sum _{i=1}^{n}\left( y_i-\hat{y_i}\right) ^2} /{\sum _{i=1}^{n}\left( y_i-{\bar{y}}\right) ^2} \end{aligned}$$
(10)

4 Numerical experiments

This section presents the numerical experiments based on the models using \(F_1\) and \(F_2\), where the definitions are given in Sect. 3.4. The training uses subsets of the presented seven features on 3,808 samples and employs ANN, SVR and MLR. The numerical experiments are carried out using sklearn (Pedregosa et al. 2011) based on the default hyperparameters, although their optimization is exemplified in Sect. 4.4.

First, the results based on incremental select-k-best are exemplified with the models for a specific purpose in Sect. 4.1. Second, the best-performing models for all purposes are identified in Sect. 4.2. Subsequently, a differentiation by tier and algorithm is exemplified in Sect. 4.3. Next, the best-performing models and alternative models are exemplified in Sect. 4.4. Finally, the effect of the feature Min on a time-dependent target is assessed in Sect. 4.5.

4.1 Results for models based on incremental select-k-best

Using the seven feature sets (see Table 5) and a specific combination described by \(F_1(Mai, MAN, CO_2)\) as well as three algorithms (MLR, SVR, ANN), 21 models are trained. Differentiated by tier and algorithm, the values for the metrics \(\hbox {R}^2\), MAE and RMSE are provided in Fig. 3.

Concerning the tiers, the models of this approach yield an acceptable performance starting from \(T_5\), from where ANN and MLR perform better than SVR. Concerning \(\hbox {R}^2\), Fig. 3a shows that the top best-performing models with an \(\hbox {R}^2\) between 0.90 and 0.92 can be found in \(T_5\), \(T_6\) and \(T_7\) using ANN and MLR. Notably, once the feature Min is added starting from \(T_5\), a large performance increase can be observed. Prior to decreasing the data dependency while trying to maximize the prediction performance, the next section presents the overall best-performing models for each purpose.

Fig. 3
figure 3

Performance metrics for models based on \(F_1 (Mai, MAN, CO_2)\)

4.2 Best-performing models differentiated by engine, mode and gas

Based on the 127 feature sets and 32 combinations described by \(F_2(e, m, g)\), the application of the three algorithms (MLR, SVR, ANN) results in 12,192 trained models. The subsequent Sects. 4.34.4 and 4.5 make use of a subset of these models in the sense that the combinations and/or feature sets are limited with respect to different aims.

Let us now present the \(\hbox {R}^2\) values for the best-performing models, irrespective of features or algorithms in Table 6. The results illustrate that the formulation of \(F_2\) provides a competitive performance in regard of the main and boiler engines exceeding an arbitrary threshold of 0.85 \(\hbox {R}^2\). Notably, in the activity-based bottom-up calculation, boiler emissions are based on a simple factor multiplication with the variable Min. In contrast, the main engine emissions depend on load factor changes during activity over time: This complication reduces the performance. Furthermore, the prediction performance regarding the auxiliary engine is significantly worse, ranging from about 0.2 to 0.75 \(\hbox {R}^2\). In general, a strong correlation between the prediction performances of different gases for the same combination of engine and mode is observed, indicating a calculation using factor multiplication.

Table 6 \(R^2\) of each best-performing model based on \(F_2(e, m, g)\)
Fig. 4
figure 4

\(\hbox {R}^2\) for selected models in different tiers based on \(F_2\)

4.3 Models differentiated by tier and algorithm

In this section, models are differentiated by tier and algorithm for selected \(F_2(e, m, g)\), i.e., \(F_2(Mai, MAN, CO_2)\) and \(F_2(Mai, TRN, NO_x)\). The \(\hbox {R}^2\) values are provided in Fig. 4a and b.

Figure 4a shows no significant performance increase after models in \(T_2\). In general, the sub-figures exemplify the performance diversification between algorithms depending on the constraints. For instance, Fig. 4a shows that MLR provides better results than SVR throughout most of the tiers. However, ANN achieves the best performance in most tiers. Notably, the sub-figure shows that most models reside in two clusters, either achieving an \(\hbox {R}^2\) value greater than 0.7 or less than 0.2. In comparison to \(F_1(Mai, MAN, CO_2)\) (see Fig. 3), the models with less features achieve a better performance. Furthermore, Fig. 4b exemplifies the advantage of MLR in a scenario, achieving the highest \(\hbox {R}^2\) value. Moreover, models after \(T_3\) show no significant increase in their performance. While the sub-figures provide an overview about the achievable prediction performance per tier, they lack the capability to show on which features the models rely. Therefore, the next section explores models, based on \(F_2(Mai, MAN, CO_2)\), in more detail.

4.4 Best-performing and alternative models in each tier

In this section, the best models and alternative models for \(F_2(Mai, MAN, CO_2)\), differentiated by tier and algorithm, are shown. A hyperparameter tuning is carried out for ANN and SVR. A simple grid search is applied; concerning ANN, six variants concerning the number of hidden layers and neurons are used, i.e., one or two layers with each layer having 100, 25, or 5 neurons. Regarding SVR, 16 variants concerning \(C=\{0.01, 0.1, 1, 10\}\) and \(\epsilon =\{0.01, 0.1, 1, 10\}\) are used. Table 7 shows the experimental results for models with a distinctive feature set. To be more clear, only the best models, trained with the same feature set but with different algorithms and hyperparameters, are shown. We exemplify the options to substitute features with one another in case of data unavailability as well as to add features in order to improve the prediction performance.

In general, one aims to minimize the number of features, preferably those requiring more effort to obtain, e.g., Min in different activity modes or TKW ME as vessel-characteristics data. This is in contrast to features such as GT and IMOT that may be easier to obtain (see Sect. 3.1). Thereby, after the training of models, a decision-making process per each port using a trade-off analysis should be conducted, between the number of expensive features and the prediction performance.

Regarding \(T_1\) models, the best two models have a large diverging performance; it is shown that the importance of the feature Min far exceeds those of the remaining features. In general, a comparison of models within each tier shows that models with a feature set lacking Min perform significantly worse than the best models (see \(\ddagger \) in Table 7). At least in this case study, this cannot be balanced by increasing the feature set with the remaining features.

As discovered in Sect. 4.3, using two features for model training can increase the performance significantly. In doing so, the results show that the best six models include Min. Notably, for each of these models in \(T_2\), the importance rank of the additional feature changed, compared to the \(T_1\) models; see, e.g., TKW ME and GT. Only three models in \(T_2\) achieve a performance higher than the arbitrary threshold of 0.85 \(\hbox {R}^2\).

Considering GT and IMOT as being easier to obtain, their combination in conjunction with Min (\(T_3\)) achieves a better performance with 0.93 \(\hbox {R}^2\) compared to their separate utilization with Min (\(T_2\)) that achieves 0.91, 0.73 \(\hbox {R}^2\), respectively. The \(T_3\) model could be considered a viable substitute that achieves a similar performance than the \(T_2\) model using Min and TKW ME that achieves 0.945.

The model with the overall best performance achieves a \(\hbox {R}^2\) value of 0.9577 and belongs to \(T_4\); however, the difference to the best model in \(T_2\) that achieves 0.9450 is minor. Hyperparameter tuning increases the best model’s performance insignificantly.

Table 7 Performance metrics for best-performing model and selected alternative models based on \(F_2(Mai, MAN, CO_2)\), from \(T_1\) to \(T_5\), ordered by tier and \(\hbox {R}^2\)

4.5 Effect of the feature minutes on a time-dependent target

Based on the numerical results obtained in Sect. 4.3, models using the features Min and GT, based on \(F_2(Mai, MAN, CO_2)\), are revisited by listing the \(\hbox {R}^2\) values, the adopted algorithms and hyperparameters completely in Table 8. In regard of the activity-based bottom-up methodology, which considers the geographic environment with two main waterways of different sailing distances, the table shows the importance of the feature Min being included in the training of the models.

In doing so, the models achieve an increased performance using ANN with an \(\hbox {R}^2\) value of 0.9108. Furthermore, the importance of experimenting with different algorithms is exemplified as significant performance gaps are observed.

Table 8 Performance metrics for models based on \(F_2(Mai, MAN, CO_2)\) using features Min and GT

5 Discussion

Taken together, we show that in the given case study, some prediction models for the same purpose provide a similar performance while requiring less data. In light of the numerical experiments, three aspects should be discussed, namely, the case-by-case use of prediction models, the feature dependency between models for different purposes, and the life cycle of prediction models.

Arguably, given inconsistent vessel-characteristics data pertinent to a single port and being the input to EI formulations, the use of different prediction models that fit the available data at hand on a case-by-case, i.e., vessel-by-vessel basis may constitute an advantage: The prediction models using ANN and SVR are able to capture strong nonlinearity using few features considering the port and calling vessels’ characteristics and their typical trajectories. This may be difficult to realize with traditional calculation formulation methods (Le et al. 2020; Jassim et al. 2022). For instance, if the essential input parameters of an activity-based bottom-up calculation formulation, such as the speed or load factor in small time intervals are missing, the formulation requires assumptions as substitutes. For nonlinear relationships, setting up groups of vessels to approximate, e.g., load factors, even by few of the available features, such as ship size or engine power, requires additional effort and may lead to inferior approximations than using ANN. Along this line of thinking, it is expected that establishing a range of prediction models with different feature sets, as presented here, takes less effort than adapting calculation formulations or setting up assumptions for the same. We exemplify this idea by exploring a range of prediction models for the purpose of predicting \(\hbox {CO}_{2}\) emissions by the main engine in the activity mode maneuvering. Here, the prediction performance is 0.94 and 0.91 \(\hbox {R}^2\) under the use of ANN and the features activity time in minutes in conjunction with the total power of the main engine and the GT, respectively. Moreover, it is shown that the prediction performance concerning emissions originating from the main engine and the boiler by far exceeds those concerning the auxiliary engine. Models for boiler emissions achieve the best performance, which can be explained with a simple linear relationship with the activity time in minutes that is easy to grasp by MLR. A major drawback is that the activity time is still required to achieve an acceptable performance, which in turn requires, e.g., obtaining and processing AIS data.

Furthermore, we demonstrate that higher tier models could achieve a higher performance until the relevant features are exhausted. Some features can hardly be substituted (Min) while other features may be substituted (TKW ME) with an acceptable loss in the prediction performance; thus, some solution space for a trade-off analysis is provided. However, it is important to highlight that such a trade-off analysis for one purpose cannot be carried out independently if models for other purposes are necessary; thus maximizing the overlap of features between those models has to be considered as well. One way to remedy this issue could be the simplification of the framework to reduce the number of models, by predicting activity mode emissions irrespectively of the engine or vice versa.

Another important concern, not limited to this study, represents the life cycle of prediction models. Generic EI formulations and their implementations in information systems can be adapted to the geographic and operational characteristics of any port. In contrast, the trained models are a black box that cannot be adapted. The performance of those trained models likely declines with the application in ports with changing characteristics. This even applies to the same port, e.g., through the implementation of speed reduction policies, or with the retrofitting of vessels with scrubbers. Similarly, caution is necessary if employing volatile vessel ratings as features that are influenced by policies, for instance, this relates to the operational carbon intensity rating of a vessel. The rating code (A–E) can change over time without changing a vessel’s characteristics and fuel efficiency, that is, the rating code is planned to be influenced by both the carbon intensity indicator (carbon emissions per unit transport work), as well as by policy-based declining upper bounds for each rating code over the years (Wang et al. 2021). In these cases, carrying out the entire process presented in Fig. 1, including the EI calculation based on an updated methodology and model training, is required.

6 Conclusions

In this paper, we introduce a framework to create port-related vessel EIs, which is motivated by the lack of continuous EIs in ports. We investigate whether our approach yields acceptable prediction models with a similar performance while reducing data requirements. The managerial impact is that a range of acceptable prediction models can be subject to a trade-off analysis under the constraints discussed in Sect. 5. For carrying out the proposed approach, different algorithms, such as ANN, MLR and SVR, should be employed to achieve better prediction performances of the models.

One important limitation of this work is that the investigation is constrained to container ships and to a time scope of three months. This is in contrast to EIs in practice that usually exhibit an annual quantification of multiple vessel types; thus, by extending the scope, uncertainty about the applicability can be mitigated and relevance to practice can be improved. Second, while the prediction models for the main engine provide an acceptable performance, the auxiliary engine models require improved feature engineering based on the applied EI methodology that provides the target data. For instance, feature sets could be extended to account for the number of reefer containers that consume energy, i.e., the load factor of auxiliary engines for reefer container ships is about twice as high as for container ships (EPA 2009). Although this work explores a range of popular algorithms, including ANN and SVR that can pick up nonlinear patterns, the hyperparameter tuning could be improved by adopting more sophisticated methods such as the Gaussian process-based Bayesian optimization (Snoek et al. 2012), the tree-structured Parzen estimator approach (Bergstra et al. 2011) and the sequential model-based optimization (Hutter et al. 2011). Finally, algorithms other than ANN, MLR and SVR and their tailoring could be employed to improve the prediction performance.

While addressing the above-mentioned limitations will benefit the applicability of the approach within a port, an interesting research avenue concerns sharing trained models between ports. Future work could evaluate the performance of prediction models applied in one port but trained in another, to avoid the need to carry out another activity-based bottom-up calculation. This may spark interest in the community of ports and promote the use of predictive tools to create EIs, contributing to the development of green ports. Another avenue for future research could be the implementation of EIs for other modes of transportation such as rail and public transport (see, e.g., Chipindula et al. 2022).