Introduction

Processing and utilizing big data presents a big challenge in today’s modern world. The presence of outlier samples in the data often intensifies this challenge. Although outliers can be simply defined as the samples which stand out as outcasts or erroneous when compared to the population, it is far more challenging to detect them. Fortunately, the problem of outlier detection (sometimes referred to as anomaly detection) has been a topic of research and interest since the \(19^{th}\) century [1], leading to the development of a wide range of outlier or anomaly detection methods and algorithms. Some of these methods are developed and only relevant for specific application domains, while the others are more generic and can be adopted for a variety of applications [2]. It is nevertheless important to devise and demonstrate an outlier detection and handling scheme, which may contain one or more of the readily available outlier detection methods, for a dataset concerning a new application domain.

The formulation of an outlier detection scheme for a new application domain would depend on several factors like the application domain itself, type and source of data, objective of further analysis, etc. Such a scheme is, generally, incorporated in the data pre-processing steps to obtain a clean and ready-to-use dataset. Omitting outlier detection and handling mechanisms may lead to imprecise and unhealthy data analysis as well as inaccurate and misleading results, as suggested by McClelland [3]. Moreover, it is well-known that the performance of most of the data-driven methods deteriorates due to the presence of outliers [4]. Thus, outlier detection and handling should be considered an important step before carrying out any prominent analysis on a dataset.

As aforementioned, the formulation of an outlier detection scheme depends on the type of dataset. In case of a correlated dataset, like the one from ships-in-service, an outlier can be viewed as a sample which is defying the expected correlations by a substantial margin, and therefore, in such a case, it may be termed as a correlation-defying outlier, as suggested by Gupta et al. [5]. It may be possible to detect these outliers using one of the well-known statistics or Machine-Learning (ML) based outlier detection techniques. However, most of these techniques detect outliers by taking into account the distribution of the data in high dimensional variable (or feature) space, paying not much attention towards the correlation between the variables. Such a technique may cause more harm than good as it would result in detecting extreme values (like extreme weather observations) as well as rare event samples as outliers, which would result in poor predictions in extreme or rare conditions using the models calibrated on the cleaned datasets, as concluded by Suboh and Aziz [6]. Moreover, in case of an unbalanced dataset, the data samples present in the sparse regions of high dimensional variable space would, probably, also suffer the same fate as extreme or rare events, resulting in the loss of valuable information. Therefore, it is recommended here to use a correlation-based outlier detection scheme for unbalanced correlated datasets, like ship in-service datasets.

A correlation-based outlier detection scheme can be simply implemented using a method which can be used to identify the samples that do not follow the prominent correlations observed in the dataset. Although it may be possible to identify such samples using simple regression or curve fitting techniques (as such samples will result in high residuals when fitted to the model), the presence of such outliers will also degrade the model fitting performance, in turn making it difficult to identify the outliers. Moreover, the curse of dimensionality and complex non-linear correlations would make it further difficult to diagnose the outliers using such a technique. However, it may be possible to use the well-known correlation-based dimensionality reduction algorithms, namely, Principal Component Analysis (PCA) and autoencoders, as they would help address some of these challenges, and more so because they have been demonstrated to work effectively in the field of outlier detection, as depicted by Sakurada and Yairi [7].

The aim of the current work is to establish a correlation-based outlier detection scheme for an unbalanced dataset obtained from a ship during operations. Such datasets are known to be highly unbalanced and contain several rare but valuable samples, especially at low ship speeds. In order to detect outliers, PCA and autoencoder models are fitted on a dataset representing the hydrodynamic state of a sea-going ship. The dataset includes variables which are either recorded onboard the ship or obtained from a hindcast (or metocean) weather data repository. Since the problem of ship hydrodynamics is well-researched from the physics point of view, it is expected that the data variables would follow some non-linear correlations. To account for the non-linearities in the case of PCA (as PCA is a linear model), some non-linear transformations are introduced based on the domain knowledge (also suggested by Gupta et al. [8]). The calibrated PCA model is further used to understand the physical meaning of the latent variables (or Principal Components). Finally, a correlation-based outlier detection scheme is proposed for ship in-service datasets using the dimensionality reduction methods.

The contents of this paper are organized as follows. Section "Literature review: outlier detection (OD)" presents an overview of method selection based on a brief literature survey. Section "Methods" contains the theory and formulation associated with the adopted methods, i.e., PCA and autoencoders. The results and conclusion are presented in sects. "Results and discussion" and "Conclusion", respectively, along with sect. "Data exploration & processing" presenting the dataset used to calibrate the models and produce the results.

Literature review: outlier detection (OD)

Since the advent of advanced digital technologies like big data, Internet of Things (IoT), Machine-Learning (ML), Artificial Intelligence (AI), etc., the topic of outlier or anomaly detection has gained interest many-folds. At the same time, the increasing amount of internet traffic, public surveillance and industrial asset monitoring has created a need to devise methods for detecting and filtering out aberrations (or anomalies) in the collected data as well as recognizing malicious activities autonomously. The field of modern medicine has also taken advantage of the above-mentioned digital technologies and invented several useful methods for ML-assisted disease diagnosis based on anomaly detection algorithms. These factors along with several others have encouraged the research and interest in the field of anomaly detection, resulting in the development of several advanced algorithms. In fact, quite recently, a group of researchers assembled an open source library in Python, called PyODFootnote 1 and presented by Zhao et al. [9], containing the implementation for more than 40 different outlier detection algorithms.

The availability of a vast number of methods and open-source libraries, with the corresponding software code, is undoubtedly quite helpful when adopting one or more of these methods for a novel application domain. However, it can be quite challenging to ascertain the best-suited method for the given application. It is, therefore, important to develop a basic understanding of the application domain as well as the long list of anomaly detection methods available at one’s disposal. The latter is made further complicated by the vast amount of literature available on these anomaly detection methods, where each new method is presented to be superior to any other known method. This is probably the biggest dilemma in today’s ML-based research community. The overwhelming amount of literature and methods induces a state of confusion for the group of researchers trying to adopt an ML-based approach for a new application domain. Therefore, the method selection process presents the first challenge here.

Method selection

According to the learning methodology, all the ML-based methods are, in general, broadly divided into the following 3 categories: supervised, semi-supervisedFootnote 2 and unsupervised methods. The first two types of anomaly or outlier detection methods require target labels or information regarding the outliers (present in the training dataset) so that the model can learn how to differentiate an outlier from an inlier (or normal sample). The third type, i.e., unsupervised outlier detection methods, can learn to detect outliers without the need for any target labels. The current problem focuses on detecting outliers in the onboard recorded data for sea-going ships, which is certainly unlabeled as well as tedious to label manually. Thus, it is best to adopt an unsupervised ML method here. Moreover, the authors of PyOD reported that the unsupervised approach for outlier detection may perform better than the supervised and semi-supervised approaches in most of the cases [10]. The unsupervised outlier detection methods detect outliers based on several different techniques. The most widely used techniques are listed as follows:

  1. (i)

    Distribution-based: These methods detect outliers based on the distribution of the data samples. If the data distribution is a well-known one, simple statistical measures like Z-score, interquartile range (IQR), etc. can be used to detect outliers. Otherwise, empirical distributions and histograms can be employed. Generally, in such methods, a threshold is defined by the user for the chosen measure or statistic. All the samples above this threshold are detected as outliers. Histogram-based Outlier Detection (HBOD; [11]) and Kernel Density Estimation (KDE; [12]) are popularly known histogram- and empirical-distribution-based outlier detection methods.

  2. (ii)

    Decision-boundary-based: These outlier detection methods, generally, fall in the category of classification algorithms. These algorithms detect outliers by drawing a decision boundary around the normal (or inlier) data samples. The decision boundary is, generally, obtained such that the margin between the boundary and the data samples is maximized. One-Class Support Vector Machine (OCSVM), originally proposed by Schölkopf et al. [13], is one of the most popular decision-boundary-based outlier detection methods. Owing to its popularity as well as success in the field of anomaly detection, several modifications have been proposed since its inception, most of them implementing a different shape of the decision boundary or surface (originally proposed as a hyperplane by Schölkopf et al. [13]). OCSVM is quite effective for datasets with complex correlations, resulting in very complex graphical distributions, as they first map the data vectors from the input space to a feature space using non-linear kernel functions. The optimized decision surface is then obtained in the feature space.

  3. (iii)

    Decision-tree-based: A decision tree in ML is implemented as a data splitting or partitioning algorithm. Here, the data is recursively split by a randomly chosen feature (or variable) at a randomly chosen splitting value. The recursive data splitting is performed until the objective is achieved, resulting in a hierarchical tree-like structure. When used in a supervised fashion for classification, the objective or the criterion for stopping recursive splitting is to separate samples belonging to individual classes. For unsupervised outlier detection, the objective is to separate outliers from inliers, generally, based on the assumption that the samples in sparse regions of the data domain are outliers. Isolation forest [14] is one of the most popular unsupervised decision-tree-based outlier detection methods. Here, each individual sample is split or isolated using lines that are orthogonal to data axes and a higher anomaly score is assigned to the samples which need fewer splits. In other words, the samples which can be isolated very easily are considered to be outliers.

  4. (iv)

    Distance-based: As mentioned above, outliers are sometimes viewed as samples which are isolated or far away from other data samples. Thus, it is possible to calculate the distance between the samples, and thereafter, identify samples which are beyond a certain threshold from other samples. These methods are further sub-categorized into clustering and nearest-neighbor (or proximity-based) methods. The former divides the data into clusters, and the latter determines the number of samples in the neighborhood of each data sample. The samples falling outside the clusters or having a very sparse neighborhood are detected as outliers. Clustering is one of the most widely used techniques in data analysis and processing, which led to the development of several advanced clustering algorithms. The most popular ones include k-means [15], density-based spatial clustering of applications with noise (DBSCAN; [16]) and Gaussian mixture models (GMM; [17]). The nearest-neighbor methods, on the other hand, are considered state-of-the-art in the field of outlier detection as they are proven quite effective for datasets with very complex distribution, without the need for projecting the data to a kernel-based feature space (as in the case of OCSVM). Local outlier factor (LOF; [18]) is the most frequently adopted and successful nearest-neighbor outlier detection method.

The weakness of all the above methods lies in the fact that they can, at best, detect outliers based on the location and distribution of data in the multidimensional variable space. None of them understands the correlation between data variables, and therefore, would always result in detecting data samples in sparse regions of the variable space as outliers, even when such samples can be rare and valuable. A correlation-based outlier scheme, discussed below, would understand the prominent correlations in the dataset and detect outliers due to their deviation from these correlation trends.

Correlation-based outlier detection

A dataset generally contains correlated variables, and data samples are expected to depict this correlation. It is, therefore, possible to define outliers based on the fact that outliers would substantially deviate from the correlation observed in the rest of the data. In an unsupervised setting, a correlation-based model is calibrated using the complete dataset, and the samples which result in high calibration residuals are detected as outliers. There are two main types of unsupervised correlation-based outlier detection methods: Regression and Factorization. The regression methods regresses the data on itself, i.e., the model is calibrated such that the input and target variables are the same. In other words, the model attempts to learn the correlation between the input variables, and thereafter, tries to reconstruct the dataset on the target side. Here, the reconstruction error is minimized to obtain the optimum model, and the final reconstruction error or residuals are used to assign outlier scores. The samples with high reconstruction residuals are detected as outliers. Neural-network-based autoencoders [19] is the most popularly used unsupervised regression method for outlier detection.

Factorization methods, on the other hand, factorize the data matrix into several factors or components, each factor carrying a certain amount of variance (extracted from the data matrix). Here, the idea is to project the input features onto a latent space, characterized by the obtained factors. The factors or components are, in fact, the direction cosines of the latent space. Principal Component Analysis (PCA; [20]) is a frequently used method for factorization-based outlier detection. PCA obtains the components such that the first component carries the maximum variance and the succeeding components carry as much of the remaining variance as possible. Since PCA is a linear method, the outliers can be detected with the help of Hotelling’s \(T^2\) statistics [21] and the so-called Q-residuals [22], further explained in sect. "Principal component analysis (PCA)". Hotelling’s \(T^2\) statistics represent the leverage of an individual data sample on the model, and the Q-residuals represent the reconstruction residuals. The samples with high leverage and residuals are considered outliers as they deviate significantly from the underlying correlations in the dataset.

The advantage of using a correlation-based method is that it would probably result in better generalization for a correlated dataset. In other words, it may be able to effectively detect outliers even outside the available (or training) data limits. A better generalization would also be advantageous for unbalanced datasets, where the data is not distributed evenly over the complete data domain. Unlike the methods mentioned in the previous section, the correlation-based outlier detection methods would not identify samples located in sparse regions, i.e., rare but valuable samples, as outliers, unless they are defying the expected correlations. The dataset used here is obtained from a sea-going ship, and ship-in-service datasets are generally unbalanced as the ship speed is kept constant for the most part of a voyage. Moreover, a correlation-based method can also be used to study the correlation between the data variables and confirm our physical understanding of the phenomenon. Therefore, correlation-based outlier detection methods, namely, PCA and autoencoders are used here to detect outliers in the current study. One drawback of using correlation-based methods is that the model may develop a bias due to the presence of a large number of outliers (compared to the total number of samples). In such a case, robust versions of PCA and autoencoders can be employed, as suggested by Chalapathy et al. [23].

Methods

Following the arguments presented above, the current work uses correlation-based outlier detection methods, namely, PCA and autoencoders, to detect outliers. Since PCA is a linear method, it is attempted to enhance it by using some non-linear transformations, obtained using the domain knowledge applicable to the dataset used here.

Principal component analysis (PCA)

PCA is an unsupervised ML method, which factorizes the data matrix into orthogonally independent and uncorrelated factors, called Principal Components (PCs). The PCs absorb the variability available in the dataset such that the first PC absorbs the maximum variance and the subsequent PCs absorb as much of the remaining variance as possible. Consequently, due to the accumulation of the majority of variance in the first few PCs, PCA is helpful in reducing the dimensionality of the dataset. In such a case, the last remaining PCs, containing a small amount of remaining variance, are discarded as noise in the dataset. Thus, PCA splits the data matrix (\({\textbf{X}}\)) into a factorized or modelled part (\({\textbf{X}}_{M,\ A}\)) and the noise or residuals (\({\textbf{E}}_A\)) as follows:

$$\begin{aligned} {\textbf{X}}^{m \times n} = {\textbf{X}}_{M,\ A}^{m \times n} + {\textbf{E}}_A^{m \times n} \end{aligned}$$
(1)

The superscripts show the dimensions of each term, i.e., \(m \times n\) shows that the data matrix (\({\textbf{X}}\)) has m rows (or samples) and n columns (or variables). The subscript, A, is formally called the dimensionality of the factorized part, i.e., the number of PCs containing the majority of the variance. It should be noted that the data matrix (\({\textbf{X}}\)) is generally standardized by subtracting the mean and dividing by the standard deviation before factorization, as shown by Gupta et al. [24]. The modelled or factorized part (\({\textbf{X}}_{M,\ A}\)) can be further written as a dot product of PC scores matrix (\({\textbf{T}}_A\)) and the transpose of PC loadings matrix (\({\textbf{P}}_A^\top\)). Here, \({\textbf{P}}_A^\top\) represents the transpose of \({\textbf{P}}_A\).

$$\begin{aligned} {\textbf{X}}^{m \times n} = {\textbf{T}}_A^{m \times A}.\ {\textbf{P}}_A^{\top \ A \times n} + {\textbf{E}}_A^{m \times n} = \sum _{i=1}^A {\varvec{t}}_i^{m \times 1}.\ {\varvec{p}}_i^{\top \ 1 \times n} + {\textbf{E}}_A^{m \times n} \end{aligned}$$
(2)

Here, \({\varvec{t}}_i\) is a column vector, \({\varvec{p}}_i^\top\) is a row vector, and i is the PC number. The loadings (\({\varvec{p}}_i\)) represent the orthonormalFootnote 3 eigenvectors or direction cosines in the latent or PC space, and the scores (\({\varvec{t}}_i\)) represent the corresponding eigenvalue-associated orthogonalFootnote 4 vectors or the location of the data samples in the latent space. The eigenvectors are also called PCs, which are basically just the latent variables. The two most popular algorithms used to estimate the PC scores and loadings are Singular Value Decomposition (SVD; [25]) and Nonlinear Iterative Partial Least Squares (NIPALS; [26]).

Influence Plots

In order to detect outliers using PCA, statisticians generally employ influence plots. The influence plot helps identify outliers based on the residuals (\({\textbf{E}}_i\)) and the leverage or influence of each data sample on the model. Here, \({\textbf{E}}_i\) is the residual matrix left after extracting i PCs from \({\textbf{X}}\). The residuals (\({\textbf{E}}_i\)) are used to calculate the Q-residuals for each sample by simply squaring the standardized residuals,Footnote 5 and then, summing over all the variables. In vector notations, the Q-residual corresponding to \(k^{th}\) data sample and \(i^{th}\) PC is calculated as:

$$\begin{aligned} Q_{i,\ k} = {\varvec{e}}_{i,\ k}.\ {\varvec{e}}_{i,\ k}^\top \end{aligned}$$
(3)

Here, \({\varvec{e}}_{i,\ k}\) is the \(k^{th}\) row, corresponding to the \(k^{th}\) data sample, in the standardized residual matrix (\({\textbf{E}}_i.\ {\textbf{S}}\), where \({\textbf{S}}\) is a diagonal matrix containing the inverse of standard deviations for each data variable), and the symbol \(^\top\) represents the transpose. If the data matrix (\({\textbf{X}}\), in eq. 2) is standardized, the estimates of standard deviations would be 1 and \({\textbf{S}}\) would become an identity matrix. Since the residuals of a linear regression model follow a Gaussian (or normal) distribution, it is possible to obtain a critical limit for the Q-residuals (as presented by Jackson and Mudholkar [22, 27], and recently by Thennadil et al. [28]), which can be further used to detect outliers (demonstrated further in sect. "Outlier detection").

The other axis of an influence plot, i.e., leverage, is quantified as Hotelling’s \(T^2\) statistics [21]. The \(T^2\) statistic represents the distance of the data sample from the center (or multivariate mean) of the dataset, and therefore, its influence on the model. Hotelling’s \(T^2\) statistics corresponding to \(k^{th}\) data sample and \(i^{th}\) PC can be calculated as follows:

$$\begin{aligned} T_{i,\ k}^2 = {\varvec{t}}_{i,\ k}.\ \lambda _i^{-1}.\ {\varvec{t}}_{i,\ k}^\top \end{aligned}$$
(4)

Where \(\lambda _i\) is the eigenvalue associated with \(i^{th}\) PC. Similar to Q-residuals, it is possible to obtain a critical limit for Hotelling’s \(T^2\) statistics, as presented by MacGregor and Kourti [29] and recently by Thennadil et al. [28].

Autoencoders (AE)

Autoencoders, originally introduced as Replicator Neural Network (RNN; [19]), is a neural-network-based ML method. A neural network can be seen as a complex mathematical transformation applied to the input data in order to map it to the corresponding target values. Here, the complexity arises due to the fact that the mathematical transformation is actually a series of linear or non-linear mappings (or mathematical operations) applied in succession, depicted by the sequence of layers in the network architecture. The target (\({\textbf{Y}}\)) can, therefore, be written as a function of the input (\({\textbf{X}}\)) as follows:

$$\begin{aligned} {\textbf{Y}} = f({\textbf{X}}; \Theta ) = f^{L}_{\theta _L}(f^{L-1}_{\theta _{L-1}}(...f^2_{\theta _2}(f^1_{\theta _1}({\textbf{X}})))) \end{aligned}$$
(5)

Here, \(\Theta = [\theta _1, \theta _2,..., \theta _{L-1}, \theta _L]\) is the set of hyperparameters, and L is the number of layers in the network architecture. Further, each mapping (\(f^l_{\theta _l}\)) is defined as a linear or non-linear transformation applied after a linear operation.

$$\begin{aligned} f^l_{\theta _l}({\textbf{X}}) = \sigma _l({\textbf{W}}_l \cdot f^{l-1}_{\theta _{l-1}}({\textbf{X}}) + {\textbf{b}}_l) \end{aligned}$$
(6)

Here, \(\sigma _l\), \({\textbf{W}}_l\) and \({\varvec{b}}_l\) are the hyperparameters of the \(l^{th}\) layer in the model (constituting \(\theta _l\)), called activation function, weight matrix and bias vector, respectively. The non-linearity in the model is introduced by using a non-linear activation function, which can be the same or different in different layers of the model. The most frequently used activation functions are linear, sigmoid, ReLU (Rectified Linear Unit) and tanh (hyperbolic tangent). The activation functions are user-specified, but the other hyperparameters, i.e., weights and biases, can be estimated by minimizing the cost function during model training, considering the hypothesis \({\textbf{Y}} = f({\textbf{X}}; \Theta )\). The most commonly used cost or loss function is Mean Squared Error (MSE) with \(L_2\) regularization, as presented below:

$$\begin{aligned} {{\textbf {J}}}(\Theta )=\frac{1}{m}(\hat{{{\textbf {Y}}}} -{{\textbf {Y}}})^\top (\hat{{{\textbf {Y}}}} - {{\textbf {Y}}}) + \lambda \sum _{l=1}^{L} (||{\textbf{W}}_l||^2 + ||{\varvec{b}}_l||^2) \end{aligned}$$
(7)

Where m is the number of samples in Y, \(\hat{{\textbf{Y}}}\) is the estimate of \({\textbf{Y}}\) obtained by the model and \(\lambda\) is the weight decay or regularization parameter.

An autoencoder is an adaptation from an ordinary Artificial Neural Network (ANN) with the following main difference:

  1. (i)

    ANN is a supervised ML method, requiring target or labeled data, whereas autoencoders is an unsupervised ML method, where the input itself is used as the target, resulting in its original name, Replicator Neural Network [19].

  2. (ii)

    Unlike ANN, autoencoders must have an odd number of layers in the model architecture, where the neurons in the center-most layer are analogous to factors or PCs in the case of factorization methods like PCA. The simplest autoencoder model contains 3 layers, i.e., input, target and a hidden layer. Moreover, the center-most hidden layer generally contains a smaller number of neurons as compared to the number of input features.

Results and discussion

The primary objective of the current paper is to demonstrate correlation-based outlier detection using dimensionality reduction methods, namely, PCA and autoencoders. However, it may be a good idea to first understand how these methods work. It can be clearly understood from sect. "Literature review: outlier detection (OD)" that PCA obtains a set of optimized latent variables (popularly called PCs) to reduce the dimensionality of the data, and even though it is not that easy to realize, autoencoders do the same. In the latter case, the neurons in the center-most layer (generally containing the smallest number of units) represent the latent variables. The primary difference between PCA and autoencoders is, therefore, just the fact that the latent variables from PCA are orthogonal to each other, whereas they are non-orthogonal in the case of autoencoders, which makes a lot of difference as shown further. In order to demonstrate this, sect. "Latent variables (LVs)" presents the latent variables obtained using both these algorithms, based on the dataset described in the following section ("Data exploration & processing"). The latent variables obtained using PCA are further studied to understand their physical meaning (in sect. "Physical meaning of LVs"). Further, sect. "Data reconstruction" presents the data reconstructed using a selected set of latent variables from both the algorithms, and finally, sect. "Outlier detection" presents the results for outlier detection using PCA and autoencoders.

Data exploration & processing

The data used for the current work is an assimilation of data recorded onboard  7000 TEUFootnote 6 post-panamax container ship and weather hindcast data obtained from one of the metocean data repositories. The onboard recorded data samples are obtained as 15-minute averaged values using an onboard installed energy management web application, called Marorka Online.Footnote 7 The data is recorded over a period of about 4.5 years covering numerous sea voyages. The wind and wave hindcast data is interpolated to the ship’s location in time for all the onboard recorded data samples. The wave data is obtained from MFWAM (Météo-France WAve Model),Footnote 8 but the source of the wind data is unknown. Unfortunately, no additional information regarding the ship is available as the data is supplied anonymously.

Figure 1 shows the speed-through-water (STW) and shaft power raw data recorded onboard the ship over a period of about 4.5 years. The raw data is further processed as per the data processing framework presented by Gupta et al. [5]. Since the onboard recorded ship data does not include the information regarding the acceleration and deceleration of the ship, the two-stage quasi-steady-state filter suggested by Gupta et al. [5] is used here to remove all the samples where the propeller shaft rpm of the ship is voluntarily changed by the ship’s captain. Such samples would be dominated by unaccounted dynamic effects, like non-zero gradients of the ship’s motion. Including them for model calibration may result in an undesirably biased model. The right-hand side subplot in Fig. 1 shows the samples remaining after applying the quasi-steady-state filter. These remaining samples are said to be in a quasi-steady state, where the gradients of the ship’s motions are negligible.

Fig. 1
figure 1

Speed-through-water vs shaft power data recorded onboard the ship. The raw data contains samples from all the voyages recorded over a period of about 4.5 years. The cleaned data is obtained after applying the quasi-steady-state filter [5] on the raw data

Looking at the cleaned data (Fig. 1b), it is observed that some samples, which are falling on horizontal straight lines just above the 6000 kW mark and just below the 22,000 kW mark, are depicting strange behavior. Additionally, some more samples at very small shaft power (< 1000 kW) have unexpectedly large STW (around 15 knots). These samples clearly do not follow the correlation depicted by other data samples, and therefore, they can be labeled as correlation-defying outliers (as suggested by Gupta et al. [5]). A further investigation reveals that these samples show anomalous behavior due to the temporary freezing of shaft power and rpm sensors during data recording. Nevertheless, such samples should be identified and dealt with before the dataset can be used for any further analysis. Thus, the methodology developed here would be validated based on the criteria that at least these anomalous samples are identified as outliers.

Latent variables (LVs)

As mentioned in sect. "Principal component analysis (PCA)", an LV is nothing but a direction cosine, representing a particular direction in the high dimensional variable space. So, the latent space is in fact same as the original variable space but with a different set of axes, and in the case of linear models like PCA, the new set of axes are actually just rotated and/or translated versions of the original data axes. However, if a non-linear model like autoencoders is used then the axes can be transformed in a much more complicated manner. Similar results can be obtained by applying a set of non-linear transformations to the dataset before carrying out PCA, forming the basis for methods like kernel PCA [30]. Unfortunately, such methods are known to be computationally expensive (as pointed out by Sakurada and Yairi [7]) as well as difficult to understand. However, since the problem of ship hydrodynamics is well-studied, it is possible to use simple non-linear transformations obtained from our domain knowledge to handle non-linearities in the data, as demonstrated by Gupta et al. [8]. Table 1 shows the non-linear transformations used for NL-PCA, i.e., the PCA model with non-linear transformations, as well as the variables (or features) used in the (linear) PCA and autoencoders model.

Table 1 Data variables used by ML models. Here, ‘All’ means all three models, i.e., (linear) PCA, NL-PCA, and Autoencoders, and \(V_{WL}\) is used as a symbol for longitudinal wind speed

In the case of PCA, the number of PCs (or LVs) is limited by the rank of the data matrix, i.e., the number of linearly independent columns or variables. Additionally, the latent space is organized such that most of the variance is contained within a very small number of PCs, resulting in dimensionality reduction. However, if most of the variables are linearly independent, which is most definitely true for the current case (observing the list of variables in Table 1), reducing the dimensionality of the data is not lucrative anymore, but studying the obtained PCs to understand the correlations between the data variables is still useful. Figure 2 shows the LVs obtained using the (linear) PCA, NL-PCA, and autoencoder models. In the case of PCA, the projection of the \(i^{th}\) LV on the variable space (shown in Fig. 2) is obtained by, first, calculating the \(i^{th}\) PC matrix (\(\textbf{PC}_i\)), as per the following equation, and then, extracting the columns corresponding to the variables (speed-though-water and shaft power in this case) on which the projection is being presented.

$$\begin{aligned} \textbf{PC}_i = {\varvec{t}}_i^{m \times 1}.\ {\varvec{p}}_i^{\top \ 1 \times n} \end{aligned}$$
(8)
Fig. 2
figure 2

Latent variables (LVs) obtained from all three models plotted over the cleaned (QSS filtered) data

Table 2 Hyperparameters for the autoencoders model

The LVs from autoencoders, on the other hand, are calculated by dropping (or multiplying by zero) the output from all except for 1 neuron in the center-most hidden layer. The percentage values presented in the legends of Fig. 2 show the explained variance, calculated as the variance of LV divided by the total variance in the data. It should be noted here that the total explained variance of all the LVs from PCA will add to an almost perfect 100%, whereas it will not be the same in case of autoencoders, due to non-orthogonality, resulting in duplication or leaking of variance. The hyperparameters of autoencoders, listed in Table 2, are tuned to obtain the optimum results for the given dataset. However, in order to draw a fair comparison with the PCA models, the number of neurons in all 3 hidden layers is used as 9 only for the results presented in the current section. For further sects. ("Data reconstruction" and "Outlier detection"), the number of neurons in the hidden layers is also tuned. Looking at Fig. 2, as expected, the PCs from the linear PCA model are unable to fit the non-linear trends, whereas the non-linear PCs (from NL-PCA model) clearly fit well to the cubic trends in the data. On the other hand, it is surprising to see that the LVs obtained from the autoencoders model seem mostly linear, even though non-linear activation functions are used in each hidden layer. The results, therefore, indicate that the PCs produced by the NL-PCA model seem most relevant to studying and understanding the prominent correlations within the dataset.

Physical meaning of LVs

The correlation between the data variables and PCs is quantified as correlation loadings, also used by Cadima and Jolliffe [31] to interpret the PCs. The correlation loadings can further be used to: (a) Study the correlation between different data variables; (b) Understand the relative importance of each data variable; and (c) Interpret the physical meaning of each LV. Table 3 presents the correlation loadings for all the PCs obtained using the current dataset. Two (or more) data variables which are strongly correlated with an individual PC (highlighted by the red background in Table 3) intrinsically have a strong correlation with each other. The correlation loadings corresponding to each PC are also shown graphically in Fig. 3 after scaling, such that the highest correlation loading is scaled up to 1 for each PC. Table 3 and Fig. 3 clearly show that the shaft power is highly correlated with the shaft rpm and speed-through-water, whereas the correlation with other variables is almost negligible. This is expected as the data, presented in Fig. 1, shows a very small influence of environmental loads.

Table 3 Correlation loadings for NL-PCA model
Fig. 3
figure 3

Scaled correlation loadings for NL-PCA model, based on the values presented in Table 3. The percentage values in the plot legend are the amount of variance or information contained in the corresponding PC

The physical meaning of each LV can be clearly understood by looking at the scaled correlation loadings (Fig. 3) and their actual numerical values (shown in Table 3). Besides, visualizing the projection of LVs in the real data space (as shown in Fig. 2) acts as a further confirmation. For instance, it is quite clear (from Figs. 23 and Table 3) that the first PC from both the PCA models mainly represents the correlation between the speed, power, and rpm. In other words, the first PC represents the line along which the speed-though-water and shaft power would move if the rpm of the propeller is changed. A similar analysis can be done for the autoencoders model using its weight matrices, but it would be far more complicated due to the complex interconnections, especially in the case of more than 1 hidden layer.

Data reconstruction

Fig. 4
figure 4

Data reconstruction with PCA, NL-PCA and autoencoders using 7 LVs. The original data (top-left) is also presented here for visual comparison. The reconstruction error (RMSE and MAE) is presented in the title of the corresponding subplots

In the case of PCA, the data is reconstructed by simply multiplying the eigenvectors (or PC loadings, \({\textbf{P}}_A\)) with the eigenvalues (or PC scores, \({\textbf{T}}_A\)) corresponding to each data sample, as given in eq. 2. Here, the number of prominent factors (or model dimensionality, A) is user-specified, which is generally based on the distribution of information or variance among the factors. For instance, in the case of the current dataset, the first 7 PCs contain almost 100% of the variance contained in the original dataset. Therefore, this dataset can be reconstructed using only the first 7 PCs. Although the reduction from 9 data variables to 7 PCs is not significant at all, the reconstruction process allows us to detect correlation-defying outliers, as shown in the next section ("Outlier detection").

In the case of autoencoders, the number of LVs is regulated by the number of neurons in the center-most hidden layer, and the data is reconstructed by using the trained network, after turning off the gradient calculations. Here, it is crucial to check if the reconstructed data depicts the same trends, linear and non-linear, as present in the original data. If not, then either the adopted hyperparameters or the number of LVs used to reconstruct the data should be reconsidered. However, including too many LVs for data reconstruction may result in relaying undesired noise in the reconstructed data (as the left-out LVs generally constitute uncorrelated noise in the data) and increase the computational time. Based on these considerations as well as an acceptably small reconstruction error, namely, Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), 7 autoencoder LVs are used to reconstruct the current dataset. Figure 4 shows the reconstructed data obtained using all three models, i.e., PCA, NL-PCA, and autoencoders. The figure also shows the original data so that a visual comparison can be carried out. As expected, the reconstructed data from NL-PCA and autoencoders resembles the original data quite well, and the reconstruction error from both models is substantially small, whereas it is not the same for the linear PCA model. With only 7 LVs, the linear PCA model fails to capture the non-linear trends in the data.

Fig. 5
figure 5

Data reconstruction with linear PCA model using 7, 8, and 9 PCs. The reconstruction error (RMSE and MAE) is presented in the title of corresponding subplots

The reconstructed data in Fig. 4 indicate that the linear PCA model is unable to capture the non-linear trends in the data. However, it is known that using the maximum number of PCs (same as the rank of the data matrix) would result in the dataset itself (with negligible reconstruction error) even in the case of the linear PCA model. Therefore, it may be interesting to investigate further and have a look at the last 2 discarded PCs. Figure 5 shows the data reconstructed using the linear PCA model with 7, 8 and 9 PCs. The data reconstructed using 8 and 9 (discarded) PCs is, in fact, able to model the non-linear trends in the dataset, disproving the general belief that the linear PCA cannot model the non-linearities in the data. This, rather, indicates that the non-linear trends are extracted as separate PCs, which may be deprioritized by the model, i.e., obtained as one of the last PCs, resulting in being discarded when the data is factorized. Nevertheless, it may not be a good idea to use a linear PCA model for outlier detection in this case, as it models non-linearities using the last few PCs which may also be contaminated with undesired noise.

Outlier detection

Fig. 6
figure 6

Outliers detected (marked by red squares) with NL-PCA using 99% confidence limits on Q-residuals and Hotelling’s T2 statistics

Fig. 7
figure 7

Outliers detected (marked by red squares) with NL-PCA using 99% confidence limit only on Q-residuals

The reconstructed data forms the basis for correlation-based outlier detection. Here, it is asserted that the data samples falling far away from the correlation trends, i.e., correlation-defying samples, are considered outliers. Based on this assertion, sample-wise reconstruction error is calculated, and the samples with high reconstruction error are recognized as outliers. However, in the case of PCA, the influence plots (explained in sect. "Principal component analysis (PCA)") present another way to detect outliers, based on the hypothesis that only samples with both high reconstruction error and high leverage should be considered outliers, as the samples with small leverage are basically harmless due to their small influence on the model. Figure 6 shows the outliers detected in the current dataset using the influence plot obtained from the NL-PCA model with 7 PCs. Here, the x-axis shows the leverage in terms of Hotelling’s \(T^2\) statistics [21], and the y-axis shows the reconstruction error in terms of Q-residuals (equivalent to normalized mean squared error) for each data sample. The samples above 99% confidence limits for Hotelling’s \(T^2\) statistics and Q-residuals are recognized as outliers.

Fig. 8
figure 8

Outliers detected (marked by red squares) with AE using 99% confidence limit on the empirical distribution of sample-wise MSE

Fig. 9
figure 9

Outliers detected (marked by red squares) with NL-PCA using 99% confidence limit on the empirical distribution of sample-wise MSE

The samples marked in Fig. 6 do not seem to include all the expected outliers (discussed in section "Data exploration & processing"), probably due to the exclusion of samples with small leverage (or Hotelling’s \(T^2\) statistics). Figure 7 shows the outliers detected with 99% confidence limit applied only on Q-residuals, i.e., also including samples with small leverage. Here, most of the expected outliers are marked, proving the effectiveness of influence plots and PCA for outlier detection. Moreover, the methodology avoids misclassifying rare but valuable samples in the lower speed-through-water range as outliers. In the case of autoencoders, it is not possible to calculate the leverage and statistical confidence limits due to its non-linear nature. However, it is possible to constitute an empirical distribution from the sample-wise mean squared error (MSE) and use it to detect potential outliers. Figure 8 shows the sample-wise MSE from the autoencoders model and the outliers detected with 99% confidence limit on the empirical distribution. In order to draw a fair comparison, similar results are presented from NL-PCA model in Fig. 9.

The results from Figs. 8 and 9 indicate that using the empirical distributions (based on sample-wise MSE) for outlier detection provides good results, especially in the case of autoencoders. However, the risk of using an empirical distribution should be kept in mind as it would always result in the removal of a fixed number of samples (depending on the chosen confidence limit), irrespective of how many outliers are present in the dataset. If a known statistical distribution is used to get the confidence limits, the number of detected outliers would depend on the actual distribution of residuals, avoiding the problem of removing too many or too few samples. On another note, comparing the outliers detected from NL-PCA using the influence plot and sample-wise MSE (i.e., Figs. 7 and 9), it is observed that the former results in detecting too many outliers in the higher range of speed and power. This is due to the cubic transformations applied on the speed-though-water and shaft rpm in NL-PCA model, resulting in higher numerical values of Q-residuals for samples with high speed and rpm. Thus, linearizing the residuals before carrying out outlier detection may be desirable, but the linearized residuals would not fit any known distribution, and therefore, no statistical distribution-based confidence limits can be obtained for them. However, it may be possible to use an ensemble of linear PCA models (without any non-linear transformation) in order to avoid this problem.

Conclusion

Outlier detection is a challenging task while handling a large amount of data. Most of the popularly-known outlier detection methods misidentify rare but valuable samples, observed in sparse regions of variable space, as outliers, thereby making outlier detection inefficient for unbalanced datasets. The current work mainly addresses this issue and constitutes the following:

  • The proposed scheme, using PCA and autoencoders, detects correlation-defying outliers and avoids misidentifying rare but valuable samples as outliers in unbalanced datasets.

  • The validation of the proposed scheme is carried out using in-service data recorded onboard a sea-going ship, which is highly unbalanced, and therefore, quite suitable for robust validation.

  • The paper demonstrates that the linear but efficient PCA model, empowered by non-linear transformations (here referred to as NL-PCA) obtained using domain knowledge, is able to model non-linear correlations. The principal components (PCs) from NL-PCA model are also found best suited to understand the physical meaning of latent variables as well as study the correlations between the data variables.

As mentioned above, the validation for the proposed scheme is carried out using a dataset recorded onboard a ship over several sea voyages. Due to an unchanged propulsive state over large durations of its voyages, the onboard recorded datasets from ships are generally highly unbalanced, and they contain several rare but valuable samples, especially at low ship speeds. The proposed outlier detection scheme not only detects appropriate outliers but also avoids detecting rare samples as false positives. Therefore, it is proven effective for detecting outliers in unbalanced datasets, like ships’ in-service datasets as well as any other similar datasets. Nevertheless, a thorough comparison with other outlier detection methods, discussed in sect. " Method selection", may be required. However, that may necessitate labeling outliers or obtaining a similar dataset with labeled outliers. Thus, it can be considered as a possible future work.