1 Introduction

The availability of the sea state is considered important in the case of some critical maritime operations [4]; for example, landing helicopters on ships is more dangerous in higher sea states. Additionally, the construction of ocean structures, platforms and ships is more robust when sea state information is available. This brings the need for instruments that can measure the ocean surface and techniques and methods that can use this data in order to provide accurate and reliable information about the state of the sea.

The ocean surface is usually measured with in situ devices, such as wave buoys and tidal gauges [15, 20]. In the literature, algorithms such as harmonic analysis [28] and the wavelet network model [12] are applied on tidal gauge data in the nearshore for the prediction of the water level. The ocean surface can also be measured with remote sensing devices, such as shipborne radars [8], satellites [6] and video cameras [5].

This work investigates the estimation of the sea state from a single uncalibrated camera. We do not utilise techniques that are used for the prediction of the wave elevation from tidal gauges (such as harmonic analysis and wavelet networks) because the pixel intensity from video does not correspond directly to wave elevation. The pixel intensity can be considered proportional to the lights reflected from the water surface [11].

Remote sensing from simple video cameras has been widely applied for acquiring the nearshore hydrodynamics and morphology [7, 14, 26]. The bathymetry is estimated with a celerity-based depth-inversion method that utilises the dispersion relation of shallow water and the spatial correlation of pixel intensity signals indicating propagation of waves. Based on this information, the nearshore sea levels [9, 18] and current predictions [22] are acquired.

The present work presents a technique that is applicable to real environments (unlike [11]), deep water (unlike [18, 19, 27, 29, 30]), does not use in situ devices for calibration (unlike [11, 30]) and is validated with videos that have corresponding in situ measurements in a variety of sea states (unlike [19, 23,24,25]). Unlike the techniques that estimate hydrodynamics, morphology or sea state in the nearshore, this work does not utilise information of foam from breaking waves. (In deep water, foam is present in very high sea states.)

In our previous work [16], we use the linear Kalman filter and the least squares approximate solution in order to form the uncalibrated ocean video amplitude spectrum. We then use ocean theory in order to calibrate this spectrum into metres and estimate the significant wave height. In the present work, the goal is to solve the same problem with a novel methodology. We verify the sea state estimations with the same video data as before.

We start, in Sect. 2, by providing an introduction to key ocean theory used by the methodology, before describing the methodology itself (Sect. 3), and then demonstrating its efficacy on real videos of the sea, with in situ buoy measurements for validation (Sect. 4).

2 Ocean theory: the Pierson–Moskowitz spectrum

The Pierson–Moskowitz spectrum [21] is an empirical spectrum of the ocean formed from data acquired from accelerometers installed on weather ships. The spectral energy in terms of angular frequency \(\omega \) is expressed as:

$$\begin{aligned} S(\omega ) = \frac{\alpha {g^2}}{\omega ^5}\exp \left( {-\beta {\left( {\frac{\omega _0}{\omega }} \right) ^4}}\right) \end{aligned}$$
(1)

where \(\alpha =8.1\times 10^{-3}\), \(\beta =0.74\), g is the gravitation acceleration and \(\omega _0=g/U\), where U is the wind speed at 19.5 m above the ocean surface. The dominant angular frequency \(\omega _m\) is equal to:

$$\begin{aligned} \omega _m=0.87\frac{g}{U} \end{aligned}$$
(2)

The area under the spectrum is equal to the integral of the function:

$$\begin{aligned} \int _0^{\infty }{S(\omega ){\mathrm{d}}\omega } = \frac{\alpha {U^4}}{4\beta {g^2}} \end{aligned}$$
(3)

The significant wave height can be found to be equal to four times the square root of the area under the spectral density [3].

3 Methodology

The aim of this work is to track the main oscillatory component from video time series of pixel intensities that is associated with the ocean’s movement. This enables the estimation of the ocean dominant frequency and the significant wave height. To achieve this, a methodology is introduced that combines the SSA algorithm and the nonlinear Kalman filter. It also incorporates ocean theory presented in Sect. 2.

3.1 Singular spectrum analysis (SSA) algorithm

Historically, the SSA algorithm is associated with work published in the 1980s, e.g. [10]. In the context of time series analysis, the SSA algorithm decomposes the input signal into a set of additive components, which are labelled as either trend, oscillatory or noise components.

In the context of this work, time series of pixel intensities is given to the SSA algorithm. The first four elementary reconstructed components (RCs) are summed to provide a new time series, which is given to the extended Kalman filter in order to estimate the dominant frequency. The hypothesis is that the SSA algorithm will concentrate the information of the central component from the video, which is associated with the dominant wave of the ocean, in the first RCs.

Although it would be expected that the dominant frequency is isolated in the first RC, practically this was not found to be the case. Empirically, selecting the first four in all cases (see Sect. 4.2) was sufficient. This selection is also based on observations from the matrix of w-correlations (see Fig. 1). Specifically, in many cases it was observed that the first four RCs had a strong correlation.

The next step involves the determination of the dominant frequency from the sum of RCs 1–4 of the SSA algorithm. Practically, the Fourier transform of this time series includes more than one peak. Selecting the highest peak does not in all cases correspond to the dominant frequency of the ocean (see Fig. 2). Since this selection of RCs usually includes more than one wave, determining one frequency is not a straightforward task. This is the reason the extended Kalman filter is used in the following step.

Fig. 1
figure 1

Example of matrix of w-correlations of SSA algorithm from ocean video. From this example, a correlation of certain PCs is visible, such as PCs 1–2, 3–4, 5–6, 7–8. Additionally, some higher PCs can be considered to contain more noise (PCs 11-max)

Fig. 2
figure 2

Example of Fourier transform of the sum of RCs 1–4 of the SSA algorithm from shipborne video (solid blue line) showing multiple peaks that do not correspond to the ocean dominant frequency (here indicated with red dashed line from buoy measurements) (color figure online)

3.2 Extended Kalman filter algorithm

The extended Kalman filter algorithm with the environment definition described in the following text is very efficient at distinguishing one main frequency and isolating the remaining video elements as noise. As mentioned in the previous section, the goal is to identify the principal component of water movement from the sum of RCs 1–4, which is hypothesised to be related to the dominant wave of the ocean. The true signal is a sinusoid:

$$\begin{aligned} x = a\sin ({\omega {t}+\phi }) \end{aligned}$$
(4)

where a is the amplitude, \(\phi \) the phase, and t is the time. The derivative of the signal with respect to time is equal to:

$$\begin{aligned} {\dot{x}} = a\omega \cos ({\omega {t}+\phi }) \end{aligned}$$
(5)

and the second derivative:

$$\begin{aligned} \ddot{x} = -a\omega ^2\sin ({\omega {t}+\phi }) \end{aligned}$$
(6)

The second derivative can be expressed as a function of the angular frequency and the true signal as:

$$\begin{aligned} \ddot{x} = -\omega ^2{x} \end{aligned}$$
(7)

This is useful in the context of this work, because it does not include amplitude and phase, and instead focuses on the true signal. Additionally, the sinusoidal form of the signal is now included in the environment definition. The environment definition is:

$$\begin{aligned} {\mathbf {g}} = \left( \begin{array}{l} {\dot{x}}\\ \ddot{x}\\ {\dot{\omega }} \end{array}\right) = \left( \begin{array}{lll} 0 &{}\quad 1 &{}\quad 0\\ -\omega ^2 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 \end{array}\right) \left( \begin{array}{l} x\\ {\dot{x}}\\ \omega \end{array}\right) \end{aligned}$$
(8)

The Jacobian of matrix \({\mathbf {g}}\) is computed at each time step with the current estimates of the states. By taking the partial derivatives of the matrix, the system’s dynamics matrix \({\mathbf {F}}\) is found to be equal to:

$$\begin{aligned} {\mathbf {F}} = \frac{\partial {{\mathbf {g}}}}{\partial {x}} = \begin{pmatrix} 0 &{}\quad 1 &{}\quad 0\\ -{\hat{\omega }}^2 &{}\quad 0 &{}\quad -2{\hat{\omega }}{\hat{x}}\\ 0 &{}\quad 0 &{}\quad 0 \end{pmatrix} \end{aligned}$$
(9)

where \({\hat{\omega }}\) is the predicted value or estimate of \(\omega \) and similarly \({\hat{x}}\) is the predicted value or estimate of x at the current time step. The fundamental matrix is not used for propagating the states, but rather only for the calculation of the gains, and is approximated with the first two terms of Taylor series. With the environment definition described here, the nonlinear Kalman filter can be solved as in [2].

Fig. 3
figure 3

Example of running the methodology with shipborne video. a Time series of pixel intensities from video (solid blue line) and the sum of reconstructed components (RCs) 1–4 of the SSA algorithm (dashed red line) b The extended Kalman filter (first state estimate in dashed red line) attempts to establish one main frequency from the sum of RCs 1–4 (solid blue line). The extended Kalman filter gives in the third state the estimate of the unknown angular frequency. c Angular frequency estimation (third state) of the extended Kalman filter in regard to time compared to in situ buoy measurements. d Theoretical errors in the angular dominant frequency estimation in regard to time (color figure online)

The derivative of the angular frequency is equal to zero. The true state of the angular frequency is constant because the sea state is not expected to change in the duration of the video. Although the unknown true state of the angular frequency is constant, the algorithm’s estimation of this value varies at each iteration, as can be seen in Fig. 3c. Including the angular frequency is our environment definition is important because the value estimate is used for inferring the value of the significant wave height in the next step of the methodology.

The Kalman filter outputs a value of the unknown angular frequency. This angular frequency can be used directly as the ocean dominant angular frequency. In the following section, this value is given as input to the theory of the Pierson–Moskowitz spectrum in order to get a value of the significant wave height. As a side note, the described methodology is performed on one pixel time series. For acquiring more accurate and reliable dominant frequency estimations, a set of pixels (or all pixels) can be used individually and the dominant frequency is found as the average. In the case of the experimental results of this work (see Sect. 4), a set of pixels equal to the image width is used.

Figure 3 demonstrates the functionality of methodology. The time series of pixel intensities is used as input to the SSA algorithm, and the principal component of the movement of water in the video is speculated to be included in the sum of RCs 1–4 (Fig. 3a). This main component is isolated from all other video components with the extended Kalman filter (Fig. 3b). The filter provides an estimate of what we hypothesise to be the ocean dominant angular frequency in each time step (Fig. 3c) and the limits of certainty for that prediction from the square root of the corresponding diagonal element of the covariance matrix (Fig. 3d).

3.3 Significant wave height

The significant wave height (\(h_s\)) can be found as equal to four times the square root of the area under the ocean spectral density [3]. The dominant frequency from the previous steps can be given as input to the Pierson–Moskowitz spectrum equation (see Sect. 2). The shape of this empirical spectrum can then be used in order to approximate \(h_s\). From Eqs. (2) and (3), \(h_s\) can be expressed in terms of dominant angular frequency \(\omega _m\) (found in the previous steps) as:

$$\begin{aligned} h_s=2\left( \frac{0.87}{\omega _m}\right) ^2\sqrt{\frac{\alpha }{\beta }} \end{aligned}$$
(10)

where \(\alpha =8.1\times 10^{-3}\) and \(\beta =0.74\). The values of dominant frequency and significant wave height are the outputs of the methodology.

Fig. 4
figure 4

Shipborne and tower video data. a Ship video with horizon stabilisation b Ship video left tracking point c Ship video after preprocessing d Tower video e Buoy station 41013 f Tower video after preprocessing. Lines in (c) and (f) denote the selection of pixels utilised for the estimations

4 Experimental results

Two sets of video data that have corresponding in situ measurements are used in order to test the accuracy of the proposed technique. Both sets comprise videos with duration of approximately one minute. The first set is taken from a shipborne camera in experiments done on the 24 November 2014 in the North Atlantic sea. Two buoys measured the ocean at the same time in a nearby location. The sea state in this set of videos is approximately the same, as the state is not expected to change in a large degree in the time span of a few hours. The significant wave height of the shipborne video is approximately 3.1m-3.4m.

The second set of video data is taken from a camera on the Frying Pan Shoals tower, a former lighthouse located approximately 39 miles southeast of Southport, North Carolina. A 24-h live video footage of the ocean is available online [13]. Although the camera is panning showing a panoramic view, in some time instances the camera is stable at fixed positions, enabling us to capture the ocean surface. A nearby buoy, station 41013 [17], owned and maintained by the National Data Buoy Center provides sea state measurements.

Fig. 5
figure 5

Shipborne video showing stability of video estimations of a similar sea state and correlation of the significant wave height (\(h_s\)) estimation between video and buoys. Buoy 1 \(h_s=3.18\) m (9:15–10:00 am), \(h_s=3.18\) m (10:00–11:00 am), Buoy 2: \(h_s=3.15\) m (9:15–10:00 am), \(h_s=3.41\) m (10:00–11:00 am). Videos before 9:15 am do not have corresponding buoy measurements. They are included in this figure because the sea state is not expected to change in a large degree in the time span of 1 h. They are useful in demonstrating that the methodology provides consistent estimations for approximately the same sea state

Fig. 6
figure 6

Tower video results showing correlation of the significant wave height (\(h_s\)) estimation between video and buoy across a variety of sea states. Buoy \(h_s\): \(\min =0.5\) m, \(\max =3.6\) m. The dashed lines denote the significant wave height range of the sea states according to the Beaufort scale

4.1 Preprocessing

The preprocessing step involves stabilisation. For the shipborne video, the rotational movement of the ship (pitch, roll, yaw) is stabilised by stabilising the horizon. This is achieved with the rotational tracker of Adobe After Effects [1]. Two rectangles are drawn above the video in order to stop the tracking points from moving horizontally while tracking the horizon. Figure 4a shows a typical frame from ship video and the rectangles drawn in order to track the horizon, and Fig. 4b presents how each tracking point is selected.

For the tower video, the video is stabilised only in cases of high local wind with the stabilisation features of Adobe After Effects. A single set of pixel locations is used for computational efficiency as in Fig. 4c and f. Figure 4c presents the result from ship video after preprocessing. Figure 4d shows a typical frame from tower video and Fig. 4f the result after preprocessing.

4.2 Main methodology

The first set of data from the ship examine the behaviour of the methodology for an approximately statistically stationary sea state. The second set of data from the tower examine the behaviour for a variety of sea states, as the videos were captured in different days. The SSA algorithm is run in all cases for a window length of 350, which is determined empirically.

The matrix of w-correlations provides a good indication of whether the window is too small or too large. Specifically, in cases of smaller window size the association between RCs in the main diagonal is weak (the model is too general). With a larger window size, high values are concentrated in positions further away from the main diagonal. In this case, it can be interpreted as the algorithm is overfitting. From empirical observations, the estimations are relatively insensitive to the window length. That is, even if different window lengths are used (for example 250 or 450) the impact on the estimations is minimal (see Sect. 4.3).

The shipborne results are presented in Fig. 5. Each point represents a one minute video captured in the same day. The buoys were deployed on the sea surface after 9:15 a.m., any videos before then are presented here only to show the behaviour of the methodology. As mentioned, the sea state is not expected to change in a large degree in the span of 1 h. The error metrics between the video estimations and buoy measurements are presented in Table 1.

Table 1 Error metrics with shipborne video (includes only videos with concurrent buoy sea state)

From Fig. 5, it is observed that the video estimations vary in values but are close in proximity to what the buoys indicate to be the true sea state. From Table 1, the 0.19 m of RMSE and 4.83% of MAPE indicate that the video estimations are not very distant from the buoy measurements. Until this point, the results support the hypothesis of the present work that the methodology estimates the significant wave height in a satisfactory degree of accuracy.

The tower video are used for examining the behaviour of the method across a variety of sea states. The tower video results are presented in Fig. 6. Each point represents a video captured on a specific date. The technique estimates lower dominant frequencies (higher \(h_s\)) for higher sea states and higher dominant frequencies (lower \(h_s\)) for lower sea states, as expected.

The error metrics from tower video are presented in Table 2. Although the error metrics of 0.23 m RMSE and 15.29% of MAPE are higher than the ones observed from shipborne video, they still remain in acceptable levels in showing that the estimations are meaningful. It should be mentioned that the MAPE metric provides higher error when the pairs of true-estimated values are lower. This is one possible reason for the higher value of MAPE with the tower video, as the tower video is captured in both lower and higher sea states. The experimental results show the methodology’s potential for estimating the significant wave height.

Table 2 Error metrics with tower video

4.3 Sensitivity analysis

From the matrix of w-correlations (see Fig. 1), it is observed that in many cases the first two or the first four elementary reconstructed components (RCs) have a strong correlation. It is also observed that components after RC 10 contain more noise. Additionally, there is a strong correlation between different components, such as RCs 3–4 and RCs 5–6.

Fig. 7
figure 7

Example of sensitivity analysis with a varying selection of RCs of the SSA algorithm with tower videos as input. The root mean square error (RMSE) for the estimation of the significant wave height

From empirical observations with multiple videos, the selection of only the first two RCs does not provide accurate estimations. The selection of RCs 1–4, 1–6 and 1–8 provides more accurate estimations. From RCs 1–10 and higher the estimations become less accurate. Possible reason for this effect is the inclusion of more noise as the component number increases. For example, see Fig. 7.

In terms of the window length parameter of the SSA algorithm, values have been tested ranging from 50 to 1000 (the number of frames is approximately 1000 for the shipborne video and 2000 for the tower video). For values between 100 and 850, the sea state estimation does not vary in a large degree. For example, see Fig. 8.

The speculation for the less accurate results with a small window size is that the model is too general; that is, the association between the different elementary reconstructed components (RCs) is not clearly defined. Similarly, the speculation for the less accurate results with a very high window size is that the model is too specific; that is, the association between the different RCs is overspecified.

A two-term Taylor series is used for approximating the fundamental matrix, which is only used for the calculation of the Kalman gains. The sea state estimation is approximately the same with the use of higher-order Taylor series. Up to five-term Taylor series have been included for the sensitivity analysis, with no significant improvements in the estimation of the sea state. Including more terms could potentially be beneficial if the approximate fundamental matrix was used for propagating the states. However, in this work the propagation of the states is achieved with integration of the differential equations over the sampling interval.

Fig. 8
figure 8

Example of sensitivity analysis with a varying window size of the SSA algorithm with tower videos as input. The root mean square error (RMSE) for the estimation of the significant wave height

5 Conclusion

The range of significant wave height for which the technique is tested is between 0.5 and 3.6 m. Higher sea states are not examined, and thus, the applicability in such instances is not known. Testing the technique in stable and varying sea states shows the potential of the methodology at estimating the sea state from video.

Future work will include further testing for higher sea states. In the context of practical utilisation of the present work, in some cases estimation of higher sea states might not be required. For example, in the case of use of the work for the execution of maritime operations, very high sea states can already be identified.