1 Introduction

Short-term wave prediction at a location is essential in the design and implementation of maritime operations. Reliable prediction allows improvements in safety for conducting operation-related activities, helps offshore structures to avoid the dangers of harsh conditions (Jain and Deo, 2007), and improves the efficiency of wave energy converters (Li et al., 2012). The operational wave forecast has been widely explored for its application value in engineering over recent decades. A large number of wave forecast models have been developed. According to the theoretical differences among the various methods, wave forecast models may be classified into four types of approaches: energy balance equation (EBE)-based models, classical time series models, intelligent-technique-based nonlinear models, and hybrid models.

Conventional numerical models for wave forecasts were based on EBEs. EBE-based numerical models are usually used to forecast waves over a large spatial and temporal domain (The Wamdi Group, 1988; Komen et al., 1994; Janssen, 2008; Sandhya et al., 2014; Tolman, 2014). However, their predictions depend on how precisely the phenomena are expressed in formulations (Mandal and Prabaharan, 2010). In addition, their implementation remains difficult because of the high computational cost and the unavailability of forcing functions (Londhe and Panchang, 2006).

Classical time series models provide another possible way for achieving wave forecasting. Their simple model methodology without wind-to-wave conversion does not require exogenous data and massive computing memory and time (Jain and Deo, 2007). However, classical time series models are not suitable for nonlinear and non-stationary wave prediction because of their linear and stationary assumptions. Therefore, several improved models, such as the bilinear model (Hannan, 1982), the threshold autoregressive model (Tong and Lim, 1980), and the autoregressive conditional heteroskedasticity model (Engle, 1982), have been developed. These nonlinear models are limited by the hypothesized explicit relationships for the available data series (Zhang et al., 1998).

To resolve the nonlinearity of ocean waves, intelligent-technique-based nonlinear models such as artificial neural networks (ANNs) and genetic programming (GP) models have been extensively studied. Real-life applications of these soft computing techniques can be found in different fields (Chau, 2007; Wu and Chau, 2013; Taormina and Chau, 2015). In these studies, it was found that the ANN model performed better for short-interval prediction and produced similar results to classical time series models for long-interval forecast. GP provides another possible intelligent solution for nonlinear ocean wave prediction. Real-time wave forecasting at two locations in the Gulf of Mexico (Gaur and Deo, 2008) suggested that GP performed better than ANN for higher interval forecasts. Even the intelligent-technique-based nonlinear models may perform well in handling nonlinearity, they may not be capable of modeling non-stationary data without preprocessing (Cannas et al., 2006; Deka and Prahlada, 2012), especially for long-interval forecast. In addition, a substantial sample size is strictly required in training ANN models, which may imply that a high computational cost is incurred. For example, Deo et al. (2001), Agrawal and Deo (2002), Mandal and Prabaharan (2010), and Kamranzad et al. (2011) used 80% of the data to train ANN models.

Modeling a nonlinear and non-stationary data set by applying a single nonlinear model is very difficult because there are too many possible patterns hidden in the data. A single model may not be general enough to capture all the important features. Hybrid models that combine pre-techniques with single models provide more effective modeling. Time series of wave height are frequently decomposed into several simple components. Then, each component is modeled using single prediction models.

Conventionally, Fourier transform and wavelet analysis are the approaches adopted most often. As Fourier transform is a linear and stationary method, it is unsuitable for nonlinear and non-stationary time series. Wavelet-based models, such as wavelet fuzzy logic model (Özger, 2010) and wavelet neural network (WLNN) (Deka and Prahlada, 2012), have been used for wave forecasting in which the wavelet technique is effective in handling non-stationarity. However, wavelet-based hybrid models have deficiencies in nonlinear and non-stationary wave forecasting. Essentially, wavelet transform is a linear and non-stationary technique. It represents a signal by a linear combination of wavelet basis functions. Its decomposition results for nonlinear data can be misleading (Huang and Wu, 2008; Kim et al., 2012). Furthermore, wavelet analysis suffers from its non-adaptive nature as it applies the same type of basis functions to the entire range of data. A set of basis functions that reflects the time-varying property of a signal is required.

A data-driven technique known as empirical mode decomposition (EMD) has been proposed by Huang et al. (1998). This approach is powerful and adaptive in analyzing nonlinear and non-stationary data sets. It provides an effective approach for decomposing a signal into a collection of so-called intrinsic mode functions (IMFs), which can be treated as empirical basis functions. EMD technique acts essentially as a dyadic filter (Flandrin et al., 2004) that separates a complex signal with wide frequency band into relatively simple components with various time scales.

The EMD technique has been widely applied to improve the performance of prediction models (Duan et al., 2015; Wang et al., 2015). In this study, a hybrid EMD-AR model was developed for the nonlinear and non-stationary wave forecasting. Compared with the intelligent-technique-based nonlinear models, the AR model is promising in practical engineering applications as it is convenient for real-time model identification, is highly adaptive, and requires a low computational cost (Zhang and Chu, 2005). However, it suffers from the limitations of nonlinearity and non-stationarity. This situation is overcome by hybridizing the EMD technique with the AR model. Implementation of the EMD-AR model for wave forecast consists of three steps. In the first step, the measured wave time series is decomposed into several stationary components called IMFs. In the second step, AR is used to forecast each component. In the final step, the prediction results of all components are aggregated to obtain the expected wave forecasts.

Derived from the EMD technique and AR model, the EMD-AR model is data-driven, highly adaptive, and suitable for nonlinear and non-stationary time series. An investigation of the EMD-AR model for nonlinear and non-stationary wave forecasting was conducted by using wave data sets from three buoys. The buoys were located at various sites and maintained by the National Data Buoy Center (NDBC), USA. For comparison, the AR model was also studied using the same data sets. The results indicate the superiority of the EMD-AR model and the effectiveness of the EMD technique in extending the scope of the AR model in wave forecasting.

In this paper, the theoretical formulations and numerical schemes of the AR and EMD-AR models are presented first, then brief descriptions of the wave data and accuracy measures are given, finally, numerical results using various significant wave height data sets are presented.

2 Theoretical formulations

2.1 AR prediction model

The AR model considers relations among variables of the time sequence; therefore, the present variable can be represented by using the previous time variable. For a given time series {x(t), t=1, 2, …, n}, the model is formulated as

$$\begin{array}{*{20}c}{x\left(t \right) = {\varphi _1}x\left({t - 1} \right) + {\varphi _2}x\left({t - 2} \right) + \ldots + {\varphi _p}x\left({t - p} \right) + a\left(t \right),} \\ {t = 1, 2,\; \ldots, n,} \\ \end{array}$$
(1)

where p is the model order, {φ1,φ2, …, φ p } are parameters of the AR model, which are unknown. The variable {a(t), t=1, 2, …, n} is zero-mean white noise. Identification of the AR model shown in Eq. (1) involves the selection of model order p and corresponding parameters {φ1, φ2, …, φ p }.

A variety of algorithms have been developed for estimating the model parameters, of which, least mean squares (LMS), recursive least squares (RLS), and Levinson-Durbin (L-D) algorithms are mostly used. However, LMS algorithms suffer from low convergence speed and eigenvalue spread problems. The use of the RLS algorithm introduces problem that program code for the sliding-window RLS algorithm is complicated to implement, memory intensive, and potentially numerically unstable (Douglas, 1996). Additionally, the determination of forgetting factor is not always adaptive, leading to non-negligible fluctuations in prediction accuracy. Therefore, the L-D algorithm was adopted to estimate the model parameters in this study.

For a given time sequence {x1, x2, …, xn−1, x n }, the L-D algorithm for an AR model with a order of p consists of the following steps: (1) compute the autocorrelation matrix R with a size of (p+1)×(p+1) using Eq. (2); (2) set the initial conditions using Eqs. (3) and (4); (3) compute the coefficients of order k using the coefficients of model order k−1 based on Eqs. (5)–(8) until k equals the preset order p.

$${r_k} = {1 \over {n - k}}\sum\limits_{i = 1}^n {{x_i}{x_{i + k}}}, \quad k = 0,\;1,\; \cdots, \;p,$$
(2)

where r k denotes the autocorrelation function of the sequence {x1, x2, …, xn−1, x n } for a lag k.

$${\varphi _{1,1}} = {\rho _1},$$
(3)
$${\sigma _1} = {r_0}(1 - \rho _1^2),$$
(4)

where φ1,1 is the first-order model parameter, σ1 is the variance, and ρ1 is the reflection coefficient as shown in the following equation:

$${\rho _1} = {r_1}/{r_0},$$
(5)
$${\varphi _{i,k}} = \left\{ {\begin{array}{*{20}c}{{\rho _k},\quad \quad \quad \quad \quad } & {i = k,\quad \quad \quad \quad } \\ {{\varphi _{i,k - 1}} - {\rho _k}{\varphi _{k - i,k - 1}},} & {i = 1,\;2,\; \cdots, \;k - 1,} \\ \end{array} } \right.$$
(6)
$$\sigma _k^2 = \sigma _{k - 1}^2(1 - \rho _k^2),$$
(7)

where φ i,k is the kth order model parameter and σ k is the corresponding variance. The kth order reflection coefficient ρ k is formulated as

$${\rho _k} = {{{r_k} - \sum\limits_{i = 1}^k {{\varphi _{i,k - 1}}{r_{k - i}}} } \over {\sigma _{k - 1}^2}}{.}$$
(8)

Another problem in AR modeling is the selection of an optimal order. In recent decades, numerous criteria have been proposed to determine the AR order of specified time series. Although it has been a long time since they were first proposed, the Akaike information criterion (Akaike, 1974) and Bayesian information criterion (BIC) (Akaike, 1979) are still the most popular approaches. These criteria have been widely used in various principles of engineering, especially in economic studies. Assume that the residual variance representing the measure of fitness of AR(p) to the data is defined as \(\hat \sigma _{\rm{a}}^2(p)\). It can be formulated as

$$\hat \sigma _{\rm{a}}^2(p) = {1 \over {N - p}}\sum\limits_{t = p + 1}^N {{{\left({{x_t} - \sum\limits_{i = 1}^p {{\varphi _i}{x_{t - i}}} } \right)}^2}} {.}$$
(9)

With the definition of the residual variance, order selection criteria of BIC are briefly described as

$${\rm{BIC}}(p) = {{\lg {{\hat \sigma }^2}_{\rm{a}}(p) + (p + 1)\lg N} \over N}{.}$$
(10)

In this study, the BIC principle is applied in order selection. The model order p0 leading to the minimum BIC value is chosen as the optimal order. Once the prediction model as presented in Eq. (1) is determined, a k-step-ahead adaptive predictor can be presented as

$$\hat x(t + k) = \left\{ {\begin{array}{*{20}c}{\sum\limits_{i = 1}^p {{\varphi _i}x(t + k - i),} \quad k = 1,\quad \quad \quad \;\;} \\ {\sum\limits_{i = 1}^{k - 1} {{\varphi _i}\hat x{{(t + i)}_{N + i}}} + \sum\limits_{i = k}^p {{\varphi _i}x(t + k - i),} } \\ {\quad \quad \quad \quad \quad \quad \quad k = 2,\;3,\; \cdots, \;p,} \\ {\sum\limits_{i = 1}^p {{\varphi _i}\hat x(t + k - i)}, \;\quad k > p,\quad \quad \quad } \\ \end{array} } \right.$$
(11)

where \(\hat x(t + k)\) is the prediction of advancing k steps.

2.2 Hybridization process of the EMD-AR model

Decomposition is a critical part of signal processing. Complex signals are frequently decomposed into several simple components and then the information in each component is analyzed to reduce the complexity and enhance interpretability. EMD was proposed by Huang et al. (1998), and it is powerful and adaptive in analyzing the nonlinear and non-stationary data sets. It provides an effective approach to decompose a signal into a collection of so-called IMFs, which can be treated as empirical basis functions driven by data. An IMF result from the EMD procedure should satisfy two conditions: (I) the number of extrema and the number of zero-crossings should differ or be equal to 1 and (II) the local average should be zero, i.e., the mean of the upper envelope defined by the local maxima and the lower envelope defined by the local minima should be zero. The first condition is similar to the traditional narrow band requirements for a stationary Gaussian process (Huang et al., 1998). Therefore, the IMF produced through the EMD procedure is stationary.

For a given sequence x(t), implementation schemes of EMD comprise the following steps: (1) identify the local extrema; (2) generate the upper envelope u(t) and the lower envelope l(t) via spline interpolation among all the local maxima and the local minima, respectively, and then obtain the mean envelope: m(t)=[l(t)+u(t)]/2; (3) subtract m(t) from the signal x(t) to obtain the IMF candidate, that is h(t)=x(t)−m(t); (4) verify whether h(t) satisfies the conditions for IMFs and do steps (1)–(4) until h(t) is an IMF; (5) get the nth IMF component imf n (t)=h(t) (after n shifting processes) and the corresponding residue r(t)=x(t)−h(t); (6) repeat the whole algorithm with r(t) obtained in step (5) until the residue is a monotonic function.

By implementing the presented algorithm, the signal can be decomposed according to the following Eq. (12). As an example, Fig. 1 displays decomposition results of the significant wave height data shown in Fig. 4a, where it can be clearly seen that the complex wave height time series can be represented by several simple components.

$$x(t) = \sum\limits_{i = 1}^n {{\rm{im}}{{\rm{f}}_i}(t)} + r(t){.}$$
(12)
Fig. 1
figure 1

Decomposition results of significant wave height time series data set I using the EMD technique

Figs. 1a1h display the simple components with different amplitude and frequency modulations. The data were measured by buoy 42085, which was maintained by the NDBC. Details about the data are provided in Table 1

When implementing EMD technique in time series prediction problem, the boundary effects should be taken into account. Researchers have proposed certain techniques for processing boundary effects, such as the characteristic wave extending method (Huang et al., 1998), the ratio extension method (Wu and Riemenschneider, 2010), and the mirror image extending method (Zhao and Huang, 2001). Among the various approaches, the symmetric extending method is the most popular. However, extended results from the symmetric extension method are far from satisfactory. Distinct differences always exist between the extended extrema and the real ones. The influence of end effects on the performance of EMD-based models has been examined by Xiong et al. (2014) and Huang et al. (2015). They found that prediction models for end effect processing lead to more reasonable extended results. In this study, the AR prediction model presented in Section 2.1 was used in the processing of boundary effects.

Time series of ocean waves are a kind of complicated nonlinear and non-stationary signal that consists of different oscillation scales. The multiple oscillation scales cause difficulties for AR models when conducting wave forecasts. The combination of an EMD model with an AR model provides an effective way to improve wave prediction. The procedure of carrying out wave forecast using the hybrid EMD-AR models comprises three steps (Fig. 2). In the first step, the wave height time series is decomposed into a couple of simple and meaningful IMFs and a residual by EMD. In the second step, prediction of decomposed components is performed individually using the AR model. In the final step, the predictions are aggregated to attain the final predictions.

Fig. 2
figure 2

Implementation of significant wave height forecasting using the EMD-AR model

3 Brief descriptions of the wave data

Ocean wave data from three buoys maintained by the NDBC were used in the forecasting study. The geographical locations where the significant wave height time series are measured and brief non-stationarity analysis of wave data are described in Sections 3.1 and 3.2. Statistical error measures for evaluating prediction performance are presented in Section 3.3.

3.1 Locations and data

To study the performance of the models in forecasting ocean waves with sufficiently different statistical characteristics, significant wave height data measured by buoys on the coast of Ponce (No. 42085), San Juan (No. 41053), and the South Virgin Islands (No. 41052) were chosen. Location information and data availability of these buoys are depicted in Table 1. Some of the hourly time series records (source files from http://www.caricoos.org/drupal/data_download) of the significant wave heights are presented in Fig. 3. The variation in the range of significant wave heights among the three buoys can be seen in Table 1. In view of these differences among the sites, it is reasonable to describe the data from these three buoys as representing a range of geographical and statistical properties (Londhe and Panchang, 2006).

Fig. 3
figure 3

Significant wave height time series from the wave measurements by buoys: (a) No. 42085, (b) No. 41053, and (c) No. 41052

Table 1 Buoy locations and data availability

3.2 Non-stationarity analysis

According to traditional definition, a time series, {x(t)}, is stationary in general, if, for all t,

$$\begin{array}{*{20}c}{E[x(t)] = {\rm{constant}} < \infty, \;\;} \\ {E[{x^2}(t)] < \infty, \quad \quad \quad \quad \;\;} \\ {E[x({t_1})x({t_2})] = R({t_2} - {t_1}),} \\ \end{array}$$
(13)

where E[·] is the expected value defined as the ensemble average of the quantity, and R is the covariance function.

Based on the definition of a stationary process, quantitative methods of consecutive statistics are used to analyze the stationarity of significant wave height time series. For stationary time series, their expected value and covariance functions are required to be constants. Fig. 4 shows the expected value and covariance functions of a stationary time series. For specification, the time delay τ in the covariance function R(τ) is assumed to be 10. It is clearly verified that the expected and covariance functions of the stationary time series are nearly constants. According to the definition of stationary process formulated in Eq. (13), it is demonstrated that the IMF produced by the EMD in Fig. 4a is stationary. Fig. 5 presents the statistical functions of significant wave height data. It shows that expected value functions and covariance functions R(10) are notably time varying, demonstrating the presence of non-stationarity in the significant wave height data.

Fig. 4
figure 4

Expected value and covariance functions of a stationary time sequence (a) Example of IMF produced by implementing EMD technique; (b) Expected value and covariance functions of IMF

Fig. 5
figure 5

Statistics of significant wave height data: expected values and covariance functions of data set I (a), data set II (b), and data set III (c)

3.3 Evaluation of forecasting performance

Error measures that are used for the evaluation of forecasting performance usually include the root mean square error (RMSE), the correlation coefficient (r), the scatter index (SI), and the mean absolute error (MAE). Each one of these error criteria has usefulness and limitations (Kalra et al., 2005). For example, the correlation coefficient r is a widely accepted measure of the degree of linear association between the target and the realized outcome, but it is highly sensitive to the extreme values. Hence, they should be viewed together while drawing any inference based on their magnitude.

In this study, prediction results were studied by (I) comparing time histories of the above models’ forecasts with measured wave heights, (II) computing the RMSE, the correlation coefficient (r), and the SI as shown in Eqs. (14)–(16), and (III) drawing scatter diagrams and computing the corresponding best-fit line slope. The RMSE is a measure representing the ensemble error of the prediction results. It is proportional to the observed mean. The SI forms a good non-dimensional error measure.

$$r = {{\sum\limits_{t = 1}^n {({{\hat x}_t} - {{\hat x}_m})({x_t} - {x_m})} } \over {\sqrt {\sum\limits_{t = 1}^n {{{({{\hat x}_t} - {{\hat x}_m})}^2}\sum\limits_{t = 1}^n {{{({x_t} - {x_m})}^2}} } } }},$$
(14)
$${\rm{RMSE}} = \sqrt {{1 \over n}\sum\limits_{i = 1}^n {{{({{\hat x}_t} - {x_t})}^2}}},$$
(15)
$${\rm{SI}} = {{{\rm{RMSE}}} \over {{x_m}}},$$
(16)

where \({\hat x_t}\) is the forecast results with the mean value of \({\hat x_m}\), x t the measured wave height motions, x m the mean value of x t , and n the testing times.

4 Results and discussion

The AR and EMD-AR models were tested using significant wave heights measured by buoys (No. 42085, No. 41053, and No. 41052). A fixed sliding window with a sample size of 500-h wave height records was designed to construct prediction models, while the subsequent 500-h data were used for validation purposes.

4.1 Results

4.1.1 Prediction results using data set I

1-h, 3-h, and 6-h historical predictions of significant wave heights on the coast of Ponce are shown in Figs. 68. Scatter diagrams of the forecasts are presented in Figs. 911. The values of the error measures, including r, RMSE, and SI, under various lead times are summarized in Table 2 (p.124). Additionally, the error measures of RMSE and r are plotted in Fig. 12 to show the relations between their magnitudes and the prediction lead times.

Fig. 6
figure 6

1-h forecast of significant wave height on the coast of Ponce by AR model (a) and EMD-AR model (b)

Fig. 7
figure 7

3-h forecast of significant wave height on the coast of Ponce by AR model (a) and EMD-AR model (b)

Fig. 8
figure 8

6-h forecast of significant wave height on the coast of Ponce by AR model (a) and EMD-AR model (b)

Fig. 9
figure 9

Scatter diagram of observations and 1-h predictions by AR (a) and EMD-AR (b) models using data set I

Fig. 10
figure 10

Scatter diagram of observations and 3-h predictions by AR (a) and EMD-AR (b) models using data set I

Fig. 11
figure 11

Scatter diagram of observations and 6-h predictions by AR (a) and EMD-AR (b) models using data set I

Fig. 12
figure 12

RMSE (a) and correlation coefficient (b) of prediction models with various lead times using data set I

Table 2 Error measures of AR and EMD-AR models in predicting significant wave heights in Ponce

4.1.2 Prediction results using data set II

Further comparisons of the prediction models were carried out using the significant wave height records measured by buoy 41053 on the coast of San Juan. Figs. 13 and 14 show the 1-h and 6-h predicted time histories, respectively, while Figs. 15 and 16 exhibit the corresponding scatter diagrams. Ensemble error measures are summarized in Table 3, and the RMSE and r are plotted in Fig. 17.

Fig. 13
figure 13

1-h forecast of significant wave on the coast of San Juan by AR (a) and EMD-AR (b) models

Fig. 14
figure 14

6-h forecast of significant wave on the coast of San Juan by AR (a) and EMD-AR (b) models

Fig. 15
figure 15

Scatter diagram of observations and 1-h predictions by AR (a) and EMD-AR (b) models using data set II

Fig. 16
figure 16

Scatter diagram of observations and 6-h predictions by AR (a) and EMD-AR (b) models using data set II

Fig. 17
figure 17

RMSE (a) and correlation coefficient (b) of prediction models with various lead times using data set II

Table 3 Error measures of the AR and EMD-AR models in predicting significant wave heights in San Juan

4.1.3 Prediction results using data set III

Explorations of the prediction models were consolidated by forecasting simulations using significant wave heights measured by buoy 41052 arranged on the coast of the South Virgin Islands. Similarly, results are represented in the form of historical time series, scatter diagrams, and error measures. For brevity, only 6-h forecasting time historical results (Fig. 18) and the corresponding scatter diagrams (Fig. 19) are presented. Summaries of error measures used in various prediction lead times are shown in Table 4 and Fig. 20.

Fig. 18
figure 18

6-h forecast of significant wave on the coast of South Virgin Islands by AR (a) and EMD-AR (b) models

Fig. 19
figure 19

Scatter diagram of observations and 6-h predictions by AR (a) and EMD-AR (b) models using data set III

Fig. 20
figure 20

RMSE (a) and correlation coefficient (b) of prediction models with various lead times using data set III

Table 4 Error measures of the AR and EMD-AR models in predicting significant wave heights in South Virgin Islands

4.2 Discussion

It is clear from the results that 1-h wave forecasts at various locations using the AR model agree with the measurements to a reasonable degree. As Figs. 6 and 13 suggest, the general patterns of the recorded significant wave height variation in different locations were well captured by the AR model. Tables 24 present the values of the forecasting measure errors, where the correlation coefficients for the wave forecasts in Ponce, San Juan, and the South Virgin Islands were 0.92, 0.93, and 0.97, respectively, indicating a relatively high degree of linear association between the predicted and recorded wave heights.

However, prediction errors remain noticeable in the predicted time series as shown in Figs. 6 and 13. Spatial offsets appear as large parts of the troughs and peaks are underestimated. Figs. 9 and 15 show that the best-fit line slopes for the scatters with respect to wave forecasts in Ponce and San Juan are only 0.857 and 0.785, respectively. In addition to the spatial offsets, Figs. 6 and 13 imply that even if the peaks and troughs were well predicted by the AR model, a shift between the recorded and predicted wave time series can still be noted. The shift is a kind of prediction error that can also be found in other research studies of wave forecasting using the AR model (Deo and Sridhar, 1998) and ANN (Londhe and Panchang, 2006). The shift results mainly from the non-stationarity hidden in the measured wave time series. Even if the nonlinear ANN is used to forecast the nonlinear and non-stationary wave height, the shift remains.

The shift is proportional to the lead time. In Figs. 68, it is easy to see that the shift between the AR-based predicted and recorded wave time series increases as the lead time grows. Predictions by the AR model in San Juan and the South Virgin Islands support these observations. As presented in Tables 24 and in Figs. 12, 17, and 20, the RMSE and SI increase, while the correlation coefficient decreases with the increase of the lead time.

Owing to the linear and stationary limitations, the AR model fails to predict the nonlinear and non-stationary wave heights accurately when the lead time reaches 6 h. The best-fit line slopes of the scatters with respect to wave forecasts in Ponce, San Juan, and the South Virgin Islands are only 0.614, 0.702, and 0.798, respectively, indicating a relatively low level of forecasting accuracy. The nonlinear and non-stationary wave forecasts are considerably improved by using the proposed EMD-AR model. The predictions of the hybrid EMD-AR show better agreement with the targets. When the lead time is short, not only are the peaks and troughs of the targets precisely captured for the most part but also the short-term fluctuations in the sequence are reproduced remarkably well (Figs. 6 and 13). For instance, the spatial offsets resulting from the observations and the predictions by the AR model are quite noticeable in the range of 650–700 h, especially for forecasts with a large lead time (Fig. 14). This situation is noticeably improved by introducing the EMD technique (Figs. 13 and 14). Additionally, the best-fit line slopes in the scatter of 1-h wave forecasts using the AR and EMD-AR models in Ponce (Fig. 9) are 0.876 and 0.957, respectively. Despite the small spatial offsets relative to the target when the lead time grows, forecasts of the EMD-AR model display a level of fidelity in the measured significant wave heights which is certainly acceptable for most practical applications.

As shown in Figs. 69, 13, 14, and 18, the shifts between the predicted and recorded wave time series in various locations were eliminated by using the EMD-AR model instead of the AR model. This improvement is confirmed by comparing the error measures of the EMD-AR and AR models. For example, the correlation coefficients of the 6-h predictions in Ponce, San Juan, and the South Virgin Islands by the AR model were 0.65, 0.86, and 0.91, respectively, while those by the EMD-AR model are 0.90, 0.94, and 0.97, respectively. Meanwhile, the SIs of the 6-h predictions in these locations by the AR model were 0.1563, 0.2405, and 0.2041, while those of the EMD-AR model were 0.0625, 0.1450, and 0.1099, respectively. In addition, Tables 24 and Figs. 12, 17, and 20 summarize error measures with various lead times, providing general evidence for the above claims. The EMD-AR model led to lower ensemble RMSE and higher r. The graphs in Figs. 12, 17, and 20 combined with Tables 24 demonstrate large improvements in prediction accuracy by using the EMD technique in the AR model. Considerable reductions in RMSE and increases in correlation coefficient were obtained. Taking 6-h wave forecasts as an example, in Table 2, the reduction in RMSE was about 60%, while the increase in the correlation coefficient was more than 50%.

5 Concluding remarks

This study developed a hybrid EMD-AR model to improve the accuracy of prediction of nonlinear and non-stationary waves. The EMD-AR and AR models were compared using wave data with various geographical and statistical properties measured by NDBC buoys in Ponce, San Juan, and the South Virgin Islands. Consistent results were obtained from the predictions of significant wave heights in different locations. For short-interval predictions, the AR model may produce reasonable results. However, spatial offsets and shifts occur widely in the nonlinear and non-stationary wave forecast. This is because the AR model is suitable only for linear and stationary time series prediction, whereas nonlinearity and non-stationarity are features of all the measurements. These errors increased as the lead time grew. This difficulty was overcome by the proposed hybrid EMD-AR model. Owing to the capability of the EMD technique in processing nonlinearity and non-stationarity, the accuracy of the wave forecast was greatly improved. Not only were the general tendencies satisfactorily reproduced but also most part of the peaks and troughs were correctly captured. Considerable improvements in prediction accuracy were obtained using the hybrid EMD-AR model. Graphs related to predicted time histories (Figs. 68, 13, 14, and 18) suggest that the shifts between the predicted and recorded wave time series were eliminated by the EMD technique. The superiority of the EMD-AR model to the AR model was confirmed by the ensemble of the smaller RMSE and SI, and higher r. However, the hybrid EMD-AR model has a limitation: it requires more computational resource than the single AR model.