1 Introduction

Conditional nonlinear optimal perturbation (CNOP) method is proposed by Mu et al. (2003) to study predictability and sensitivity of the oceanic and climatic events in the nonlinear system (Wang et al. 2009). CNOP defines a perturbation which could lead to the largest nonlinear development at the prediction time under the given constraints. The study of the solved perturbation could help the researchers understand the corresponding physical mechanism and improve the prediction skill. Therefore, CNOP has been widely applied in the studies of the various events such as El Niño–Southern Oscillation (ENSO) (Zhang et al. 2018), typhoon (Mu et al. 2019) and Kuroshio (Zhang et al. 2019a, b, c), and these studies verify the efficiency of CNOP.

One problem in studying CNOP is how to obtain the perturbation leading to the largest development. Because the essence of CNOP is an optimal problem, there are two general ways to solve it. One way is applying the gradient-based methods (Sun et al. 2010), but these methods are easy to encounter the problems such as the missing of the adjoint component and the influence of the discontinuous on–off switch (Mu et al. 2005). Another way is applying the gradient-free methods, for example, intelligent algorithms, also called heuristic algorithm, could be applied to solve CNOP (Zheng et al. 2014). However, intelligent algorithm generally cannot obtain an effective solution within a proper time when the scale of the problem is large. Therefore, some researchers propose a framework, which is called feature extraction-based intelligent algorithm (FEIA) framework in this paper, to reduce the search space of intelligent algorithm by the dimension reduction method such as principal component analysis (PCA) (Wold et al. 1987; Ringnér 2008; Abdi and Williams 2010). For example, Mu et al. (2015) combine the PCA and particle swarm optimization (PSO) to solve the CNOP of ENSO in Zebiak–Cane (ZC) model and Yuan et al. (2019a) combine the PCA and simulated annealing (SA) to solve the CNOP of double-gyre variation in Regional Ocean Modeling System (ROMS). Although utilizing PCA improves time efficiency to some extent, there exists a problem that due to the fixed latent space in PCA, the probability of obtaining the effective solution is quite low.

Recently, neural network has attracted the attention of lots of researchers because of its convenience and excellent performance. Meanwhile, many structures of neural network can be used to construct the low-dimensional latent search space. Compared with PCA, neural network might obtain a relative sparse uniform search space or a better reconstruction mapping for the special problem, which might be helpful for searching. However, few studies concentrate on employing neural network as feature extraction component in FEIA framework to elevate the solving performance. Hence, in this paper, two possible ways applying neural network in FEIA framework are thought of. One way is trying applying the neural network with the reduction function. For example, auto-encoder (AE) (Rumelhart et al. 1986; Hinton et al. 2006; Wang et al. 2016; Ramamurthy et al. 2020) and its variants such as sparse AE (SAE) (Ng 2011; Liu et al. 2019; Zhang et al. 2021), convolutional AE (CAE) (Masci et al. 2011; Chen et al. 2018) and variational AE (VAE) (Kingma et al. 2014; Xie et al. 2019; Liu et al. 2020a, b; Lin et al. 2020; Jiao et al. 2020) might replace the role of PCA in FEIA framework. Another way is applying PCA to obtain the latent space and applying neural network, such as decoder and generative adversarial nets (GAN) (Goodfellow et al. 2014; Creswell et al. 2018; Zhang et al. 2019a, b; Schonfeld et al. 2020), to reconstruct the origin space. Then, we conduct experiments to verify the feasibility of adopting the above two ways to solve the CNOP of double-gyre variation in ROMS. Results demonstrate that in contrast to PCA-based FEIA, FEIA with neural network can obtain more effective solutions in the aspects of better objective values and larger probabilities triggering the expected physical phenomenon.

The rest of this paper is organized as follows. This paper selects solving CNOP of double-gyre variation in ROMS as the case study, and the related contents of CNOP, neural network-based dimension reduction methods and the case are described in Sect. 2. In Sect. 3, FEIA framework whose intelligent algorithm component is selected as PSO, the related neural network and the coupling way of the network for FEIA framework are introduced. The tuning process of the network and the solving result are shown and analyzed in Sect. 4. Finally, the conclusion and the prospect of future work are given in Sect. 5.

2 Related works

2.1 CNOP

CNOP method, proposed by Mu et al. (2003), has been widely applied in the field of atmospheric and oceanic sciences to study the predictability and sensitivity (Wang et al. 2020; Jiang and Duan 2020; Liu et al. 2020a, b).

Mathematical description of CNOP is mentioned in the following.

For a given problem, assume that X0 is the initial state, Mt is the nonlinear propagator of the model from time 0 to t and x0 is the initial perturbation. In order to explore the initial perturbation \({\varvec{x}}_{{\varvec{0}}}^{\user2{*}}\), which can make the model’s development deviate maximally from the reference state at the prediction time under the given constraints δ, the problem of CNOP can be written as follows (Eq. 1):

$$ {\mathbf{x}}_{{\mathbf{0}}}^{{\mathbf{*}}} = \arg \max_{{\left\| {{\mathbf{x}}_{{\mathbf{0}}} } \right\|_{C} \le \delta }} J(x_{0} ) = \arg \max_{{\left\| {{\mathbf{x}}_{{\mathbf{0}}} } \right\|_{C} \le \delta }} \left\| {M_{t} ({\varvec{X}}_{{\varvec{0}}} + {\mathbf{x}}_{{\mathbf{0}}} ) - M_{t} ({\mathbf{X}}_{{\mathbf{0}}} )} \right\|_{E} $$
(1)

where ||⋅||E is the energy norm and ||⋅||C is the constraint norm for the problem.

The essence of CNOP is an optimal problem with certain constraints. Generally, depending on whether the gradient is involved, there are two types of approaches to solving CNOP. One is applying gradient-based methods (Sun et al. 2010). The adjoint method, a traditional gradient-based method, is highly dependent on adjoint component in numerical model and requires a lot of computation (Towara and Naumann 2013). Besides, due to frequent occurrence of discontinuous on–off switch (Mu et al. 2005) in nonlinear system, gradient-based methods fail to compute the correct gradient, which results in failure of solving CNOP. The other is applying gradient-free methods. Intelligent algorithms are the general name of a series of gradient-free algorithms designed by people utilizing natural laws. Such algorithms could be applied to solve CNOP (Zheng et al. 2014). When the scale of problem is small, intelligent algorithms could obtain global optimum. However, the scale of oceanic and climatic events is large. In this case, intelligent algorithms cannot obtain an effective solution within a proper time. With the aim of reducing the search space of intelligent algorithms, FEIA framework is proposed. PCA (Wold et al. 1987; Ringnér 2008; Abdi and Williams 2010) is often used as the dimension reduction method in FEIA framework. For example, Mu et al. (2015) combine the PCA and particle swarm optimization (PSO) to solve the CNOP of ENSO in Zebiak–Cane (ZC) model and Yuan et al. (2019a) combine the PCA and simulated annealing (SA) to solve the CNOP of double-gyre variation in Regional Ocean Modeling System (ROMS). Although utilizing PCA improves time efficiency in some way, there exists a problem that the probability of obtaining the effective solution is quite low, since PCA with fixed latent space cannot balance the trade-off between dimension reduction and information loss in some cases.

2.2 Neural network-based dimension reduction

With the development of artificial intelligence, many structures of neural network can be used to construct a relative sparse uniform latent search space or a better reconstruction mapping. AE is a neural network structure dedicated to transforming inputs into outputs with minimal information loss. Rumerlhart et al. (1986) first introduced AE in 1980s. AE can map input data to low-dimensional latent space and then use that latent space to generate output data similar to the input data. Because of its ability of learning useful features from data, AE has been universally applied in dimension reduction. Wang et al. (2016) compare AE with state-of-the-art dimension reduction methods. Experimental results show that AE can learn some features that are different from other dimension reduction methods. Researchers also apply AE in hyperspectral images classification as the dimension reduction component and prove that this proposed technique achieves image denoising and high performance (Ramamurthy et al. 2020). Besides, there are some variants of AE such as SAE, CAE and VAE. Ng (2011) introduces the concept of sparsity into training of traditional autoencoders and proposes SAE to enable the hidden variables to show more obvious characteristics. SAE has been widely used in feature extraction of images (Liu et al. 2019; Zhang et al. 2021). By introducing convolutional layers and deconvolutional layers, CAE is proposed to enable the network to better capture spatial features and relevance between data. It is applied in anomaly detection to learn nonlinear relationships between features (Chen et al. 2018). VAE assumes the latent feature follows a distribution, which is generally Gaussian distribution. VAE-based dimension reduction is applied in text learning (Xie et al. 2019; Liu et al. 2020a, b), biology-related analysis (Lin et al. 2020; Jiao et al. 2020), etc. In addition, GAN, proposed by Goodfellow et al. (2014), shows high potentials in data generation (Creswell et al. 2018). GAN and its variants (Zhang et al. 2019a, b; Schonfeld et al. 2020) generally consist of generator to generate simulated data and discriminator to discriminate whether the date is real or not. The structure of GAN meets requirements of reconstruction.

2.3 The case of double-gyre variation in ROMS

Double gyre, which consists of a sub-polar gyre and a sub-tropical gyre, is a typical large-scale ocean circulation in the northern mid-latitude ocean basins (Shen et al. 1999). Double-gyre variation is one of the low-frequency variability phenomena (Nauw and Dijkstra 2001). The study of the variation is helpful to understand the dynamic mechanism of double gyre and how the oceanic variability contributes to the mid-latitude climate variability (Qiu 2000).

ROMS is a split-explicit, free-surface, topography-following-coordinate ocean model (Shchepetkin & McWilliams 2005). It has been widely used in a variety of applications of the scientific community. Double gyre is one of the events simulated in ROMS, and the simulation follows Moore et al. (2004). The model simulates double gyre in a region whose longitude length and latitude length are separately 1000 km and 2000 km, respectively, and the region is divided into four vertical layers of 125 m in the vertical direction. The state data of double gyre consist of three parts, which are separately eastward velocity u, northward velocity v and sea surface height ζ. Under the resolution of 18.5 km, the size of u, v and ζ is separately 55 × 110 × 4, 56 × 109 × 4 and 56 × 110, and the total size of the state data is 54776.

Generally, double gyre can be divided into three states, which are separately symmetry state (Fig. 1a), jet-up state (Fig. 1b) and jet-down state (Fig. 1c). Figure 1 shows the representation of the three states in ROMS. In general, double gyre keeps steady in one state or shifts between symmetry state and another state. When the variation happens, the shift between jet-up state and jet-down state would appear. CNOP can be used to obtain the initial perturbation causing the variation, and the obtained perturbation can be used to study the dynamic mechanism of double gyre (Yuan et al. 2019a).

Fig. 1
figure 1

The a symmetry state, b jet-up state, c jet-down state of double gyre

According to Zhang et al. (2015), the energy norm (Eq. 2) and the constraint norm (Eq. 3) of double gyre can be defined as follows:

$$ \left\| {M_{t} \left( {{\mathbf{X}}_{0} + {\mathbf{x}}_{0} } \right) - M_{t} \left( {{\mathbf{X}}_{0} } \right)} \right\|_{E} = \frac{1}{2}\left[ {\int_{\Lambda } {h\left( {\Delta {\mathbf{u}}_{{\mathbf{t}}}^{2} + \Delta {\mathbf{v}}_{{\mathbf{t}}}^{2} } \right)dxdydz} + \int_{\Lambda } {\Delta {\mathbf{\zeta }}_{{\mathbf{t}}}^{2} dxdy} } \right] $$
(2)
$$ \left\| {{\mathbf{x}}_{0} } \right\|_{C} = \frac{1}{2}\left[ {\int {h\left( {\Delta {\mathbf{u}}_{{\mathbf{t}}}^{2} + \Delta {\mathbf{v}}_{{\mathbf{t}}}^{2} } \right)dxdydz} + \int {g\Delta {\mathbf{\zeta }}_{{\mathbf{t}}}^{2} } dxdy} \right] $$
(3)

where X0 = {u0, v0, ζ0} and x0 = {Δu, Δv, Δζ} are separately the initial state vector and the initial perturbation state vector for double-gyre data simulated in ROMS. Mt is the nonlinear propagator of ROMS from time 0 to t, which can be regarded as a black-box function. {Δut, Δvt, Δζt} is the development state vector, which is calculated from the result of Mt, and can be calculated as follows:

$$ \{ {\mathbf{\Delta u}}_{{\mathbf{t}}} ,{\mathbf{\Delta v}}_{{\mathbf{t}}} ,{\mathbf{\Delta \zeta }}_{{\mathbf{t}}} \} = \{ {\mathbf{u}}_{{{\mathbf{t\_new}}}} ,{\mathbf{v}}_{{{\mathbf{t\_new}}}} ,{{\varvec{\upzeta}}}_{{{\mathbf{t\_new}}}} \} - \{ {\mathbf{u}}_{{\mathbf{t}}} ,{\mathbf{v}}_{{\mathbf{t}}} ,{{\varvec{\upzeta}}}_{{\mathbf{t}}} \} = M_{t} ({\mathbf{X}}_{{\mathbf{0}}} + {\mathbf{x}}_{{\mathbf{0}}} ) - M_{t} ({\mathbf{X}}_{{\mathbf{0}}} ). $$
(4)

g is the gravitational acceleration, whose value is set to 9.8 m/s2, and h is the vertical layer thickness, whose value is 125 m. The energy norm integrates in the region Λ (0 km ≤ x ≤ 600 km, 750 km ≤ y ≤ 1250 km), and the constraint norm integrates in the whole simulation region (0 km ≤ x ≤ 1000 km, 0 km ≤ y ≤ 2000 km). The other settings refer to the previous experiment (Yuan et al. 2019a), where the initial state X0 is a jet-up state and the constraint value of perturbation δ is set to 4.0 × 1011 m5/s2, which is about 10% of the constraint norm value of the initial state. The intent of the problem is to obtain the perturbation \({\mathbf{x}}_{{\mathbf{0}}}^{{\mathbf{*}}}\),which can lead to double-gyre variation.

3 Methods

3.1 FEIA framework

Intelligent algorithm is a type of method combining rules and randomness to imitate the natural phenomena and seek the optimal value (Lee et al. 2005). The basic flows of intelligent algorithm can be summarized as follows:

  • Step 1: Determine the initial solution x.

  • Step 2: Calculate the objective function value f with x.

  • Step 3: Judge the iteration condition. If the termination condition is satisfied, output the best solution x*; otherwise, go to Step 4.

  • Step 4: Update solution x with the related rules and the objective value f calculated in Step 2. Go to Step 2.

One problem of intelligent algorithm is curse of dimensionality. Assume the scale of the problem, i.e., the dimension of x, is n. Because the essence of intelligent algorithm is doing the random search with some rules in the n-dimension space, if n is too large, the efficiency of the algorithm would be very low. However, in the process of solving the actual problem, the solution generally has some features. Assume the whole n-dimension space is O, the points with the related features in O make up a subspace S. For example, in the CNOP problem of this paper, the optimal initial perturbation is in the subspace, which shows the perturbation feature of double gyre. If there are two mappings p and r, p can map x in S into an m-dimension (m <  < n) latent space F and r can reconstruct the low-dimensional solution w in F into x. One optimal problem can be reconstructed as follows (Eq. 5):

$$ \begin{gathered} {\mathbf{x}}^{{\mathbf{*}}} = r({\mathbf{w}}^{{\mathbf{*}}} ) = \arg optf(r({\mathbf{w}})) \\ s.t.g_{i} ({\mathbf{x}}) = g_{i} (r({\mathbf{w}})) \ge 0,i = 1, \ldots ,m \\ \end{gathered} $$
(5)

where f is the objective function and gi is the constraint.

Based on the above thought, this paper calls the process that intelligent algorithm solves the optimal problem in the low-dimensional space as FEIA framework. The basic flow of FEIA framework can be summarized as follows:

  • Step 1: Collect the samples in F. Determine the mapper p and the re-constructor r.

  • Step 2: Determine the initial solution x. Map x into w by p.

  • Step 3: Reconstruct w into x by r. Calculate the objective function value f with x.

  • Step 4: Judge the iteration condition. If the termination condition is satisfied, output the best solution x* = r(w*); otherwise, go to Step 4.

  • Step 5: Update solution w with the related rules and the objective value f calculated in Step 3. Go to Step 3.

One point is the initial solution in Step 2 of FEIA framework should also in F, so the construction of the initial solution would combine experience and randomness rather than completely relies on randomness. In past, the mapper and re-constructor usually are the feature matrix and its transpose matrix of PCA. And in this paper, PSO algorithm is selected the intelligent algorithm component of FEIA framework. Therefore, PCA and PSO algorithms are introduced briefly as follows.

3.1.1 PCA

PCA is a classical machine learning method, which is widely applied in the dimension reduction problem. The intent of PCA is to obtain a set of vector bases making the data have the maximum projection in the direction of the vector basis, so the approximate error between the reduction data and the original data could be as small as possible. Assume X ⊆ F is the sample set of the possible solutions. The vector bases can be obtained by doing eigen-decomposition for X (Eq. 6).

$$ {\mathbf{U,\Sigma }} = eigen\_decom({\mathbf{XX}}^{{\mathbf{T}}} ) $$
(6)

where Σ = 1, λ2,…, λn} is the eigenvalues of the descending order and U = (u1, u2,⋯, un) is the corresponding eigenvectors. Assuming n is the dimension of the original data and m is the dimension of the reduction data, the vector bases consist of the first m eigenvectors Um = (u1, u2,⋯, um). If PCA is the feature extraction component of FEIA framework, p and r can be constructed as Um (Eq. 7):

$$ p({\mathbf{x}}) = {\mathbf{xU}}_{{\mathbf{m}}} ,r({\mathbf{w}}) = {\mathbf{wU}}_{{\mathbf{m}}}^{T} . $$
(7)

3.1.2 PSO

PSO algorithm is an intelligent algorithm imitating the process that birds search for food. The algorithm assumes there are l particles searching the optimal solution and updates the particles by recording the local best solution and global best solution. The local best solution is the best solution for each particle in the past iterations, and the global best solution is the best solution for all the particles in the past iterations. For FEIA framework, the pseudocode of PSO algorithm component is as follows.

figure a

In the pseudocode, the initial solution x0 is an empirical solution which follows the feature of the space F. The inertia coefficient ic is the adaptive parameter to keep the past velocity. In this paper, most constants of the algorithm are set to the empirical values, where l is set to 20, ic0 is set to 0.9, Δic is set to 0.01 and c1 and c2 are set to 2. Because the calculation of the model costs lots of time in the CNOP problem, max_iter is set to 30 for verifying whether the algorithm can obtain the effective solution in the proper time. One important thing to note is that a constraint project function should be defined if the problem has some constraint. For the CNOP problem in this paper, the project function is defined as Eq. 8.

$$ cons\_pro({\mathbf{w}}^{{\mathbf{j}}} ) = {\text{ }}\left\{ {\begin{array}{*{20}l} {{\mathbf{w}}^{{\mathbf{j}}} ,} \hfill & {{\text{if}}||r\left( {{\mathbf{w}}^{{\mathbf{j}}} } \right)||_{C} \le \delta } \hfill \\ {\sqrt {\frac{\delta }{{||r({\mathbf{w}}^{{\mathbf{j}}} )||_{C} }}} {\mathbf{w}}^{{\mathbf{j}}} ,{\text{ }}} \hfill & {{\text{while|}}|r\left( {{\mathbf{w}}^{{\mathbf{j}}} } \right)||_{C} } \hfill \\ \end{array} } \right.. $$
(8)

3.2 AE and its variants

AE is a common neural network for dimension reduction. Compared with other neural network, AE has the following characteristics: (1) the number of units in the input layer and that in the output layer are same; (2) the number of the units in the middle latent layer is less than that in input layer or output layer; (3) the network from the input layer to the middle latent layer is called encoder and the network from the middle latent layer to the output layer is called decoder; and (4) the training for the network should make the input and output as same as possible. Figure 2 shows a simple AE with three layers.

Fig. 2
figure 2

A simple three-layer AE

In Fig. 2, l1 is the input layer, l2 is the middle latent layer and l3 is the output layer. l1 and l2 make up the encoder, and l2 and l3 make up the decoder. The relation between two adjacent data layers can be written as follows (Eq. 9):

$$ {\mathbf{X}}_{{{\mathbf{i + 1}}}} = h_{i} \left( {{\mathbf{X}}_{{\mathbf{i}}} |{\mathbf{W}}_{i} ,{\mathbf{b}}_{{\mathbf{i}}} } \right) $$
(9)

where Xi is the data in ith layer, hi is a hidden function and Wi and bi are the parameters of the function. It is easy to find that encoder and decoder of AE can serve as the mapper p and the re-constructor r in FEIA framework. In this paper, four kinds of AEs are tested in the experiment, and the follows introduce them briefly. And the detailed setting such as activation function is discussed in Sect. 3.4.

3.2.1 AE

For the original AE, the data vectors are passed in the network by the fully connected layer. Besides the three necessary data layers, there can be other latent data layers in the encoder and decoder. The hidden function can be written as follows (Eq. 10):

$$ h_{i} \left( {{\mathbf{X}}_{{\mathbf{i}}} |{\mathbf{W}}_{i} ,{\mathbf{b}}_{{\mathbf{i}}} } \right) = act_{i} ({\mathbf{X}}_{{\mathbf{i}}} {\mathbf{W}}_{i} + {\mathbf{b}}_{{\mathbf{i}}} ) $$
(10)

where acti(⋅) is the activation function, which is discussed in Sect. 3.4. Wi is the weight matrix in the fully connected layer, and bi is the biases vector in the fully connected layer. The intent of AE is to minimize the reconstruction error, so the cost function of the network is defined as Eq. 11.

$$ c({\mathbf{W}},{\mathbf{b}}) = \frac{1}{2}mse({\mathbf{X}},{\mathbf{Y}}) + reg({\mathbf{W}}) $$
(11)

where W and b are the all weights and biases adjusted in the network, X is the input data, Y is the output data, mse(⋅) (Eq. 12) represents the reconstruction error, and reg(⋅) (Eq. 13) is the regularization function to avoid overfitting.

$$ mse({\mathbf{X}},{\mathbf{Y}}) = \frac{1}{n}\sum\limits_{i = 1}^{n} {(x_{i} - y_{i} )^{2} } $$
(12)
$$ reg({\mathbf{W}}) = \lambda_{reg} \sum\nolimits_{i,j,l} {w_{i,j,l}^{2} } $$
(13)

where λreg is a constant which represents the weight of the regularization cost and i, j and l are the indices of row, column and layer for the weight matrix, respectively.

3.2.2 SAE

The structure of SAE is similar to AE, and there are two main differences: (1) The data layers only contain the main three layers and (2) a sparse cost (Eq. 14) is added to the cost function (Eq. 15).

$$ spa({\hat{\mathbf{\rho }}}) = \lambda_{spa} \sum\limits_{j = 1}^{m} {KL(\rho \parallel \hat{\rho }_{j} )} = \lambda_{spa} \sum\limits_{j = 1}^{m} {\rho \log \frac{\rho }{{\hat{\rho }_{j} }}} + (1 - \rho )\log \frac{1 - \rho }{{\hat{\rho }_{j} }} $$
(14)
$$ c({\mathbf{W}},{\mathbf{b}}) = \frac{1}{2}mse({\mathbf{X}},{\mathbf{Y}}) + reg({\mathbf{W}}) + spa({\hat{\mathbf{\rho }}}) $$
(15)

where the sparse cost is represented by the Kullback–Leibler divergence between the activation degree of the middle layer data \(\mathop \rho \limits^{ \wedge }\) and a sparseness constant \(\rho\) which is set to an experience value 0.1 in this paper. λspa is a constant which represents the weight of the sparse cost. By controlling the activation degree of the middle layer, SAE could relatively keep the sparseness of the feature.

3.2.3 CAE

The difference between CAE and AE is that CAE introduces the convolutional layer and the deconvolutional layer to replace the fully connected layer. Different from the fully connected layer, the passed data in the convolutional layer are matrix, and the hidden function can be written as Eq. 16:

$$ h_{i} \left( {{\mathbf{X}}_{{\mathbf{i}}} |{\mathbf{W}}_{i} ,{\mathbf{b}}_{{\mathbf{i}}} } \right) = act_{i} (p_{c\_d} ({\mathbf{X}}_{{\mathbf{i}}} ,{\mathbf{W}}_{{\mathbf{i}}} ) + {\mathbf{b}}_{{\mathbf{i}}} ) $$
(16)

where Pc_d represents the process of the convolution (Fig. 3a) or deconvolution (Fig. 3b), Wi is the set of (de)convolutional kernel and bi is the biases matrix. As shown in Fig. 3a, the convolution is mapping the dot product of the shadow matrix in original data and kernel into the dot in the projection data. And as shown in Fig. 3b, the deconvolution is making the dot in original data multiply the kernel and then adding the result to the shadow matrix in the projection data. The cost function of CAE is same as that of AE (Eq. 11). Compared with AE, CAE can catch the spatial information of the data and the convolutional kernel can reduce the memory usage.

Fig. 3
figure 3

The process of a convolution and b deconvolution

3.2.4 VAE

VAE assumes the latent feature follows a distribution, which is generally Gaussian distribution. Based on the assumption, the encoder of the network outputs a mean and a standard deviation of the distribution to construct the latent feature rather than outputs the latent feature directly, and Fig. 4 shows this flow. Meanwhile, the distribution cost (Eq. 17) is added to the cost function (Eq. 18).

$$ dis({{\varvec{\upmu}}},{{\varvec{\upsigma}}}) = \frac{{\lambda_{vae} }}{2}\sum\limits_{i = 1}^{m} {1 + \log (\sigma_{j}^{2} ) - \mu_{i}^{2} - \sigma_{i}^{2} } $$
(17)
$$ c({\mathbf{W}},{\mathbf{b}}) = \frac{1}{2}mse({\mathbf{X}},{\mathbf{Y}}) + reg({\mathbf{W}}) + dis({{\varvec{\upmu}}},{{\varvec{\upsigma}}}) $$
(18)

where μ and σ are separately the mean vector and the standard deviation vector outputted by the encoder. λvae is a constant which represents the weight of the distribution cost.

Fig. 4
figure 4

The simple flow of VAE

3.3 The mapping model-based PCA and neural network

Another possible way applying neural network in FEIA framework is only training neural network as the re-constructor. This way assumes the latent features obtained by PCA are good enough and tries training some different re-constructor by the neural network. Compared with the neural network in Sect. 3.2, the input data Xw are the calculated by the original data X cross-multiples the reduction matrix Um obtained by PCA and the output data Y are expected to be the same as the original data X. In this paper, the decoder and GAN are tested as the re-constructor of this way in the experiment, and the follows introduce them briefly.

3.3.1 Decoder

As the name indicates, the decoder is the second part of AE. The structure of the decoder can follow the description in Sect. 3.2.

3.3.2 GAN

GAN consist of a generator network G and a discriminator network D. The generator maps the latent data Xw into the real data space, and the discriminator evaluates and distinguishes the real data X and the generated data Y. Figure 5 shows the basic structure of GAN.

Fig. 5
figure 5

The basic structure of GAN

It is clear that the generator of GAN can serve as the re-constructor in FEIA framework. The inner structure of the generator could be same as the decoder, and the difference is that the cost function (Eq. 19) of the generator consists of the cost function of the decoder and a generator discrimination cost (Eq. 20).

$$ c_{g} ({\mathbf{W}},{\mathbf{b}}) = \frac{1}{2}mse({\mathbf{X}},{\mathbf{Y}}) + reg({\mathbf{W}}) + gan_{g} ({\mathbf{Y}}) $$
(19)
$$ gan_{g} ({\mathbf{Y}}) = - \lambda_{{g{\text{an}}}} \log (D({\mathbf{Y}})) $$
(20)

where D(⋅) represents the output of the discriminator and λgan is a weight constant for the cost of GAN. The inner structure can be a classifier network, whose dimension of output is 1 and activation function of output layer can be a sigmoid function (Eq. 21). The larger output of the discriminator represents that the input has a larger possibility to be real data. And the cost function (Eq. 22) of discriminator consists of a discriminator discrimination cost (Eq. 23) and a regularization cost.

$$ sigmoid(x) = \frac{1}{{1 + e^{ - x} }} $$
(21)
$$ c_{d} ({\mathbf{W}},{\mathbf{b}}) = reg({\mathbf{W}}) + gan_{d} ({\mathbf{X}},{\mathbf{Y}}) $$
(22)
$$ gan_{d} ({\mathbf{X}},{\mathbf{Y}}) = - \lambda_{gan} (\log (D({\mathbf{X}})) + \log (1 - D({\mathbf{Y}}))). $$
(23)

3.4 The coupling of neural network for FEIA framework

In this section, the settings of neural network for FEIA framework in the experiment are discussed. In detail, five points for the network are introduced in the following.

3.4.1 Activation function

The data for the double gyre in ROMS could be negative, and the reduction solution in FEIA framework also could be negative. Therefore, the activation function of the output layer for the mapper and the re-constructor is set to linear activation function. And the activation function of other latent layers is setting to leaky RELU function (Eq. 24).

$$ leaky\_relu(x) = \left\{ {\begin{array}{*{20}c} {x,} & {{\text{if}}{\kern 1pt} x \ge 0} \\ {ax,} & {{\text{if}}{\kern 1pt} x < 0} \\ \end{array} } \right. $$
(24)

where a is the negative gradient factor which is set to the empirical value 0.2 in the experiment.

3.4.2 Re-constructor bias

CNOP is an optimization problem with the constraint. It can be found that the bias could make the solution not satisfy the constraint. According to Eq. 8, although the project function changes the value of the latent solution wj, the result of the norm is always influenced by b. For example, Eq. 25 shows the influence of b for computation of 2-norm in a two-layer linear network.

$$ \left\| {w_{s} (\tau x){ + }b} \right\|_{2}\,=\,(w_{s} (\tau x))^{2} + 2w_{s} (\tau x) + b^{2} $$
(25)

where ws is the sum weight for the element x in data vector, b is the bias for x and τ is the project coefficient for x. It could be seen that τ does not influence b2 in the result and the constraint could be never satisfied. Therefore, in the experiment, the bias for re-constructor is always set to 0, so the constraint could be satisfied with the constraint project function.

3.4.3 Weight parameter selection

As described in Sects. 3.2 and 3.3, many weight constants λ are introduced to construct the cost function of the network. Considering that the mean square error is the main cost, this paper solves the various costs in Sects. 3.2 and 3.3 without training, and the various λs are used to make the value of other cost be about one percent of the initial value of the mean square error. Therefore, the network cannot ignore the main intent and assisted by the other cost in the later training stage. The values of the λs are as follows: λreg is set to 10–6, λspa is set to 10–3, λvae is set to 10–4 and λgan is set to 10–4.

3.4.4 Training data and validation data

The application of determining training data and validation data can help the training of the network. In FEIA framework, the samples in the possible solution space can be chosen as the training data, and the initial solution for the intelligent algorithm can be chosen as the validation data. Therefore, the fit degree of the network for the solving can be evaluated simply, and the network can be adjusted further.

For example, for the double gyre simulated in ROMS, a set of non-period oscillation data and a set of steady data are obtained by adjusting the model parameters according to Yuan (2019b). The difference between the non-period oscillation data and the steady data makes up the training data, whose size is 2000 × 54,776. And the training data are also the original matrix to carry out the process of PCA. On the other hand, the initial solution is constructed by the difference between jet-down data and symmetry data. And this initial solution for the intelligent algorithm component is the validation data.

3.4.5 Training process

The Adam optimizer (Kingma et al. 2014) is used to training the network in the experiment. And the flow of the training is as follows:

figure b

The shuffle operation is important, which can eliminate influence of the ordered data and improve the ability of generalization. In the experiment, epoch is firstly set to 1000, and the performance of the network would be observed. According to the performance on the validation data, epoch is set to the value which can make the validation data close to the best evaluation. And the values of other parameters such as Sizebatch and lr are discussed in the experiment.

All the network in the experiment is training by TensorFlow, and the parameters, which are not mentioned, are set to the default value in TensorFlow.

4 Experiment and results

This section shows the experiment that FEIA framework with neural network solves CNOP of double gyre in ROMS. The problem and method can be reviewed in Sects. 2 and 3. The five feature dimensions, separately 20, 40, 60, 80 and 100, are tested in the experiment. As the reference data, the results that solved without reduction and solved with PCA are shown in Tables 1 and 2.

Table 1 The single experiment result of FEIA framework (PCA) and no reduction
Table 2 The statistic results of FEIA framework (PCA) for ten times

From the above table, it can be seen that the result without reduction is quite small and the corresponding solution has no probability to lead to double-gyre variation, which can be verified in the next experiment. With the reduction of PCA, the results show a significant improvement, and the effective solution can be obtained when the feature dimension is set to 40. However, it can also be seen that the probability obtaining the effective solution is low. This paper suggests two ways of applying neural network in FEIA framework, and the corresponding experiments and results are shown below.

4.1 The experiment for the first way

As shown in Sect. 3.2, the first way is training a network, which can serve as both the mapper and the re-constructor. AE and its three variants are tested, and the corresponding structures are shown in Table 3. One point needs to note is that the numbers of units in many layers are set to 256, which is limited by the machine’s memory.

Table 3 The tested network structures of the first way

In the training stage, the batch size and the learning rate of the optimizer are adjusted to investigate their influence. Figures 6 and 7 show the related error (Eq. 26) varying curves of the validation data for the four networks within 1000 epochs.

$$ related\_error\,=\,\frac{{\left\| {{\mathbf{X}} - r(p({\mathbf{X}}))} \right\|_{F} }}{{\left\| {\mathbf{X}} \right\|_{F} + e}} $$
(26)

where X is the original data, p(⋅) is the mapper of the FEIA framework, r(⋅) is the re-constructor of the FEIA framework, ||⋅||F is the Frobenius norm and e is a small constant. The related error represents the difference rate between the original data and the reconstruct data, and it can be used to evaluate the quality of the network. It is worth noting that the related error is actually caused by the mapper rather than the re-constructor. Therefore, for FEIA framework, the only one related error appears in the stage constructing the initial feature solution.

Fig. 6
figure 6

The related error varying curves of validation data for AE (BS is the batch size and LR is the learning rate, which are same in Figs. 7, 8 and 9 and in Tables 4 and 5)

Fig. 7
figure 7

The related error varying curves of validation data for SAE

From Fig. 7, it can be found that VAE cannot converge and has a large related error during 1000 epochs. This is because the feature generating process of VAE relies on the random, which leads to the instability. And it leads to that selecting the epoch with the proper related error becomes difficult, so VAE is not tested and discussed in the next experiments. From Figs. 6, 8 and 9, it can be found that (1) the batch size seems not to have a significant and same influence for the different networks, for example, with the increase in the batch size, the overfit degree decreases in the AE and SAE but increases in CAE; (2) the decrease in the learning rate can decrease the degree of oscillation and overfit. Meanwhile, from Figs. 6, 8 and 9, the epoch which can make the validation data close to the best related error is selected, and the corresponding networks used in FEIA framework are trained with the early stop. Tables 4 and 5 show the related errors and the single experiment results for the trained network.

Fig. 8
figure 8

The related error varying curves of validation data for CAE

Fig. 9
figure 9

The related error varying curves of validation data for VAE

Table 4 The related errors and the single experiment results for AE
Table 5 The related errors and the single experiment results for SAE

From Tables 4 and 5, it can be found SAE has the best performance in the above trained networks. One possible reason might be that the size of the training data is only 2000, and the simple structure of the network could decrease the degree of the overfit. SAE has a relatively simple structure compared with AE and CAE, so it shows the best performance. And from Tables 4 and 5, it can be found that the batch size and learning rate seem not to have a significant influence on the results. Therefore, in the next experiments, the batch size is set to 40 and learning rate is set to 10–4, which can avoid the influence of the extreme value on the judgment. On the other hand, the results in Table 6 verify that the first way of applying neural network (SAE) in FEIA framework is effective and even can solve the better solution. In order to further verify it, the statistic results of SAE for running ten times are shown in Table 7.

Table 6 The related errors and the single experiment results for CAE

Compared with the results of PCA in Table 2, the results of SAE in Table 7 further verify the effectiveness of the first way. Compared with PCA, SAE can not only solve the results in the interval (1.55–1.65) where the solutions have the probability to lead to the variation, but also solve the results in the interval (> 1.65) where the solutions almost certainly lead to the variation. And the mean value and the max value of the results in Table 7 both are bigger than those in Table 2. On the other hand, in Table 2, PCA can only help to obtain the effective solution when the feature dimension is 40, but SAE can help to obtain the effective solution except the feature dimension is 20. One possible reason is that the training of PCA is not for a special feature dimension but the training of neural networks is for the special feature dimension. Therefore, compared with PCA, neural network might save the cost to determine the feature dimension.

Table 7 The statistic results of FEIA framework (SAE) for ten times

4.2 The experiment for the second way

As shown in Sect. 3.3, the second way is combining PCA and the network, where PCA serves as the mapper and the network serves as the re-constructor. In this section, decoder and GAN are tested to serve as the re-constructor, and the corresponding structures are shown in Table 8.

Table 8 The tested network structures of second way

Because the simple structure obtains the better results in Sect. 4.1, the number of the extra latent layer is adjusted to investigate its influence. Figures 10 and 11 show the related error varying curves of the validation data for the two networks within 1000 epochs. And with the best epochs selected by Figs. 10 and 11, the related errors and the single experiment results for the trained network are shown in Tables 9 and 10.

Fig. 10
figure 10

The related error varying curves of validation data for decoder

Fig. 11
figure 11

The related error varying curves of validation data for GAN

Table 9 The related errors and the single experiment results for decoder
Table 10 The related errors and the single experiment results for GAN

From Figs. 10 and 11, it can be found that decreasing the number of the extra latent layers can decrease the degree of oscillation and overfit. And from Tables 9 and 10, it can be found that decreasing the number of extra latent layers can obtain better related errors and objective function values. These results further verify the conclusion in Sect. 4.1, if the size of the training sample set is small, the simpler structure would be proper for FEIA framework. On the other hand, the results of solving CNOP are similar for the two structures. And the two structures can both obtain the effective solutions leading to double-gyre variation, which verifies the effectiveness of the second way. Similar to Sect. 4.1, GAN (EL = 0) are selected to test ten times to further verify the performance of the second way, and the results are shown in Table 11.

Table 11 The statistic results of FEIA framework (GAN) for ten times

The results in Table 11 are similar to those in Table 7. Compared with PCA, the results with the second way show the following chrematistics: (1) The solution in better interval can be obtained; (2) better average value and max value of the objective function can be obtained; and (3) the effective solution can be obtained in more feature dimension. These results further show the effectiveness of the second way.

4.3 The discussion for the result

The above experiments verify the effectiveness of neural network in FEIA framework. Both the first way and the second way can solve the effective solution which can lead to double-gyre variation. And with the good training, FEIA framework with neural network can even show a better performance than FEIA framework with PCA. According to the above experiments, the optimization details of the network and the performance of FEIA framework neural network are summarized and discussed.

4.3.1 Structure and parameter selection

In the experiments, the influences of the batch size, the learning rate and the complex degree of the network are tested. With the condition that the size of the training data is relatively small, which is 2000 × 54,776, the influence of them can be summarized as follows: (1) The complexity degree of the network has the largest influence, while the batch size is the least sensitive, and the influence of the learning rate is between them. (2) One of the main reasons to cause the performance loss is the overfit, and the decrease in the complexity degree and learning rate can decrease the degree of oscillation and overfit. (3) Although the learning rate can also decrease the degree of overfit, it does not obtain a smaller error in the best epoch. Therefore, with the early stop, only the change in the complexity degree can cause the relatively significant optimization. Above all, in order to obtain the proper mapper and re-constructor for FEIA framework, the structure of the network needs to be considered firstly according to the characteristic and size of the data. And then, the adjusting of training process, such as decreasing the learning rate, might give a further optimization.

4.3.2 Performance analysis

In fact, the process of PCA is generally faster than that of training a network. However, in some problem, such as solving CNOP of double gyre in the experiment, the main time cost is on the calculation of intelligent algorithm. In this paper, because the calculation of the objective function involves in the integration of the model, even the time cost of CAE, which is the highest for the network trained in the experiments, is lower than that of intelligent algorithm. Therefore, training the proper mapper and re-constructor is relatively more important than costing a less time to obtain the mapper and constructor. And in the experiments, neural network shows three advantages compared with PCA: (1) Neural network can help to solve the solutions in better interval where the solutions almost certainly lead to the variation. (2) The solutions solved with FEIA framework with neural network have a larger mean value and max value. (3) The effective solutions, which can lead to the variation, can be solved in more feature dimension. The first two points suggest that neural network with proper design can construct a better mapping-reconstruction structure to help FEIA framework solve the problem. And the last point shows that neural network might save the cost to determine the feature dimension, which means the calculation frequency of intelligent algorithm part might be decreased. These might be because neural network can do a more special fitting compared with PCA. For example, PCA is not training for a special feature dimension, but neural network is training for the special feature dimension. Above all, according to the results of the experiments, neural network is proven to be an effective component which can be applied in FEIA framework. And with the proper design, the performance of FEIA framework with neural network might be better than that of FEIA framework with the classical method.

5 Conclusion

In this paper, the two ways of applying neural work in FEIA framework are suggested. The first way is training a network to serve as both mapper and re-constructor, and the second way is using the classical method to serve as the mapper and training a network to serve as the corresponding re-constructor of the mapper. With the experiments solving CNOP of double-gyre variation in ROMS, how to train the proper neural network in FEIA is discussed, and the good performance of FEIA framework with neural network is verified. Compared with PCA, neural network can construct a better mapping-reconstruction structure with the proper design. Therefore, the solutions solved by FEIA framework with neural network obtain better objective values and have a larger probability leading to the expected physical phenomenon.

In fact, besides CNOP problem, FEIA framework can be applying in many other problems. And in this paper, the size of training data for the network is relatively small. It is worth to look into the performance of FEIA framework in more problems and the way applying the network with more data.