1 Introduction

Fuzzy cognitive map (FCM) is a class of computationally intelligent model that inherits specific properties from both fuzzy logic and neural network domains. FCMs have been initially introduced by Kosko in 1986 [1] as a cognitive structure capable of describing the dynamic behavior of a given system. These graph-based models consist of concept nodes that influence each other through signed weighted edges in a recurrent way. Concept nodes express principal elements of the under investigation system such as states, variables and goals while weight interconnections describe the causal relations among system elements. Also, feedback connections are supported introducing a memory state representation feature for concept nodes. FCMs and their extensions have been successfully applied in a wide range of applications such as: system modeling [2, 3], classification [4,5,6,7,8], decision making [9, 10], industrial systems [11, 12], time series prediction [13,14,15,16,17,18], control problems [19, 20], business models analysis [21], health services quality evaluation [22], robotic navigation [23], medical applications [24, 25].

The wide applicability of FCMs has led to an intensified interest in developing different topological and training extensions. Fuzzy Grey Cognitive Maps (FGCMs) [26] are a combination of grey systems theory and fuzzy cognitive maps dealing with problems characterized by high uncertainty. High-order Intuitionistic FCMs (IFCMS) [27] are based on evidential reasoning theory in order to model complex systems with both qualitative and quantitative uncertainties. Instead of number values, the weight interconnections in Granular Cognitive Maps [28] are defined by intervals and fuzzy sets. Fuzzy-Rough Cognitive Networks (FRCNs) [29, 30] eliminate the parameter learning requirements of rough cognitive networks (RCN) [31] models presenting performance comparable to state-of-the-art classifiers. Multi-Layer Fuzzy Cognitive Maps (MLFCMs) [32] perform a decomposition of a complex system to smaller structures of related nodes organized in layers providing enhanced decision support capabilities and increased explainability. A deep fuzzy learning approach has been proposed in [33] for time series classification based on FCMs. In this method, original time series are elevated to fuzzy concept space and an FCM-based representation of each time series is formed.

Regarding the training mechanism of the interconnected weights in FCM-based models, diverse algorithms have been proposed [34, 35]. A plethora of algorithms are related with Hebbian learning, indicatively Active Hebbian learning (AHL) [36] and data-driven non-linear Hebbian learning (DDNHL) [37]. Evolutionary based algorithms have been adopted during training [38]. Moreover, the learning procedure has been modeled as a multi-modal optimization problem formulating a niching-based multi-modal multi-agent genetic algorithm \((NMM_{MAGA})\) [39], which learns several candidate FCMs, and selects one FCM from those learned candidates.

Fuzzy Cognitive Network (FCN) is an FCM-based learning framework, which encapsulates proper conditions of convergence and also an essential weight updating procedure based on linear and bi-linear parametric modeling [20]. Recently, functional forms of weight interconnections have been proposed in [6] in order to avoid the requirement of storing the acquired weight knowledge in large fuzzy rule databases and eliminate the intervention of experts in the principal elements of the fuzzy inference procedures.

Compared with conventional FCMs, FCNs demonstrate remarkable convergence properties. Moreover, by employing functional weights they become capable of approximating really complex non-linear i/o mappings. That’s why they are very successful in pattern recognition and system approximation tasks. However, their structure contains only one layer of nodes lacking probably the depth that would enhance their representation power. Therefore, a challenging task is their transition to deep structures, leveraging the learning capabilities of deep neural networks, enhancing their efficiency and extending their applicability.

To this end, we first study the transition difficulties of Fuzzy Cognitive Networks with functional weights (FCNs-FW) towards deep learning structures. These difficulties are directly connected with the topological limitations of FCMs when adding intermediate layers between the input–output layers. In the context of the aforementioned analysis, the incorporation of hidden layers in a feed forward topology is considered in order to present the constraints induced in the adaptive learning approach of the network. Then we propose three hybrid representations in this direction. More specifically, the main contributions of this work are:

  1. 1.

    The study of the learning restrictions of FCNs when adding intermediate layers towards deep FCN formulations.

  2. 2.

    The introduction of three hybrid representations towards the development of deep learning structures in the general FCNs framework, namely: a) Convolutional Neural Networks-FCN (CNN-FCN); b) Echo State Network-FCN (ESN-FCN); c) Autoencoder-FCN (AE-FCN).

These distinct approaches extrapolate the representative properties of CNNs, ESNs and Autoencoders respectively and the learning capabilities of FCN-FW which handle polynomial structures of weights instead of plain values. A set of diverse benchmark datasets are taken into consideration for evaluation purposes including: (a) handwritten digits recognition (MNIST); (b) stock prediction (S&P 500); (c) index tracking (HSI) and; (d) remaining useful life prediction (C-MAPSS turbofan engine). The latter is used in order to compare all three hybrid implementations in a turbofan engine degradation dataset provided by NASA.

The rest of this paper is organized as follows: Sect. 2 presents the basic concepts of FCNs. Section 3 is devoted to the analysis of the learning limitations when hidden layers are added to the FCM and FCN structures. This analysis leads to the development of the proposed hybrid implementations and their experimental evaluation on benchmark examples. Section 4 stands for the comparison of all three approaches in a remaining useful life problem. Section 5 provides a discussion analysis regarding the research outcomes of this work. Results and contribution are given in Sect. 6.

2 FCN Description

Fuzzy Cognitive Networks (FCNs) [40, 20] consist a direct extension of Fuzzy Cognitive Maps (FCMs) [1] introducing an enhanced interaction and synergy with the system they describe. The learning mechanism encapsulates proper convergence conditions and compact adaptive algorithms allowing partial human knowledge in the model. The acquired knowledge is stored automatically in the form of fuzzy rules. Later improvements, with respect to FCN’s information storage mechanism, replace the use of fuzzy rule databases with functional representation concepts of fuzzy inference systems. This way the network’s weight associations are replaced by polynomial surrogates avoiding the use of large fuzzy rule databases leading to FCNs with functional Weights (FCNs-FW). Moreover, the increased memory requirements of large fuzzy rule databases are avoided and the required human intervention in issues related to the principal elements of the fuzzy inference procedures is minimized. FCNs are applied in a wide range of applications, as a result of their essential methodology.

2.1 Basic Notions and Graphical Representation

The core conception of Fuzzy Cognitive Maps is based on the development of appropriate graph structures that represent causal knowledge reasoning. Graphs are comprised of concepts and weighted arcs, representing the dynamic behavior of a system. Each concept describes an attribute of the system (input variable, trend, operational parameter, output, action) and each weighted arc denotes different degrees of causality between causal objects (concept nodes). Weight interconnections can indicate direct or inverse influential relation between two concepts, while also can be utilized to model concepts’ self-feedback connections. The concept nodes - weighted arcs pair composes the knowledge of the underlying system and describes its dynamical behavior.

Each concept node in the network is symbolised as \(C_i\), with \(i=1,2,\ldots ,n\) and its state value is represented as \(A_i\), which is quantified in the interval [0, 1] (or [-1, 1]). The concept state values are embraced uniformly in a vector as:

$$\begin{aligned} {\textbf {A}}={{\left[ \begin{matrix} {{A}_{1}} &{} {{A}_{2}} &{} \ldots &{} {{A}_{n}} \\ \end{matrix} \right] }^{\top }} \end{aligned}$$
(1)

The property of causality between nodes is expressed through the utilization of weighted edges, \(w_{ij}\), which are restricted in the interval [-1,1]. Thus, the cause-effect relationship is translated in terms of weight interconnections, which in turn affect the activation degree of node values. The level of influence between concepts \(C_{i}\) and \(C_{j}\) is related with the absolute value of \(w_{ij}\), while the sign of weight demonstrates if this influential relation is direct or inverse. Three potential conditions emerge: a) Direct: \(w_{ij}>0\) indicates that an increase (decrease) of concept \(C_{j}\) will increase (decrease) the value of concept node \(C_{i}\), b) Indirect: \(w_{ij}<0\) means that an increase (decrease) of the value of concept \(C_{j}\) will result a decrease (increase) to concept \(C_{i}\) and c) No connection: if \(w_{ij}=0\) there is no relation between these nodes. In the case where a node influences but is not influenced by others is called steady node.

Fig. 1
figure 1

FCN scheme. a Example of general system modeling representation with 5 concept nodes. b Example of FCN classifier with three inputs and two outputs, c Learning process, d Knowledge storage using Fuzzy Rule Database, e Knowledge storage adopting Functional Weights representations

The next state (activation degree) of each concept is computed using an iterative equation in which the possible influence to this specific concept from its causal concepts is encapsulated:

$$\begin{aligned} {{A}_{i}}\left( k+1 \right) =f\left( {{d}_{ii}}{{A}_{i}}\left( k \right) +\sum \limits _{{\begin{matrix} j=1 \\ j\ne i \end{matrix}}}^{n}{{{w}_{ij}}{{A}_{j}}\left( k \right) } \right) \end{aligned}$$
(2)

where \(A_i(k+1)\) is the activation level of concept \(C_i\) at iteration \(k+1\), \(A_i(k)\) and \(A_j(k)\) are the values of concepts \(C_i\) and \(C_j\) at discrete time k respectively. Variable \(d_{ii}\) is associated with the existence of strong or weak self-feedback to node i and \(w_{ij}\) is the interconnection weight values positioned in the interval [-1, 1]. The activation function f is a non-linear continuous squashing function, which maps the concept values in the interval [a, b]. The most commonly used squashing function is the logistic sigmoid, \(f=1/(1+{e}^{-{{c}_{l}}x})\), which quantifies the concept states into the interval [0,1]. A typical FCN structure with 5 concept nodes is illustrated in Fig. 1a. Each row of the weight interconnection matrix contains the degrees by which each node is influenced from other nodes and the diagonal values show the self-feedback values of each concept.

Note that in pattern classification tasks, an FCN structure includes steady nodes to represent input concepts and non-steady nodes for output concepts, as presented in Fig. 1b. Steady nodes influence but are not influenced by others, which means that during the iterative computation of Eq. 2 they remain the same (hence steady). In a simplified manner, the steady nodes act like different external excitations. Regarding output concepts, interconnections among each other can be included optionally (dashed lines in Fig. 1b incorporating weights among output concepts. The inclusion of these weights has been studied in [6], concluding that this operational mode enhances slightly the network’s performance, because additional information coming from the other output nodes is exploited. However this means that more weights have to be trained due to the incorporation of interconnections between output concepts. Therefore the inclusion of such interconnections should be assessed in a trade-off manner.

2.2 Adaptive Learning Approaches

A recognized issue in conventional FCMs is related to the network’s state convergence. Starting from an initial state, the activation rule (Eq. 2) may arrive in one of the following behavioral states after a number of iterations: a) fixed-point attractor (the network’s state becomes fixed after a number of iterations), b) limit cycle (oscillations between state vectors), c) chaotic attractor (chaotic behavior). In FCNs, the learning mechanism copes with this convergence challenge by including proper conditions that satisfy the existence and uniqueness of a solution [41].

Therefore, convergence to the desired equilibrium is assured by specific conditions on the network’s parameters (weights and inclination parameters of the sigmoid functions) and by employing parameter projection methods to satisfy these conditions. The complete framework of FCNs [20] incorporates an essential learning procedure based on two distinct updating algorithms (linear and bi-linear parametric modeling). The training procedure modifies the trainable parameters to reduce the error between the desired state \(A^{des}\) and the network’s converged equilibrium state \(A^{eq}\). In the linear parametric model (inclination parameters assumed to be equal to one) the algorithm adjusts solely the weights, while in the case of bi-linear modeling weights and inclination parameters are estimated.

Assuming an operating condition as a desired state \(A^{des}\) then Eq. 2 can be written as:

$$\begin{aligned} {{\textbf {A}}^{des}}=f\left( {{{\textbf {W}}^{*}}}{{\textbf {{A}}}^{des}} \right) \end{aligned}$$
(3)

where \({\textbf {A}}^{des}={{\left[ \begin{matrix} A_{1}^{des} &{} A_{2}^{des} &{} \ldots &{} A_{n}^{des} \\ \end{matrix} \right] }^{\top }}\) is a column vector, \({\textbf {W}}^*\) is the ideal weight matrix which fulfills Eq. 3 and f is a vector valued function \(f:{{\Re }^{n}}\rightarrow \Re\), defined as follows: \(f(x)={{\left[ \begin{matrix} {{f_1}(x_1)} &{} {{f_2}(x_2)} &{} \ldots &{} {{f_n}(x_n)} \\ \end{matrix} \right] }^{\top }}\) where \(x \in \Re ^n\) and \({{f_i}(x_i)}=\frac{1}{1+{{e}^{-x_i}}}\), for \(i=1,2,\ldots ,n\), n : number of nodes. The weight updating rule for the linear parametric model is given by:

$$\begin{aligned} {{\textbf {w}}_{i}}\left( k \right) ={{\textbf {w}}_{i}}\left( k-1 \right) +\alpha {\varvec{\varepsilon }_{i}}\left( k \right) {{({{\textbf {A}}^{des}})}^{\top }} \end{aligned}$$
(4)

and the error law is given by:

$$\begin{aligned} {\varvec{\varepsilon }_{i}}\left( k \right) =\frac{f_{i}^{-1}({\textbf {A}}_{i}^{des})-{{\textbf {w}}_{i}}\left( k-1 \right) {{\textbf {A}}^{des}}}{c+{{({{\textbf {A}}^{des}})}^{\top }}{{\textbf {A}}^{des}}} \end{aligned}$$
(5)

where \({\textbf {w}}_i(k)\) and \({\textbf {w}}_i(k-1)\) are the \(i^{th}\) rows of weight matrices \({\textbf {W}}(k)\) and \({\textbf {W}}(k-1)\) respectively, \({\textbf {A}}^{des}\) is a constant vector which describes the equilibrium point of each concept, \(f_i^{ - 1}({\textbf {A}}_i^{des}) = ln\left( {{\textbf {A}}_i^{des}/\left( {1 - {\textbf {A}}_i^{des}} \right) } \right)\) is constant and results straightforwardly from sigmoid function with \(c_l=1\) and \(\alpha ,c > 0\) are design parameters.

In the case of bi-linear modeling, weight and inclination parameters estimation occurs providing a more flexible approach of equilibrium convergence. The weights estimation is performed following Eq. 4 and the inclination parameters are given by:

$$\begin{aligned} {{\textbf {c}}_{{{l}_{i}}}}\left( k \right) ={{\textbf {c}}_{{{l}_{i}}}}\left( k-1 \right) +\gamma {\varvec{\varepsilon }_{i}}\left( k \right) \left( {{\textbf {w}}_{i}}\left( k-1 \right) {{\textbf {A}}^{des}} \right) \end{aligned}$$
(6)

where \(c_{l_{i}}(k)\) is the inclination parameter of the squashing function for each concept i and \(\gamma\) is a learning rate factor. The corresponding error of the adaptive estimation algorithm based on bi-linear modeling is of the form:

$$\begin{aligned} {\varvec{\varepsilon }_{i}}\left( k \right) =\frac{f_{i}^{-1}({\textbf {A}}_{i}^{des})-{{\textbf {c}}_{{{l}_{i}}}}\left( k-1 \right) {{\textbf {w}}_{i}}\left( k-1 \right) {{\textbf {A}}^{des}}}{c+{{\textbf {c}}_{{{l}_{i}}}}(0){{({{\textbf {A}}^{des}})}^{\top }}{{\textbf {A}}^{des}}+\gamma {{\left( {{\textbf {w}}_{i}}\left( k-1 \right) {{\textbf {A}}^{des}} \right) }^{2}}} \end{aligned}$$
(7)

The bi-linear parametric model introduces a slightly enhanced performance during training as a result of the incorporation of both weight and inclination parameters. This provides a more flexible structure allowing the use of smaller FCN structures [42]. A comparison study between linear and bi-linear training models in a variety of classification datasets has been given in [6] providing evidence of this overall performance improvement. Figure 1c shows the general learning scheme of the FCN classifier with three inputs and two output nodes. However, in this work we use an exclusively linear parametric model during learning.

2.3 FCN Topology (Input–Output)

This is the topology that describes the basic architecture of FCNs in a supervised feed forward manner, where the input patterns are represented as steady nodes and the output concepts are depicted as non-steady ones, as depicted in Fig. 1b. Both input and output nodes/neurons constitute the concepts’ state vector A. The operating procedure is applied repetitively computing the next state of each concept (activation degree of each node). During the training mechanism, the weight interconnections are modified properly satisfying specific conditions of existence and uniqueness of a solution in order to minimize the error between the desired and the equilibrium states in a supervised manner. Equilibrium state (\(A^{eq}\)) stands for the state in which the FCN converges for a given pair of: initial concept values vector; and weight interconnections matrix.

2.4 Knowledge Storage in the Form of Functional Weights

After the training process, the accumulated knowledge is stored in the different pairs of weight matrices with the corresponding values of the equilibrium points (operating condition point vector). Each association pair expresses the unique path that allows the FCN to capture every operating condition of the under examination physical system through i/o data associations. Thus, the accumulated knowledge can be stored in a fuzzy if-then rule database in order to recall later the relative associations using a fuzzy inference procedure [41, 42]. An example of such an implementation is depicted in Fig. 1d, where every operating condition is expressed by the left hand side (if part), while the weights that drive the FCN to this equilibrium state constitute the right hand side (then part). In the case of the bi-linear parametric model, inclination parameters are also included on the right hand side.

Alternatively, the idea of functional weights representations for the network’s interconnections has been initially used in diverse applications such as motor fault detection [43], system identification and indirect inverse control in a two-tank system [44], switching control of a DC motor [45] and incipient short-circuit fault detection in induction generators [46]. The complete framework of functional weights has been described in full detail in [6] offering a compact and flexible structure of weight interconnections where: i) the network can handle multiple relationships between nodes; ii) the network is alleviated from the requirements related with the use of large fuzzy rule databases; iii) human intervention is minimized for the issues related with the principal elements of the fuzzy inference procedures.

Considering a fuzzy system with n inputs (\(x_j\) with j=1,2,\(\ldots\),n), m outputs (\(y_k\) with k=1,2,\(\ldots\),m) and N rules, then in the context of fuzzy IF-THEN rules, the \(i^{th}\) rule will be given as:

\({{R}^{i}}\): IF \({{x}_{1}}\) is \(A_{1}^{i}\) and \({{x}_{2}}\) is \(A_{2}^{i}\) and \(\ldots\) and \({{x}_{n}}\) is \(A_{n}^{i}\)        THEN \({{y}_{1}}\) is \(B_{1}^{i}\) and \(\ldots\) \({{y}_{m}}\) is \(B_{m}^{i}\)

The following fuzzy logic representation can be used [47]:

$$\begin{aligned} {\textbf {y}}=f({\textbf {x}}\vert \varvec{\theta })=\sum \limits _{i=1}^{N}{{{\theta }_{i}}}{{\varphi }_{i}(x)}=\varvec{\theta }^{\top }\varvec{\varphi }({\textbf {x}}) \end{aligned}$$
(8)

where \(\varvec{\theta }\) are the adjustable parameters of the output membership function centers \(z_{ki}\). More specifically, \(\mathbf {\varvec{\theta } }={{\left[ \begin{matrix} {{\theta }_{1}} &{} {{\theta }_{2}} &{} \ldots &{} {{\theta }_{N}} \\ \end{matrix} \right] }^{\top }}\), \(\mathbf {\varvec{\varphi (x)} }={{\left[ \begin{matrix} {{\varphi }_{1}(x)} &{} {{\varphi }_{2}(x)} &{} \ldots &{} {{\varphi }_{N}(x)} \\ \end{matrix} \right] }^{\top }}\) and \(\varphi _{i}(x)\) is the fuzzy basis function [48] defined by:

$$\begin{aligned} \begin{aligned} \begin{aligned}&{{\varphi }_{i}(x)}=\frac{\left( \prod \limits _{j=1}^{n}{{{\mu }_{A_{j}^{i}}}\left( {{x}_{j}} \right) } \right) }{\sum \nolimits _{i=1}^{N}{\left( \prod \limits _{j=1}^{n}{{{\mu }_{A_{j}^{i}}}\left( {{x}_{j}} \right) } \right) }}\;, \\&{{\mu }_{A_{j}^{i}}}\left( {{x}_{j}} \right) ={{e}^{-{{\left( \frac{{{x}_{j}}-c_{j}^{i}}{\sigma _{j}^{i}} \right) }^{2}}}} \end{aligned} \end{aligned} \end{aligned}$$
(9)

where \({{\mu }_{A_{j}^{i}}}\left( {{x}_{j}} \right)\) is the gaussian membership function of the linguistic term \(A_{j}^{i}\) and \(c_{j}^{i}\), \(\sigma _{j}^{i}\) are the input membership function centers and input membership function spreads respectively.

Trying to simplify Eq. 9 and avoid the estimation of the involved fuzzy parameters, alternative functional weight forms have been proposed [6]. They are multi-variable polynomials of \(p^{th}\) order with adjustable coefficients that can be estimated using input–output information. The acquired i/o (\(x,w_k\)) associations during FCN training and convergence of Eq. 4 (also Eq. 6 for the bi-linear parametric modeling) can be used in the sequel for the estimation of these functional structures. The unknown polynomial coefficients can be computed using least squares methods. Constructing different \(p^{th}\) order polynomials affects the best fit shape. Consequently, in terms of FCNs Eq. 8 can be expressed as a regression model for each weight interconnection:

$$\begin{aligned} {\mathrm {{\textbf {w}}}_{k}}\left( \mathrm {{\textbf {x}}} \right) =\varvec{\theta } {^{\top }}\varvec{\varphi }{\left( \mathrm {{\textbf {x}}} \right) } +\mathrm {{\textbf {e}}}\left( \mathrm {{\textbf {x}}} \right) \end{aligned}$$
(10)

where \(\varvec{\theta }\) are the unknown parameters which are to be estimated, \(\varphi\) are the known regression variables and \(\textrm{e}\left( \textrm{x} \right)\) denotes the function approximation error. Supposing that there are M input–output data pairs, the generic problem for selecting model parameters becomes:

$$\begin{aligned} \varvec{\theta } =\arg \ \underset{\theta }{\mathop {min}}\,\left[ \sum \limits _{l=1}^{M}{{{\left( {{\textbf {f}}_{\varvec{\theta } }}\left( {{\textbf {x}}_{l}} \right) -{\textbf {w}}_{k}^{l} \right) }^{2}}} + \lambda ^2 \left\| \varvec{\theta } \right\| ^{2} \right] \end{aligned}$$
(11)

where \({\textbf {f}}_{\varvec{\theta }}({\textbf {x}})\) expresses the polynomial approximation function of the actual weight interconnection function \({\textbf {w}}_k\) based on M data pairs \((x, w_k)\) and \(\lambda\) is a regularization parameter following a Tikhonov regularization rationale and \(\Vert .\Vert\) is the euclidean norm. The parameter \(\lambda\) determines how much weight is given to the minimization of \(\Vert \varvec{\theta } \Vert\) relative to the minimization of the residuals in the summation.

The polynomial parameters are determined using QR decomposition with column pivoting in order to obtain plausible approximation models of \({\textbf {w}}_k\). In Fig. 1e the functional weights scheme is depicted utilizing the approximated parameters. Note that the number of regressors (monomials) is directly connected with the number of inputs and the \(p^{th}\) order of the functional weights’ polynomial forms.

3 Towards Deep Structures Based on FCN

In this section, we consider adding hidden layers between input and output. As described in Subsection 2.3 all concept nodes are integrated into a single/uniform vector \({\textbf {A}}\) for each condition state. During training, each operating condition of the under investigation problem/system is considered desired, \({\textbf {A}}^{des}\), and the FCN estimates its weights in an adaptive way in order to minimize the error between \({\textbf {A}}^{des}\) and the predicted equilibrium \({\textbf {A}}^{eq}\). Adding hidden layers in a feed forward topology gives the state vector:

$$\begin{aligned} \begin{aligned} {\textbf {A}} = \left[ \begin{matrix} {({A}^{in}_{1}} &{} \ldots &{} {A}^{in}_{n}) &{} ({A}^{hi}_{1} &{} \ldots &{} {{A}^{hi}_{nn}}) \\ \end{matrix}\right. \quad \cdots \\ \cdots \quad \left. \begin{matrix} ({A}^{out}_{1} &{} \ldots &{} {{A}^{out}_{N}}) \\ \end{matrix}\right] ^\top \end{aligned} \end{aligned}$$
(12)

where \(A^{in}\), \(A^{hi}\), \(A^{out}\) is the activation level of input, hidden and output nodes and n, nn, N denote the corresponding nodes. This structure constrains the adaptive learning approach, as the hidden node values need to be given for each input pattern as an additional intermediate target.

Table 1 Comparison results in terms of accuracy (%) using simple implementation of hidden layers

Although, a training procedure that adopts an algorithm which shares the same operating properties with back-propagation could allow the incorporation of this topology and may keep the recurrent nature of FCMs and FCNs, in this work we do not perform such an implementation. Alternatively, the hidden layers of the feed forward topology can be exploited as representation layers performing non-linear transformations of input patterns to feed the output layer, where the training mechanism is performed.

For example, let’s assume that one simple representation layer is enclosed between input–output (extension of classic FCN topology). In this case, the inputs are forwarded to the intermediate layer, the activation degree of concept nodes \(({A}^{hi}_{1} \ldots {{A}^{hi}_{nn}})\) is computed iteratively using random initial weights and then the converged activation values are utilized as inputs to the training mechanism that takes place in the output layer. This is like shifting the FCN training procedure to be applied between hidden and output layers. In order to demonstrate this simple FCN representation example, the classic 28-by-28 grayscale MNIST digits dataset is chosen. Table 1 shows the evaluation results in terms of accuracy for such a case including a different number of neurons in two distinct topologies of hidden layers. As expected, the increase of neurons in a shallow FCN affects the performance positively. However, the increase in the number of intermediate hidden layers does not provide the expected result. At the same time, we compare the result with a well-known class of feedforward neural networks named Extreme Learning Machine (ELM). The reason for this indicative comparison is twofold: a) The FCN representation example is a feedforward network like ELM. The ELM is the most appropriate candidate for single hidden layer feedforward neural network; b) The FCN hidden layer(s) serves as a representation layer(s) where the training procedure is shifted to be applied between the last hidden and output layer. Similarly in the ELM case the only free parameters to be adjusted are the connection weights between the hidden layer and the output layer; As depicted in Table 1, the FCN with functional weights presents similar performance with ELM, but the increase of representation layers affects the classification performance negatively.

Since the above inclusion of simple representation layers does not offer any significant improvement, three alternative approaches are next proposed, namely: a) convolutional layers in a hybrid CNN-FCN topology; b) extended intermediate layer using Echo State Network, i.e., a hybrid ESN-FCN model; or c) autoencoder in an AE-FCN formulation. In the next subsections, such hybrid extensions will be designed in distinct topological cases and different benchmark datasets will be examined to evaluate their feasibility and reliability.

3.1 Hybrid CNN-FCN

In this section the overall hybrid CNN-FCN scheme will be presented and experimental results will evaluate its feasibility. The integration of such a hybrid structure can be performed/designed by CNN as a representative feature extractor and FCN as a recognizer in a synergistic way. Processes related to feature extraction hold a key role in pattern recognition systems. It is crucial for the produced features to retain invariant properties within the same class and at the same time provide strong perceptible characteristics among distinct classes.

In addition to conventional neural networks, CNNs incorporate a local transform and invariant property by including weight sharing, sub-sampling through pooling layers and local receptive fields. Thus, the salient features are extracted through convolutional layers and a process of feature selection/reduction is performed intending also to limit overfitting. Pooling layers achieve dimensionality reduction on the feature maps in each layer retaining the invariant property, while also reducing the network’s parameters. Therefore, the spatial resolution reduction of the feature maps performed by pooling allows the salient features to be invariant to shift and shape distortions at a certain level, as the sensitivity of the convolution output is reduced to these effects.

Fig. 2
figure 2

Architecture of the LeNet-5

Fig. 3
figure 3

Architecture of the hybrid CNN-FCN

Table 2 Side-by-side comparison between CNN-FCN and Lenet-5 in terms of accuracy using MNIST dataset

The Convolutional Neural Network, as introduced by LeCun in [50], is a multi-layer neural network in a deep learning architectural scheme as presented in Fig. 2. This network is composed of two core structural components i.e., the input data are forwarded to an automatic feature extraction network and then the resultant extracted features are forwarded to a trainable classifier as illustrated in Fig. 2. The automatic feature extraction is performed through convolutional and pooling layer pairs, where digital filters are utilized to perform convolution and the pooling layer is used in succession to reduce the spatial size of the representation. Each output feature map of a CNN is computed using the convolution of the kernel weights with the input volume and then a non-linear activation function is applied to activate the result. Thus, the forward pass of the CNN is computed by:

$$\begin{aligned} {\textbf {x}}^{l}_{j}=f(({\textbf {W}}^{l}_{j}*{\textbf {x}}^{l-1})+{\textbf {b}}^{l}_j) \end{aligned},$$
(13)

where \({\textbf {x}}^{l}_{j}\) and \({\textbf {x}}^{l-1}\) are the jth output feature map and the input from the previous layer respectively, \({\textbf {W}}^{l}_{j}\) and \({\textbf {b}}^{l}_j\) are the trainable weights and bias of the output and f is the non-linear activation function. The classifier and the learnable filter (or kernel) weights are trained using the back-propagation algorithm. According to Eq. 13, the output values of the intermediate (hidden) layers are provided by the linear combination of the previous layer node values with weights, plus bias. Consequently, these neural values can be handled as features of the FCN classifier.

Following the aforementioned rationale, the hybrid architecture of CNN-FCN can be designed by replacing the fully connected output layer of the conventional CNN with the FCN classifier. The structure of the hybrid CNN-FCN model is depicted in Fig. 3. The numbers in the parentheses denote the corresponding feature sizes and the number of feature maps for the case of MNIST dataset, in which evaluation tests will be performed.

The training procedure of the hybrid model is conducted as follows: i) the input layer is fed with the normalized input signal (image); ii) the original CNN with the fully connected output layer (Multi-Layer Perceptron) is trained until a sufficient convergence occurs (for a specified number of epochs); iii) then the output layer is replaced by the FCN; iv) training of the FCN model is managed using the output maps of the previous layer as a new feature vector. In this case 84 features are produced as depicted in Fig. 3 and the Eq. 4 and Eq. 5 are applied during the FCN training; v) the trained weights are structured in polynomial surrogates (functional weights) containing the acquired knowledge using Eq. 10 and Eq. 11 following the functional weights scheme depicted in Fig. 1e. Finally, the well-trained FCN performs the classification recognition, retrieving weights through their functional forms and produces new output decisions using Eq. 2 on the testing set employing the automatically extracted features.

Fig. 4
figure 4

Performance comparison between Lenet-5 and CNN-FCN for different optimizers among epoch evolution. a batch 128 case; b full-batch case

In order to evaluate the performance of the hybrid CNN-FCN model, the MNIST dataset has been chosen. We also study the generalization performance under full-batch reusing the same (full) batch at each step introducing an "extreme" case scenario for comparison purposes. In accordance with the FCN classifier, in both training and testing procedures, the input patterns are represented by 84 neural values produced by the F6 layer as a new feature vector (see Fig. 3). Table 2 addresses a comparison study between conventional convolutional neural network (Lenet-5) and the hybrid implementation (CNN-FCN) utilizing diverse optimizers. The optimization method is used throughout the Lenet-5 training process, while in the case of CNN-FCN is used only for the first phase of learning as underlined in stage ii) above.

The numbers in the parentheses (Table 2) denote the case where interconnections between CNN-FCN output classes are incorporated. Note that, this is an optional operational mode of FCNs as described in subsection 2.1, where interconnection weights among output concepts (classes) exist. Indicatively, the resulting accuracy of this mode is presented for the case of 100 epochs (numbers in the parentheses) in Table 2. This architectural approach enhances slightly the classification performance because the FCN is able to exploit additional information coming from the other output nodes preserving the cognitive nature of the network’s operation. However, this improvement comes with an additional computational cost as more weights need to be trained. A visual representation of Table 2 is given in Fig. 4 for small size batch and full-batch adoption. The proposed CNN-FCN classifier produces a sufficient overall performance compared with Lenet-5. The latter produces slightly better accuracy when a small batch size is used (see Fig. 4a). Although the hybrid CNN-FCN classifier is not superior in all cases, it provides notable results in all situations avoiding local minima. Note that in the case of full-batch the proposed model presents a more reliable approach. Such an observation is clearly visible in Fig. 4b and especially in cases where Sgd, Adadelta and Adagrad optimizers are adopted. Indicatively, in Fig. 5 the performance after 100 epochs of Lenet-5, CNN-FCN and the CNN-FCN with weight interconnections among output concepts (optional mode) is compared side-by-side for the full-batch case.

Fig. 5
figure 5

Performance comparison for full-batch case—100 epochs

The FCN overcomes the insufficiencies that arise from vanishing gradients, poor conditioning and saddle points assuming each input pattern and the corresponding output labels as desired equilibrium states during training. Then, the learned multiple input–output associations are stored in compact functional structures. This is like using multiple local approximators and then the acquired knowledge is compressed in polynomial structures of weights retaining the global learning properties. For this reason, the CNN-FCN model is described by comparative efficacy in the case of full-batch. Therefore, based on the evaluated efficiency and reliability, the use of a hybrid model which is constituted by a convolutional representative algorithm and an FCN recognition algorithm is feasible and absolutely justifiable.

3.2 Hybrid ESN-FCN

In this section a recurrent hybrid implementation is presented utilizing the operational structures of Echo State Networks (ESNs) and Fuzzy Cognitive Networks in conjunction. In this hybrid approach, ESN will represent an extended intermediate layer retaining the recurrent nature of the model. Echo state network has been proposed by Jaeger [51] as a dynamic Recurrent Neural Network (RNN) consisting of: i) an input layer; ii) a recurrent layer, called reservoir, which utilizes a number of sparsely connected neurons, and; ii) an output layer. There are two identified stages towards training a typical ESN, firstly the states are updated from the previous step and then the output weight matrix is trained. More specifically, the dynamics of the ESN that formulate its state transition are defined as:

$$\begin{aligned}{} & {} \tilde{{\textbf {x}}} (k+1)=f({\textbf {W}}_{in}{} {\textbf {u}}(k+1)+{\textbf {W}}_{res}{} {\textbf {x}}(k)) \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \quad {\textbf {x}}(k+1)=(1-\alpha ){\textbf {x}}(k)+\alpha \tilde{{\textbf {x}}}(k+1) \end{aligned}$$
(15)
$$\begin{aligned}{} & {} \quad {\textbf {y}}(k+1)={\textbf {W}}_{out}[{\textbf {u}}(k+1);{\textbf {x}}(k+1)] \end{aligned},$$
(16)

where \({\textbf {u}}\in \mathbb {R}^{N_u}\), \({\textbf {x}}\in \mathbb {R}^{N_x}\), \(\tilde{{\textbf {x}}}\in \mathbb {R}^{N_x}\) and \({\textbf {y}}\in \mathbb {R}^{N_y}\) express the input vector, the internal states of reservoir, the vector of reservoir neuron updates and the output vector as a vertical concatenation of input and reservoir neuron activations, respectively; \({\textbf {W}}_{in} \in \mathbb {R}^{N_x\times N_u}\), \({\textbf {W}}_{res} \in \mathbb {R}^{N_x\times N_x}\) and \({\textbf {W}}_{out} \in \mathbb {R}^{N_y\times (N_u+N_x)}\) are the input, recurrent and output weight matrices respectively; \(\alpha \in (0,1]\) is called leaking rate and concerns the speed with which the reservoir is updated; The output weights \({\textbf {W}}_{out}\) describe the feedforward associations between reservoir neurons and output neurons. Based on a set of data points, \({\textbf {W}}_{out}\) can be calculated by minimizing the error between the predicted output \({\textbf {y}}(k+1)\) and the target (ground truth) values \({\textbf {y}}^{target}(k+1)\). The calculation formula of the readout weights \({\textbf {W}}_{out}\) is produced by ridge regression as:

$$\begin{aligned} {\textbf {W}}_{out}=({\textbf {X}}^\top {\textbf {X}} + \beta \mathbb {I})^{-1}{} {\textbf {X}}^\top {\textbf {Y}}^{target} \end{aligned},$$
(17)

where \({\textbf {X}}\) stands for the output of the reservoir neurons, \({\textbf {Y}}^{target}\) is the target outputs, \(\beta\) is the regularization term and \(\mathbb {I}\) is the identity matrix. Regarding initialization, three core hyper-parameters have to be chosen before training of ESNs occurs: i) the association elements of matrix \({\textbf {W}}_{in}\) are randomly initialized from a uniform distribution utilizing an input-scaling parameter \(w^{in}\), such that \({\textbf {W}}_{in}\) is initialized in \([-w^{in},w^{in}]\); ii) the sparsity parameter sp of \({\textbf {W}}_{res}\), denoting the ratio of non-zero elements in matrix; iii) the spectral radius parameter of reservoir, \(\rho ({W}_{res})\), which is the highest absolute eigenvalue of \({\textbf {W}}_{res}\) and \({\textbf {W}}_{res}\) is initialized from a matrix W where its elements are generated randomly in [-1,1] and \(\lambda _{max}(W)\) is the largest eigenvalue, so as: \(W_{res}=\rho ({W}_{res}) \frac{W}{\lambda _{max}(W)}\).

Fig. 6
figure 6

Architecture of the hybrid ESN-FCN

Table 3 ESN-FCN evaluation results

Following the aforementioned rationale, the general hybrid ESN-FCN model is illustrated in Fig. 6, where the readout system of conventional ESNs (Eq. 17) is replaced by the FCN. Note that the weight associations of FCN (\({\textbf {W}}_{FCN}\)) are trained using Eq. 4 and Eq. 5 (linear parametric model) and subsequently follow a functional form as described in subsection 2.4, using a subset of ESN states as input concepts. In order to evaluate the performance of the hybrid ESN-FCN model, the MNIST dataset is chosen as a classification problem and the S&P 500 stock index as a time series candidate. As it can be understood in the case of MNIST dataset, the 28-by-28 images feed the 784 nodes of the input layer. The time series evaluation problem concerns the S&P 500 stock index for two different periods: a) from January 4, 2016 to December 30, 2016 including 251 observations of one year (120 data points are used for training, 30 for validation and the rest 101 points for testing purposes) and b) from January 4, 2018 to December 30, 2019 including 500 observations of two consecutive years (240 data points are used for training, 60 for validation and the rest 200 points for testing purposes). In all examples, the leaking rate \(\alpha\) of the ESN-FCN model was set to 0.8 using trial and error, the spectral radius was set to 0.9 and the sparsity parameter 0.5. Table 3 presents the classification accuracy and the prediction error for the MNIST and S&P 500 stock index respectively. It should be noted that in the case of MNIST dataset, the FCN receives solely a subset of the ESN states without adding also the input u, i.e., \(FCN_{input}\subseteq ESN_{states}\), while in the case of S&P 500 the FCN also receives the input u, thus \(FCN_{input}=[u\;subset\_of\_ESN_{states}]\) (see Table 3).

The subset of states that the FCN receives from the reservoir is acquired by trial and error, thus the concept is to take a subset instead of the total of states in order to get a more informative representation. Note that in the case of simple implementation of one hidden layer (see Sect. 3) the corresponding accuracy for 50 and 100 hidden FCN nodes was 72.43% and 80.58% respectively (see Table 1). The ESN-FCN hybrid model increases slightly the overall performance in respect with the simple implementation of one hidden layer. However, the conventional ESN performs better in the MNIST case and it is outperformed by the hybrid ESN-FCN model in the case of S&P 500 prediction problem as depicted in Table 3. On the one hand we observe a consistency in the superiority of the hybrid model compared to the single ESN in the time series problem as reported in Table 3 (better performance in terms of RMSE in both evaluation periods). On the other hand, the hybrid ESN-FCN model fails to produce an improved classification approach compared to the single ESN. Both ESN and FCN are recurrent based networks which inherently enclose memory capabilities. Therefore, their merits are coupled in the time series problem producing an enhanced model due to their temporal relationships. This operational advantage cannot be exploited in a classification task and intuitively is more likely to produce a degraded model. This is indicatively supported by the experimental results, as the hybrid ESN-FCN implementation presents a slightly better performance in respect to the single FCN with one hidden layer, but practically fails to provide an enhanced accuracy compared to the single ESN.

3.3 Hybrid Autoencoder-FCN

Autoencoder is a well-known class of unsupervised neural networks, that is associated with the task of representation learning by performing reconstruction of the input from its encoded form. More specifically, high-dimensional input data can be compressed to lower representations (lower-dimensional codes) focusing on the most important features and then the input vectors are reconstructed back.

Fig. 7
figure 7

Architecture of the Hybrid Autoencoder-FCN

The autoencoder can be considered as a non-linear generalization of Principal Component Analysis (PCA) consisting of an encoder which transforms the high-dimensional input data into low-dimensional encoded data and a similar decoder which reconstructs the original high-dimensional data from the compressed knowledge representation. The encoder-decoder networks are trained together in order to minimize the discrepancy between the original and the reconstructed data. Thus, the autoencoder tries to learn an approximation function to the identity function, so as:

$$\begin{aligned} \hat{{\textbf {x}}}=f_{{\textbf {W}},{\textbf {b}}}({\textbf {x}}) \approx {{\textbf {x}}} \end{aligned},$$
(18)

where x is the input data, \({\textbf {x}}\in {{\mathbb {R}}^{n\times N}}\), with n and N the number of samples and input dimension respectively, that will be mapped through a deterministic mapping to a hidden representation \({\textbf {h}}\in {{\mathbb {R}}^{n\times m}}\) with m being the number of the compressed feature representations. The weights and biases of both encoder-decoder layers are given by \({\textbf {W}}=\{{\textbf {W}}_1,{\textbf {W}}_2\}\) and \({\textbf {b}}=\{{\textbf {b}}_1,{\textbf {b}}_2\}\) respectively. The input passes through the encoder and the representation mapping is given by:

$$\begin{aligned} {\textbf {h}}=g_{{\textbf {W}}_1,{\textbf {b}}_1}({\textbf {x}})= \sigma ( {\textbf {x}} {\textbf {W}}_1+{\textbf {b}}_1) \end{aligned},$$
(19)

where \({\textbf {W}}_1 \in {{\mathbb {R}}^{N\times m}}\) and \({\textbf {b}}_1 \in {{\mathbb {R}}^{1\times m}}\) are the parameters of the encoder part and \(\sigma\) is the element-wise activation function. At the decoder stage, the latent representation h is mapped back to the reconstruction \(\hat{{\textbf {x}}} \in {{\mathbb {R}}^{n\times N}}\) as given by:

$$\begin{aligned} \hat{{\textbf {x}}}=g_{{\textbf {W}}_2,{\textbf {b}}_2}({\textbf {h}})= \sigma ( {\textbf {h}} {\textbf {W}}_2+{\textbf {b}}_2) \end{aligned},$$
(20)

where \({\textbf {W}}_2 \in {{\mathbb {R}}^{m\times N}}\) and \({\textbf {b}}_2 \in {{\mathbb {R}}^{1\times N}}\). In order to optimize the trainable parameters, the average reconstruction error is used:

$$\begin{aligned} J_E({\textbf {W}},{\textbf {b}})=\frac{1}{n} \sum _{r=1}^{n} \frac{1}{2} \Vert \hat{{\textbf {x}}}^{(r)} - {\textbf {x}}^{(r)} \Vert ^2 \end{aligned}$$
(21)

The low-dimensional code (middle layer) imposes a bottleneck which forces a compressed knowledge representation of the original input data. Instead of restraining the size of the latent code h forcing the encoder to compress the input data x (undercomplete autoencoders), a class of autoencoders called sparse autoencoders introduce a sparsity constraint where only a fraction of the neurons are activated. This penalty property prevents the neural network from activating more nodes and serves as a regularizer. This approach is effective even in case of large code h as the autoencoder is compelled to represent each input as a combination of a limited number of hidden units allowing the network to explore structures in the data.

In order to incorporate the sparsity regularizer term, the Kullback-Leibler (KL) divergence function is used considering the activations over a collection of samples, rather than summing them (as in the case of L1 Loss), and then the average activation of each node is constrained encouraging neurons to only activate for a subset of the training observations. The average activation of hidden unit j with respect to the input \({\textbf {x}}^{(r)}\) is given by:

$$\begin{aligned} \hat{p}_j=\frac{1}{n} \sum _{r=1}^{n} h_j ({\textbf {x}}^{(r)}) \end{aligned},$$
(22)

where \(\hat{p}_j\) is the averaged activation of j over the training set which is then constrained such that, \(\hat{p}_j=\rho\). The sparsity parameter, \(\rho\), is a small positive value close to zero which penalizes the neurons that are too active. This is incorporated in the optimization objective in order to penalize \(\hat{p}_j\) in case of a significant deviation from \(\rho\). Thus, the KL divergence term is added in the loss function intending to measure the difference between two probability distributions, i.e., calculates the difference between the average activation of hidden neurons \(\hat{p}_j\) (actual probability) and the target probability \(\rho\) that a neuron in the hidden-coding layer will activate. The target (ideal) distribution function takes the value 1 with a small probability \(\rho\) and the value 0 with a probability \(1-\rho\). The KL divergence similarity is given by:

$$\begin{aligned} J_{KL}(\rho \ \Vert \ \hat{{\textbf {p}}})=\sum _{j=1}^{m} \rho \log \frac{\rho }{\hat{p}_j} + (1-\rho ) \log \frac{1- \rho }{1- \hat{p}_j} \end{aligned}$$
(23)

If the neurons’ activation is equal to the defined sparsity rate then \(J_{KL}(\rho \ \Vert \ \hat{{\textbf {p}}})=0\) and the penalty term is minimized, otherwise as \(\hat{p}_j\) diverges from \(\rho\) a loss function with high value will be produced penalizing the network’s result. A weight decay term, which tends to decrease the magnitude of the weights, is added to Eq. 21 and Eq. 23 in order to prevent overfitting, so as the overall cost function for the sparse autoencoder becomes:

$$\begin{aligned} \begin{aligned} J_{SAE}({\textbf {W}},{\textbf {b}})&=J_E({\textbf {W}},{\textbf {b}}) + \beta J_{KL}(\rho \ \Vert \ \hat{{\textbf {p}}}) \\&+ \frac{\lambda }{2} \sum _{l=1}^{2} \sum _{i=1}^{s_l} \sum _{j=1}^{s_{l+1}} (w_{ij}^{(l)})^2 \end{aligned} \end{aligned},$$
(24)

where the first term refers to Eq. 21, the second term is the KL divergence of Eq. 23 with \(\beta\) controlling the weight of the sparsity penalty term and the third term is the weight decay with the parameter \(\lambda\) which controls the relative importance of the two terms facilitating weight decay. The adjacent layers are denoted with the notations \(s_l\) and \(s_{l+1}\).

The general architecture of the hybrid AE-FCN approach is unfolded in two phases as depicted in Fig. 7. The autoencoder is trained using N inputs following the process described in Eqs. 18-24 and the m latent representations are used as input concepts to the FCN. Then the FCN follows a standard supervised learning approach using the latent representations as steady nodes and the target values as non-steady ones. The training procedure of the FCN model follows the linear parametric model (Eqs. 4-5). After that, the trained weights are structured in polynomial surrogates (subsection 2.4) and later recalled based on testing input data to produce the output decision. This procedure is similar to the one described in subsection 3.1 regarding the handling of features by the FCN and its learning scheme.

In the following subsections, two time series applications are chosen in order to evaluate the AE-FCN synergistic operational scheme, which encapsulates the representation and prediction merits of both approaches.

3.3.1 Time-Series Prediction

The evaluation time series problem concerns the S&P 500 stock index from January 4, 2016 to December 30, 2016 including 251 observations. The first 120 data points are used for training, 30 for validation and the rest 101 points for testing purposes. During the training phase, the autoencoder receives the normalized open daily prices of the training set and standard limited-memory BFGS quasi-Newton is applied [52] for minimization of Eq. 24. The parameters are defined as \(\beta =0.5\) and \(\lambda =0.01\) and the autoencoder has three layers as presented in Fig. 7.

After completing the autoencoder’s training phase, m features are chosen to preserve inputs to the FCN. Each training instance is considered as equilibrium and the FCN estimates the interconnection weights minimizing the error between the network’s output and the target value. Subsequently, the synaptic weights are transformed into functional representations using the input values in order to form the known regressors \(\phi\) (giving also the \(p^{th}\) order of the weight’s polynomial structure) and then least squares procedure and QR decomposition with column pivoting are followed (see subsection 2.4) to estimate the adjustable parameters \(\theta\). During testing, weights are retrieved based on input values from the testing set and the FCN produces the prediction result (see Fig. 7).

Table 4 Testing performance in terms of RMSE for different number of autoencoder representations
Fig. 8
figure 8

Predicted S&P 500 time series using 10 compressed knowledge representation nodes in the autoencoder

The number of neurons in the middle layer, m, denotes the compressed feature representations. However, the m representation mappings of Eq. 19 affect the number of inputs in the FCN and thus the final result. Table 4 presents the testing performance in terms of RMSE among different representation mappings, with \(m=10\) being the best case scenario. As a result, the best case scenario prediction is depicted in Fig. 8 and a comparison study with other interesting FCM-based methods, as well as conventional ANFIS and ANN configurations, is presented in Table 5. Note that both hybrid FCN models assume polynomial forms of \(1^{st}\) order and the ESN-FCN model is described in subsection 3.2.

Table 5 S&P 500 forecast accuracy comparison with other prediction models in terms of RMSE

3.3.2 Index Tracking

One of the most acceptable passive portfolio management methods is called index tracking, often known as index replication. The objective of this technique is to reproduce, as close as possible, the performance of a market index by constructing and managing a portfolio of assets comprised on the index. The created portfolio of stocks or bonds is called index fund and its replication or tracking difference from the performance of the financial index is called tracking error. An easy to manage and operate practice is the full replication method which refers to the incorporation of all assets of the index by buying quantities of all the constituents that compose the index. In this case, the large size of the index fund and thus the capital allocation among all the assets introduces high transaction costs, large tracking errors, while also the portfolio is possible to be consisted of many illiquid stock positions.

For this reason, it is crucial for new approaches to develop a selection strategy where a subset of constituent stocks is chosen to comprise the tracking portfolio. Thus, appropriate methods are capable to cope with the aforementioned defects and even achieve an enhanced performance compared with the index. Generally, the adopted approaches are divided into static and dynamic for the index tracking problem. In the static case, the constructed portfolio is kept on hold position during the whole period of interest. On the contrary, the dynamic approach dictates readjusting the portfolio following a trading strategy.

Various methods have been proposed in the literature introducing statistical methods for constructing tracking portfolios [55], evolutionary heuristic methods [56], sparse non-convex optimization [57] and neural network predictors [58]. In recent years, deep learning algorithms and autoencoders have been embodied in order to provide an accurate solution to the index tracking problem. More specifically, an autoencoder can be utilized to select stocks and then a neural network is applied in succession to track the market index based on the selected stocks. Following this rationale, we incorporate an FCN replacing the market index model.

The asset selection can be performed using autoencoders in order to rank stocks based on the information that they share with the index market (communal information content). This is closely related to Eq. 19 where the encoded representation mapping h can be realized as the market composed of N stocks. This stems from the fact that h is interpreted as the market factor similar to the Capital Asset Pricing Model [59], which states that each asset price can be expressed as a function of the market value. With the utilization of reconstruction error, the similarity between a stock and its reconstructed structure is calculated. Note that, an interesting work in this field extends that notion incorporating the similarity between an asset and the market index through a correlation coefficient [60]. Autoencoders with a single hidden layer and \(m=4\) nodes, three hidden layers with \(m=1\) in the middle layer and diverse autoencoder configurations have been studied in [61, 62] and [60, 63] respectively.

Table 6 Size of each data subset
Fig. 9
figure 9

The constituent stock (2382.HK) that shares the most common information with HSI

In this work, we use the Hang Seng Index (HSI) as a benchmark following the autoencoder architecture with one hidden layer and \(m=4\) as produced/proposed by [61] (similar to the autoencoder illustrated in Fig. 7). Table 6 shows the under examination period intervals. The HSI for this period is composed of 51 constituents and after excluding 5 of them because of missing data (0288.HK, 1113.HK, 1299.HK, 1928.HK, 1997.HK), 46 constituent stocks are fed to the autoencoder. The autoencoder input data follow z-score standardization as a pre-processing phase. Indicatively, the stock that shares the most common information with the HSI market is illustrated in Fig. 9.

Fig. 10
figure 10

HSI tracking performance for the testing period

The stocks for the tracking portfolio inclusion are selected using K assets that have been identified as the most relevant and most informative ones based on the reconstruction error given by Eq. 21. Also, L stocks with \(L>K\) that share less communal information content with the market portfolio are chosen. More directly, the autoencoder ranks the assets based on their communal information content and a fixed number of stocks with the best performed proximity to their autoencoded versions have been chosen (K stocks), while also a number of the least communal stocks is selected (L stocks) forming a tracking portfolio of \(K+L\) assets. This is a common practice as it is not beneficial to include too many stocks contributing the same information [61]. In our case \(K=3\) and \(L=15\). More specifically, the most relevant stocks in descending order according to the reconstruction error are: 2382.HK, 0175.HK and 0700.HK. The reported 15 least communal stocks, in ascending order, are: 1211.HK, 0762.HK, 0241.HK, 0002.HK, 0267.HK, 0941.HK, 0017.HK, 2319.HK, 1109.HK, 2628.HK, 0386.HK, 0883.HK, 0016.HK, 1044.HK, 0005.HK.

Fig. 11
figure 11

Cumulative return for the testing period

After that, an FCN is applied in succession to perform index tracking utilizing the selected stocks. The FCN follows a similar learning procedure and functional representation of its trained weights as described in previous sections. Note that within the FCN structure, weight interconnection indicates causality between two concepts. In addition to this property, the weight association of each selected stock with the index market is stored in functional form instead of plain, constant and time-independent value. Therefore, the retrieved weights from the polynomial surrogates permit the mapping of multiple stocks-market associations at different time intervals. This attribute allows recalling weights based on diverse market conditions introducing a dynamic approach toward the index tracking problem. Figure 10 shows the HSI tracking performance, while Fig. 11 presents the cumulative return of the tracking index against the original index. The tracking error (TE) is given by:

$$\begin{aligned} ATE=\frac{1}{T}\sqrt{\sum _{t=1}^{T}(R^{I}_{t}-R^{P}_{t})^2} \end{aligned}$$
(25)

where T represents the total number of testing periods and \(R^{I}_{t}\), \(R^{P}_{t}\) are the returns of the index and tracking portfolio at time t respectively. The resulted weekly tracking error for our implementation is \(TE=0.094\%\). The corresponding tracking error using a deep learning approach was \(TE=0.12\%\) as presented in [62].

4 Comparison Evaluation in a Remaining Useful Life (RUL) Prediction Problem

In this section we evaluate the performance of the proposed hybrid methods using the turbofan engine degradation dataset developed by NASA [64]. The dataset has been simulated with the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) program proving turbofan engines, which operate in different conditions and under different initial health states. As time progresses, each engine begins to deteriorate affecting its health condition until failure occurs. The effects of faults and deterioration are simulated in five main components: Fan, Low Pressure Compressor (LPC), High Pressure Compressor (HPC), High Pressure Turbine (HPT), Low Pressure Turbine (LPT). In Table 7 a description of the 21 simulated outputs and the three environment variables is shown. The symbols in green are the selected features for our RUL prediction approaches.

Table 7 C-MAPSS dataset overview
Fig. 12
figure 12

Engine #26 in FD001 dataset. a Standardized signals (bold signals in Table 7); b Piece-wise RUL function (maximum RUL value is 125 time cycles)

The C-MAPSS dataset is divided into four sub sets concerning different operating conditions and fault modes, named: FD001, FD002, FD003 and FD004. Each subset contains training data of engine sensor measurements and operational settings from health state to fault. However, the testing data provide sensor and operational records for a period of time for each engine and the purpose is to predict the RUL of these turbofan engines.

In this study, only the first subset, FD001, is considered for evaluation. This sub set has 100 engines for training with 20,631 samples and 100 engines for testing with 13,096 samples in total and the objective is to predict the RUL of each turbofan engine at the end of its testing record as mentioned. Pre-processing phase includes z-score normalization (standardization) of the selected features as depicted indicatively for engine number #26 in Fig. 12a. For RUL target label, a piece-wise linear degradation function is adopted limiting the RUL value at a maximum threshold as described in [65,66,67]. The maximum constant value of RUL is chosen as 125 time cycles and after that value, the engine starts to degrade at a certain point as illustrated in Fig. 12b.

The CNN-FCN architecture includes five consecutive sets of 1D convolution, batch normalization and sigmoid layer followed by a fully connected layer and a dropout layer. The convolution operation is carried out in the time sequence dimension for each feature. The 1D-CNN layers consist of 16, 32, 64, 128, 256 filters with filter size \(5\times 1\), \(7\times 1\), \(9\times 1\), \(11\times 1\), \(13\times 1\) respectively and \(stride=1\), \(padding="causal"\). The number of hidden units is defined as 100, the dropout probability is 0.5 and the batch size is 16. Regarding the AE-FCN architecture, a hidden size of 12 neurons is chosen with sigmoid activation functions and a batch size of 512. In the case of ESN-FCN, 20 reservoir neurons for the ESN layer are chosen and 15 states from the ESN are inserted to the FCN as feature representations. The ESN parameters are chosen as: leaking rate, \(\alpha =0.8\); sparsity parameter, \(sp=0.5\) and; spectral radius, \(\rho =0.95\). In all hybrid configurations the sigmoid function is chosen as the activation function in order to comply with the iterative equation of computing the concept nodes in FCN (Eq. 3). Moreover, in all cases the linear parametric model is used during FCN learning procedure (Eq. 4, 5) and functional structures of FCN weights are constructed using \(1^{st}\) order polynomials. Note that for selecting the hyper-parameters of the hybrid implementations appropriately, standard 5-fold cross-validation has been performed solely on the training set. This procedure concerns mainly the representative part of the hybrid models (CNN, AE and ESN parts) in order to tune and finally choose the aforementioned hyperparameter values. For the FCN part, the order of polynomials has been selected. In FCNs, the expert intervention is limited only to the initial FCN graph construction by denoting the number of nodes and the existence of causal relationships. Initial weights are chosen randomly and expertise about initial values of nodes and weight interconnections is not required [6].

Fig. 13
figure 13

Predicted RUL for each engine using all three hybrid implementations in FD001 dataset

The predicted RUL for each engine in the testing set, utilizing the three hybrid FCN approaches, is depicted in Fig. 13. Generally, the CNN-FCN implementation consistently provides the best prediction performance, the AE-FCN model struggles to predict the low level RUL values i.e., in the scale between [0-20], while the ESN-FCN results in the less accurate predictions. The adopted evaluation metrics are:

$$\begin{aligned}{} & {} RMSE=\sqrt{\frac{1}{n} \sum _{i=1}^n d_i^2} \end{aligned}$$
(26)
$$\begin{aligned}{} & {} \quad Score= {\left\{ \begin{array}{ll} \sum _{i=1}^n \exp {(\frac{-d_i}{13})}-1, &{} \text {for } d_i <0 \\ \sum _{i=1}^n \exp {(\frac{+d_i}{10})}-1, &{} \text {for } d_i \ge 0 \end{array}\right. } \end{aligned},$$
(27)

where \(d_i\) is the prediction error and n stands for the number of units under test. The difference between these two metrics is that RMSE gives equal penalty weight to both early and late predictions (for the same absolute deviation from zero), while Score function penalizes more heavily late predictions than early ones. The prediction error between actual and predicted RUL, in terms of RMSE and Score is given in Table 8, including also the performance of other methods reported in the literature.

Table 8 Prediction comparison in terms of RMSE and Score
Fig. 14
figure 14

Performance comparison of FCN hybrid implementations from evaluation metrics’ perspective along 100 test engines: a RMSE representation for each individual prediction error; b Score representation for each individual prediction error

Figure 14 illustrates in an unfolded way the results of the proposed approaches providing insights about their performance from the evaluation metrics’ perspective along 100 test engines. As it can be observed, the RUL error concentration (each error point concerns performance for each indidivual test engine) is located in a narrower band around zero in the case of CNN-FCN, thus the best performance is recognized. It should be noted that, although the AE-FCN model produces a slightly worse performance in terms of RMSE compared with standard CNN [69], performs better in terms of Score (see Table 8). This stems from the fact that the prediction operation in AE-FCN model prompts RUL underestimation (early predictions illustrated in both Fig. 13 and 14), which is less penalized in Score mapping.

Particular interest arises in the case of ESN-FCN as one single error, upper right point in both Fig. 14a, b, contributes heavily to the whole performance. More specifically, the upper right point, which corresponds to test engine number 41, contributes \(2.16*10^3\) to the total Score of \(3.555*10^3\) as reported in Table 8. This high contribution emerges from the deviation of 76.76 from the target/desired RUL of 18 cycles (late RUL prediction of 94.76 cycles). Generally, the ESN-FCN model produces greater number of late predictions compared with the other two hybrid implementations. This prediction behavior explains the lower performance compared with the other two FCN hybrid implementations. Summarizing, it should be highlighted that there is a huge difference in the number of trainable parameters between CNN-FCN and the other two hybrid implementations. In a descending order CNN-FCN, AE-FCN, ESN-FCN utilize 566k, 601 and 444 weight parameters respectively. This is an added property that should be taken into consideration in a trade-off manner between performance and computational load. Therefore, even ESN-FCN model provides a result that paves the way for future efforts in parallel and hierarchical ESN topologies in conjuction with FCNs for prediction tasks.

5 Discussion and Lessons Learnt

This work is devoted to the development of three hybrid representative structures utilizing FCNs-FW. After presenting the main concepts and notions of FCNs-FW (Sect. 2), an analysis is conducted regarding the ability of these networks to be formulated in deep topologies. We present the learning restrictions when adding intermediate layers towards deep FCN formulation. Then, we compared the performance of such structure with the most appropriate candidate for single hidden layer feedforward neural network, i.e., ELM (Sect. 3). The results showed that the increase of neurons in a shallow FCN affects the performance positively, instead increasing the number of intermediate hidden layers does not provide the desired outcome. This observation led to the development of three hybrid FCN-based implementations namely, CNN-FCN, ESN-FCN and AE-FCN where functional weights are assumed in all cases. The evaluation results indicated the feasibility of these methods in diverse applications presenting also multifaceted benefits.

Initially, we tested the performance of CNN-FCN architecture using CNN as a representative feature extractor and FCN as a recognizer in a synergistic way, in handwritten digits recognition (MNIST) dataset. The comparison duel with standard CNN candidate (Lenet-5), performed under regular batch size (128), but also under full-batch which is a known extreme case for training stochastic gradient descent-related algorithms. Although, the hybrid CNN-FCN classifier produced a sufficient overall performance compared with Lenet-5 for 128 batch size but not more dominant, in the case of full-batch showed notable results avoiding local minima providing a more reliable and resilient approach to insufficiencies like vanishing gradients, poor conditioning and saddle points. This result highlighted the virtues of functional weights where the learned multiple input–output associations are stored in compact functional structures. This is like using multiple local approximators and then the acquired knowledge is compressed in polynomial structures of weights retaining the global learning properties.

Then, the hybrid ESN-FCN model has been presented under classification and time series tasks. The peculiarity of this model is that the FCN tries to extract useful knowledge from the ESN reservoir states. In the classification task, the hybrid implementation failed to produce an enhanced classification performance compared with standard ESN. On the other hand, in the S&P 500 prediction task, the ESN-FCN model produced an enhanced performance compared with standard ESN (Table 3) and 3 out of 5 other competitors as depicted in Table 5. As noted, this behavior is expected as both ingredients of the hybrid model belong to the general family of recurrent neural networks enclosing memory capabilities inherently. Thus, their merits are coupled in the time series problem producing an enhanced model due to their temporal relationships, while this advantage cannot be exploited in a classification task verifying the expected result.

The AE-FCN model was evaluated in S&P 500 prediction task and compared with classic algorithms while also with the most dominant FCM-based approaches for time series tasks (Table 5). Note that in this task the ESN-FCN produced better result compared with single ESN and a fair performance compared with FCM-based methods. However, in this problem the AE-FCN produced the best outcome in terms of RMSE utilizing efficiently 10 representation mappings (neurons). In the tracking index problem, the AE-FCN model produced lower tracking error compared with a deep learning approach. In this particular problem, the encapsulated causality in the network’s weights, as well as the formulation of multiple stocks-market associations at different time intervals allow the network to recall weights based on diverse market conditions. This leads to a dynamic approach with higher levels of flexibility.

Finally, the proposed hybrid implementations were evaluated in a remaining useful life prediction problem. Subsequently, their performances were compared with each other while also with other methods reported in the literature. The evaluation performed under two evaluation metrics, i.e., RMSE and an asymmetric scoring function. The comprehensive analysis from the evaluation metrics’ point of view indicated that the CNN-FCN model produced a more compact behavior with most of engine predictions being uniformly around zero error (on-time). Also, the AE-FCN model produced a descent performance in terms of RMSE and slightly better than the standard CNN. The ESN-FCN approach performed poor in terms of Score being dominated mainly from one heavily late prediction. However, both AE-FCN and ESN-FCN models utilize dramatically lower number of trainable parameters. Of particular interest would be the evaluation of more complex models of the already established hybrid implementations towards narrowing down late predictions. Autoencoder and CNN extensions could be utilized for the first part of AE-FCN and CNN-FCN respectively, while for the ESN-FCN model hierarchically connected reservoirs could be used to discover higher level features of the under examination signal.

6 Conclusion

In this paper three hybrid implementations have been introduced towards deep learning structures in the area of Fuzzy Cognitive Maps. The general issues related with deep topologies using Fuzzy Cognitive Networks with functional weights have been examined leading to three hybrid schemes named: a) CNN-FCN; b) ESN-FCN and; c) AE-FCN. The CNN-FCN model presented an overall accuracy of 99.06% in MNIST dataset when interconnections between output classes have been incorporated. Furthermore, as compared to the standard Lenet-5, this model demonstrated great levels of reliability and efficiency in the case of full-batch. This is due to the incorporation of functional weight structures, where the acquired knowledge is compressed from multiple local approximators retaining the global learning properties. Hybrid AE-FCN model presented a higher performance in a stock market prediction example (11.79 RMSE in S&P 500) and an index tracking problem (0.094% tracking error in HSI), when compared with other FCM-based approaches and a deep learning model respectively. The ESN-FCN structure presented a decent performance, but outperformed by the other two models in all benchmarks. Finally, a comparison evaluation performed on a remaining useful life problem using the C-MAPSS turbofan engine degradation dataset. CNN-FCN model outperformed the other two hybrid structures presenting the lowest prediction error, i.e., 16.35 in terms of RMSE and 706 in terms of asymmetric scoring function (Score). The evaluation results provided strong evidence that the hybrid implementations using FCNs with functional weights are feasible with satisfying prediction capabilities. Future work will be devoted to the utilization of hybrid deep FCN-FW methods in new application domains, as well as on the development of a distributed adaptive control approach incorporating the FCN-FW formulation.