1 Introduction

During the last decades, neural networks (NN) have been receiving much attention from the research community, especially in the area of artificial intelligence. The reward comes from the emulation of the human brain in the computing environment through mathematical models. A NN is a strongly connected structure of artificial neurones.

As is well known, an artificial neurone is a mathematical abstraction of the nerve cell,—an excitable cell that communicates with other cells via specialised connections. An artificial network It resembles a set of biological neurones only in a few aspects, with its main differences being the topology and the number of neurones used. Also, layers are calculated in a certain order while biological neurones can act asynchronously. The speed of information transfer and the learning process are other distinguishing features.

To mirror the brain, the perceived inputs are propagated through the network, changing on the way as a result of the connections’ different weights, as well as the mixing and signal-triggering of different functions which characterise every neurone. As the information withheld by the NN is dependent on its structure and every neurone idiosyncrasy, the dimension of the structure, its topology and the type of its neurones define the NN capacity to retain specific information. Thence, an adequate learning algorithm should be used to adjust the weights within the NN either to retain required information or adjust the interconnection structure. In this way, this complex structure can resemble an artificial brain that is able to memorise, map information, and give answers according to the received inputs. Learning methods are classified into supervised, unsupervised, and reinforced [1,2,3].

Assuming the success of the chosen learning algorithm and that the NN is well-dimensioned, the network will produce the required information. Most of the learning methods give neither attention to the form of the information spread within the network nor to the physical meaning of the network weights. The information disseminates through the structure without any legibility nor inter-functionality of its local values.

The neurone is the NN basic unit that retains information about the structure in question through its weights, i.e., information about its integration and activation functions. Once its structure is chosen, a NN retains information in the weights of all its neurones. Whenever the information memorised in the network can be reorganised, particularly through the decomposition into parcels of information, a relevant question is whether this decomposition can be carried out locally in the neurones.

To answer this question, we need to prove whether it is possible to “split” a neurone, keeping the core of its structure but assuming all its variables to be multidimensional as well as the activation function decomposed into orthogonal components. Such a decomposition should be reversible in the sense that the sum of the sub-models is always equal to the transfer function of the initial neurone (before being split). Another issue is to present a method able to transfer information between neurones for this new model structure (i.e., vectorial signals and integration and activation functions) and that, at the same time, warrants immutability of the neurone and consequently of the whole NN.

Therefore, the present work aims to capacitate an already existing NN to re-arrange its information by transforming its scalar inputs, outputs and functions into multivariable entities to guarantee the decomposition of the signals that circulate within the NN. Crucial to this paradigm is to warrant that the transfer functions of the NN are not changed, to make sure that the information is preserved. This is possible since the decomposition of the multidimensional functions transmits the representation of the transfer function of every neurone as an invariant sum of parcel functions that equals its original value.

The back-propagation algorithm is traditionally used in supervised learning of NN [4]. It propagates the NN output error to every neurone and then, locally, makes the readjustment of the weights. Its objective is the NN to learn from the training data and then to be able to reproduce the data (this or other data) with the smallest error as possible. The algorithm that we propose in this work—Multi-back-propagation algorithm—has two phases. But its contribution is Phase-2, since Phase-1 is the NN that we always assume to exist. Phase-1, or learning phase, obtains the NN structure with some other method; e.g. the back-propagation algorithm. Phase-2, or the splitting phase, transforms the NN obtained at Phase-1 into a multidimensional version assuring that no information is lost and the NN output components are taken as answers of the decomposed NN. At any instant, the sum of the NN components should remain the same as the original NN (from Phase-1). Thus, the objective of this method is to back-propagate the separability factor of the output components, making sure that at every vectorial neurone of the NN arrives an adjustment factor of separability. Once holding this value, it is possible to locally transfer information between the sub-models of the vectorial neurone.

From the best of our knowledge, transfer function transformation and optimisation has received little research attention so far. The idea presented in this paper, i.e., to split the transfer function into orthogonal components is innovative and can be applied to all problems that benefit from NN readability. Moreover, it also makes possible to interrogate the NN about portions of the modelled knowledge. We envision its use, with clear benefits, for structural NN reorganisation and in the formulation of new machine learning strategies.

The idea of splitting the NN is interesting and there are just a few papers on this topic. In [5] the NN automatically learns to split the network weights into either a set or a hierarchy of multiple groups that use disjoint sets of features, by learning both the class-to-group and feature-to-group assignment matrices along with the network weights. In [6], experiments show that in the case where the network performs two different tasks, the neurones naturally split into clusters, where each cluster is responsible for processing a different task. This behaviour not only corresponds to biological systems, but also allows for further insight into interpretability or continual learning. In [7], a new technique is proposed to split a convolutional network structure into small parts that consume lower memory than the original one. The split parts can be processed almost separately, which provides an essential role for better memory management. In [8] is proposed a novel, complete algorithm for the verification and analysis of feedforward NN ReLU-based. The algorithm, based on symbolic interval propagation, introduces a new method for determining split-nodes which evaluates the indirect effect that splitting has on the relaxations of successor nodes. A constructive algorithm is proposed in [9] for feedforward NN which uses node-splitting in the hidden layers to build large networks from smaller ones. Modification of the transfer function and the error back-propagation technique applied to NN can be seen in [4] and the transfer function pooling in [10].

In Sect. 2, the activation function is written as a Fourier sum of m terms to facilitate decomposition of the input and output signals. The optimality of this approximation is proven. In Sect. 3, the multidimensional neurone is formulated and the orthogonality of the partial activation functions is demonstrated. In Sect. 4, the new learning method is outlined—the Multi-back-propagation algorithm. This conveys the transfer of information between different parts of the decomposed neurones of a feedforward NN with the use of a reinforcement back-propagation technique. The demonstration of the decomposition procedure, as well as the effectiveness of the proposed splitting algorithm can be found in Sect. 5. Namely, (1) we illustrate the extension of the activation function to a periodic one; (2) we give an example where the activation function—the hyperbolic tangent—to demonstrate the virtue of the approximation of the output by an orthogonal sum; (3) The splitting of the NN transfer function, as well as the invariance of the sum of its parts, is substantiated in two different examples. The work concludes with Sect. 6 where the main conclusions are outlined. Also, some directions for future research in view to ripen the Multi-back-propagation method are stated.

2 Activation Function Orthogonal Decomposition

A classical NN is formed by elementary units—the neurones.

A single artificial neurone is represented mathematically as:

$$\begin{aligned} y = \varphi \left( \sum _{ i=1}^{n+1} w_{i} x_{i}\right) . \end{aligned}$$
(1)

Considering that we have \(a=\varvec{w}^{T}\varvec{x},\) where \(\varvec{w}=\left( w_{1}, w_{2}, \ldots , w_{n}, w_{n+1}\right) ^{T}, w_{i} \in \mathbb {R}, i=1, \ldots , n,\) is the vector of the weights and \(\varvec{x}=\left( x_{1}, x_{2}, \ldots , x_{n}, x_{n+1}\right) ^{T}\) the input vector which means that it takes n inputs \(x_{i}\in \mathbb {R}, i=1,\ldots , n,\) with each one having its own weight, \(w_{i}, i=1,\ldots , n.\) A bias input exists at every node and is represented by a constant \(x_{n+1}=1\) and the adjusted weight \(w_{n+1}.\) Thence \(y=\varphi (a).\)

Fig. 1
figure 1

Structure of an artificial neuron. \(\Sigma \) is the activation function and the integration function

Two mathematical functions are the basis of its structure. The neurone calculates the sum of the weighted inputs—the integrative function (\(\Sigma \))—and passes the result through \(\varphi ,\) that represents the neurone’s activation potential. Typically, \(\varphi \) is non-linear and has the shape of a sigmoid — the activation function (). Usual choices are the hyperbolic tangent, the Heaviside function or the LeRu [11, 12]. \(\varphi \) is often monotonic increasing, continuous, differentiable and bounded. This function is used to pass the information further to the network, with the activation lower and upper bounds represented by \(-\delta \) and \(+\delta ,\) respectively. A threshold level is used to shift the action potential value of the neurone to the network, represented here by the bias of every neurone, \(x_{n+1}.\)

2.1 Orthogonal Decomposition of \(\varphi \)

The activation function defines how the output signal of the integration function is transformed into an output of the neurone. Without loss of generality, \(\varphi \) is a monotonously increasing saturated function. Define the saturation level as \(\delta \) and the interval where \(\varphi \) is defined as \(\left[ \underline{a},\overline{a} \right] ,\) such that \(\varepsilon = \varphi ( \underline{a}) + \delta =\delta - \varphi ( \overline{a}).\) For what is meant the extremes of the interval to be sufficiently close to the saturation point. Furthermore, \(\varphi \) is odd, and then we have that \( \underline{a} =- \overline{a}\) and \(\underline{a} = -\varphi ^{-1}(\delta -\varepsilon ), \overline{a} = \varphi ^{-1}(\delta -\varepsilon ),\) where \(\varepsilon \) is the required precision. Next, a periodic extension of function \(\varphi \) in the interval \(\left[ 2\underline{a},2 \overline{a} \right] \) is defined:

$$\begin{aligned} \Phi (a) = \left\{ \begin{array}{lll} \varphi (-a + 2 \underline{a}) &{}: &{} 2 \underline{a} \le a \le \underline{a} \\ \varphi (a) &{}: &{} \underline{a} \le x \le \overline{a} \\ \varphi (-a + 2 \overline{a}) &{}: &{} \overline{a} \le a \le 2 \overline{a}. \end{array}\right. \end{aligned}$$
(2)

Function \(\Phi (a) \) is periodic of fundamental period \(T= 2 \left( \overline{a} - \underline{a}\right) , \) i.e., \(\Phi (a+kT) = \Phi (a), k \in \mathbb {Z}.\) The extension of \(\varphi \) by a periodic function is illustrated in Fig. 3 of Sect. 5.1.

Remark 1

The decomposition of the activation function into Fourier series requires the activation function to be bounded (saturates), and therefore its transformation into a periodic function. For no saturating activation functions the same method may be used using other type of orthogonal decomposition as for instance Legendre, Chebyshev or Hermite polynomials [13, 14].

Theorem 2.1 states that if function \(\Phi \) complies with certain assumptions then it can be written as an infinite sum—using the Fourier Series.

Theorem 2.1

(Approximation of the activation function by a Fourier Series) Assuming that the periodic extension \(\Phi \) has been defined in a way that is continuous and absolutely integrable over its period, so that the Dirichlet conditions are fulfilled, hence:

$$\begin{aligned} \Phi (a)=S_{m} (a)+ R_{m} (a) \end{aligned}$$
(3)

where

$$\begin{aligned} S_{m} (a):= & {} \sum _{n=1}^{m } \hat{\Phi }_{n} \sin ( \omega _{n} a), \end{aligned}$$
(4)
$$\begin{aligned} R_{m} (a):= & {} \sum _{n=m+1}^{\infty } \hat{\Phi }_{n} \sin ( \omega _{n} a), \end{aligned}$$
(5)
$$\begin{aligned} \hat{\Phi }_{n}= & {} \displaystyle \dfrac{2}{T} \int \limits _{-T/2}^{T/2} \sin (\omega _{n} a ) \, \Phi (a) da, \end{aligned}$$
(6)

with \(w_{n}:= (2n-1)\omega _{0}\) and \(\omega _{0}=\dfrac{2 \pi }{T}.\)

Proof

See Appendix A. \(\square \)

The following corollary asserts that if some assumptions on \(\Phi \) are fulfilled, its approximation by a Fourier sum can be as accurate as required by controlling the number of terms.

Corollary 2.1.1

In the conditions of Theorem 2.1, and for an input signal that is real, \(\lim _{m \rightarrow \infty } \big |\Phi (a) - S_{m} (a) \big |\rightarrow 0\) and the truncated Fourier series of degree m\(S_{m} (a),\) is the best approximation of this order in \(L^{2}\left( [-T/2, T/2 ]\right) \) to function \(\Phi (a).\)

Proof

From the Riesz-Fischer Theorem [15] we know that if \(S_{m} (a)\) is the Fourier Series of m terms for the square-integrable function \(\Phi (a)\) then

$$\begin{aligned} \lim _{m \rightarrow \infty }\big |\Phi (a) - S_{m} (a) \big |\rightarrow 0. \end{aligned}$$
(7)

Moreover, the truncated Fourier Series of degree m\(S_{m} (a),\) is the best approximation of this order in \(L^{2}\left( [-T/2, T/2 ]\right) \) to function \(\Phi (a)\) [16]. That is, if \(\Phi (a) \in L^{2}\left( [-T/2, T/2 ]\right) \) and \(\zeta _{1}, \zeta _{2}, \ldots , \zeta _{m} \) are real numbers, then:

$$\begin{aligned} \big |\Phi (a) -S_{m} (a)\big |\le \big |\Phi (a) - \sum _{n=1}^{m} \zeta _{n} \sin ( \omega _{n} a ) \big |\end{aligned}$$
(8)

and equality holds only when \(\zeta _{n} = \hat{\Phi }_{n}, n=1,\ldots , m.\) \(\square \)

Lemma 2.1 states that the amount of energy contained in function \(\Phi \) in a certain finite interval equals the energy of the Fourier series in the same interval.

Lemma 2.1

For \(\Phi \) defined in \(L^{2} \left( [2\underline{a}, 2\overline{a}]\right) ,\) we have

$$\begin{aligned} \int \limits _{-T/2}^{T/2} { |\Phi (a) |}^{2} da= & {} \int \limits _{-T/2}^{T/2} |S_{m}(a) |^{2} dx + \int \limits _{-T/2}^{T/2} |R_{m}(a) |^{2} da. \end{aligned}$$

Proof

See Appendix A. \(\square \)

Corollary 2.1.1

In the conditions of Lemma 2.1, we have:

$$\begin{aligned} \dfrac{1}{T} \int \limits _{-T/2}^{T/2} |S_{m}(a) |^{2} da = \sum _{n= 1}^{m} \hat{\Phi }_{n}^{2}. \end{aligned}$$
(9)

Also, the mean quadratic error is the sum of the energies of the neglected harmonics:

$$\begin{aligned} \dfrac{1}{T} \int \limits _{-T/2}^{T/2} |R_{m}(a)|^{2} da = \sum _{n=m+1}^{\infty } \hat{\Phi }_{n}^{2}. \end{aligned}$$
(10)

Lemma 2.2 states that for an odd function \(\Phi \), the coefficients of the Fourier series are calculated back in the interval \(\left[ \underline{a}, \overline{a}\right] \) where \(\varphi \) is defined before (2), i.e. \(\hat{\Phi }_{n}\) is twice the respective \(\hat{\varphi }_{n}.\)

Lemma 2.2

For an odd function \(\Phi \), the coefficients of the Fourier series in (6) can be redefined as:

$$\begin{aligned} \hat{\Phi }_{n} = \displaystyle \dfrac{4}{T} \int _{-T/4}^{T/4} \sin ( \omega _{n} a ) \, \Phi (a) da. \end{aligned}$$

Proof

See Appendix A. \(\square \)

And then is easily concluded that the energy of \(\Phi \) in \(\left[ -\dfrac{T}{2}, \dfrac{T}{2} \right] \) is twice the energy of \(\varphi \) in \(\left[ -\dfrac{T}{4}, \dfrac{T}{4} \right] .\)

Remark 2

For an odd function \(\varphi (a)\)

$$\begin{aligned} \dfrac{1}{2}\int \limits _{-T/2}^{T/2} |\Phi (a) |^{2} da = \int \limits _{-T/4}^{T/4} |\varphi (a) |^{2} da. \end{aligned}$$

This remark follows directly from the Parseval-Rayleigh’s identity, since \(\varphi (a)\) coincides with \(\Phi (a)\) in the interval \(\left[ -\dfrac{T}{4},\dfrac{T}{4} \right] .\)

Remark 3

For an odd function

$$\begin{aligned} R_{m}(a) = \sum _{n=m+1}^{\infty } {\hat{\Phi }}_{n} \sin ( \omega _{n} a) = 2 \sum _{n=m+1}^{\infty } {\hat{\varphi }}_{n} \sin ( \omega _{n} a ) \end{aligned}$$

with

$$\begin{aligned} {\hat{\varphi }}_{n}= \dfrac{1}{2} {\hat{\Phi }}_{n}= \dfrac{2}{T} \int \limits _{-T/4}^{T/4} \sin ( \omega _{n} a )\, \varphi (a) da \end{aligned}$$

From what is clear that the error of the approximation of function \(\Phi (a)\) by the Fourier Series is double of the same error for function \(\varphi (a).\)

This remark follows from Lemma 2.2.

Remark 4

Knowing (5), then

$$\begin{aligned} \dfrac{2}{T} \int \limits _{-T/4}^{T/4} |R_{m} (a)|^{2} da = \sum _{n=m+1}^{\infty } {\hat{\Phi }}_{n}^{2}. \end{aligned}$$

This remark follows from Lemma 2.2.

3 Multidimensional Neurones

In this section, we consider the multivariable counterpart of the neurone formalised in Sect. 2. The neurone input and output are multidimensional variables with every component being a fraction of the original signal. Hence

$$\begin{aligned} \begin{array}{rcl} \varphi : \mathbb {R}&{} \longrightarrow &{} \mathbb {R}^{\ell } \\ a &{} \leadsto &{} y=\varphi (a), \end{array} \end{aligned}$$

where \( \varvec{a}=\sum _{i=1}^{n+1} w_{i}\vec {x}_{i}, w_{i}, \vec {x}_{i} \in \mathbb {R}^{p}.\) The multidimensional case is represented for \(p=2\) and \(\ell =2\) in Fig. 2, i.e., every input has got two different channels leading to the duplication of the integration/ activation functions, that are orthogonal between themselves, as it is demonstrated in Theorem 3.1.

Fig. 2
figure 2

Structure of the multidimensional neurone with \(\ell = 2\)

In the multidimensional formulation, the counterpart of \(\varvec{x}\) in (1) is \(\varvec{X}=\begin{pmatrix} \vec {x}_{1} \\ \vec {x}_{2} \\ \vdots \\ \vec {x}_{n+1}\end{pmatrix}, \) where \(\vec {x_{i}}=\left( x_{i}(1), x_{i}(2), \ldots , x_{i}(p)\right) ,\) for \(i=1,\ldots ,n.\) To decompose \(x_{i}\) into p components: \(x_{i}(k)= x_{i} \beta _{i}(k),\) with \(\vec \beta _{i}^{T}= \left( \beta _{i}(1), \ldots , \beta _{i}(p)\right) ,\) \(\sum _{k=1}^{p} \beta _{i}(k)=1\) and \(x_{i}\) in (1) becomes \(x_{i} =\displaystyle \sum \nolimits _{k=1}^{p} x_{i}(k).\) Therefore, \(\varvec{B} =\begin{pmatrix} \vec \beta _{1}&\vec \beta _{2}&\cdots \vec \beta _{n+1}\end{pmatrix}^{T}\) is the matrix of splitting factors of input \(\varvec{x}. \) Thence \(\varvec{X}=\begin{pmatrix} \vec {x}_{1} \\ \vec {x}_{2} \\ \vdots \\ \vec {x}_{n+1}\end{pmatrix}= \begin{pmatrix}X_{1}&X_{2}&\cdots&X_{p}\end{pmatrix},\) \(X_{k}\in \mathbb {R}^{n+1}\) is a column vector that contains the component-k of every input \(\vec {x_{i}}.\) Hence \(X_{k}=B_{k} \odot \varvec{x}= diag(B_{k})\cdot \varvec{x},\quad k=1,\ldots , p,\) where \(\odot \) is the Hadamard product, \(\cdot \) is the inner product and \(diag(B_{k})\) is the diagonal matrix whose main diagonal is vector \(B_{k}\)—corresponding to the fraction-k of every input \(x_{i}.\) An alternative way to write matrix X is using the Khatri-Rao product:

$$\begin{aligned} \varvec{X}=\varvec{B} * \varvec{x}= \begin{pmatrix} \vec \beta _{1}^{T} \\ \vec \beta _{2}^{T} \\ \vdots \\ \vec \beta _{n+1}^{T} \end{pmatrix} * \begin{pmatrix}x_{1} \\ x_{2} \\ \vdots \\ x_{n+1}\end{pmatrix} = \begin{pmatrix} \vec \beta _{1}^{T}\otimes x_{1} \\ \vec \beta _{2}^{T}\otimes x_{2} \\ \vdots \\ \vec \beta _{n+1}^{T}\otimes x_{n+1} \end{pmatrix}. \end{aligned}$$
(11)

In a similar manner, the output \(y= \begin{pmatrix} y(1)&y(2)&\cdots&y(\ell ) \end{pmatrix},\) with \(y=\sum _{k=1}^{\ell } y(k),\) \(y(k) = \zeta (k) y\) and \( \sum _{k=1}^{\ell } \zeta (k) =1, \zeta = \begin{pmatrix} \zeta (1)&\cdots&\zeta (\ell ) \end{pmatrix}.\) That is, \( Y= \zeta \otimes y\) with \(\zeta \) being the splitting factor the output. Note that here we have assumed that \(\ell =p.\) The information contained in a neurone is strongly dependent on the weight values, \(w_{i},\) and its transfer within the decomposed neurone can be done by assigning different weights to the different channels of the same input. Thence, consider \(\vec \tau _{i}^{T}= \begin{pmatrix}\tau _{i}(1)&\cdots&\tau _{i}(p) \end{pmatrix} \) the splitting factor of the weights \(w_{i},\) i.e., \(w_{i}(k)=w_{i}\tau _{i}(k),\) \(w_{i}=\sum _{k=1}^{p} w_{i}(k).\) Define \(\varvec{T}= \begin{pmatrix}\tau _{1}&\tau _{2}&\cdots&\tau _{n+1}\end{pmatrix}^{T} \) and then \(\varvec{W} = \varvec{T}*\varvec{w},\) likewise as in (11). For simplicity sake, in what follows, we consider \(p=\ell =2.\) Hence \(w_{i}= \begin{pmatrix} \tau _{i}&1-\tau _{i}\end{pmatrix} \) and

\( \varvec{T} = \begin{pmatrix} \varvec{\tau }&\textbf{1}-\varvec{\tau }\end{pmatrix}, \varvec{\tau } \in \mathbb {R}^{n+1}\) and \(\varvec{\tau }(i)=\tau _{i}.\) Also \(\varvec{B}=\begin{pmatrix} \varvec{\beta }&\textbf{1}-\varvec{\beta }\end{pmatrix}, \varvec{\beta } \in \mathbb {R}^{n+1}\) and \(\varvec{\beta }(i)=\beta _{i}, i=1,\ldots ,n,\) \(\textbf{1}\) is a vector of ones.

A standard neurone has several input signals, which are weighted and conducted to the cell nucleus and, from there—through the activation function—to the output. All variables and the image of the function are scalar. With the herein proposed method, the structure of the neurone is replicated to accept vector inputs. The activation function of the neurone is decomposed into orthogonal functions, taking advantage of its representation as a Fourier series, and assuming it to be a periodic function. Each component of the activation function uses as input the weighted sum value of the associated coordinate of the input vector. In this decomposition process, the weights of the original NN are distributed across various channels, always ensuring that the sum of the output components of the neurone is equal to the value of the standard neurone output. The weight distribution that occurs in each neurone is adjusted by a supervised sharing learning technique. Let \(\zeta (k)\) be the separability vector of sample-k from the training set. With this new algorithm, the splitting factors \(\zeta (k)\) are back-propagated by the structure of the NN to the input of each neurone, where the weights W are then decomposed into various channels.

The values \(\zeta (k)\), \(k=1,...,\ell \), are the result of applying a certain criterion of separability to the output signal, based on dissimilarity measure or decomposition techniques of the transfer function. This process is not considered in this study. The result of this splitting is that the flow of information in the NN happens through multi-dimensional channels. The coordinates of the output vector of the NN are the components of the decomposition of the NN response.

In this section, we explain how to decompose every input, as well as the respective weights, into different channels. The next theorem gives a representation for each of its component.

Theorem 3.1

(Decomposition of the integration function) Considering all the previous definitions, the splitting of the integration function, for \(p=2,\) i.e., \(a=\varvec{w}^{T}\varvec{x}=a_{1}+a_{2}\) becomes

$$\begin{aligned} a_{1}= & {} \varvec{w}^{T}\textrm{diag }( \varvec{\tau } \odot \varvec{\beta })\varvec{x}, \end{aligned}$$
(12)
$$\begin{aligned} a_{2}= & {} \varvec{w}^{T}\textrm{diag } (I- \varvec{\tau }\odot \varvec{ \beta })\varvec{x}. \end{aligned}$$
(13)

Proof

See Appendix B. \(\square \)

Define

$$\begin{aligned} \varvec{ \alpha }:= \varvec{\tau } \odot \varvec{\beta } \in \mathbb {R}^{n+1} \end{aligned}$$
(14)

which is a new parameter vector associated with the network input that specifies how the information is broken down. E.g. in the bidimensional case, if \(\alpha _{1}=0.5\) it means that the first and second components of \(x_{1}\) have the same amount of information. Au contraire if \(\alpha _{1}=1,\) there will be no information stored in the second component of the same input (this is what happens without splitting). Next, Theorem 3.2 gives a Fourier decomposition for the output components.

Theorem 3.2

(Splitting of the Output) Assuming the input decomposed as in Theorem 3.1, \(y\approx \varphi (a)\) can be split in the following manner: \(\varphi =\varphi _{1}(a_{1},a_{2})+\varphi _{2}(a_{1},a_{2}),\) where

$$\begin{aligned} \varphi _{1}(a_{1},a_{2})= & {} \sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin ( \omega _{n} a_{1} ) \cos (\omega _{n} a_{2} ) , \end{aligned}$$
(15)
$$\begin{aligned} \varphi _{2}(a_{1},a_{2})= & {} \sum _{n=1}^{m} {\hat{\Phi }}_{n} \cos ( \omega _{n} a_{1} ) \sin (\omega _{n} a_{2} ) \end{aligned}$$
(16)

and \(y_{i} = \varphi _{i}(a_{1},a_{2},), i=1,2.\)

Proof

See Appendix B. \(\square \)

Remark 5

From Theorem 3.2, one may infer:

$$\begin{aligned}{} & {} a_{1} =0 \implies \varphi \left( a_{1}+a_{2} \right) = \varphi _{2} \left( a_{1}, a_{2} \right) = \sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin ( \omega _{n} a_{2} )\\{} & {} \text { and } {\hat{\Phi }}_{n} = \dfrac{2}{T} \int \limits _{-T/2}^{T/2} \sin \left( \omega _{n} a_{2}\right) \varphi (a_{2}) dx\\{} & {} a_{2} =0 \implies \varphi \left( a_{1}+a_{2} \right) =\varphi _{1} \left( a_{1}, a_{2} \right) = \sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin \omega _{n} a_{1} \\{} & {} \text { and } {\hat{\Phi }}_{n} = \dfrac{2}{T} \int \limits _{-T/2}^{T/2} \sin \left( \omega _{n} a_{1}\right) \varphi (a_{1})dx. \end{aligned}$$

Remark 6

Theorem 3.2 may be generalised to the decomposition of \(\varphi \) into three or more orthogonal components. E.g. \(\varphi (a_{1}+a_{2}+a_{3})=\varphi _{1}(a_{1},a_{2},a_{3})+\varphi _{2}(a_{1},a_{2},a_{3})+\varphi _{3}(a_{1},a_{2},a_{3})+\varphi _{4}(a_{1},a_{2},a_{3}),\) where \(\lbrace \varphi _{1},\varphi _{2},\varphi _{3},\varphi _{4} \rbrace \) are orthogonal functions between themselves.

Axiom 3.1

The activation function \(\varphi (a)\) obeys the following properties:

Subspace projection:

\(\varphi (a)= \varphi _{1}(a+0)= \varphi _{2}(0+a)\)

Decomposition:

\(\varphi \left( a_{1}+a_{2} \right) = \varphi _{1}\left( a_{1}+a_{2} \right) + \varphi _{2}\left( a_{1}+a_{2} \right) \)

Additivity:

\( \varphi \left( a_{1}+a_{2} \right) =\varphi _{1}\left( a_{1}+a_{2} \right) + \varphi _{2}\left( a_{1}+a_{2} \right) \)

Theorem 3.3 proves the interesting property of the orthogonality of the components of the output between themselves.

Theorem 3.3

(complementarity of \(\varphi _{i}, i=1,2\)) The activation function components, \(\varphi _{1}\left( a_{1}+a_{2} \right) \) and \(\varphi _{2}\left( a_{1}+a_{2} \right) ,\) are orthogonal between themselves.

Proof

See Appendix B. \(\square \)

For computational purposes, there is also a representation for the splitting factor of the output, given in Theorem 3.4. That is, \( \varphi _{1}\left( a_{1},a_{2} \right) = \zeta \varphi (a). \)

Theorem 3.4

(Output split factor) The split factor for the output is

$$\begin{aligned} \zeta = \dfrac{1}{2} \left( 1 + \dfrac{\sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin \left( \omega _{n} \left( a_{2} - a_{1} \right) \right) }{\varphi (a)} \right) , \end{aligned}$$
(17)

with \(a=a_{1}+a_{2}.\)

Proof

$$\begin{aligned} \zeta= & {} \dfrac{ \varphi _{1}\left( a_{1},a_{2} \right) }{\varphi (a)}\\= & {} \dfrac{ \sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin (\omega _{n} a_{1} ) \cos (\omega _{n} a_{2}) }{\sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin \left( \omega _{n}\left( a_{1}+ a_{2} \right) \right) }\\= & {} \dfrac{1}{2} \dfrac{ \sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin \left( \omega _{n} \left( a_{1}+ a_{2} \right) \right) - \sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin \left( \omega _{n} \left( a_{1}- a_{2} \right) \right) }{\sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin \left( \omega _{n}\left( a_{1}+ a_{2} \right) \right) }\\= & {} \dfrac{1}{2} \left( 1 + \dfrac{\sum _{n=1}^{m} {\hat{\Phi }}_{n} \sin \left( \omega _{n} \left( a_{2} - a_{1} \right) \right) }{\varphi (a)} \right) \end{aligned}$$

\(\square \)

Remark 7

If \(a_{2}=0\), then \(a=a_{1}\), \(\zeta =0\). If \(a_{1}=0\), then \(a=a_{2}\), \(\zeta =1\). If \(a_{1}=a_{2}\), then \(\zeta =\dfrac{1}{2}.\)

4 Multi-back-propagation Algorithm

The back-propagation method [17] is used to train a feedforward NN to fit a data set, \(\left\{ x_{j}, d_{j}\right\} _{j=1,\ldots ,N},\) where \(x_{j}\) is the input and \(d_{j}\) the desired output. That is, the weights \(W \in \mathbb {R}^{N}\) are fine-tuned based on the current error rate, where every \(d_{j}=y_{j} +e_{j} \) and \(y_{j}=NN\left( x_{j}, W \right) .\) Prior to the running of the algorithm, the user needs to choose the number of layers, the number of neurones in each layer, the integration and activation functions, the number of iterations and the learning rate. The NN weights, together with the bias of every neurone, are initially randomised and then iteratively readjusted, by sending messages forward and backward alternatively in the NN, until the error becomes acceptable [18]. In the end of this process, the adjustment function has been found. This identification process corresponds to a supervised learning method, widely used in parametric learning of NN, conventionally called NN back-propagation algorithm. Whenever this process proves to be suitable, the NN weights encode the information of the training data set. This is here referred to as the learning phase or Phase-1.

Assuming the decomposition of the information into several parcels to be possible, the herein proposed method will perform the adjustment task at the multidimensional neuron level, transferring the information between the respective sub-models. This is a multidimensional version of the error back-propagation technique, whose main objective is to maximise the separability criterion of the output components of the NN—Phase-2 or splitting phase.

Thus, once the adjustment function to the data has been found—Phase-1—the splitting phase—Phase-2—takes place: the neurones are forced to divide the data representing the adjustment function into two connected parts, which can be considered as the first and the second components. To do this, conceive the data set to be such that \(\varvec{d}_{j}=\left( d_{j}(1), d_{j}(2)\right) \) with \(d_{j}= d_{j}(1)+ d_{j}(2) \) and \(d_{j}(\ell ) \approx y_{j}(\ell ), \ell = 1,2,\) where \(y_{j}=y_{j}(1)+y_{j}(2)\) is calculated as in Theorem 3.2. The weights, W,  and the splitting coefficients, \(\zeta \) in (17), are randomised to initialise the Multi-back-propagation calculations. To calculate the splited output to a desired precision, the fraction square mean error (FSME) is minimised:

$$\begin{aligned} \min _{W} \sum _{j=1}^{N} \left\| \varvec{e}_{j} \right\| ^{2} = \min _{W} \dfrac{1}{N}\sum _{j=1}^{N} 2 \left( e_{j}(1)-e_{j}\right) e_{j}(1) + e_{j}^{2}. \end{aligned}$$
(18)

Since

$$\begin{aligned} d_{j}= & {} y_{j} +e_{j} \Leftrightarrow e_{j} = y_{j} -d_{j}\\ \text {with }d_{j}= & {} d_{j}(1)+ d_{j}(2) \implies d_{j}(2)= d_{j} -d_{j}(1) \\ y_{j}= & {} y_{j}(1)+ y_{j}(2) \implies y_{j}(2)= y_{j} -y_{j}(1) \\ \implies \left\| \varvec{e}_{j} \right\| _{2}^{2}= & {} e_{j}(1)^{2}+e_{j}(2)^{2} \\= & {} \left( d_{j}(1)- y_{j}(1)\right) ^{2} +\left( d_{j}(2)- y_{j}(2)\right) ^{2} \\= & {} \left( d_{j}(1)- y_{j}(1)\right) ^{2} +\left( \left( d_{j}- y_{j}\right) - \left( d_{j}(1) -y_{j}(1)\right) \right) ^{2} \\= & {} e_{j}(1)^{2}+ \left( e_{j} -e_{j}(1) \right) ^{2}. \end{aligned}$$

So \(\displaystyle \min _{W,\alpha } \left\| \varvec{e}_{j} \right\| _{2}^{2} \implies \min _{W,\alpha } e_{j}(1)^{2}+ \left( e_{j} -e_{j}(1) \right) ^{2}.\) \(\square \)

4.1 Algorithm Outline

Based on the results found in Sect. 3 an algorithm is outlined next for the multi-dimensional version of the NN—a multidimensional back-propagation algorithm. To make the proposed algorithm easier to understand, it will be presented for dimension two.

The aim of the algorithm is to find a multilayer NN able to approximate the data set to a prescribed precision, that is, given \(\left\{ x_{j}, d_{j}\right\} _{j=1,\ldots ,N},\) obtain \(y_{j}\) such that \(\dfrac{1}{N} \sum _{j=1}^{N} \left( d_{j} -y_{j}\right) ^{2} < \varepsilon ,\) where \(\varepsilon \) is the required precision. Moreover, \(y_{j}\) is split into two components such that \(d_{j} (\ell )\approx y_{j}(\ell ), \ell = 1,2,\) and \(d_{j}=d_{j}(1)+d_{j}(2).\)

This algorithm takes place at Phase-2 and has got two main stages. The first stage is optional since the multidimensional NN may already exist. Otherwise, Stage-1 takes place (as described below) by transforming the NN into a multidimensional NN. To do so, all variables are transformed from scalars to vectors and the activation function is decomposed in an orthogonal way. The new parameters are initialised.

After Phase-1, the NN have a multidimensional structure, where one of the neurone’s sub-models will be a copy of the original neurone, and the others being devoid of information (zero-valued weights). Phase-2 of the algorithm will then transfer information between the sub-models (dimensions) of the neurone using a process identical to the back-propagation algorithm. However, in this case, back-propagation is not performed on the approximation error of the NN output, but rather on the separability value of the output components of the multidimensional neural network, taking (18).

Stage–1:

Transforms a NN into its multidimensional version.

  • Write \(\varphi (a)\) as a Fourier Series of m terms: \(S_{m}(a)\)

  • Decompose the activation functions using Theorem 3.2.

  • Initialise randomly the NN bias and splitting parameters \(\varvec{\alpha }.\)

Stage–2:

Propagating the separability error criterium backwards. Minimise FMSE using (18) by tuning \(\varvec{\alpha }\).

  1. 1.

    Calculate \(y_{j}(1)\) and \(y_{j}(2)\) using the splitting factor calculated in (17). I.e.

    1. (a)

      Decompose the integration function using Theorem 3.1.

    2. (b)

      Calculate the splitting factor \(\zeta \) using (17).

    3. (c)

      Calculate \(y_{j}(1)\) and \(y_{j}(2)\) using \(y_{j}(1)=\zeta y_{j}\) and \(y_{j}(2)=(1-\zeta ) y_{j}\), \(j=1,\ldots , N\).

  2. 2.

    Calculate \(e_{j}(1) = y_{j}(1)- d_{j}(1)\) and \(e_{j} = y_{j} - d_{j}\), \(j=1,\ldots , N\).

  3. 3.

    Evaluate \(\dfrac{dE}{de_{j}(1)}=\dfrac{2}{N} \sum _{j=1}^{N} \left( 2 e_{j}(1) -e_{j}\right) \) to guide the search for maximum splitting into components \(y_{j}(1)\) and \(y_{j}(2)\).

  4. 4.

    Propagating the error backwards. Readjust \(\varvec{\alpha }\) to minimise (18).

In summary, the final objective of the algorithm is to decompose the network into two complementary parts. To do this, it minimises (18) taking into account maximum separability, which is achieved by taking advantage of component orthogonality. Obviously, every obtained component can be further split.

5 Discussion, Tests and Results

This section contains the assessment of different stages of the Multi-back-propagation algorithm and, in particular, of the neurone decomposition. Namely:

  1. A.

    The validation of the approximation of an activation function—the hyperbolic tangent—by a Fourier Series of m terms. See Sect. 5.1.

  2. B.

    The illustration of the decomposition of the activation function. See Sect. 5.2.

  3. C.

    To illustrate the performance of the Multi-back-propagation algorithm, two different examples are implemented, a 2D classifier and the decomposition of a 3D network. Both examples demand the estimation of a NN with high precision for a non-linear target function. Hence:

    1. 1.

      The decomposition of a 2D classification example into two sub-classifiers is presented in Sect. 5.3.

    2. 2.

      The decomposition of a 3D NN whose transfer function, \(y={NN} \left( x_{1},x_{2}\right) \), has the shape of a volcano is presented in Sect. 5.4. The network is split into two sub-models, \(y= {NN}_{1} \left( x_{1},x_{2}\right) + {NN}_{2} \left( x_{1},x_{2}\right) \), where one identifies a mountain, \({NN}_{1} \left( x_{1},x_{2}\right) \), and the other, \({NN}_{2} \left( x_{1},x_{2}\right) \), the volcano crater, that once added up returns the volcano.

For both problems, the transfer function, that is the sum of the estimated sub-models, is not changed and is equal to the initial one, which has been identified using the standard back-propagation algorithm.

To assess the transfer of information in every NN layer, namely at the n inputs, the following metrics is used:

$$\begin{aligned} TI = \dfrac{\sum _{i=1}^{n} |\alpha _{i} ||w_{i} |}{\left\| {\tilde{w}} \right\| }, \end{aligned}$$
(19)

where n are the inputs of the neuron. \(\varvec{\alpha }\) and \(\varvec{w}\) are as defined in Sect. 3. This metrics is used to assess the information transfer in each neurone of every layer. It is calculated as the product of the \(\mathbf \alpha \) parameters by the weights of the normalised neurone. E.g for \(n=2\) and \(p=2\) it translates into \(TI=\dfrac{ |\alpha _{1} w_{1}|+ |\alpha _{2} w_{2} |}{\sqrt{w_{1}^{2}+w_{2}^{2}}}.\)

The code [19] has been written in Matlab/ Octave without using a specific toolbox but implementing everything from scratch. Instead, Matlab served as the coding platform, chosen for its graphical capabilities.

In the implementation of both back-propagation algorithms, the classic (scalar version) and the multidimensional version, the gradient-descent method has been used; this is a method simple to implement but of slow convergence. Although the computational results were quite good, its convergence might be improved by choosing some non-linear methods as for instance the Levenberg-Marquardt. However, since the results obtained in this work were quite satisfactory, we do not consider this to be an essential issue in the work.

5.1 Approximation of \(\varphi \) by a Fourier Series

To assess the decomposition of the activation function into orthogonal components, we consider \(\varphi (a) = \tanh (a),\) an activation function commonly chosen [12]. \(\Phi \) is defined as an extension of the hyperbolic tangent \(\varphi (a) =\dfrac{ \left( e^{a} - e^{-a}\right) }{\left( e^{a} + e^{-a}\right) },\) with \(\underline{a}=-4 \) and \(\overline{a}=4. \) As \(\delta =1,\) then \(\varepsilon = -1 - \varphi (-4) = 1 -\varphi (4) \approx 6.7 \times 10^{-4}.\)

Fig. 3
figure 3

Representation of \(\Phi (a), a \in \left[ -\dfrac{T}{2}, \dfrac{T}{2}\right] \) and period \(T=16\)

To obtain the following result, we apply Lemma 2.1 to \(\varphi (x) = \tanh (a).\)

Corollary 1

Considering \(\varphi (x) = \tanh (a)\) approximated by the Fourier Series of m terms, \(S_{m}(a),\) in interval \(\left[ -\dfrac{T}{4},\dfrac{T}{4}\right] ,\) thence the value of the mean quadratic error is

$$\begin{aligned} \int \limits _{-T/4 }^{T/4} R_{m}^{2}(a)da = \dfrac{T}{2}\left( 1-\dfrac{4}{T}\tanh \left( \dfrac{T}{4}\right) - \dfrac{1}{2} \sum _{i=m+1}^{\infty } {\hat{\Phi }}_{n}^{2}\right) . \end{aligned}$$
(20)

Proof

$$\begin{aligned} \int \limits _{-T/4}^{T/4} |\varphi (a) |^{2} da= & {} \int \limits _{-T/4}^{T/4} |S_{m} (a)|^{2} da + \int \limits _{-T/4}^{T/4} |R_{m}(a) |^{2} da \Leftrightarrow \\ \Leftrightarrow \dfrac{2}{T} \int \limits _{-T/4}^{T/4} |R_{m}(a) |^{2} da= & {} \dfrac{2}{T} \int \limits _{-T/4}^{T/4} |\varphi (a) |^{2} da - \dfrac{2}{T} \int \limits _{-T/4}^{T/4} |S_{m}(a) |^{2} da \Leftrightarrow \\= & {} \dfrac{2}{T} \int \limits _{-T/4}^{T/4} |\tanh (a)|^{2} da - \dfrac{2}{T} \int \limits _{-T/4}^{T/4} |S_{m}(a) |^{2} da \Leftrightarrow \\= & {} \dfrac{2}{T} |a -\tanh (a)|_{-T/4}^{T/4} - \dfrac{T}{2} \sum _{i=m+1}^{\infty } {\hat{\Phi }}_{n}^{2} \Leftrightarrow \\= & {} 1-\dfrac{4}{T}\tanh \left( \dfrac{T}{4}\right) - \dfrac{1}{2} \sum _{i=m+1}^{\infty } {\hat{\Phi }}_{n}^{2}. \end{aligned}$$

\(\square \)

Corollary 2

Assuming the saturation level \(\delta =1\) for the activation function \(\varphi (a) = \tanh (a),\) the value of the mean quadratic error in intervals \(\left[ -\infty , -\dfrac{T}{4} \right] \cup \left[ \dfrac{T}{4},\infty \right] \) is

$$\begin{aligned} \int \limits _{-\infty }^{-T/4} R_{m}^{2}(x)dx + \int \limits _{T/4 }^{\infty } R_{m}^{2}(x)dx= & {} 2 \int \limits _{-\infty }^{-T/4} (1+\tanh (x))^{2}dx \nonumber \\= & {} 4 \left( \log \left( 2 \cosh \dfrac{T}{4}\right) - \dfrac{T}{4}\right) + 2 \tanh \dfrac{T}{2}-2. \end{aligned}$$
(21)

Proof

The result is obtained after a few standard calculations. \(\square \)

Take \(m=8\) and \(T=16\) and then the hyperbolic tangent activation function: \(\displaystyle \varphi (x) = \tanh (x) \approx \sum \nolimits _{n=1}^{m=8} {\hat{\Phi }}_{n} \sin \omega _{m} a\). Table 1 have the Fourier Series coefficients values, \({\hat{\Phi }}_{n}\), \(n=1,\cdots , m\). Fourier’s harmonics terms are show in Fig. 4 as well as the Fourier series approximation.

From Formulas (20) and (21) the errors are calculated:

  • \(\left[ -T/4, T/4\right] \): \( R_{m}^{2} (x)=1.0 \times 10^{-7}\)

  • \(\left[ -\infty , -T/4\right] \cup \left[ T/4, \infty \right] \): \(R_{m}^{2}(x)=6.7 \times 10^{-4}\)

Table 1 \(\varphi (x) = \tanh (x),\) \(\omega _{0}=\pi /8,\) \(m=8,\) \(T=16\)
Fig. 4
figure 4

Fourier harmonics representation: \(\varphi (x) = \tanh (x),\) \(\omega _{0}=\pi /8,\) \(m=8\) and \(T=16\)

5.2 Orthogonal Decomposition of \(\varphi \)

Fig. 5
figure 5

Decomposition of \(\tanh (a)\) into orthogonal components \(\varphi _{1}\left( a_{1},a_{2} \right) \) and \(\varphi _{2}\left( a_{1},a_{2} \right) \) using different splitting values

Considering \(\varphi (a)= \tanh (a), \) Fig. 5 shows its decomposition into orthogonal components \(\varphi _{1}\left( a_{1},a_{2} \right) \) and \(\varphi _{2}\left( a_{1},a_{2} \right) \) using different splitting values \(\lambda ,\) i.e., \(a_{1}=\lambda a\) and \(a_{2}= (1-\lambda ) a.\) Colour black represents component \(\varphi _{1}\left( a_{1},a_{2} \right) ,\) blue \(\varphi _{2}\left( a_{1},a_{2} \right) \) and magenta the sum \(\varphi _{1}\left( a_{1},a_{2} \right) + \varphi _{2}\left( a_{1},a_{2} \right) .\) The results reveal that the two components complement each other, with the sum \(\varphi _{1}\left( a_{1},a_{2} \right) + \varphi _{2}\left( a_{1},a_{2} \right) \) matching \(\varphi \left( a \right) \) for all and every \(\lambda \in \left\{ 0, \dfrac{1}{8}, \dfrac{1}{4}, \dfrac{3}{8}, \dfrac{1}{2}, \dfrac{5}{8}, \dfrac{3}{4}, \dfrac{7}{8}, 1\right\} .\) According to the input decomposition obtained, \(a=a_{1}+a_{2},\) the output splitting factor \(\zeta \) is calculated from (17).

5.3 2D Classification Problem

In this 2D example, a NN classifier is decomposed into two sub-classifiers such that the first one has the left half-plane as domain and the second the right half-plane. The chosen function/data represents four semi-circles, two in each half plane as shown in Fig. 6a. The semi-circles are not compact to attain a higher difficulty to approximate the function. The classification has the following requirements: (1) whenever a small semi-circle belongs to a class, the exterior semi-circle in the same half-plane should belong to the other class; (2) whenever a small semi-circle belongs to a class, the other small semi-circle should belong to the other class. The classification was done in two consecutive stages: Stage-1) using a standard feedforward NN and the standard back-propagation algorithm. Stage-2) conceiving a multivariable network and using the Multi-back-propagation algorithm.

Fig. 6
figure 6

Training data for the classification problem: blue—class+1, \(d=+1\); magenta—class-1, \(d=-1\)

The initial NN divides the space into regions, labelled Classes 1 and 2 (in magenta and blue, respectively, and labelled −1 and 1). The results obtained at the first stage entirely reproduced the initial data, i.e., Fig. 6. After \(10^{5} \) iterations, the NN shows a MSE of \(1.403805\times 10^{-4}\) in the classification, when using the standard back-propagation algorithm. For a null threshold of the neuronal network output, the MSE is zero.

Fig. 7
figure 7

Classification using the Multi-back-propagation algorithm

The proposed method decomposes the original NN into components. The first component classifies only the left half-plane and the second the right half-plane. In regions not covered by each of the sub-models, the NN output components yield a null result (labelled 0) which means the absence of classification. This example aims to illustrate a classification problem into subproblems restricted to sub-regions of the classification space.

Fig. 7 shows the results obtained with the second stage. The first component of the NN classifies for \(x_{1} <0,\) left half-plane and it is neutral in the right half-plane—Null classification value (See Fig. 7a). The second component of the NN classifies the region \(x_{1}>=0\) and it is null for the left half-plane (See Fig. 7b). Figure 8 shows the surfaces of the transfer functions for both NN components.

After \(10^{5}\) iterations and considering a learning factor of 0.2 and a MSE of 0.0038 in the last iteration, the sum of two partial classifications was obtained with a MSE \( = 6.5939\times 10^{-7}\) when compared to the original classification at the first stage. However, both show a classification error null for thesholders \(\pm 0.5\). One can observe that the sum of the two classifiers gives the same result as the standard approach.

Fig. 8
figure 8

Classification curves of the sub-models

Fig. 9
figure 9

Assessment of the results obtained with the splitting method applied to the classifier

Fig. 9a shows the transfer of information TI, as defined in (19), and the \(\varvec{ \alpha } \) broken input vector, as defined in (17), for all neurons of the two layers of the NN. In Fig. 9b the split classifier shows a FMSE equal to \(4.922010\times 10^{-1}\) and a partial MSE for the first subproblem, according to formula (18), of 0.0037.

5.4 Surface Decomposition: Volcano

We discuss the decomposition of the identification of a surface that has the shape of a volcano. In this example a volcano surface is obtained as the result of the sum of a mountain with a crater. This is illustrated in the first row of Fig. 10 where the mountain is the first figure, the crater is the second and the volcano is the third. At Stage-1, the standard back-propagation technique used a NN of 4 layers, with 20 and 10 neurones in the hidden layers, respectively, and the \(\tanh (a)\) as the activation function, to initialise the whole procedure.

Next, at Stage-2, we want to decompose the initial \(NN \left( x_1, x_2 \right) \), whose transfer function represents the surface of a volcano. The two components \(NN_1 \left( x_1, x_2 \right) \) and \(NN_2 \left( x_1, x_2 \right) \) model the mountain and the crater, respectively, using a separability criterium that is dominantly mountain-shaped for the 1st component, while the 2nd component of the NN network is decomposed with the remaining part of the NN (i.e., the crater-shaped form).

Fig. 10
figure 10

Row 1: envisage the surface decomposition. Row 2: surface decomposition using the splitting approach

Fig. 11
figure 11

Assessment of the results: splitting method applied to the volcano

The decomposition was done using the Multi-back-propagation algorithm. The results obtained can be seen in the second row of Fig. 10, where the third figure of the row shows the reconstitution of the volcano. Figure 11 shows the transfer of information for every neurone of the two layers of the NN, using the metric defined in (17) and in (19). This example shows to be possible to decompose the transfer function of a NN as a sum of components where every component models a different aspect.

6 Conclusions and Future Work

This work presents a new algorithm for network training that aims to re-arrange the information of an already existing NN. To do this, the neurone is seen as a multivariable entity whose inputs, bias, and outputs are split into two or more components to optimise a separability criterium. Fundamental to the whole procedure is to demonstrate that the activation function can be approximated by a Fourier series of m terms and that this is the best approximation of its order in the space \(L^{2}\left( [-T/2, T/2 ]\right) .\) Moreover, the components of the activation function are orthogonal between themselves, assuring full decomposition as well as reconstitution of the original whole. Thus, the transfer function is decomposed into orthogonal sub-models whose sum reconstitutes the original shape, which means that information is preserved.

The result is a new algorithm, the Multi-back-propagation method. The outcome of the Multi-back-propagation method is the splitting of the NN into disjoint components. At a first phase, the algorithm uses the standard back-propagation technique to train the NN from data. This first phase, is optional since the NN may already exist. At the second phase, the trained network is split into components. The whole procedure was outlined in Sect. 4. In Sect. 5, it was our objective to illustrate the different features of the algorithm, that we view as an extension of the back-propagation technique applied when the decomposition of certain problems becomes possible. Thence, in Sect. 5.1, the approximation of the activation function by a truncated Fourier series is illustrated for a particular function - the hyperbolic tangent. This approximation is decomposed into orthogonal components which enables the splitting of the input and output signals and leads to a sum that recuperates the original trained NN. The orthogonal decomposition, using different splitting values, is assessed in Sect. 5.2. Moreover, the different aspects of the decomposition of the multivariable neurone have been mathematically demonstrated. In Sect. 5.3 and 5.4, the method is applied to two different problems: (1) a 2D classification problem and (2) the identification of a 3D surface. In both cases, it was possible to recuperate the original transfer function of the NN from the splitting disjoint components. The assessment examples are original and have also been constructed from scratch to accommodate the idiosyncrasies of the presented method.

We would like to mention that although the issue of information separability criteria has not been addressed in this work, we present two examples to assess the herein presented method. In the first example (Sect. 5.3), the classifier is to unfold into two sub-classifiers which are applicable to the two distinct half-planes of the classification domain. In the second example (Sect. 5.4), the separability criterion used favours the sub-model with a mountain shape and leaving the remaining information to the other sub-model.

From our research, we find this method very promising, specifically in separating different aspects of a problem. We also understand that a few issues still need further research and we would like to delve deeper into them. Namely, we would like to do some more work to strengthen the algorithm. For instance, in the example of Sect. 5.1, the decomposition of \(\tanh (a)\) has been done using the same \(\varvec{\alpha }\) for all the harmonics of the Fourier series. However, different values of \(\varvec{\alpha }\) can be used to calculate the different terms.

The issue of information separability criteria should be addressed. Also, some different orthogonal functions can be used to approximate the activation function. An example that splits the NN information into more than two parts has to be constructed. Moreover, different metrics can be used to assess the transfer of information between neurones. Paramount is also to apply the splitting multi-algorithm to some problems with real added value.