1 Introduction

Artificial neural networks (ANNs) have proved to be valuable tools in a host of different applications, such as function approximation and data fitting [2], solution of ordinary and partial differential equations [15,16,17], time-series prediction for the stock market [34], pattern recognition [2, 29], classification [35] and clustering [7], to name a few. ANNs are flexible modeling functions known for their excellent approximation capabilities [6, 10,11,12, 14] and have been termed “Universal Approximators.” ANNs may be designed according to various architectures, the main structural elements being the number of hidden layers, the number of neurons and the type of activation functions. Deep neural networks (DNNs) are ANNs with multiple hidden layers and can model complex mappings between the input and output layers.

ANNs suffer from the issue of overfitting, i.e., producing a model that may be very accurate for a subset of data while failing to account for the rest. In DNNs, the overfitting issue is even more pronounced due to the extra layers that enable the fitting of outliers. Several techniques have been developed to combat overfitting known collectively under the name “Regularization Methods.” Examples are node pruning [31], weight decay (or \(L_2\) regularization) [1], weight bounding [20], sparsity (or \(L_1\) regularization) and more recently the “dropout” technique [32], determinantal point processes (DPPs)[22], approximate empirical Bayes methods[37] that may be roughly described as random pruning. ANNs are trained using a so-called training set, and their performance is evaluated using a “test set.” Networks that perform well are said to generalize. An overfit/overtrained network obviously does not generalize and therefore cannot be trusted for further use.

In the present article, we introduce a new type of ANN, the “functionally weighted neural network” (FWNN). Single-hidden-layer ANNs may be expressed as a linear combination of a number of parametric basis functions. Common forms are based on the logistic and Gaussian activation functions, namely:

$$\begin{aligned} N_l(\varvec{x};\;\theta )&=w_0+\sum _{k=1}^{K}\frac{w_k}{1+exp\left( -(c_k^T\varvec{x}+b_k)\right) }\nonumber \\&\qquad \, \hbox {(Logistic MLP)}\end{aligned}$$
(1)
$$\begin{aligned} N_G(\varvec{x};\;\theta )&=w_0+\sum _{k=1}^{K}w_k\exp \left(-\frac{1}{2}\left| \frac{\varvec{x}-\varvec{\mu }_k}{\sigma _k}\right| ^2\right)\nonumber \\&\qquad \, \hbox {(Gaussian RBF)} \end{aligned}$$
(2)

where \(\theta\), in both cases, stands collectively for the adjustable parameters and K is the number of neural nodes.

Our proposal introduces a neural network that employs a continuous nodal distribution \(\rho (s),\) instead of a countable set of discrete nodes. The corresponding functionally weighted expressions for logistic and Gaussian activation functions may be cast as:

$$\begin{aligned} N_{Fl}(\varvec{x};\theta )&=\int \frac{w(s)}{1+\exp \left( -(c(s)^T\varvec{x}+b(s))\right) }\rho (s){\rm d}s \end{aligned}$$
(3)
$$\begin{aligned} N_{FG}(;,\theta )&= \int w(s)\exp \left( -\frac{1}{2}\left| \frac{\varvec{x}-\varvec{\mu }(s)}{\sigma (s)}\right| ^2\right) \rho (s){\rm d}s \end{aligned}$$
(4)

Preliminary results assessing the performance of FWNNs have been reported earlier [3] and have been presented at the Sofianos-2017 international symposium. The substitution of discrete weights by continuous functions has been also considered in [30], where, however, the activation is restricted to be an odd function, and the weights are either piecewise constant or piecewise affine functions. Polynomials have not been considered there, because the integrals involved cannot be expressed in a closed analytic form. To the best of our knowledge, this work has not been followed up.

In Sect. 2, we introduce the proposed neural network with continuous weight functions, by associating it with an ordinary radial basis function (RBF) network and presenting the process of the transition to the continuum. Technical details are given in Sect. 3, about the numerical quadrature, the training optimization methods and the software platforms used. In Sect. 4, we report the results of numerical experiments conducted on simulated homemade datasets as well as on established benchmarks from the literature. Finally, in Sect. 5, we summarize the strengths of the method and pose a few questions that may become the subject of future research.

2 Neural networks with infinite number of hidden units

Radial basis functions are known to be suitable for function approximation and multivariate interpolation [4, 27]. Assuming an n-dimensional input space, \(\varvec{x} \in R^n\), an RBF neural network consisting of K Gaussian nodes with parameters \(\varvec{\mu }_k \in R^n\) and \(\sigma _k \in R\) is given by Eq. (2).

The set \(\theta =\{ w_0,\left( w_k,\varvec{\mu }_k,\sigma _k \right) _{k=1}^K \}\) denotes collectively the network parameters to be determined via the training procedure. The total number of adjustable parameters is given by the expression

$$\begin{aligned} N_{var}^{RBF}= K(2+n) + 1 \end{aligned}$$
(5)

which grows linearly with the number of network nodes. Consider a dataset \(S=\{\varvec{x}_{i},t_{i}\}\), where \(t_i\) is the desired output (target) for the corresponding input \(\varvec{x}_i\). Let also \(T \subset S\) be a subset of S with cardinality \(\#T\). The approximating RBF network is then determined by minimizing the mean squared deviation over T:

$$\begin{aligned} E_{[T]}(\theta ) {\mathop {=}\limits ^{\text {\tiny def}}}\frac{1}{\#T} \sum _{\varvec{x}_i,t_l \in T} \left( N_{G}(\varvec{x}_i; \theta )-t_i \right) ^2 \end{aligned}$$
(6)

Let \(\widehat{\theta }= \lbrace \widehat{w}_0, \left( \widehat{w}_k,\widehat{\varvec{\mu }}_k,\widehat{\sigma }_k \right) _{k=1}^K\rbrace\) be the minimizer of \(E_{[T]}(\theta )\), i.e.,

$$\begin{aligned} \widehat{\theta } = \arg \min _{\theta } \lbrace E_{[T]}(\theta ) \rbrace . \end{aligned}$$
(7)

The network’s generalization performance is measured by the mean squared deviation, \(E_{[ S-T ]}(\widehat{\theta })\), over the relative complement set \(S-T\). In the neural network literature, T is usually referred to as the “training” set, while \(S-T\) as the “test” set. A well-studied issue is the proper choice for K, which denotes the number of nodes in the neural network architecture.

The training “error” \(\displaystyle E_{[T]}(\widehat{\theta })\) is a monotonically decreasing function of K, while the test “error” \(E_{[S-T]}(\widehat{\theta })\) is not. Hence, we may encounter a situation where adding nodes, in an effort to reduce the training error, will result in an increase in the test error, spoiling therefore the network’s generalization ability. This behavior is known as “overfitting” or “overtraining” and is clearly undesirable. An early analysis of this phenomenon coined under the name “bias–variance dilemma” may be found in [9]. Overfitting is a serious problem, and considerable research effort has been invested to find ways to deter it, leading to the development of several techniques such as model selection, cross-validation, early stopping, regularization and weight pruning [2, 9, 13, 24, 25].

2.1 Functionally weighted neural network

We define the “functionally weighted neural network” (FWNN) to be the limit of the conventional ANN, as the number of nodes \(K \rightarrow \infty\). The set of discrete nodes indexed by an integer (k) is replaced by a nodal distribution \(\rho (s)\) that depends on a continuous variable (s). The FWNN may then be cast, in correspondence with Eq. (2), as:

$$\begin{aligned} N_{FG}(\varvec{x};\theta )= \int _{-1}^{1}{\rm d}s \ \rho (s) \ \tilde{w}(s) \exp \left( -\frac{ |\varvec{x}-\varvec{\mu }(s)|^2 }{2 \sigma ^2 (s) }\right) , \end{aligned}$$
(8)

by applying the following transitions:

$$\begin{aligned} w_k&\longrightarrow \tilde{w}(s) \end{aligned}$$
(9a)
$$\begin{aligned} \varvec{\mu }_k&\longrightarrow \varvec{\mu }(s) \end{aligned}$$
(9b)
$$\begin{aligned} \sigma _{k}&\longrightarrow \sigma (s)\end{aligned}$$
(9c)
$$\begin{aligned} \sum _{k=1}^{K}&\longrightarrow \int _{-1}^{1} {\rm d}s \ \rho (s) \end{aligned}$$
(9d)

The density function \(\rho (s)\) should lead to an infinite number of nodes, i.e.

$$\begin{aligned} \int _{-1}^{+1} \rho (s) \ ds \rightarrow \infty . \end{aligned}$$
(10)

For the density function, we have chosen the following form that satisfies (10):

$$\begin{aligned} \rho (s)=\frac{1}{1-s^2} \end{aligned}$$
(11)

The weight functions \(\tilde{w}(s),\varvec{\mu }(s) \hbox { and } \sigma (s)\) are parametrized, and these parameters are collectively denoted by \(\theta\). In this article, we have examined the following functional forms:

$$\begin{aligned} {\tilde{w}}(s)&\equiv \sqrt{1-s^2} w(s) = \sqrt{1-s^2} \sum _{j=0}^{L_w} w_j s^j \end{aligned}$$
(12a)
$$\begin{aligned} \varvec{\mu }(s)&= \sum _{j=0}^{L_{\mu }}\varvec{\mu }_j s^j \end{aligned}$$
(12b)
$$\begin{aligned} \sigma (s)&= \sum _{j=0}^{L_{\sigma }}\sigma _j s^j \end{aligned}$$
(12c)

Note that \(\varvec{\mu }(s)\) and \(\varvec{\mu }_j=\left( \mu _{jl}\right) _{l=1}^n \ , \ j=0,\ldots ,L_{\mu }\) are vectors in \(R^n\).

The set of adjustable parameters is then represented by:

$$\begin{aligned} \theta =\lbrace \left( w_j \right) _{j=0}^{L_w}, \ \left( \mu _{jl} \right) _{j=0, l=1}^{\ L_{\mu } , \ n }, \ \left( \sigma _j \right) _{j=0}^{L_{\sigma }} \rbrace \end{aligned}$$
(13)

with a total parameter number given by:

$$\begin{aligned} N_{var}^{FW}&= (1+L_w)+ n( L_{\mu }+1) + (L_{\sigma }+1) = \nonumber \\&= L_w+nL_{\mu }+L_{\sigma }+n+2 \end{aligned}$$
(14)

The “cost” function \(C(\theta )\), is formed by adding a regularization term \(R(\theta )\) to the mean squared deviation of Eq. (6),

$$\begin{aligned} C(\theta ) \;{\mathop {=}\limits ^{\text {\tiny def}}} \;E_{[T]}(\theta ) + R(\theta ) \end{aligned}$$
(15)

\(C(\theta )\) serves as the objective function for the optimization task, and from now on, we redefine \(\widehat{\theta }\) as \(\widehat{\theta }= \arg \min _{\theta } \lbrace C(\theta ) \rbrace\). For the regularization term \(R(\theta )\), the squared Euclidean (L2) norm multiplied by a penalty factor has been adopted.

3 Technical details

In this section, we present the numerical methods used in our calculations. Namely, we describe the employed integration technique, the optimization procedure, and we also refer to the relevant software.

Substituting the nodal density from Eq. (11) in Eq. (8) and using Eq. (12a), the FWNN may be rewritten as:

$$\begin{aligned} N_{FG}(\varvec{x};\theta )= \int _{-1}^{1}\frac{{\rm d}s}{\sqrt{1-s^2}}w(s) \exp \left( -\frac{ |\varvec{x}-\varvec{\mu }(s)|^2 }{2 \sigma ^2 (s) }\right) . \end{aligned}$$
(16)

3.1 Approximating integrals

Integrals were estimated by the accurate Gauss–Chebyshev quadrature:

$$\begin{aligned} \int _{-1}^{1}\frac{{\rm d}s}{\sqrt{1-s^2}} g(s) \approx \frac{\pi }{M}\sum _{i=1}^{M} g(s_i) , \end{aligned}$$
(17)

where

$$\begin{aligned} s_i=\cos \left( \frac{2i-1}{2M}\pi \right) . \end{aligned}$$

The above explains our choice for the functional form of \({\tilde{w}}(s)\) in Eq. (12a). In our experiments, we have used \(M=100\). The number of integration points has been increased up to \({M=200}\), without noticing any appreciable difference.

3.2 Learning procedure and software platforms

Determination of the FWNN parameters is accomplished by minimizing the cost function given in Eq. (15). Since objectives of this kind are known to be multimodal, global optimization should be considered. We have employed a simple stochastic global optimization technique known as “Multistart” [33]. This is a two-phase method, consisting of an exploratory global phase and a subsequent local minimum-seeking phase.

In Multistart, a point \(\theta\) is sampled uniformly from within the feasible region, \(\theta \in S\), and subsequently a local search \(\mathcal L\), is started from it leading to a local minimum \(\widehat{\theta } =\mathcal L(\theta )\). If \(\widehat{\theta }\) is a minimum found for the first time, it is stored; otherwise, it is rejected. The cycle goes on until a stopping rule [18] instructs termination. An algorithmic presentation of Multistart is given below:

Simple Multistart Algorithm

  1. 1.

    Initialize: Set \(k=1\), sample \(\theta \in S\) and set \(\widehat{\theta }_k=\mathcal L(\theta )\)

  2. 2.

    If a termination rule applies, set \(\widehat{\theta }=\widehat{\theta }_m\) and stop (note that m is the index with the property: \(\displaystyle C(\widehat{\theta }_m)= \min _i\{C(\widehat{\theta }_i)\}\))

  3. 3.

    Main iteration: Sample \(\theta \in S\) \(\widehat{\theta }=\mathcal L(\theta )\) If \(\widehat{\theta } \notin \{\widehat{\theta }_1,\widehat{\theta }_2,\dots ,\widehat{\theta }_k\}\), then \(\qquad \qquad k \leftarrow k+1\) and \(\widehat{\theta }_k \leftarrow \widehat{\theta }\) Endif

  4. 4.

    Repeat from step 2.

The computer code was written in Python. For the local phase, we have relied on the quasi-Newton framework with the BFGS update, using the weak Wolfe–Powell conditions for the line search, that is contained in Pythons scipy.optimize library.

4 Numerical experiments, comparative analysis and extrapolation performance

A series of numerical experiments was devised for testing the performance of the proposed FWNN, by comparing its outcomes against those obtained by a number of established alternatives. We have considered both homemade simulated datasets and benchmarks that are widely used in the relevant scientific literature.

In our experiments, we have compared FWNN with MLP and RBF networks, as well as with Gaussian processes (GPs). For the neural networks, a host of architectural configurations (created by varying the number of the hidden nodes \(K \in [5, 100]\)) have been considered. MLPs were trained by the “Limited Memory BFGS” (L-BFGS) method that requires low memory computational resources and has proved to be quite efficient. For the RBFs, the exponential parameters were determined by K-means clustering, while the amplitudes were determined by linear regression. For the Gaussian processes, we have considered RBF kernels with automatic determination of its scalar parameter in the range \([10^{-5}, 10^5]\). For the experiments in all cases (MLP, RBF, GPs), the following values for the regularization parameter have been used: \(\alpha = \{10^{-10}, 10^{-5}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 0, 10, 10^2, 10^{3}, 10^5\}\). We have noticed that in some cases the regularization parameter had a significant effect. For every experiment, only the best result of each approach is reported for comparison with the corresponding FWNN outcome. The reason for choosing Gaussian processes in our experimental study is first its modeling potential and second some neural networks become identical to a Gaussian process with a specific type of covariance function in the limit of infinite hidden units [25, 28]. Finally, we have used Python’s Scikit-learn library for the implementation of the above three regression methodologies.

4.1 Experiments with simulated datasets

Several datasets were constructed by evaluating a number of selected test functions at preset sets of equidistant points. Four and three test functions have been employed for the 1d and for the 2d experiments, respectively, with their plots and formulas depicted in Figs. 1a–d and  2a–c.

Fig. 1
figure 1

Generating functions used for creating the 1d datasets. In each case, 100 training and 1000 testing points were used

Fig. 2
figure 2

Generating functions used for creating the 2d datasets. In each case, 100 training and 1000 testing points were used

Each dataset was divided into a training set and a test set. The target values of the training sets have been deliberately “contaminated” by addition of noise. On the other hand, the test sets have been left “clean,” i.e., with no noise addition, so that one can make an assessment on the capability of the tested methods to filter out the noise and reveal the underlying function.

In our experiments, we compare the FWNN to the logistic MLP and Gaussian RBF networks with “weight decay”(L2) regularization. For the evaluation, we use the almost insensitive to data scaling “Normalized Mean Squared Error” (NMSE) over the test set \([S-T]\), namely:

$$\begin{aligned} {\text{NMSE}}&= \frac{1}{\#[S-T]} \sum \limits _{x_i,t_i \in [S - T]} \left( \frac{N(\varvec{x}_i; \widehat{\theta })-t_i}{t_i} \right) ^2\nonumber \\&\quad \times 100 \end{aligned}$$
(18)

The experimental setup for the simulated datasets has been detailed in an earlier publication [3].

Two levels of signal-to-noise ratio were considered for generating the simulated training sets: medium \((-5\ dB)\), and large \((-10\ dB)\). For each noise level, 50 independent runs were performed and the corresponding NMSE mean and standard deviation are reported. For the FWNN, we have used throughout the following polynomial degrees:

  • \(L_w=5\), for the polynomial contained in w(s)

  • \(L_{\mu }=1\), for each of the \(\varvec{\mu }(s)\) polynomials

  • \(L_{\sigma }=1\), for the \(\sigma (s)\) polynomial

As a consequence of the above settings, the total number of the FWNN adjustable parameters equals \(2n+8\).

The results are listed in Tables 1 and 2 for the 1d and 2d datasets accordingly. Notice that for the MLP and RBF networks, as well as for the Gaussian process, only the results corresponding to the best performing case are listed. By inspection, FWNN’s generalization is superior, especially for large noise levels. This advantage becomes even more pronounced in the 2d case. While FWNN employs only ten and 12 parameters for the 1d and 2d datasets, MLP and RBF networks require a significantly larger number in the range \([31-301]\) and \([41-401]\), respectively, in order to achieve a comparable test error. For these datasets, a plethora of experiments and related results may be found in [3].

Table 1 Comparison of the NMSE mean over the test set, resulting from 50 independent experiments, for the 1d datasets related to Fig. 1a–d
Table 2 Comparison of the NMSE mean over the test set, resulting from 50 independent experiments, for the 2d datasets related to Fig. 2a–c

Additional experiments were conducted in order to study the generalization performance of the FWNN as a function of the number of network parameters. We have examined a limited number of cases; hence, our results are only indicative, not conclusive. In doing so, we have retained first-degree polynomials for both \(\varvec{\mu }(s)\) and \(\sigma (s)\) and varied only the degree of the polynomial in w(s). Accordingly, for the MLP and RBF networks, we have varied the number of hidden nodes. Again 50 independent experiments were performed for each case, and the corresponding NMSE mean was calculated. We have selected two artificial datasets, generated by the functions plotted in Figs. 1b and 2b. We have observed that for the FWNN, the dependence of NMSE on the number of parameters was significantly weaker.

4.2 Extrapolation in one dimension

Consider an 1d dataset with points \(x_1,x_2,\ldots , x_M\) arranged in ascending order, and corresponding targets \(y_1,y_2,\ldots ,y_M\). Let \(N(x,\theta )\) be a network trained over the above set. Estimating the target value as \(Y=N(X,\theta )\) at a point \(X \in (x_j,x_{j+1})\) is called interpolation, while at a point \(X \notin [x_1,x_M]\) is called extrapolation. It has been argued in [21] that artificial neural networks extrapolate rather poorly. To study the extrapolation potential of FWNN, the first two test functions of Fig. 1 have been employed, namely:

$$\begin{aligned} f(x)&=2x^2 + \exp (\pi /x)\sin (2 \pi x) \ \hbox { and } \\ f(x)&=x\sin (x)\cos (x) , \end{aligned}$$

for generating two datasets, each with 150 equidistant data points. The first 100 points were used for training, while the remaining 50 points labeled as \(z_1,\ldots , z_{50}\) were used for evaluating the quality of extrapolation. We base the assessment for the extrapolation capability on the relative deviation at an extrapolation point defined by:

$$\begin{aligned} r_i \equiv \frac{|f(z_i) - N(z_i,\widehat{\theta })|}{\max \{1,|f(z_i)|\}} \end{aligned}$$
(19)

By imposing an upper bound \(r_b\), for the acceptable relative deviation, we determine J, the number of consecutive extrapolation points satisfying:

$$\begin{aligned} r_i<r_b,\ \forall \ i \le J \hbox { and }r_{J+1}>r_b \end{aligned}$$
(20)

Given a value for the upper bound \(r_b\), inside a reasonable range \(r_b \in [0,0.25]\), the best method for extrapolation is the one with the highest value of J.

Table 3 contains the extrapolation results for three values of the upper bound, \(r_b=\{0.05, 0.15, 0.25\}\). In particular, we show the mean values of the J-index that have resulted from 50 independent experiments. By inspection, it is clear that the FWNN outperforms the rival MLP and RBF networks, as well as the Gaussian processes. Further details and extrapolation experiments have been presented earlier in [3].

Table 3 Comparison of the extrapolation index J, for the two datasets related to Fig. 1a, b

4.3 Experiments with real-world benchmarks

Additional experiments were performed on a variety of established benchmarks.

4.3.1 Experiments with UCI datasets

We have selected nine benchmarks from the UCI Machine Learning RepositoryFootnote 1. which are briefly described in Table 4. Note that the last two datasets (pima, wine) are benchmarks used primarily for evaluating classification methods and contain data belonging to two and seven classes, respectively.

Table 4 Summary of the selected real-world datasets from the UCI repository

For each dataset and network architecture, 50 experiments were carried out. For these experiments, we have used \(5^{th}\) degree polynomials (\(L_w=L_\mu =L_\sigma =5\)) corresponding to a number of \(6(n+2)\) model parameters. For the MLP, RBF and GPs, we have experimented with a host of different architectural and regularization parameters, and in Table 5, we quote, for each of them, the best performing configuration. Observing these results, we note that FWNN outperforms all competitors in five (out of nine) datasets and in another dataset shares the top with GPs. MLP is top in one dataset and is tied at the top with GPs in another one. GPs is at the top in one dataset, while RBF failed to win the top in any of the UCI datasets.

Table 5 Comparison of the NMSE mean over the test set, resulting from 50 independent experiments, for the nine UCI datasets

Since the pima and wine datasets are classification benchmarks, the classification capability of FWNN has been tested. For this purpose, the classification accuracy is calculated as the percentage of the correctly classified test points within a tolerance (see [5]). The results are presented in Table 6 for four different tolerance values, namely: \(\eta =0.10,0.25, 0.5\) and 1.0. In these experiments, FWNN together with GPs performs better than both the MLP and RBF networks. It is interesting to note the remarkable classification accuracy of the FWNN, particularly for the low tolerance value of \(\eta =0.10\).

Table 6 Classification accuracy for several tolerance values

4.3.2 Large-scale experiments

To further test the approximation quality of the FWNN, experiments on extensively studied complex, large datasets were performed. The datasets are summarized in Table 7. The Sarcos dataset is a robotic real-world benchmark [28], representing the inverse dynamics of a robotic seven-joint armFootnote 2 related to rhythmic motions. The task is to map a 21-dimensional input space (seven joint positions, seven joint velocities, seven joint accelerations) to the corresponding seven joint torques.

Table 7 Summary of the datasets used in our large-scale experiments

The training in this case was performed using fifth-degree polynomials (\(L_w=L_{\mu }=L_{\sigma }=5\)) corresponding to a total of \(138 \ (=6n+12)\) parameters. The FWNN results along with results published by different authors using GPs are listed in Table 8 and compare favorably. In Fig. 3, the predicted versus the actual values are plotted, for all seven DOFs, rendering the model’s performance obvious. We observe that all points are scattered symmetrically around and near to the diagonal \(x=y\) line that represents the perfect match.

Table 8 Mean and normalized mean squared errors for the SARCOS dataset
Fig. 3
figure 3

Plots of the predicted (y-axes) versus the actual (x-axes) values of the 4484 test cases, for each of the seven DOFs in the SARCOS dataset. The diagonal line (thin) denotes the perfect match

For the remaining (Elevators, Kin40k, Pole Telecomm, Pumadyn32-nm) datasets,Footnote 3 the FWNN results are listed in Table 9 along with results provided by a state-of-the-art Gaussian process approach reported in [19]. In spite its simplicity, the FWNN’s performance is better or similar to that of a sophisticated, high-demanding, state-of-the-art method.

Table 9 Comparison of the results (NMSE criterion) depicted with the proposed FFWN and those published in the literature

5 Discussion and conclusions

In the present article, we have proposed a new type of neural network, the FWNN, in which the weights are functions of a continuous variable. This may be interpreted as a neural network with an infinite number of hidden nodes. In the conducted numerical experiments, the FWNN exceeded in generalization performance the MLP and RBF networks, as well as the Gaussian processes. This is evidence of robustness, reliability and modeling potential.

The FWNN has a number of interesting properties. There is ample experimental evidence that the generalization performance is superior. This may be related to the fact that the number of required parameters is limited, which in turn prevents serious overtraining.

The positions of the Gaussian centers are determined by the \(\varvec{\mu }(s)\) and the corresponding widths by \(\sigma (s)\), with \(s \in [-1, 1]\). In the case of studying simulated datasets, we have used an affine form; hence, the \(\mu (s)\) curve is a straight-line segment joining the two end points \(\varvec{\mu }(-1)\) and \(\varvec{\mu }(+1)\) in \(R^n\). The widths are linearly increasing or decreasing with s, depending on the sign of \(\sigma _1\). In spite of that this might seem to be a severe constraint, it has not degraded the network’s performance. We credit this to the infinite number of nodes that render the approximation of any function feasible [6, 10]. In the case of real benchmarks, the affine model imposes an overly strict constraint, and thus it was replaced by a higher order polynomial, at the expense of some extra parameters. The Gaussian centers then may lie on a parabolic or a cubic locus, and the widths acquire higher adaptability.

The attractive features of the proposed FWNN may be briefly summarized as:

  1. 1.

    Frugal model, incorporating a small number of adjustable parameters.

  2. 2.

    Resistant to overtraining.

  3. 3.

    Superior interpolation and extrapolation performance.

We consider that some issues need further investigation and will become part of our future research effort. In particular,

  • The model behavior when using different density functions.

  • The effect caused by choosing different functional forms for the weights.

  • The difference in using other than Gaussian kernels.

  • The possibility of extending the shallow architecture to deep.

Furthermore, we would like to assess the effectiveness of FWNN in complex problems, such as solving partial and ordinary differential equations [15,16,17], modeling interatomic potentials [26], forecasting time series [34]. This task is currently underway.