Functionally weighted neural networks: frugal models with high accuracy

Blekas, Konstantinos; Lagaris, Isaac E.

doi:10.1007/s42452-020-03713-y

Functionally weighted neural networks: frugal models with high accuracy

Research Article
Published: 07 November 2020

Volume 2, article number 1954, (2020)
Cite this article

Download PDF

SN Applied Sciences Aims and scope Submit manuscript

Functionally weighted neural networks: frugal models with high accuracy

Download PDF

781 Accesses
Explore all metrics

Abstract

In this article, we introduce the “functionally weighted neural network,” a new addition to the rich collection of artificial neural networks. Instead of a finite number of discrete nodes, we consider an infinite number of continuously distributed nodes. The weights assume a functional form, and the sum over the nodes becomes an integral. The gain is a significant reduction in the number of adjustable parameters, accompanied by an enhanced generalization performance. To quantitatively assess the quality of this new network, we have performed numerical experiments on a number of benchmark datasets. Comparison with state-of-the-art techniques reveals the advantages of the proposed method and emphasizes its modeling potential.

Neural Networks – State of Art, Brief History, Basic Models and Architecture

Neural Networks

Multilayer Perceptron (MLP)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Artificial neural networks (ANNs) have proved to be valuable tools in a host of different applications, such as function approximation and data fitting [2], solution of ordinary and partial differential equations [15,16,17], time-series prediction for the stock market [34], pattern recognition [2, 29], classification [35] and clustering [7], to name a few. ANNs are flexible modeling functions known for their excellent approximation capabilities [6, 10,11,12, 14] and have been termed “Universal Approximators.” ANNs may be designed according to various architectures, the main structural elements being the number of hidden layers, the number of neurons and the type of activation functions. Deep neural networks (DNNs) are ANNs with multiple hidden layers and can model complex mappings between the input and output layers.

ANNs suffer from the issue of overfitting, i.e., producing a model that may be very accurate for a subset of data while failing to account for the rest. In DNNs, the overfitting issue is even more pronounced due to the extra layers that enable the fitting of outliers. Several techniques have been developed to combat overfitting known collectively under the name “Regularization Methods.” Examples are node pruning [31], weight decay (or $L_2$ regularization) [1], weight bounding [20], sparsity (or $L_1$ regularization) and more recently the “dropout” technique [32], determinantal point processes (DPPs)[22], approximate empirical Bayes methods[37] that may be roughly described as random pruning. ANNs are trained using a so-called training set, and their performance is evaluated using a “test set.” Networks that perform well are said to generalize. An overfit/overtrained network obviously does not generalize and therefore cannot be trusted for further use.

In the present article, we introduce a new type of ANN, the “functionally weighted neural network” (FWNN). Single-hidden-layer ANNs may be expressed as a linear combination of a number of parametric basis functions. Common forms are based on the logistic and Gaussian activation functions, namely:

$$\begin{aligned} N_l(\varvec{x};\;\theta )&=w_0+\sum _{k=1}^{K}\frac{w_k}{1+exp\left( -(c_k^T\varvec{x}+b_k)\right) }\nonumber \\&\qquad \, \hbox {(Logistic MLP)}\end{aligned}$$

(1)

$$\begin{aligned} N_G(\varvec{x};\;\theta )&=w_0+\sum _{k=1}^{K}w_k\exp \left(-\frac{1}{2}\left| \frac{\varvec{x}-\varvec{\mu }_k}{\sigma _k}\right| ^2\right)\nonumber \\&\qquad \, \hbox {(Gaussian RBF)} \end{aligned}$$

(2)

where $\theta$, in both cases, stands collectively for the adjustable parameters and K is the number of neural nodes.

Our proposal introduces a neural network that employs a continuous nodal distribution $\rho (s),$ instead of a countable set of discrete nodes. The corresponding functionally weighted expressions for logistic and Gaussian activation functions may be cast as:

$$\begin{aligned} N_{Fl}(\varvec{x};\theta )&=\int \frac{w(s)}{1+\exp \left( -(c(s)^T\varvec{x}+b(s))\right) }\rho (s){\rm d}s \end{aligned}$$

(3)

$$\begin{aligned} N_{FG}(;,\theta )&= \int w(s)\exp \left( -\frac{1}{2}\left| \frac{\varvec{x}-\varvec{\mu }(s)}{\sigma (s)}\right| ^2\right) \rho (s){\rm d}s \end{aligned}$$

(4)

Preliminary results assessing the performance of FWNNs have been reported earlier [3] and have been presented at the Sofianos-2017 international symposium. The substitution of discrete weights by continuous functions has been also considered in [30], where, however, the activation is restricted to be an odd function, and the weights are either piecewise constant or piecewise affine functions. Polynomials have not been considered there, because the integrals involved cannot be expressed in a closed analytic form. To the best of our knowledge, this work has not been followed up.

In Sect. 2, we introduce the proposed neural network with continuous weight functions, by associating it with an ordinary radial basis function (RBF) network and presenting the process of the transition to the continuum. Technical details are given in Sect. 3, about the numerical quadrature, the training optimization methods and the software platforms used. In Sect. 4, we report the results of numerical experiments conducted on simulated homemade datasets as well as on established benchmarks from the literature. Finally, in Sect. 5, we summarize the strengths of the method and pose a few questions that may become the subject of future research.

2 Neural networks with infinite number of hidden units

Radial basis functions are known to be suitable for function approximation and multivariate interpolation [4, 27]. Assuming an n-dimensional input space, $\varvec{x} \in R^n$, an RBF neural network consisting of K Gaussian nodes with parameters $\varvec{\mu }_k \in R^n$ and $\sigma _k \in R$ is given by Eq. (2).

The set $\theta =\{ w_0,\left( w_k,\varvec{\mu }_k,\sigma _k \right) _{k=1}^K \}$ denotes collectively the network parameters to be determined via the training procedure. The total number of adjustable parameters is given by the expression

$$\begin{aligned} N_{var}^{RBF}= K(2+n) + 1 \end{aligned}$$

(5)

which grows linearly with the number of network nodes. Consider a dataset $S=\{\varvec{x}_{i},t_{i}\}$, where $t_i$ is the desired output (target) for the corresponding input $\varvec{x}_i$. Let also $T \subset S$ be a subset of S with cardinality $\#T$. The approximating RBF network is then determined by minimizing the mean squared deviation over T:

$$\begin{aligned} E_{[T]}(\theta ) {\mathop {=}\limits ^{\text {\tiny def}}}\frac{1}{\#T} \sum _{\varvec{x}_i,t_l \in T} \left( N_{G}(\varvec{x}_i; \theta )-t_i \right) ^2 \end{aligned}$$

(6)

Let $\widehat{\theta }= \lbrace \widehat{w}_0, \left( \widehat{w}_k,\widehat{\varvec{\mu }}_k,\widehat{\sigma }_k \right) _{k=1}^K\rbrace$ be the minimizer of $E_{[T]}(\theta )$, i.e.,

$$\begin{aligned} \widehat{\theta } = \arg \min _{\theta } \lbrace E_{[T]}(\theta ) \rbrace . \end{aligned}$$

(7)

The network’s generalization performance is measured by the mean squared deviation, $E_{[ S-T ]}(\widehat{\theta })$, over the relative complement set $S-T$. In the neural network literature, T is usually referred to as the “training” set, while $S-T$ as the “test” set. A well-studied issue is the proper choice for K, which denotes the number of nodes in the neural network architecture.

The training “error” $\displaystyle E_{[T]}(\widehat{\theta })$ is a monotonically decreasing function of K, while the test “error” $E_{[S-T]}(\widehat{\theta })$ is not. Hence, we may encounter a situation where adding nodes, in an effort to reduce the training error, will result in an increase in the test error, spoiling therefore the network’s generalization ability. This behavior is known as “overfitting” or “overtraining” and is clearly undesirable. An early analysis of this phenomenon coined under the name “bias–variance dilemma” may be found in [9]. Overfitting is a serious problem, and considerable research effort has been invested to find ways to deter it, leading to the development of several techniques such as model selection, cross-validation, early stopping, regularization and weight pruning [2, 9, 13, 24, 25].

2.1 Functionally weighted neural network

We define the “functionally weighted neural network” (FWNN) to be the limit of the conventional ANN, as the number of nodes $K \rightarrow \infty$. The set of discrete nodes indexed by an integer (k) is replaced by a nodal distribution $\rho (s)$ that depends on a continuous variable (s). The FWNN may then be cast, in correspondence with Eq. (2), as:

$$\begin{aligned} N_{FG}(\varvec{x};\theta )= \int _{-1}^{1}{\rm d}s \ \rho (s) \ \tilde{w}(s) \exp \left( -\frac{ |\varvec{x}-\varvec{\mu }(s)|^2 }{2 \sigma ^2 (s) }\right) , \end{aligned}$$

(8)

by applying the following transitions:

$$\begin{aligned} w_k&\longrightarrow \tilde{w}(s) \end{aligned}$$

(9a)

$$\begin{aligned} \varvec{\mu }_k&\longrightarrow \varvec{\mu }(s) \end{aligned}$$

(9b)

$$\begin{aligned} \sigma _{k}&\longrightarrow \sigma (s)\end{aligned}$$

(9c)

$$\begin{aligned} \sum _{k=1}^{K}&\longrightarrow \int _{-1}^{1} {\rm d}s \ \rho (s) \end{aligned}$$

(9d)

The density function $\rho (s)$ should lead to an infinite number of nodes, i.e.

$$\begin{aligned} \int _{-1}^{+1} \rho (s) \ ds \rightarrow \infty . \end{aligned}$$

(10)

For the density function, we have chosen the following form that satisfies (10):

$$\begin{aligned} \rho (s)=\frac{1}{1-s^2} \end{aligned}$$

(11)

The weight functions $\tilde{w}(s),\varvec{\mu }(s) \hbox { and } \sigma (s)$ are parametrized, and these parameters are collectively denoted by $\theta$. In this article, we have examined the following functional forms:

$$\begin{aligned} {\tilde{w}}(s)&\equiv \sqrt{1-s^2} w(s) = \sqrt{1-s^2} \sum _{j=0}^{L_w} w_j s^j \end{aligned}$$

(12a)

$$\begin{aligned} \varvec{\mu }(s)&= \sum _{j=0}^{L_{\mu }}\varvec{\mu }_j s^j \end{aligned}$$

(12b)

$$\begin{aligned} \sigma (s)&= \sum _{j=0}^{L_{\sigma }}\sigma _j s^j \end{aligned}$$

(12c)

Note that $\varvec{\mu }(s)$ and $\varvec{\mu }_j=\left( \mu _{jl}\right) _{l=1}^n \ , \ j=0,\ldots ,L_{\mu }$ are vectors in $R^n$.

The set of adjustable parameters is then represented by:

$$\begin{aligned} \theta =\lbrace \left( w_j \right) _{j=0}^{L_w}, \ \left( \mu _{jl} \right) _{j=0, l=1}^{\ L_{\mu } , \ n }, \ \left( \sigma _j \right) _{j=0}^{L_{\sigma }} \rbrace \end{aligned}$$

(13)

with a total parameter number given by:

$$\begin{aligned} N_{var}^{FW}&= (1+L_w)+ n( L_{\mu }+1) + (L_{\sigma }+1) = \nonumber \\&= L_w+nL_{\mu }+L_{\sigma }+n+2 \end{aligned}$$

(14)

The “cost” function $C(\theta )$, is formed by adding a regularization term $R(\theta )$ to the mean squared deviation of Eq. (6),

$$\begin{aligned} C(\theta ) \;{\mathop {=}\limits ^{\text {\tiny def}}} \;E_{[T]}(\theta ) + R(\theta ) \end{aligned}$$

(15)

$C(\theta )$ serves as the objective function for the optimization task, and from now on, we redefine $\widehat{\theta }$ as $\widehat{\theta }= \arg \min _{\theta } \lbrace C(\theta ) \rbrace$. For the regularization term $R(\theta )$, the squared Euclidean (L2) norm multiplied by a penalty factor has been adopted.

3 Technical details

In this section, we present the numerical methods used in our calculations. Namely, we describe the employed integration technique, the optimization procedure, and we also refer to the relevant software.

Substituting the nodal density from Eq. (11) in Eq. (8) and using Eq. (12a), the FWNN may be rewritten as:

$$\begin{aligned} N_{FG}(\varvec{x};\theta )= \int _{-1}^{1}\frac{{\rm d}s}{\sqrt{1-s^2}}w(s) \exp \left( -\frac{ |\varvec{x}-\varvec{\mu }(s)|^2 }{2 \sigma ^2 (s) }\right) . \end{aligned}$$

(16)

3.1 Approximating integrals

Integrals were estimated by the accurate Gauss–Chebyshev quadrature:

$$\begin{aligned} \int _{-1}^{1}\frac{{\rm d}s}{\sqrt{1-s^2}} g(s) \approx \frac{\pi }{M}\sum _{i=1}^{M} g(s_i) , \end{aligned}$$

(17)

where

$$\begin{aligned} s_i=\cos \left( \frac{2i-1}{2M}\pi \right) . \end{aligned}$$

The above explains our choice for the functional form of ${\tilde{w}}(s)$ in Eq. (12a). In our experiments, we have used $M=100$. The number of integration points has been increased up to ${M=200}$, without noticing any appreciable difference.

3.2 Learning procedure and software platforms

Determination of the FWNN parameters is accomplished by minimizing the cost function given in Eq. (15). Since objectives of this kind are known to be multimodal, global optimization should be considered. We have employed a simple stochastic global optimization technique known as “Multistart” [33]. This is a two-phase method, consisting of an exploratory global phase and a subsequent local minimum-seeking phase.

In Multistart, a point $\theta$ is sampled uniformly from within the feasible region, $\theta \in S$, and subsequently a local search $\mathcal L$, is started from it leading to a local minimum $\widehat{\theta } =\mathcal L(\theta )$. If $\widehat{\theta }$ is a minimum found for the first time, it is stored; otherwise, it is rejected. The cycle goes on until a stopping rule [18] instructs termination. An algorithmic presentation of Multistart is given below:

Simple Multistart Algorithm

1.
Initialize: Set $k=1$, sample $\theta \in S$ and set $\widehat{\theta }_k=\mathcal L(\theta )$
2.
If a termination rule applies, set $\widehat{\theta }=\widehat{\theta }_m$ and stop (note that m is the index with the property: $\displaystyle C(\widehat{\theta }_m)= \min _i\{C(\widehat{\theta }_i)\}$)
3.
Main iteration: Sample $\theta \in S$ $\widehat{\theta }=\mathcal L(\theta )$ If $\widehat{\theta } \notin \{\widehat{\theta }_1,\widehat{\theta }_2,\dots ,\widehat{\theta }_k\}$, then $\qquad \qquad k \leftarrow k+1$ and $\widehat{\theta }_k \leftarrow \widehat{\theta }$ Endif
4.
Repeat from step 2.

The computer code was written in Python. For the local phase, we have relied on the quasi-Newton framework with the BFGS update, using the weak Wolfe–Powell conditions for the line search, that is contained in Pythons scipy.optimize library.

4 Numerical experiments, comparative analysis and extrapolation performance

A series of numerical experiments was devised for testing the performance of the proposed FWNN, by comparing its outcomes against those obtained by a number of established alternatives. We have considered both homemade simulated datasets and benchmarks that are widely used in the relevant scientific literature.

In our experiments, we have compared FWNN with MLP and RBF networks, as well as with Gaussian processes (GPs). For the neural networks, a host of architectural configurations (created by varying the number of the hidden nodes $K \in [5, 100]$) have been considered. MLPs were trained by the “Limited Memory BFGS” (L-BFGS) method that requires low memory computational resources and has proved to be quite efficient. For the RBFs, the exponential parameters were determined by K-means clustering, while the amplitudes were determined by linear regression. For the Gaussian processes, we have considered RBF kernels with automatic determination of its scalar parameter in the range $[10^{-5}, 10^5]$. For the experiments in all cases (MLP, RBF, GPs), the following values for the regularization parameter have been used: $\alpha = \{10^{-10}, 10^{-5}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 0, 10, 10^2, 10^{3}, 10^5\}$. We have noticed that in some cases the regularization parameter had a significant effect. For every experiment, only the best result of each approach is reported for comparison with the corresponding FWNN outcome. The reason for choosing Gaussian processes in our experimental study is first its modeling potential and second some neural networks become identical to a Gaussian process with a specific type of covariance function in the limit of infinite hidden units [25, 28]. Finally, we have used Python’s Scikit-learn library for the implementation of the above three regression methodologies.

4.1 Experiments with simulated datasets

Several datasets were constructed by evaluating a number of selected test functions at preset sets of equidistant points. Four and three test functions have been employed for the 1d and for the 2d experiments, respectively, with their plots and formulas depicted in Figs. 1a–d and 2a–c.

Each dataset was divided into a training set and a test set. The target values of the training sets have been deliberately “contaminated” by addition of noise. On the other hand, the test sets have been left “clean,” i.e., with no noise addition, so that one can make an assessment on the capability of the tested methods to filter out the noise and reveal the underlying function.

In our experiments, we compare the FWNN to the logistic MLP and Gaussian RBF networks with “weight decay”(L2) regularization. For the evaluation, we use the almost insensitive to data scaling “Normalized Mean Squared Error” (NMSE) over the test set $[S-T]$, namely:

$$\begin{aligned} {\text{NMSE}}&= \frac{1}{\#[S-T]} \sum \limits _{x_i,t_i \in [S - T]} \left( \frac{N(\varvec{x}_i; \widehat{\theta })-t_i}{t_i} \right) ^2\nonumber \\&\quad \times 100 \end{aligned}$$

(18)

The experimental setup for the simulated datasets has been detailed in an earlier publication [3].

Two levels of signal-to-noise ratio were considered for generating the simulated training sets: medium $(-5\ dB)$, and large $(-10\ dB)$. For each noise level, 50 independent runs were performed and the corresponding NMSE mean and standard deviation are reported. For the FWNN, we have used throughout the following polynomial degrees:

$L_w=5$, for the polynomial contained in w(s)
$L_{\mu }=1$, for each of the $\varvec{\mu }(s)$ polynomials
$L_{\sigma }=1$, for the $\sigma (s)$ polynomial

As a consequence of the above settings, the total number of the FWNN adjustable parameters equals $2n+8$.

The results are listed in Tables 1 and 2 for the 1d and 2d datasets accordingly. Notice that for the MLP and RBF networks, as well as for the Gaussian process, only the results corresponding to the best performing case are listed. By inspection, FWNN’s generalization is superior, especially for large noise levels. This advantage becomes even more pronounced in the 2d case. While FWNN employs only ten and 12 parameters for the 1d and 2d datasets, MLP and RBF networks require a significantly larger number in the range $[31-301]$ and $[41-401]$, respectively, in order to achieve a comparable test error. For these datasets, a plethora of experiments and related results may be found in [3].

Table 1 Comparison of the NMSE mean over the test set, resulting from 50 independent experiments, for the 1d datasets related to Fig. 1a–d

Full size table

Table 2 Comparison of the NMSE mean over the test set, resulting from 50 independent experiments, for the 2d datasets related to Fig. 2a–c

Full size table

Additional experiments were conducted in order to study the generalization performance of the FWNN as a function of the number of network parameters. We have examined a limited number of cases; hence, our results are only indicative, not conclusive. In doing so, we have retained first-degree polynomials for both $\varvec{\mu }(s)$ and $\sigma (s)$ and varied only the degree of the polynomial in w(s). Accordingly, for the MLP and RBF networks, we have varied the number of hidden nodes. Again 50 independent experiments were performed for each case, and the corresponding NMSE mean was calculated. We have selected two artificial datasets, generated by the functions plotted in Figs. 1b and 2b. We have observed that for the FWNN, the dependence of NMSE on the number of parameters was significantly weaker.

4.2 Extrapolation in one dimension

Consider an 1d dataset with points $x_1,x_2,\ldots , x_M$ arranged in ascending order, and corresponding targets $y_1,y_2,\ldots ,y_M$. Let $N(x,\theta )$ be a network trained over the above set. Estimating the target value as $Y=N(X,\theta )$ at a point $X \in (x_j,x_{j+1})$ is called interpolation, while at a point $X \notin [x_1,x_M]$ is called extrapolation. It has been argued in [21] that artificial neural networks extrapolate rather poorly. To study the extrapolation potential of FWNN, the first two test functions of Fig. 1 have been employed, namely:

$$\begin{aligned} f(x)&=2x^2 + \exp (\pi /x)\sin (2 \pi x) \ \hbox { and } \\ f(x)&=x\sin (x)\cos (x) , \end{aligned}$$

for generating two datasets, each with 150 equidistant data points. The first 100 points were used for training, while the remaining 50 points labeled as $z_1,\ldots , z_{50}$ were used for evaluating the quality of extrapolation. We base the assessment for the extrapolation capability on the relative deviation at an extrapolation point defined by:

$$\begin{aligned} r_i \equiv \frac{|f(z_i) - N(z_i,\widehat{\theta })|}{\max \{1,|f(z_i)|\}} \end{aligned}$$

(19)

By imposing an upper bound $r_b$, for the acceptable relative deviation, we determine J, the number of consecutive extrapolation points satisfying:

$$\begin{aligned} r_i<r_b,\ \forall \ i \le J \hbox { and }r_{J+1}>r_b \end{aligned}$$

(20)

Given a value for the upper bound $r_b$, inside a reasonable range $r_b \in [0,0.25]$, the best method for extrapolation is the one with the highest value of J.

Table 3 contains the extrapolation results for three values of the upper bound, $r_b=\{0.05, 0.15, 0.25\}$. In particular, we show the mean values of the J-index that have resulted from 50 independent experiments. By inspection, it is clear that the FWNN outperforms the rival MLP and RBF networks, as well as the Gaussian processes. Further details and extrapolation experiments have been presented earlier in [3].

Table 3 Comparison of the extrapolation index J, for the two datasets related to Fig. 1a, b

Full size table

4.3 Experiments with real-world benchmarks

Additional experiments were performed on a variety of established benchmarks.

4.3.1 Experiments with UCI datasets

We have selected nine benchmarks from the UCI Machine Learning Repository^{Footnote 1}. which are briefly described in Table 4. Note that the last two datasets (pima, wine) are benchmarks used primarily for evaluating classification methods and contain data belonging to two and seven classes, respectively.

Table 4 Summary of the selected real-world datasets from the UCI repository

Full size table

For each dataset and network architecture, 50 experiments were carried out. For these experiments, we have used $5^{th}$ degree polynomials ($L_w=L_\mu =L_\sigma =5$) corresponding to a number of $6(n+2)$ model parameters. For the MLP, RBF and GPs, we have experimented with a host of different architectural and regularization parameters, and in Table 5, we quote, for each of them, the best performing configuration. Observing these results, we note that FWNN outperforms all competitors in five (out of nine) datasets and in another dataset shares the top with GPs. MLP is top in one dataset and is tied at the top with GPs in another one. GPs is at the top in one dataset, while RBF failed to win the top in any of the UCI datasets.

Table 5 Comparison of the NMSE mean over the test set, resulting from 50 independent experiments, for the nine UCI datasets

Full size table

Since the pima and wine datasets are classification benchmarks, the classification capability of FWNN has been tested. For this purpose, the classification accuracy is calculated as the percentage of the correctly classified test points within a tolerance (see [5]). The results are presented in Table 6 for four different tolerance values, namely: $\eta =0.10,0.25, 0.5$ and 1.0. In these experiments, FWNN together with GPs performs better than both the MLP and RBF networks. It is interesting to note the remarkable classification accuracy of the FWNN, particularly for the low tolerance value of $\eta =0.10$.

Table 6 Classification accuracy for several tolerance values

Full size table

4.3.2 Large-scale experiments

To further test the approximation quality of the FWNN, experiments on extensively studied complex, large datasets were performed. The datasets are summarized in Table 7. The Sarcos dataset is a robotic real-world benchmark [28], representing the inverse dynamics of a robotic seven-joint arm^{Footnote 2} related to rhythmic motions. The task is to map a 21-dimensional input space (seven joint positions, seven joint velocities, seven joint accelerations) to the corresponding seven joint torques.

Table 7 Summary of the datasets used in our large-scale experiments

Full size table

The training in this case was performed using fifth-degree polynomials ($L_w=L_{\mu }=L_{\sigma }=5$) corresponding to a total of $138 \ (=6n+12)$ parameters. The FWNN results along with results published by different authors using GPs are listed in Table 8 and compare favorably. In Fig. 3, the predicted versus the actual values are plotted, for all seven DOFs, rendering the model’s performance obvious. We observe that all points are scattered symmetrically around and near to the diagonal $x=y$ line that represents the perfect match.

Table 8 Mean and normalized mean squared errors for the SARCOS dataset

Full size table

For the remaining (Elevators, Kin40k, Pole Telecomm, Pumadyn32-nm) datasets,^{Footnote 3} the FWNN results are listed in Table 9 along with results provided by a state-of-the-art Gaussian process approach reported in [19]. In spite its simplicity, the FWNN’s performance is better or similar to that of a sophisticated, high-demanding, state-of-the-art method.

Table 9 Comparison of the results (NMSE criterion) depicted with the proposed FFWN and those published in the literature

Full size table

5 Discussion and conclusions

In the present article, we have proposed a new type of neural network, the FWNN, in which the weights are functions of a continuous variable. This may be interpreted as a neural network with an infinite number of hidden nodes. In the conducted numerical experiments, the FWNN exceeded in generalization performance the MLP and RBF networks, as well as the Gaussian processes. This is evidence of robustness, reliability and modeling potential.

The FWNN has a number of interesting properties. There is ample experimental evidence that the generalization performance is superior. This may be related to the fact that the number of required parameters is limited, which in turn prevents serious overtraining.

The positions of the Gaussian centers are determined by the $\varvec{\mu }(s)$ and the corresponding widths by $\sigma (s)$, with $s \in [-1, 1]$. In the case of studying simulated datasets, we have used an affine form; hence, the $\mu (s)$ curve is a straight-line segment joining the two end points $\varvec{\mu }(-1)$ and $\varvec{\mu }(+1)$ in $R^n$. The widths are linearly increasing or decreasing with s, depending on the sign of $\sigma _1$. In spite of that this might seem to be a severe constraint, it has not degraded the network’s performance. We credit this to the infinite number of nodes that render the approximation of any function feasible [6, 10]. In the case of real benchmarks, the affine model imposes an overly strict constraint, and thus it was replaced by a higher order polynomial, at the expense of some extra parameters. The Gaussian centers then may lie on a parabolic or a cubic locus, and the widths acquire higher adaptability.

The attractive features of the proposed FWNN may be briefly summarized as:

1.
Frugal model, incorporating a small number of adjustable parameters.
2.
Resistant to overtraining.
3.
Superior interpolation and extrapolation performance.

We consider that some issues need further investigation and will become part of our future research effort. In particular,

The model behavior when using different density functions.
The effect caused by choosing different functional forms for the weights.
The difference in using other than Gaussian kernels.
The possibility of extending the shallow architecture to deep.

Furthermore, we would like to assess the effectiveness of FWNN in complex problems, such as solving partial and ordinary differential equations [15,16,17], modeling interatomic potentials [26], forecasting time series [34]. This task is currently underway.

Notes

These datasets are available at: http://mlr.cs.umass.edu/ml/datasets.html.
Sarcos dataset is available at http://www.gaussianprocess.org/gpml/data/.
All data were downloaded from http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html.

References

Bartlett P (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory 44:525–536
Article MathSciNet MATH Google Scholar
Bishop C (2006) Pattern recognition and machine learning. Springer, Berlin
MATH Google Scholar
Blekas K, Lagaris I (2017) Artificial neural networks with an infinite number of nodes. J Phys Conf Ser 915(1):012006
Article Google Scholar
Broomhead D, Lowe D (1988) Multivariable functional interpolation and adaptive networks. Complex Syst 2:321–355
MathSciNet MATH Google Scholar
Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553
Article Google Scholar
Cybenko G (1989) Approximations by superposition of sigmoidal functions. Mathe Control Signals Syst 2:303–314
Article MathSciNet MATH Google Scholar
Du KL (2010) Clustering: a neural network approach. Neural Netw 23(1):89–107. https://doi.org/10.1016/j.neunet.2009.08.007
Article MATH Google Scholar
Duch W (2010) Datasets used for classification: comparison of results. http://fizyka.umk.pl/kis-old/projects/datasets.html
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4(1):1–58
Article Google Scholar
Hornik K (1991) Approximation capabilities of multilayer feed-forward networks. Neural Netw 4:251–257
Article Google Scholar
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Article MATH Google Scholar
Ismailov VE (2014) On the approximation by neural networks with bounded number of neurons in hidden layers. J Math Anal Appl 417(2):963–969
Article MathSciNet MATH Google Scholar
Krogh A, Hertz J (1992) A simple weight decay can improve generalization. Adv Neural Inf Process Syst 4:950–957
Google Scholar
Kůrková V (1992) Kolmogorov’s theorem and multilayer neural networks. Neural Netw 5(3):501–506
Article Google Scholar
Lagaris I, Likas A, Fotiadis D (1997) Artificial neural network methods in quantum mechanics. Comput Phys Commun 104:1–14
Article Google Scholar
Lagaris IE, Likas A, Fotiadis DI (1998) Artificial neural networks for solving ordinary and partial differential equations. IEEE Trans Neural Netw 9:987–1000
Article Google Scholar
Lagaris IE, Likas A, Papageorgiou DG (2000) Neural network methods for boundary value problems with irregular boundaries. IEEE Trans Neural Netw 11:1041–1049
Article Google Scholar
Lagaris IE, Tsoulos IG (2008) Stopping rules for box-constrained stochastic global optimization. Appl Math Comput 197:622–632
MathSciNet MATH Google Scholar
Lazaro-Gredilla M, Quinonero-Candela J, Rasmussen C, Figueiras-Vidal A (2010) Sparse spectrum Gaussian process regression. J Mach Learn Res 11:1865–1881
MathSciNet MATH Google Scholar
MacKay D (1992) Bayesian interpolation. Neural Comput 4(3):415–447
Article MATH Google Scholar
Marcus GF (1998) Rethinking eliminative connectionism. Cogn Psychol 37:243–282
Article Google Scholar
Marier Z, Sra S (2016) Diversity networks: neural network compression using determinantal point processes. In: International conference on learning representations (ICLR)
Meier F, Schaal S (2016) Drifting Gaussian processes with varying neighborhood sizes for online model learning. In: IEEE international conference on robotics and automation (ICRA), pp 264–269
Murphy K (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge
MATH Google Scholar
Neal RM (1996) Bayesian learning for neural networks, vol 118. Lecture notes in statistics. Springer, Berlin
Book MATH Google Scholar
Onat B, Cubuk ED, Malone BD, Kaxiras E (2018) Implanted neural network potentials: application to Li–Si alloys. Phys Rev B 97:094106
Article Google Scholar
Powell M (1985) Radial basis functions for multivariable interpolation: a review. In: IMA conference on “Algorithms for the Approximation of Functions and Data”. RMCS Shrivenham
Rasmussen C, Williams CI (2006) Gaussian processes for machine learning. MIT Press, Cambridge
MATH Google Scholar
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Book MATH Google Scholar
Roux N, Bengio Y (2007) Continuous neural networks. In: Eleventh international conference on artificial intelligence and statistics (AISTATS), pp 404–411
Sietsma J, Dow R (1991) Creating artificial neural networks that generalize. Neural Netw 4:67–79
Article Google Scholar
Srivastav N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet MATH Google Scholar
Voglis C, Lagaris I (2006) A global optimization approach to neural network training. Neural Parallel Sci Comput 14:231–240
MathSciNet MATH Google Scholar
Wang JZ, Wang JJ, Zhang ZG, Guo SP (2011) Forecasting stock indices with back propagation neural network. Expert Syst Appl 38:14346–14355
Google Scholar
Zhang GP (2000) Neural networks for classification: a survey. IEEE Trans Syst Man Cybern Part C (Appl Rev) 30(4):451–462
Article Google Scholar
Zhao H, Stretcu O, Negrinho R, Smola A, Gordon G (2017) Efficient multi-task feature and relationship learning. arXiv:1702.04423
Zhao H, Tsai YH, Salakhutdinov R, Gordon G (2019) Learning from the experience of others: approximate empirical Bayes in neural networks. In: International conference on learning representations (ICLR)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Ioannina, 45110, Ioannina, Greece
Konstantinos Blekas & Isaac E. Lagaris

Authors

Konstantinos Blekas
View author publications
You can also search for this author in PubMed Google Scholar
Isaac E. Lagaris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Konstantinos Blekas.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Derivatives

Since the optimization methods used for the training need derivative information, we list the FWNN first order derivatives.

Let us define for convenience the following quantity:

$$\begin{aligned} y(\varvec{x},\mu ,\sigma )= \frac{|\varvec{x}-\mu (s)|^2}{\sigma ^2(s)} = \sum _{i=1}^{n}\left( \frac{x_i-\mu _i(s)}{\sigma (s)}\right) ^2 \end{aligned}$$

(21)

The network (16), is then rewritten as:

$$\begin{aligned} N_{FW}(\varvec{x};\theta )&= \int _{-1}^{1}\frac{{\rm d}s}{\sqrt{1-s^2}}w(s)\nonumber \\& \quad \times \exp \left( -\frac{1}{2} y(\varvec{x},\mu ,\sigma )\right) \end{aligned}$$

(22)

and its partial first-order derivatives w.r.t. $w, \mu ,\hbox { and } \sigma$ are given by:

$$\begin{aligned}&\forall \ r=0,1,\ldots , L_w \nonumber \\&\frac{\partial N_{FG}(\varvec{x};\theta )}{\partial w_r}= \int _{-1}^{1}\frac{{\rm d}s}{\sqrt{1-s^2}}s^r\exp \left( -\frac{1}{2}y(\varvec{x},\mu ,\sigma )\right) \end{aligned}$$

(23)

$$\begin{aligned}&\forall \ k=1,\ldots ,n \hbox { and } r = 0,1,\ldots ,L_{\mu }\end{aligned}$$

$$\begin{aligned} \frac{\partial N_{FG}(\varvec{x};\theta )}{\partial \mu _{kr}} &=\ \int _{-1}^{1}\frac{{\rm d}s}{\sqrt{1-s^2}}w(s)\exp \left( -\frac{1}{2} y(\varvec{x},\mu ,\sigma )\right) \nonumber \\& \quad \times \frac{x_k-\mu _k(s)}{\sigma ^2(s)}s^r\end{aligned}$$

(24)

$$\begin{aligned}&\forall \ r = 0,1,\ldots ,L_{\sigma } \end{aligned}$$

$$\begin{aligned} \frac{\partial N_{FG}(\varvec{x};\theta )}{\partial \sigma _{r}} &= \int _{-1}^{1}\frac{{\rm d}s}{\sqrt{1-s^2}}w(s)\exp \left( -\frac{1}{2}y(\varvec{x},\mu ,\sigma )\right) \nonumber \\& \quad \times \frac{|\varvec{x}-\mu (s)|^2}{\sigma ^3(s)}s^r \end{aligned}$$

(25)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Blekas, K., Lagaris, I.E. Functionally weighted neural networks: frugal models with high accuracy. SN Appl. Sci. 2, 1954 (2020). https://doi.org/10.1007/s42452-020-03713-y

Download citation

Received: 21 February 2020
Accepted: 14 October 2020
Published: 07 November 2020
DOI: https://doi.org/10.1007/s42452-020-03713-y

Functionally weighted neural networks: frugal models with high accuracy

Abstract

Similar content being viewed by others

Neural Networks – State of Art, Brief History, Basic Models and Architecture

Neural Networks

Multilayer Perceptron (MLP)

1 Introduction

2 Neural networks with infinite number of hidden units

2.1 Functionally weighted neural network

3 Technical details

3.1 Approximating integrals

3.2 Learning procedure and software platforms

4 Numerical experiments, comparative analysis and extrapolation performance

4.1 Experiments with simulated datasets

4.2 Extrapolation in one dimension

4.3 Experiments with real-world benchmarks

4.3.1 Experiments with UCI datasets

4.3.2 Large-scale experiments

5 Discussion and conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: Derivatives

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Functionally weighted neural networks: frugal models with high accuracy

Abstract

Similar content being viewed by others

Neural Networks – State of Art, Brief History, Basic Models and Architecture

Neural Networks

Multilayer Perceptron (MLP)

1 Introduction

2 Neural networks with infinite number of hidden units

2.1 Functionally weighted neural network

3 Technical details

3.1 Approximating integrals

3.2 Learning procedure and software platforms

4 Numerical experiments, comparative analysis and extrapolation performance

4.1 Experiments with simulated datasets

4.2 Extrapolation in one dimension

4.3 Experiments with real-world benchmarks

4.3.1 Experiments with UCI datasets

4.3.2 Large-scale experiments

5 Discussion and conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: Derivatives

Appendix: Derivatives

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation