Gradient free stochastic training of ANNs, with local approximation in partitions

Bakas, N. P.; Langousis, A.; Nicolaou, M. A.; Chatzichristofis, S. A.

doi:10.1007/s00477-023-02407-2

Gradient free stochastic training of ANNs, with local approximation in partitions

Original Paper
Open access
Published: 07 March 2023

Volume 37, pages 2603–2617, (2023)
Cite this article

Download PDF

You have full access to this open access article

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Gradient free stochastic training of ANNs, with local approximation in partitions

Download PDF

N. P. Bakas¹,
A. Langousis²,
M. A. Nicolaou³ &
…
S. A. Chatzichristofis⁴

1367 Accesses
2 Citations
Explore all metrics

Abstract

We present a numerical scheme for computation of Artificial Neural Networks (ANN) weights, which stems from the Universal Approximation Theorem, avoiding costly iterations. The proposed algorithm adheres to the underlying theory, is highly fast, and results in remarkably low errors when applied to regression and classification problems of complex data sets with ${\textbf{x}} \in {\mathbb {R}}^{n}$ (e.g. Griewank, Gomez-Levy, Shekel, and Polynomial functions) with random noise addition (i.e. Uniform, Normal, Generalized Pareto, Log-Normal, and a mixture of Log-Normal, Exponential, and Frechet), as well as the database for handwritten digits recognition MNIST (Modified National Institute of Standards and Technology) with $7\times 10^4$ images. The same mathematical formulation was found capable of approximating highly nonlinear functions in multiple dimensions, with low errors (e.g. $10^{-10}$) for the test set of the unknown functions, their higher-order partial derivatives, as well as numerically solving Partial Differential Equations, such as those appearing in Physics, Engineering, Environmental Sciences, etc. The method is based on the calculation of the weights of each neuron in small neighbourhoods of the data. Accordingly, optimization of hyperparameters is not necessary, as the number of neurons stems directly from the dimensionality of the data, further improving the algorithmic speed. Under this setting, overfitting is inherently avoided, and the results are interpretable and reproducible. The complexity of the proposed algorithm is of class P with ${\mathcal {O}}(mNni_{cl} + Nmn^2+Nn^3 + mN^2+N^3)$ computing time, with respect to the observations m, features n, and Neurons N, contrary to the NP-Complete class of standard algorithms for ANN training. The performance of the method is high, irrespective of the size of the data set, and the test set errors are similar or smaller than the training errors, indicating the generalization efficiency of the algorithm. A supplementary computer code in Julia and Python Languages is provided, which can be used to reproduce the validation examples, and/or apply the algorithm to other data sets.

Local search and pseudoinversion: an hybrid approach to neural network training

Article 20 April 2016

Treating Artificial Neural Net Training as a Nonsmooth Global Optimization Problem

Efficient Bayesian Learning of Sparse Deep Artificial Neural Networks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Although Artificial Intelligence (AI) has broadened its numerical methods and extended its fields of application (Shahiri Tabarestani and Afzalimehr 2021; Shaibani et al. 2021; Mohebbi Tafreshi et al. 2020; Kasiviswanathan and Sudheer 2017), empirical rigor has not followed such advancements (Sculley et al. 2018; Bakas et al. 2022), with researchers questioning the accuracy of iterative algorithms (Hutson 2018a), as the results obtained for a certain problem are not always reproducible (Hutson 2018b; Belthangady and Royer 2019). In theory, Artificial Neural Networks (ANN) are capable of approximating any continuous function (Hassoun et al. 1995) but, apart from existence, the theory alone cannot conclude on a universal approach to calculate an optimal set of ANN model parameters, also referred to as weights, and a variety of algorithmic implementations have occurred for this purpose (Li et al. 2016; Yang and Wu 2016; Lin et al. 2019). Along these lines, iterative optimization algorithms (Ruder 2016) are usually applied to reach an optimal set of ANN weights, which minimize the total error of model estimates. Note, however, that apart from trivial cases rarely met in practice, the optimization problem has more than one local minima, and its solution requires multiple iterations that significantly increase the computational load. To resolve this issue, enhanced optimization methods, such as stochastic gradient descent (Bottou 2010; Johnson and Zhang 2013), have been proposed. Another common issue in ANN applications is that of overfitting, which relates to the selection of a weighting scheme that approximates a given set of data well, while failing to generalize the accuracy of the predictions beyond the training set. To remedy overfitting problems, several methods have been proposed and effectively applied, such as dropout (Srivastava et al. 2014). Additional, and probably more important concerns regarding effective application of ANN algorithms, are: a) the arbitrary selection of the number of computational Neurons, which may result in an unnecessary increase of the computational time, and b) the optimization of the hyper-parameters of the selected ANN architecture (Bergstra et al. 2011; Bergstra and Bengio 2012; Feurer and Hutter 2019), which corresponds to solving an optimization problem with objective function values determined by the solution of another optimization problem, that is the calculation of ANN weights for a given training set.

The purpose of this work is to develop a numerical scheme for the calculation of the optimal weights, the number of Neurons, and other parameters of ANN algorithms, which relies on theoretical arguments, in our case the Universal Approximation Theorem and, at the same time, is fast and precise. This has been attained without deviating from the classical ANN representation, by utilizing a novel numerical scheme: firstly dividing the studied data set into small neighborhoods, and subsequently, performing matrix manipulations for the calculation of the sought weights. Local approximation with Heaviside activation function cannot be constructed in a Euclidean space with dimension higher than 1 (Chui et al. 1994); thus we propose a scheme with other sigmoid functions such as logistic, tanh, etc. The numerical experiments exhibit high accuracy, attaining a remarkably low number of errors in the test set of known data sets such as those included in MNIST database for computer vision (LeCun et al. 2010), and complex nonlinear functions for regression, while the computational time is kept short. Interestingly, the same algorithmic scheme may be applied to approximate the solution of Partial Differential Equations (PDEs), appearing in Physics, Engineering, Environmental Sciences, etc. The paper is organized as follows: In Sect. 2, we present the general formulation of the suggested method, hereafter referred to as ANNbN (Artificial Neural Networks by Neighborhoods). More precisely, the basic formulation of the ANNbN approach is progressively developed in Sects. 2.1.1 and 2.1.2. Section 2.2 extends the method for the case when radial basis functions are utilized, while Sects. 2.3 and 2.4 implement the method for approximation of derivatives, and solution of PDEs, respectively. Section 2.5 transforms the original scheme to Deep Networks, and Sect. 2.6 to Ensembles of ANNs. The results of the conducted numerical experiments are presented and discussed in Sect. 3. Conclusions and future research directions are presented in Sect. 4. An open-source computer code written in Julia (Bezanson et al. 2017) and Python (Python 2001-2021) programming Languages is available at https://github.com/nbakas/ANNbN.jl.

2 Artificial neural networks by neighborhoods (ANNbN)

Let ${{x}_{ij}}$ be some given data of $j\in \left\{ 1,2,\ldots ,n \right\}$ input variables in $i\in \left\{ 1,2,\ldots ,m \right\}$ observations of $y_i$ responses. The Universal Approximation Theorem (Cybenko 1989; Tadeusiewicz 1995), ensures the existence of an integer N, such that

$$\begin{aligned} y_i\cong {} {{f}_{i}}({{x}_{i1}},{{x}_{i2}},\ldots ,{{x}_{in}})= \sum \limits _{k=1}^{N}{{{v}_{k}}}\sigma \left( \sum \limits _{j=1}^{n}{{{w}_{jk}}{{x}_{ij}}}+{{b}_{k}} \right) +{{b}_{0}}, \end{aligned}$$

with approximation errors $\epsilon _i=y_i-f_i$ among the given responses $y_i$ and the corresponding simulated $f_i$ values, arbitrarily low. N is the number of Neurons, ${{w}_{jk}}$ and ${{b}_{k}}$ denote the local approximation weights and bias terms, respectively, of the linear summation conducted for each neuron k, and ${{v}_{k}},{{b}_{0}}$ correspond to the global approximation weights and bias terms, respectively, of the linear summation upon all neurons. $\sigma$ is any sigmoid function, as presented below in Sect. 2.1.1.

The suggested ANNbN (Artificial Neural Networks by Neighborhoods) method is based on the segmentation of a given data set into smaller clusters of data, so that each cluster k is representative of the local neighborhood of $y_{ik}$ responses and, subsequently, uses the weights ${{w}_{jk}}$ calculated for each cluster to derive the global weights v of the overall approximation. To conclude on the neighborhoods (i.e. the proximity clusters) of the response observations $y_{i}$, we use the well known k-means clustering algorithm (see e.g. MacQueen et al. 1967; Hartigan and Wong 1979) and k-means++ for the initial seed (Arthur and Vassilvitskii 2007). Any other clustering algorithm can be utilized as well, while by supplying the initial seed, the obtained results are always reproducible. It is worth mentioning that the method works well even without clustering the data, however clustering increases the accuracy and is more compatible with a strict implementation of the Universal Approximation Theorem, as we present in Sect. 3.1. Clustering adds significant computational load, especially for large data sets, however, as presented in Table 1 for the MNIST dataset, ANNbN yields prevalent results even without clustering.

Table 1 Computer vision (MNIST)

Full size table

2.1 Basic formulation for shallow networks

Figure 1 illustrates the calculation process for the ANNbN weights. Contrary to the regular ANN approach where all responses $y_{i}$ are treated in a single step as a whole, the ANNbN method first splits the responses into proximity clusters, calculates the weights $w_{jk}$ in each cluster k using the responses $y_{ik}$ and corresponding input data $x_{ijk}$, and subsequently uses the derived weights ${{w}_{jk}}$ for each cluster to calculate the global weights $v_k, b_0$ of the overall approximation. The aforementioned two-step approach is detailed in Sects. 2.1.1 and 2.1.2 below.

2.1.1 Calculation of $w_{jk}$ and $b_k$ in the kth cluster

Let $m_k$ be the observations found in the kth cluster, with $\sum _{k=1}^N m_k=m$, $\sigma$ the sigmoid function, which may be selected among the variety of sigmoids, such as $\sigma (x)={\frac{1}{1+e^{-x}}}$, with the inverted sigmoid $\sigma ^{-1}$ being $\sigma ^{-1}(y)=\log \left( {\frac{y}{1-y}}\right)$.

By defining $[n] {:}{=}\{1,2, \ldots , n\}$ the number of features’ iterator, $[m] {:}{=}\{1,2, \ldots , m\}$ the iterator of samples, $[m_k] {:}{=}\{1,2, \ldots , m_k\}$ the local samples’ indices, ${\textbf{X}}_k$ the $m_k \times n$ matrix

$$\begin{aligned} {\textbf{X}}_k {:}{=}(x_{ijk})_{i \in [m_k], j \in [n]}, \end{aligned}$$

${\textbf{w}}_k {:}{=}\{w_{1,k},w_{2,k}, \ldots , w_{n,k},b_{k}\}^{T}$ the weights’ vector, with $A^{T}$ denoting the transpose of matrix or vector A, and ${\textbf{y}}_k {:}{=}\{y_{1,k},y_{2,k}, \ldots , y_{m_k,k}\}^{T}$ the target values found in each cluster, we may write

$$\begin{aligned} \sigma \odot \Big (\big ({\textbf{X}}_k \Big |{\textbf{1}}\big ) \times {\textbf{w}}_k\Big ) = {\textbf{y}}_k, \end{aligned}$$

where $\big ({\textbf{X}}_k \Big | {\textbf{1}}\big )$ denotes the matrix ${\textbf{X}}_k$, with a column of 1s appended. The symbol $\odot$ implies the element-wise application of $\sigma$ on $\big ({\textbf{X}}_k \Big | {\textbf{1}}\big )$, and the symbol $\times$ denotes the matrix product. By utilizing the inverse sigmoid function $\sigma ^{-1}$, and writing the left part of the Equation in matrix form, we deduce that

$$\begin{aligned} \big ({\textbf{X}}_k \Big | {\textbf{1}}\big ) \times {\textbf{w}}_k = \sigma ^{-1} \odot ({\textbf{y}}_k ) \end{aligned}$$

(1)

Hence, because the dimensions of ${\textbf{X}}_k$ are small $(m_k<<m)$, we may rapidly calculate the approximation weights ${\textbf{w}}_k$ (Fig. 1 left) in the kth cluster (corresponding to the kth neuron) by solving the linear system

$$\begin{aligned} \big ({\textbf{X}}_k \Big | {\textbf{1}}\big ) \times {\textbf{w}}_k= {\hat{{\textbf{y}}}}_k, \end{aligned}$$

(2)

with respect to ${\textbf{w}}_k$, with ${\hat{{\textbf{y}}}}_k {:}{=}\sigma ^{-1} \odot {\textbf{y}}_k$.

In cases with $m_k = n+1$ and linearly independent columns of ${\textbf{X}}_k$, we may solve Eq. 2 with Gaussian Elimination, while when ${\textbf{X}}_k$ is not of full column rank, least squares or generalized inversion may be applied (Marlow 1993; Chapra et al. 2010).

2.1.2 Calculation of $v_k$ and $b_0$ exploiting all the given observations

Following the computation of the weights ${\textbf{w}}_k$, for each neuron k in the hidden layer, and concatenating the weights’ vectors ${\textbf{w}}_k$, we obtain the matrix of the weights for all neurons N, ${\textbf{w}} {:}{=}[{\textbf{w}}_1 \, {\textbf{w}}_2 \, \cdots \, {\textbf{w}}_N]$. Let

$$\begin{aligned} {\hat{{\textbf{X}}}} {:}{=}\bigg [ \sigma \odot \Big (\big ({\textbf{X}} \Big | {\textbf{1}}\big )\times {\textbf{w}} \Big ) \bigg | {\textbf{1}}\bigg ], \end{aligned}$$

(3)

where ${\textbf{X}}$ corresponds to the total samples, contrary to the previous step that utilized ${\textbf{X}}_k$ containing the observations in cluster k. Hence, in order to compute the weights of the output layer ${\textbf{v}} {:}{=}\{v_1,v_2, \ldots , v_{N}, b_0 \}^{T}$, for all the neurons connected with the external layer, we solve the linear system

$$\begin{aligned} {\hat{{\textbf{X}}}} \times {\textbf{v}} = {\textbf{y}}, \end{aligned}$$

(4)

for ${\textbf{v}}$, where ${\textbf{y}}=\{y_1, y_2, \ldots , y_m\}^{T}$ is the entire vector of observations.

In the numerical experiments, the local approximation weights ${\textbf{w}}_j$ are distinct, while the number of neurons is usually smaller than the number of observations ($N<m$), hence one can obtain the entire representation of the ANNbN by solving:

$$\begin{aligned} {\textbf{v}}=({\hat{{\textbf{X}}}}^{T} {\hat{{\textbf{X}}}})^{-1} {\hat{{\textbf{X}}}}^{T}{\textbf{y}}, \end{aligned}$$

(5)

According to the previous linear systems in Eq. 2, if ${\hat{{\textbf{X}}}}$ is not of full column rank, we may use a solution with least squares or other solvers for dense, tall matrices. The ANNbN approximation scheme is concisely described in the following Algorithm 1.

2.2 ANNbN with radial basis functions as kernels

In what follows, the method is further expanded by using Radial Basis Functions (RBFs) for the approximation, $\varphi (r)$, depending on the distances among the observations r, instead of their raw values (Fig. 2). The operation is conducted in the identified clusters of data, as per Sect. 2.1, instead of the entire sample. A variety of studies exist on the approximation efficiency of RBFs (Yiotis and Katsikadelis 2015; Babouskos and Katsikadelis 2015), however they refer to noiseless data, and the entire sample, instead of neighborhoods. We should also distinguish this approach of RBFs implemented as ANNbN from the Radial Basis Function Networks (Schwenker et al. 2001; Park and Sandberg 1991), with $\varphi ({{\textbf{x}}})=\sum _{{i=1}}^{N}a_{i}\varphi (||{\textbf{x}}-{{\textbf{c}}}_{i}||)$, where the centers ${{\textbf{c}}}_{i}$ are the clusters’ means—instead of collocation points, N is the number of neurons and $\alpha _i$ are calculated by training, instead of matrix manipulation. In the proposed formulation, the representation regards the distances $r_{ijk}$ (Fig. 2) among all the observations ${\textbf{x}}_{ik}=\{x_{i1k},x_{i2k},\ldots ,x_{ink}\}$ in cluster k with dimension (features) n, and ${i} \in \{1,2,\ldots ,m_k\}$, and another observation in the same cluster ${\textbf{x}}_{jk}=\{x_{j1k},x_{j2k},\ldots ,x_{jnk}\}$, with ${j} \in \{1,2,\ldots ,m_k\}$. For each cluster $k \in [N]$, we define

$$\begin{aligned} \pmb \varphi _k {:}{=}\varphi (x_{ijk})_{i \in [m_k], j \in [n]}, \end{aligned}$$

while by using RBFs, the local weights of cluster k, do not comprise a bias term, hence,

$$\begin{aligned} {\textbf{w}}_k {:}{=}\{w_{1,k},w_{2,k}, \ldots , w_{n,k}\}^{T} \end{aligned}$$

In most cases matrix $\pmb \varphi _k$ is invertible (see below for further elaboration), thus, one may approximate the responses in the kth cluster $y_{ik}$, as

$$\begin{aligned} \pmb \varphi _k \times {\textbf{w}}_k = {\textbf{y}}_k, \end{aligned}$$

(6)

and compute ${\textbf{w}}_k$, by

$$\begin{aligned} {\textbf{w}}_k=\pmb \varphi ^{-1}_k {\textbf{y}}_k. \end{aligned}$$

The elements $\varphi _{ijk}=\varphi (\Vert {\textbf{x}}_{jk}-\textbf{x}_{ik}\Vert )$ of the symmetric matrix $\pmb \varphi _k$, denote the application of function $\varphi$ to the pairwise Euclidean Distances (or norms) of the observations in the kth cluster. Note that vector ${\textbf{w}}_{k}$ has length $m_k$ for each cluster k, instead of n for the sigmoid approach. Afterwards, similar to the case of the sigmoid kernels in the previous section (Eqs. 4, 5), and by using

$$\begin{aligned} {\textbf{w}}&{:}{=}[{\textbf{w}}_1 \, {\textbf{w}}_2 \, \cdots \, {\textbf{w}}_N],\\ \pmb \varphi&{:}{=}[\pmb \varphi _1 \, \pmb \varphi _2 \, \cdots \, \pmb \varphi _N], \end{aligned}$$

one can obtain the entire representation for all clusters, for the weights of the output layer ${\textbf{v}}$, by solving

$$[\pmb\varphi \times \mathbf{w} \mid \mathbf{1}] \times \mathbf{v} = \mathbf{y}.$$

(7)

The rows of matrices $\pmb \varphi _1, \pmb \varphi _2, \ldots$ contain the observations of the entire sample, while the columns contain the collocation points found in each cluster. For calculation of the weights, we use $\varphi _{ijk}=\varphi (\Vert {\textbf{x}}_{jk}-\textbf{x}_{ik}\Vert )$. After computing the ${\textbf{w}}_k$ and ${\textbf{v}}$, one may interpolate for any new ${\textbf{x}}$ (out-of-sample), using

$$\begin{aligned} \varphi _{jk}({\textbf{x}})=\varphi (\Vert {\textbf{x}}_{jk}-{\textbf{x}}\Vert ), \end{aligned}$$

where ${\textbf{x}}_{jk}$ are the RBF collocation points for the approximation, same as in Eq. 6. Hence, we may predict for out-of-sample observations by using

$$\begin{aligned} f({\textbf{x}})=\sum ^N_{k=1}\left( \sum ^n_{j=1} w_{jk} \varphi _{jk}({\textbf{x}})\right) v_k + b_0. \end{aligned}$$

(8)

As the weights ${\textbf{w}}_k$ are applied directly by multiplication, the calculation of the inverted function $\varphi ^{-1}$ (corresponding to $\sigma ^{-1}$ in Eq. 1) is not needed. This results in a convenient formulation for the approximation of the derivatives, as well as the solution of PDEs.

In case matrix $\pmb \varphi _k$ results to be singular [see Mairhuber–Curtis theorem (Mairhuber 1956)], one should select an alternative kernel for the data under consideration, in order to obtain invertible $\mathbf {\phi }$, as well as increase accuracy (Fasshauer and Zhang 2007; Fasshauer and McCourt 2012). Some examples of radial basis kernels are the Gaussian $\varphi (r)=e^{-r^2/c^2}$, Multiquadric $\varphi (r)={\sqrt{1+(c r)^{2}}}$, etc., where $r=\Vert {\textbf{x}}_{j}-{\textbf{x}}_{i}\Vert$, and the shape parameter c controls the width of the function. c may take a specific value or be optimized, to attain higher accuracy for the particular dataset studied. Similar to sigmoid functions, after computation of ${\textbf{w}}_k$, we use Eq. 7 to compute ${\textbf{v}}$ and obtain the entire representation.

2.3 ANNbN for the approximation of derivatives

Equation 8 offers an approximation to the sought solution, by applying algebraic operations on $\varphi _j (r)$, where

$$\begin{aligned} r=\Vert {\textbf{x}}_{j}-{\textbf{x}}\Vert =\sum ^{n}_{p=1}{(x_{jp}-x_p)^2}, \end{aligned}$$

(9)

and $\varphi _j$ is a differentiable function with respect to any out-of-sample ${\textbf{x}}$, considering the n-dimensional collocation points ${\textbf{x}}_{j}$ as constants.

Accordingly, one may compute any higher-order derivative of the approximated function, by utilizing Eq. 8 and simply differentiating the kernel $\mathbf \varphi$, and multiplying by the computed weights ${\textbf{w}}_k=w_{jk}$, for all $\textbf{x}_{j}$. In particular, we may approximate the lth derivative with respect to the pth dimension, at the location of the ith observation by

$$\begin{aligned} \frac{ {\partial }^l f_{i}}{\partial {x^l_{ip}}}=\begin{pmatrix} \frac{ {\partial }^l \varphi _{{i}{1}}}{\partial {x^l_{ip}}} \quad \frac{ {\partial }^l \varphi _{{i}{2}}}{\partial {x^l_{ip}}} \quad \cdots \quad \frac{ {\partial }^l \varphi _{{i}{m_k}}}{\partial {x^l_{ip}}} \end{pmatrix}\mathbf {w_k}, \end{aligned}$$

(10)

where

$$\begin{aligned} \varphi _{{i}{j}}=\varphi _j({\textbf{x}}_i)=\varphi (\Vert \textbf{x}_{jk}-{\textbf{x}}_i\Vert ), \end{aligned}$$

and

$$\begin{aligned} \frac{ {\partial } \varphi _{{i}{j}}}{\partial {x_{ip}}}= \frac{ {\partial } \varphi _{{i}{j}}}{\partial {r_{{i}{j}}}} \frac{ {\partial } r_{{i}{j}}}{\partial {x_{ip}}}, \end{aligned}$$

(11)

where ${\textbf{x}}_{jk}$ denote the collocation points of cluster k, and ${\textbf{x}}_i$ the points where $f_i=f({\textbf{x}}_i)$ is computed. Since vector ${\textbf{v}}$ applies by multiplication and summation to all N clusters (Eq. 8), one may obtain the entire approximation for each partial derivative (i.e. by differentiating $\varphi$, and applying all ${\textbf{w}}_k$ and vector ${\textbf{v}}$, to $\frac{ {\partial }^l \varphi _{ij}}{\partial {x_{ip}}^l}$). The weights remain the same for the function and its derivatives. We should underline that the differentiation in Eq. 11 holds for any dimension $p\in \left\{ 1,2,\ldots ,n \right\}$ of ${\textbf{x}}_i$, hence due to Eq. 11, with the same formulation, we derive the partial derivatives with respect to any variable, in a concise setting.

For example, to approximate a function $f(x_1,x_2)$ and later compute its partial derivatives with respect to $x_1$, one can utilize the collocation points ${\textbf{x}}_{j}$ and write

$$\begin{aligned} r=\left( {\textbf{x}}_{j1} - x_{1}\right) ^{2} + \left( {\textbf{x}}_{j1} - x_{2}\right) ^{2}, \end{aligned}$$

and using as kernel

$$\begin{aligned} \varphi (x_1,x_2)=-\frac{r^{4}}{4}, \end{aligned}$$

one obtains

$$\begin{aligned} \frac{ {\partial } \varphi (x_1,x_2)}{\partial {x_{1}}} =-2 \left( {\textbf{x}}_{j1} - x_{1}\right) r^{3}, \end{aligned}$$

and hence

$$\begin{aligned} \frac{ {\partial }^2 \varphi (x_1,x_2)}{\partial {x^2_{1}}} =-2\left( {\textbf{x}}_{j1} - x_{1}\right) 6 \left( {\textbf{x}}_{j1} - x_{1}\right) r^{2} - 2 r^{3}. \end{aligned}$$

Variable $x_1$ may take values from the collocation points or any other intermediate point, after the weights’ calculation, in order to produce predictions for out-of-sample observations. In empirical practice, one may select among the available in literature RBFs, try some new, or optimize their shape parameter c. In Appendix I, we also provide a simple computer code for the symbolic differentiation of any selected RBF, using SymPy (Meurer et al. 2017) package.

Particular interest is exhibited in the Integrated RBFs (IRBFs) (Bakas 2019; Babouskos and Katsikadelis 2015; Yiotis and Katsikadelis 2015), which are formulated from the indefinite integration of the kernel, such that its derivative is the RBF $\varphi$. Accordingly, we may integrate the kernel more than once, to approximate higher-order derivatives. For example, by utilizing $\text{erf}(x)=\frac{1}{\sqrt{\pi }}\int _{-x}^{x}{{{e}^{-{{t}^{2}}}}dt}$, and the twice-integrated Gaussian RBF for $\varphi$ at collocation points $x_j$,

$$\begin{aligned} {{\varphi }_{j}}(x) =\frac{{{\text{c}}^{2}}{{\text{e}}^{\frac{-{{(x-{{x}_{j}})}^{2}}}{{{c}^{2}}}}}+\text{c}\sqrt{\pi }(x-{{x}_{j}})\,\text{erf}\frac{(x - {{x}_{j}})}{c}}{2}, \end{aligned}$$

we deduce that

$$\begin{aligned} \frac{ {d} \varphi _{{j}}}{d {x}}=\frac{\text{c}\sqrt{\pi }\,\text{erf}\frac{(x - {{x}_{j}})}{c}}{2}, \end{aligned}$$

and hence

$$\begin{aligned} \frac{ {d}^2 \varphi _{{j}}}{d {x^2}}={{e}^{-\frac{{{x-x_{j}}^{2}}}{{{c}^{2}}}}}, \end{aligned}$$

which is the Gaussian RBF, approximating the second derivative $\ddot{f}(x)$, instead of f(x).

2.4 ANNbN for the solution of partial differential equations

Similar to the numerical differentiation, we may easily apply the proposed scheme to numerically approximate the solution of Partial Differential Equations (PDEs). We consider a generic Differential operator

$$\begin{aligned} T=\sum _{l=1}^{p}g_{l}({\textbf{x}})D^{l}, \end{aligned}$$

depending on the $D^{l}$ partial derivatives of the sought solution f, for some coefficient functions $g_{l}({\textbf{x}})$, which satisfy

$$\begin{aligned} Tf=h, \end{aligned}$$

where h may be any function in the form of $h(x_1,x_2,\ldots ,x_n)$. We may approximate f by

$$\begin{aligned} f=\sum \limits _{j=1}^{n}{{{w}_{j}}} \varphi _j({{x}}) \end{aligned}$$

(12)

By utilizing Eq. 10, we constitute a system of linear equations. Hence, the weights $w_{jk}$ may be calculated by solving the resulting system, as per Eq. 6.

For example, consider the following generic form of the Laplace equation

$$\begin{aligned}{} & {} \nabla ^{2}f=h, \nonumber \\{} & {} \frac{{{\partial }^{2}}f}{\partial {{x}^{2}}}+\frac{{{\partial }^{2}}f}{\partial {{y}^{2}}}=h(x,y). \end{aligned}$$

(13)

The weights ${w}_{j}$ in Eq. 12 are constant, hence the differentiation regards only function $\varphi$. Thus, by using

$$\begin{aligned} \mathbf {D^2}\pmb \varphi _k {:}{=}\Big (\frac{{{\partial }^{2}}\varphi _{i,j}}{\partial {{x}^{2}}}+ \frac{{{\partial }^{2}}\varphi _{i,j}}{\partial {{y}^{2}}}\Big )_{i \in [m_k], j \in [n]} \end{aligned}$$

and writing Eq. 13 for all $h_{ik}=h(\textbf{x}_{ik})=y_{ik}$ found in cluster k, we obtain

$$\begin{aligned} \mathbf {D^2}\pmb \varphi _k \times {\textbf{w}}_k = {\textbf{y}}_k. \end{aligned}$$

(14)

Because the weights $w_{jk}$ are the same for the approximated function and its derivatives, we may apply some boundary conditions for the function or its derivatives $D^l$, at some boundary points $[b] {:}{=}\{1,2,\ldots ,m_b\}$,

$$\begin{aligned} \frac{{{\partial }^{l}}f({\textbf{x}}_b)}{\partial {{x}^{l}_p}}=y_b \end{aligned}$$

by defining

$$\begin{aligned} {\mathbf {D^l}\pmb \varphi _k} {:}{=}\Big (\frac{{{\partial }^{l}}\varphi _{ij}}{\partial {{x}^{l}_p}}\Big )_{i \in [m_b], j \in [b]} \end{aligned}$$

and using

$$\begin{aligned} {\mathbf {D^l}\pmb \varphi _k} \times {\textbf{w}}_k = {\textbf{y}}_b \end{aligned}$$

(15)

hence, we may compute ${\textbf{w}}_k$, by solving the resulting system of Equations

$$\begin{aligned} \begin{pmatrix} \mathbf {D^2}\pmb \varphi _k \\ \mathbf {D^l}\pmb \varphi _k \\ \end{pmatrix} {\textbf{w}}_{k} = \begin{pmatrix} {\textbf{y}}_k \\ {\textbf{y}}_b \\ \end{pmatrix}, \end{aligned}$$

(16)

similar to Eq. 6 for cluster k. Afterwards, we may obtain the entire representation for all clusters, by using Eq. 7 for the computation of ${\textbf{v}}$. Finally, we obtain the sought solution by applying the computed weights ${\textbf{w}},{\textbf{v}}$ in Eq. 8, for any new ${\textbf{x}}$.

2.5 Deep networks

A method for the transformation of shallow ANNbNs to Deep Networks is also presented. Although Shallow Networks exhibit vastly high accuracy even for unstructured and complex data sets, Deep ANNbNs may be utilized for research purposes, for example in the intersection of neuroscience and artificial intelligence. After the calculation of the weights for the first layer $w_{jk}$, we use them to create a second layer (Fig. 3a), where each node corresponds to the given $y_i$. We then use the same procedure for each neuron k of layer $[l] {:}{=}\left\{ 2,3,\ldots ,L \right\}$, by solving:

$$\begin{aligned} \bigg [ \sigma \odot \Big (\big ({\textbf{X}} \Big |{\textbf{1}}\big ) \times {\textbf{w}}_l\Big ) \bigg ] \times {\textbf{v}}_l = \sigma ^{-1} \odot {\textbf{y}} \end{aligned}$$

(17)

with respect to ${\textbf{v}}_l$. For each layer l, ${\textbf{w}}_l$ corresponds to ${\textbf{w}}$ of Eq. 3. We may arbitrarily select any number of neurons, within each layer l, without additional computational cost, as the solution ${\textbf{v}}_l$ of Eq. 17 depends on the weights ${\textbf{w}}_l$ only, corresponding to the previously computed layer. We should note that although Eq. 17 is similar to Eqs. 3 and 4, we now use both sigmoid (left part) and inverted sigmoid (right part) functions. This procedure is iterated for all neurons k of layer l. Matrix ${\textbf{w}}_l$ corresponds to the weights of layer $l-1$. Finally, for the output layer, we calculate the linear weights $v_k$, as per Eq. 4. This procedure results in a good initialization of the weights, close to the optimal solution, and if we normalize $y_i$ in a range close to the linear part of the sigmoid function $\sigma$ (say [0.4, 0.6]), we rapidly obtain a deep network with approximately equal errors to those of the shallow. Afterwards, any optimization method may be supplementarily applied to compute the final weights, however, the accuracy is already vastly high.

Alternatively, we may utilize the obtained layer for the shallow implementation of ANNbN, ${\hat{{\textbf{X}}}}$ (see Eq. 4), as an input $x_{ij}$ for the second layer (without computing ${\textbf{v}}$), then for a third, and sequentially up to any desired number of layers.

2.6 Generalization in terms of ensembles

A generalisation of the approach presented in Sects. 2.1–2.5, can be obtained, by averaging the results of different ANNbN models, each one fitted to a different portion of the data (Dietterich et al. 2002; Li et al. 2016). This can be done by randomly sub-sampling at a percentage of $\alpha \%$ of the observations, running the ANNbN algorithm multiple times $i_f\in \left\{ 1,2,\ldots ,n_f \right\}$, averaging the results with respect to the errors $\epsilon _{i_f}$ over all n-folds $n_f$:

$$\begin{aligned} y_i=\frac{\sum ^{n_f}_{i_f=1} y_{i,i_f}\frac{1}{\epsilon _{i_f}}}{\sum ^i_{i_f=1} \frac{1}{\epsilon _{i_f}}}, \end{aligned}$$

and using $y_i$ to constitute an Ensemble of ANNbN models (Fig. 3b). Ensembles of ANNbNs exhibit increased accuracy and generalization properties for noisy data, as per the following Numerical Experiments.

2.7 Time complexity of the ANNbN algorithm

The training of an ANN with two layers and only three nodes has proven to be NP-Complete (Blum and Rivest 1989), if the nodes compute linear threshold functions of their inputs. Even simple cases, and approximating hypotheses, result in NP-complete problems (Engel 2001). Apart from the theoretical point of view, the slow speed of learning algorithms is a major flaw of ANNs. On the contrary, ANNbNs are fast, because the main part of the approximation regards operations with small-sized square matrices $(n+1)\times (n+1)$, with n being the number of features. Below we provide a theoretical investigation of ANNbNs’ time complexity, which has been empirically validated by running the supplementary code.

Statement 1: ANNbN Training is conducted in three distinct steps: a) Clustering, b) Solution of linear systems of equations with small-sized matrices ${\textbf{X}}_k$ (Eq. 1) for the calculation of $w_{jk}$ weights, and c) Calculation of $v_k$ weights (Eq. 5).

Statement 2: Let m be the number of observations, and n the number of features. We assume that $m>>n$ (e.g. $m>10n$ as usual for regression problems), that corresponds to the case when the samples are more than the features. We note that the number of clusters N, is also equal to the number of neurons (see Eq. 1, and Fig. 1).

Lemma 1

Time complexity of step (a) is ${\mathcal {O}}(mNni_{cl})$.

Proof

The running time of Lloyd’s algorithm (and most variants) is ${\mathcal {O}}(mNni_{cl})$ (Hartigan and Wong 1979; Manning et al. 2008), with $i_{cl}$ denoting the number of iterations necessary to converge. We should note that, in the worst case, the complexity of Lloyd’s algorithm is super-polynomial (Blömer et al. 2016; Arthur and Vassilvitskii 2006), with $i_{cl}=2^{\varOmega ({\sqrt{m}})}$ and the same holds for the ANNBN algorithm. However, in practice, the algorithm converges for small $i_{cl}$. For the case when the $k-$means$++$ algorithm with $D^2-$weighting is used, the total error is at most ${\mathcal {O}}(\log {N})$ (Arthur and Vassilvitskii 2007). For Theorem 1, we use the more conservative approximation, that is the Lloyd’s algorithm with complexity ${\mathcal {O}}(mNni_{cl})$. $\hfill\square$

Lemma 2

Time complexity of step (b) is ${\mathcal {O}}(Nmn^2+Nn^3)$

Proof

Time complexity of step (b) regards the solution of linear systems with matrices $(n+1) \times (n+1)$ (Eq. 1). However, in the general case, the clustering algorithm may result in clusters with unequal sizes. Hence, one needs to solve a system with non-square matrices ${\textbf{X}}_k$ of sizes $m_k \times n$ (Eq. 1). In the worst case scenario, a cluster contains $m-N+1$ samples, while the rest of the clusters comprises a single sample. Accordingly, the complexity of step 2 regards the solution of a linear system with ${\textbf{X}}_k$ having dimensions $(m-N+1) \times n$. When solving this system, and assuming that ${\textbf{X}}_k$ is of full column rank, the multiplication ${\hat{{\textbf{X}}}}_k^{T}{\hat{{\textbf{X}}}_k}$ results in complexity ${\mathcal {O}}(nmn)={\mathcal {O}}(n^2m)$, as $m>(m-N+1)$. Then, one proceeds to solve the corresponding square linear system with complexity ${\mathcal {O}}(n^3)$, as well as multiplication of $({\hat{{\textbf{X}}}_k}^{T}{\hat{{\textbf{X}}}_k})^{-1}$ with ${\hat{{\textbf{X}}}_k}^{T}$, resulting in complexity ${\mathcal {O}}(nnm)$, and $({\hat{{\textbf{X}}}_k}^{T}{\hat{{\textbf{X}}}_k})^{-1} {\hat{{\textbf{X}}}_k}^{T}$ with $\sigma ^{-1} \odot ({\textbf{y}}_k )$, resulting in complexity ${\mathcal {O}}(nm1)$. Thus, the total complexity is ${\mathcal {O}}(mn^2+n^3+mn^2+mn)$. This is repeated N times, hence the complexity is ${\mathcal {O}}(Nmn^2+Nn^3)$. $\hfill\square$

Lemma 3

Time complexity of step (c) is ${\mathcal {O}}(mN^2+N^3)$

Proof

Step c regards the solution of an $m\times N$ system of Equations (Eq. 4). We assume that the linear systems to be solved in Eq. 5 are of full column rank. Hence the complexity regards the multiplication ${\hat{{\textbf{X}}}}^{T}{\hat{{\textbf{X}}}}$ with complexity ${\mathcal {O}}(NmN)={\mathcal {O}}(N^2m)$, obtaining the solution of the corresponding linear system with complexity ${\mathcal {O}}(N^3)$, as well multiplication of $({\hat{{\textbf{X}}}}^{T}{\hat{{\textbf{X}}}})^{-1}$ with ${\hat{{\textbf{X}}}}^{T}$, with complexity ${\mathcal {O}}(NNm)$, and $({\hat{{\textbf{X}}}}^{T}{\hat{{\textbf{X}}}})^{-1} {\hat{{\textbf{X}}}}^{T}$, with ${\textbf{y}}$, with complexity ${\mathcal {O}}(Nm1)$. Thus, the total complexity is ${\mathcal {O}}(mN^2+N^3+mN^2+mN)={\mathcal {O}}(mN^2+N^3)$.

$\hfill\square$

Theorem 1

(ANNbN Complexity) The running time of ANNbN algorithm is ${\mathcal {O}}(mNni_{cl} + Nmn^2+Nn^3 + mN^2+N^3)$.

Proof

By considering the Time Complexity of each step, (Lemmas 1, 2, 3), we deduce that the total computational complexity of the ANNbN algorithm is ${\mathcal {O}}(mNni_{cl} + Nmn^2+Nn^3 + mN^2+N^3)$. $\hfill\square$

In practice, the clustering algorithm converges after few iterations $i_{cl}$, and the computing time is a third order polynomial for the number of neurons, second order for the features, and linear for the samples. A comparison between the Theoretical and Experimental Computing Times is illustrated in Fig. 4.

Particularly, Fig. 4 depicts the Experimental and Theoretical Computing Times for the case of Griewank Function. Figure 4a corresponds to $n=10^2$ features, $N=10$ neurons, and a variation of samples m ranging from $10^3$ to $10^4$ with step $10^3$. Figure 4b regards $m=10^4$ samples, $N=10^2$ neurons and a variation of features n ranging from 10 to $10^2$ with step 10. Finally, Fig. 4c corresponds to $m=10^5$ samples, $n=10$ features, and a variation of neurons N ranging from $10^2$ to $10^3$ with step $10^2$. Each case has been run 10 times with random samples for input and the Average, Minimum and Maximum Experimental Times are reported. The Theoretical Times presented in Figs. 4a–c have been obtained by standardising the resulting complexity of Theorem 1, based on the average Experimental Times measured on Cyclone Facility (https://hpcf.cyi.ac.cy/), using a single node with 40 cores, for cases (a), (b) and (c), respectively.

Within the limits of this experiment, we notice a linear pattern (with slope less than 1) in the average Experimental Time in Fig. 4a, which is expected from Theorem 1, for the case when we vary m. In Fig. 4b, we may see a non linear growth of the experimental time with n. A second order polynomial fit to the average experimental time results an $R^2=0.9942$, which is consistent with Theorem 1 as well. Finally, we would expect a third order polynomial for the number of neurons N, however the trend of the experimental time exhibits a sublinear stabilization pattern. This may be a result of the actual logarithmic complexity of the first step (clustering). Additionally, perhaps the BLAS operations which run in parallel offer this improvement, however a more analytical investigation should be further performed in future research. It is worth noting that in the case of small matrices $\hat{{\textbf{X}}}_k$, instead of Generalized Inversion assumed in Theorem 1, least squares may be used, especially when one solves for many neurons with few samples per cluster, rendering the solution of the systems of Eq. 1 unstable.

3 Validation results

3.1 1D function approximation and its geometric point of view

We consider a simple one-dimensional function f(x), with $x\in {\textbf{R}}$, to present the basic functionality of ANNbNs. Because ${\sigma ^{-1}} (y)=\log \left( {\frac{y}{1-y}}\right)$ is unstable for $y\rightarrow 0$, and $y \rightarrow 1$, we normalize the responses in the domain [0.1, 0.9]. In Fig. 5, the approximation of $f(x)=0.3\sin (e^{3x})+0.5$ is depicted, by varying the number of neurons utilized in the ANNbN. We may see that by increasing the number of neurons from 2 to 8, the approximating ANNbN exhibits more curvature alterations. This complies with the Universal Approximation theorem and offers a geometric point of view. Interestingly, the results are not affected by adding some random noise, $\epsilon \sim {{\mathcal {U}}}(-\frac{1}{20},\frac{1}{20})$, as the Mean Absolute Error (MAE) in this noisy dataset was $2.10\hbox{E}{-}2$ for the train set, and for the test set was even smaller $1.48\hbox{E}{-}2$, further indicating the capability of ANNbN to approximate the hidden signal and not the noise. We should note that for noiseless data of 100 observations and 50 neurons, the MAE in the train set was $6.82\hbox{E}{-}6$ and in the test set $8.01\hbox{E}{-}6$. The approximation of the same function with Gaussian RBF, and shape parameter $c=0.01$, results in $7.52\hbox{E}{-}8$ MAE for the train set and $1.07\hbox{E}{-}7$ for the test set.

3.2 Regression in ${\mathbb {R}}^n$

We assess the performance of ANNbN by using four nonlinear functions with singularities and folding points, distorted with five cases of noise each (Fig. 6).

Particularly, the Gomez–Levy function

$$\begin{aligned} L(x_1,x_2)=4x_1^{2}-2.1x_1^{4}+{\frac{1}{3}}x_1^{6}+x_1x_2-4x_2^{2}+4x_2^{4} \end{aligned}$$

subjected to $-\sin (4\pi x_1)+2\sin ^{2}(2\pi x_2)\le 1.5 -\sin (4\pi x_1)+2\sin ^{2}(2\pi x_2)\le 1.5$, in order to check the irregularity in the boundaries, the polynomial of five variables,

$$\begin{aligned} P(\textbf{x})=-x_1+\frac{x_2^2}{2}-\frac{x_3^3}{3}+\frac{x_4^4}{4}-\frac{x_5^5}{5}. \end{aligned}$$

the Shekel function with $m_a=10$ maxima in $n=25$ dimensions, ${\textbf{c}}=(c_i)_{i \in [1,2,\ldots ,m_a]}$, $c_i \sim {\mathcal {U}}(0,1)$, and ${\textbf{a}}=(a_{ij})_{i \in [1,2,\ldots ,m_a], j \in [1,2,\ldots ,n]}$, with $a_{ij} \sim {{\mathcal {U}}}(-1/2,1/2)$

$$\begin{aligned} S({{\textbf{x}}})=\sum _{i=1}^{m_a}\;\left( c_{i}+\sum \limits _{j=1}^{n}(x_{j}-a_{ji})^{2}\right) ^{-1} \end{aligned}$$

and the highly nonlinear Griewank function (Griewank 1981),

$$\begin{aligned} G({\textbf{x}})=1+{\frac{1}{4000}}\sum _{{i=1}}^{n}x_{i}^{2}-\prod _{{i=1}}^{n}\cos \left( {\frac{x_{i}}{{\sqrt{i}}}}\right) + \epsilon . \end{aligned}$$

In all cases we use $m=10^4$ observations for the train as well as the test sets, while we add noise in the train set only, in order to assess the capability of the method to approximate the signal and not the noise. Particularly, we use the following cases: (1) Uniform, (2) Normal, (3) Generalized Pareto, (4) Log-Normal, and 5) Mixture of Log-Normal, Exponential, and Frechet distributions (by sub-sampling from each and concatenating the samples), in order to investigate a variety of noise distributions. All the noise vectors are normalized to have zero mean and mean absolute value 5% of the mean value of the target ${\textbf{y}}$. The target is normalized in [0.1, 0.9].

For the approximation with ANNbN, we use a Gaussian kernel with shape parameter $c=10$ for the RBFs, and $N=\lfloor \frac{m}{50}\rfloor =200$ neurons, because in the output layer, a regression is performed (Eqs. 4, 7), and it is important to have a high ratio of neurons vs observations (i.e. 50). We compare the performance with other methods, i.e. Random Forests (Breiman 2001), as implemented in Sadeghi (2013), XGBoost (Xu and Chen 2014; Ruder 2016), and AdaBoost from ScikitLearn (Scikitlearn 2016). We use the Mean Absolute Percentage Error (MAPE) as a metric. The results are presented in Fig. 6 indicating the high accuracy attained with ANNbNs. We should note that the solution exhibited low sensitivity regarding the partitioning. Particularly, by changing the initial seeding algorithm for clustering with k-means, for the polynomial function studied, the MAPE varied from 0.00277 (Kmpp Algorithm Arthur and Vassilvitskii 2007), to 0.00268 (KmCentrality Algorithm Park and Jun 2009), while with random seeding the MAPE was 0.00278.

3.3 Classification for computer vision

As highlighted in the introduction, the reproducibility of AI Research is a major issue. We utilize ANNbN for the MNIST database (LeCun et al. 1998, 2010), obtained from (Shindo 2015), consisting of $6\times 10^4$ handwritten integers $\in [0,9]$, for train and $10^4$ for test. The investigation regards a variety of ANNbN formulations, and the comparison with other methods. In particular, the $\text{erf}(x)=\frac{1}{\sqrt{\pi }}\int _{-x}^{x}{{{e}^{-{{t}^{2}}}}dt}$, and $\sigma =\frac{1}{1+e^{-x}}$ were utilized as activation functions, and the corresponding $erf^{-1}(x)$, and $\sigma ^{-1}(x)$ in Eq. 1. We constructed ANNbNs with one and multiple layers, varying the number of neurons and normalization of y in the domain $\left[ \epsilon , 1-\epsilon \right]$. The results regard separate training for each digit. All results in Table 1 are obtained without any clustering. We consider as an accuracy metric the percentage of the Correct Classified (CC) digits, divided by the number of observations m

$$\begin{aligned} \alpha =100\frac{CC}{m}\%. \end{aligned}$$

This investigation aimed to compare ANNbN with standard ANN algorithms such as Flux (Innes et al. 2018; Innes 2018), as well as Random Forests as implemented in Sadeghi (2013), and XGBoost (Xu and Chen 2014). Table 1 presents the results in terms of accuracy and computing time. The models are trained on the raw data set, without any spatial information exploitation. The results in Table 1 are exactly reproducible in terms of accuracy, as no clustering was utilized and the indexes are taken into account in ascending order. For example, the running time to train 5000 neurons is 29.5 s on average for each digit, which is fast, considering that the training regards 3,925,785 weights, for 6E4 instances and 784 features. Also, the Deep ANNbNs with 10 layers with 1000 neurons each, are trained in the considerably short timeframe of 91 s per digit on average (Table 1). Correspondingly, in Table 1, we compare the Accuracy and Running Time, with Random Forests (with $261\approx 784/3$ Trees), and XGBoost (200 rounds). Future steps may include data preprocessing and augmentation, as well as exploitation of spatial information like in CNNs. Furthermore, we may achieve higher accuracy by utilizing clustering for the Neighborhoods training, Ensembles, and other combinations of ANNbNs. Also, by exploiting data prepossessing and augmentation, spatial information, and further training of the initial ANNbN with an optimizer such as stochastic gradient descent. No GPU or parallel programming was utilized, which might also be a topic for future research. For example, the RBF implementation of ANNbN with clustering and $1.2\times 10^4$ neurons exhibits a test set accuracy of 99.7 for digit 3. The accuracy results regard the out-of-sample test set with $10^4$ digits. The running time was measured in an Intel i7-6700 CPU @3.40 GHz with 32 GB memory and SSD hard disk. A computer code to feed the calculated weights into Flux (Innes et al. 2018) is provided.

3.4 Solution of partial differential equations

We consider the Laplace Equation

$$\begin{aligned} \frac{{{\partial }^{2}}f}{\partial {{x}^{2}}}+\frac{{{\partial }^{2}}f}{\partial {{y}^{2}}}=0, \end{aligned}$$

in a rectangle with dimensions (a, b), and boundary conditions $f(0,y)=0$ for $y \in [0,b]$, $f(x,0)=0$, for $x \in [0,a]$, $f(a,y)=0$, for $y \in [0,b]$, and $f(x,b)=f_0\sin (\frac{\pi }{a}x$), for $x \in [0,a]$. In Fig. 7a, the numerical solution as well as the exact solution

$$\begin{aligned} f(x)=\frac{f_0}{\hbox{sinh}\left( \frac{\pi }{a}b\right) } \sin (\frac{\pi }{a}x)\hbox{sinh}\left( \frac{\pi }{a}y\right) , \end{aligned}$$

are presented. The MAE among the closed-form solution and the numerical with ANNbN, was found $3.97\hbox{E}{-}4$. Interestingly, if we add some random noise in the zero source; i.e.

$$\begin{aligned} \frac{{{\partial }^{2}}f}{\partial {{x}^{2}}}+\frac{{{\partial }^{2}}f}{\partial {{y}^{2}}}=\epsilon \sim {\mathcal {U}}\left( 0,\frac{1}{10}\right) , \end{aligned}$$

(18)

the MAE remains small, and in particular $2.503\hbox{E}{-}3$, for $a=b=1$, over a rectangle grid of points with $dx=dy=0.02$. It is important to underline that numerical methods for the solution of partial differential equations are highly sensitive to noise (Mai-Duy and Tran-Cong 2003; Bakas 2019), as it vanishes the derivatives. However, by utilizing the ANNbN solution the results change slightly, as described in the above errors. This is further highlighted if we utilize the calculated weights of the ANNbN approximation and compute the partial derivatives of the solution f of Eq. 18, $\frac{{{\partial }^{2}}f}{\partial {{x}^{2}}}$, and $\frac{{{\partial }^{2}}f}{\partial {{y}^{2}}}$, the corresponding MAE for the second order partial derivatives is $6.72\hbox{E}{-}4$ (Fig. 7b), which is about two orders less than the added noise $E({{\mathcal {U}}}(0,\frac{1}{10}))=0.05$, implying that ANNbN approximates the signal and not the noise even in PDEs, and even with a stochastic source.

4 Discussion and conclusions

As described in the formulation of the proposed method, we may use a variety of ANNbNs, such as Sigmoid or Radial Basis Functions scheme, Ensembles of ANNbNs, Deep ANNbNs, etc. The method adheres to the theory of function approximation with ANNs, as per visual representations of ANNs’ capability to approximate continuous functions (Nielsen 2015; Rojas 2013). We explained the implementation of the method in the presented illustrative examples, which may be reproduced with the provided computer code. In general, Sigmoid functions are faster, RBFs more accurate and Ensembles of either sigmoid of RBFs handle the noisy data sets better. RBFs may use smaller than $N=\lfloor \frac{m}{n+1}\rfloor$ sized matrices, and hence approximate data sets with limited observations and many features. The overall results are stimulating in terms of speed and accuracy, compared to state-of-the-art methods in the literature.

The approximation of the partial derivatives and solution of PDEs, with or without a noisy source, in a fast and accurate setting, offers a solid step towards the unification of Artificial Intelligence Algorithms with Numerical Methods and Scientific Computing. Future research may consider the implementation of ANNs to specific AI applications such as pattern recognition in environmental data and remote sensing observations, hydrometeorological predictions, regression analysis, and solution of other types of PDEs for environmental modelling and risk assessment. Furthermore, the investigation of other sigmoid functions than the logistic, such as $\tanh , \arctan , \text{erf}, \text{softmax}$, etc., as well as other RBFs, such as multiquadrics, integrated, etc., and the selection of an optimal shape parameter for even higher accuracy, are also of interest. Finally, while the computation of weights is very fast, the algorithm may easily be converted to parallel, as weights’ computation for each neuron requires solving N linear systems with matrices ${\textbf{X}}_k$.

Interpretable AI is a modern demand in Science, and ANNbNs are inherently suitable for this purpose, as by checking the approximation errors of the neurons in each cluster, one may retrieve information for the local accuracy, as well as local and global non-linearities in the data properties. Furthermore, as demonstrated in the examples, the method is proficient for small data sets, without overfitting, by approximating the signal and not the noise, which is a common problem of ANNs.

Abbreviations

${\textbf{x}}$ :: Out-of-sample point in ${\mathbb {R}}^n$
${\textbf{X}}_k$ :: Local matrix of observations (in kth cluster)
$b_k$ :: Bias for kth neuron
CC :: Correct Classified Observations
i, j, k :: Iterators for observations, features, and clusters
L :: Number of Layers
m :: Number of observations
$m_k$ :: Number of observations found in kth cluster
N :: Number of clusters, equal to number of neurons
n :: Number of variables (features)
$n_f$ :: Number of folds for Ensembles
$x_{ijk}$ :: Observations in the kth cluster
$y_{ik}$ :: Local observations in kth cluster
${{v}_{k}},{{b}_{0}}$ :: Approximation weights and bias for the external layer
${{w}_{jk}}$ :: Weights of first layer for jth neuron, in kth cluster
${{x}_{ij}}$ :: Given data $(m\times n)$

References

Arthur D, Vassilvitskii S (2006) How slow is the k-means method? In: Proceedings of the twenty-second annual symposium on computational geometry, ACM, New York, NY, USA, SCG ’06, pp 144–153. https://doi.org/10.1145/1137856.1137880
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
Babouskos NG, Katsikadelis JT (2015) Optimum design of thin plates via frequency optimization using BEM. Arch Appl Mech 85(9–10):1175–1190. https://doi.org/10.1007/s00419-014-0962-7
Article Google Scholar
Bakas NP (2019) Numerical solution for the extrapolation problem of analytic functions. Research 2019(3903187):1–10. https://doi.org/10.34133/2019/3903187
Article Google Scholar
Bakas NP, Plevris V, Langousis A, Chatzichristofis SA (2022) ITSO: A novel inverse transform sampling-based optimization algorithm for stochastic search. Stoch Env Res Risk Assess 36(1):67–76
Article Google Scholar
Belthangady C, Royer LA (2019) Applications, promises, and pitfalls of deep learning for fluorescence image reconstruction. Nat Methods 16(12):1215–1225
Article CAS Google Scholar
Bergstra JS, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Advances in neural information processing systems, pp 2546–2554
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(2012):281–305
Google Scholar
Bezanson J, Edelman A, Karpinski S, Shah VB (2017) Julia: A fresh approach to numerical computing. SIAM Rev 59(1):65–98
Article Google Scholar
Blömer J, Lammersen C, Schmidt M, Sohler C (2016) Theoretical analysis of the k-means algorithm—a survey. In: Algorithm engineering. Springer, pp 81–116
Blum A, Rivest RL (1989) Training a 3-node neural network is np-complete. In: Advances in neural information processing systems, pp 494–501
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer, pp 177–186
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Chapra SC, Canale RP et al (2010) Numerical methods for engineers. McGraw-Hill Higher Education, Boston
Google Scholar
Chui CK, Li X, Mhaskar HN (1994) Neural networks for localized approximation. Math Comput 63(208):607–623
Article Google Scholar
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Systems 2(4):303–314
Article Google Scholar
Dietterich TG et al (2002) Ensemble learning. Handbook Brain Theory Neural Netw 2(1):110–125
Google Scholar
Engel A (2001) Complexity of learning in artificial neural networks. Theor Comput Sci 265(1–2):285–306
Article Google Scholar
Fasshauer GE, McCourt MJ (2012) Stable evaluation of gaussian radial basis function interpolants. SIAM J Sci Comput 34(2):A737–A762
Article Google Scholar
Fasshauer GE, Zhang JG (2007) On choosing “optimal’’ shape parameters for RBF approximation. Numer Algorithms 45(1):345–368
Article Google Scholar
Feurer M, Hutter F (2019) Hyperparameter optimization. In: Hutter F, Kotthoff L, Vanschoren J (eds) Automated machine learning: methods, systems, challenges. Springer International Publishing, Cham, pp 3–33
Chapter Google Scholar
Griewank AO (1981) Generalized descent for global optimization. J Optim Theory Appl 34(1):11–39
Article Google Scholar
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28(1):100–108
Google Scholar
Hassoun MH et al (1995) Fundamentals of artificial neural networks. MIT Press, Cambridge
Google Scholar
Hutson M (2018) AI researchers allege that machine learning is alchemy. https://doi.org/10.1126/science.aau0577
Hutson M (2018) Artificial intelligence faces reproducibility crisis. Science 359(6377):725–726. https://doi.org/10.1126/science.359.6377.725
Article Google Scholar
Innes M (2018) Flux: Elegant machine learning with Julia. J Open Source Softw 3(25):602
Article Google Scholar
Innes M, Saba E, Fischer K, Gandhi D, Rudilosso MC, Joy NM, Karmali T, Pal A, Shah V (2018) Fashionable modelling with flux. arXiv:1811.01457
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems, pp 315–323
Kasiviswanathan K, Sudheer K (2017) Methods used for quantifying the prediction uncertainty of artificial neural network based hydrologic models. Stoch Env Res Risk Assess 31(7):1659–1670. https://doi.org/10.1007/s00477-016-1369-5
Article Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
LeCun Y, Cortes C, Burges C (2010) MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
Li S, You ZH, Guo H, Luo X, Zhao ZQ (2016) Inverse-free extreme learning machine with optimal information updating. IEEE Trans Cybern 46(5):1229–1241. https://doi.org/10.1109/TCYB.2015.2434841
Article Google Scholar
Lin S, Zeng J, Zhang X (2019) Constructive neural network learning. IEEE Trans Cybern 49(1):221–232. https://doi.org/10.1109/TCYB.2017.2771463
Article Google Scholar
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA vol 1, pp 281–297
Mai-Duy N, Tran-Cong T (2003) Approximation of function and its derivatives using radial basis function networks. Appl Math Model 27(3):197–220
Article Google Scholar
Mairhuber JC (1956) On Haar’s theorem concerning Chebychev approximation problems having unique solutions. Proc Am Math Soc 7(4):609–615
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Book Google Scholar
Marlow WH (1993) Mathematics for operations research. Courier Corporation, North Chelmsford
Google Scholar
Meurer A, Smith CP, Paprocki M, Čertík O, Kirpichev SB, Rocklin M, Kumar A, Ivanov S, Moore JK, Singh S et al (2017) SymPy: symbolic computing in python. PeerJ Comput Sci 3:e103
Article Google Scholar
Mohebbi Tafreshi G, Nakhaei M, Lak R (2020) A GIS-based comparative study of hybrid fuzzy-gene expression programming and hybrid fuzzy-artificial neural network for land subsidence susceptibility modeling. Stoch Environ Res Risk Assess 34(7):1059–1087. https://doi.org/10.1007/s00477-020-01810-3
Article Google Scholar
Nielsen MA (2015) Neural networks and deep learning. Determination Press
Park HS, Jun CH (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341
Article Google Scholar
Park J, Sandberg IW (1991) Universal approximation using radial-basis-function networks. Neural Comput 3(2):246–257
Article CAS Google Scholar
Python (2001–2021) The python language reference. https://docs.python.org/3/reference
Rojas R (2013) Neural networks: a systematic introduction. Springer, Berlin
Google Scholar
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv:1609.04747
Sadeghi B (2013) Decisiontree.jl. https://github.com/juliaai/decisiontree.jl
Schwenker F, Kestler HA, Palm G (2001) Three learning phases for radial-basis-function networks. Neural Netw 14(4–5):439–458
Article CAS Google Scholar
Scikitlearn D (2016) Scikitlearn.jl. https://github.com/cstjean/scikitlearn.jl
Sculley D, Snoek J, Wiltschko A, Rahimi A (2018) Winner’s curse? On pace, progress, and empirical rigor. In: Sixth international conference on learning representations workshop
Shahiri Tabarestani E, Afzalimehr H (2021) Artificial neural network and multi-criteria decision-making models for flood simulation in GIS: Mazandaran province, Iran. Stoch Environ Res Risk Assess 35(12):2439–2457. https://doi.org/10.1007/s00477-021-01997-z
Article Google Scholar
Shaibani MJ, Emamgholipour S, Moazeni SS (2021) Investigation of robustness of hybrid artificial neural network with artificial bee colony and firefly algorithm in predicting Covid-19 new cases: case study of Iran. Stochastic Environmental Research and Risk Assessment, pp 1–16
Shindo H (2015) MLDatasets.jl. https://github.com/juliaml/mldatasets.jl
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Google Scholar
Tadeusiewicz R (1995) Neural networks: a comprehensive foundation. Control Eng Pract. https://doi.org/10.1016/0967-0661(95)90080-2
Article Google Scholar
Xu B, Chen T (2014) XGBoost.jl. https://github.com/dmlc/xgboost.jl
Yang Y, Wu Q (2016) Extreme learning machine with subnetwork hidden nodes for regression and classification. IEEE Trans Cybern 46(12):2885–2898. https://doi.org/10.1109/TCYB.2015.2492468
Article Google Scholar
Yiotis AJ, Katsikadelis JT (2015) Buckling of cylindrical shell panels: a MAEM solution. Arch Appl Mech 85:1545–1557
Article Google Scholar

Download references

Funding

Open access funding provided by HEAL-Link Greece. The contribution of Andreas Langousis has been conducted within the project PerManeNt, which has been co-financed by the European Regional Development Fund of the European Union and Greek National Funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH – CREATE – INNOVATE (Project code: T2EDK-04177). The contribution of Nikolaos Bakas has been conducted within the EuroCC Project (GA 951732) and EuroCC 2 Project (101101903) of the European Commission. Parts of the runs were performed on the MeluXina (https://docs.lxp.lu/) as well as Cyclone (https://hpcf.cyi.ac.cy/) Supercomputers.

Author information

Authors and Affiliations

National Infrastructures for Research and Technology – GRNET, 7 Kifisias Avenue, 115 23, Athens, Greece
N. P. Bakas
Department of Civil Engineering, University of Patras, 265 04, Patras, Greece
A. Langousis
Computation-Based Science and Technology Research Center The Cyprus Institute, 20 Konstantinou Kavafi Street, 2121, Aglantzia, Nicosia, Cyprus
M. A. Nicolaou
Department of Computer Science, Neapolis University Pafos Intelligent Systems Laboratory, Pafos, Cyprus
S. A. Chatzichristofis

Authors

N. P. Bakas
View author publications
You can also search for this author in PubMed Google Scholar
A. Langousis
View author publications
You can also search for this author in PubMed Google Scholar
M. A. Nicolaou
View author publications
You can also search for this author in PubMed Google Scholar
S. A. Chatzichristofis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All Authors contributed equally to the formulation of the method as well as the numerical experiments and paper writing.

Corresponding author

Correspondence to A. Langousis.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix I: Computer code

The presented method is implemented in Julia (Bezanson et al. 2017), and Python (Python 2001-2021) Languages. The corresponding computer code is available on https://github.com/nbakas/ANNbN.jl

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bakas, N.P., Langousis, A., Nicolaou, M.A. et al. Gradient free stochastic training of ANNs, with local approximation in partitions. Stoch Environ Res Risk Assess 37, 2603–2617 (2023). https://doi.org/10.1007/s00477-023-02407-2

Download citation

Accepted: 12 February 2023
Published: 07 March 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s00477-023-02407-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Gradient free stochastic training of ANNs, with local approximation in partitions

Abstract

Similar content being viewed by others

Local search and pseudoinversion: an hybrid approach to neural network training

Treating Artificial Neural Net Training as a Nonsmooth Global Optimization Problem

Efficient Bayesian Learning of Sparse Deep Artificial Neural Networks

1 Introduction

2 Artificial neural networks by neighborhoods (ANNbN)

2.1 Basic formulation for shallow networks

2.1.1 Calculation of \(w_{jk}\) and \(b_k\) in the kth cluster

2.1.2 Calculation of \(v_k\) and \(b_0\) exploiting all the given observations

2.2 ANNbN with radial basis functions as kernels

2.3 ANNbN for the approximation of derivatives

2.4 ANNbN for the solution of partial differential equations

2.5 Deep networks

2.6 Generalization in terms of ensembles

2.7 Time complexity of the ANNbN algorithm

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 1

Proof

3 Validation results

3.1 1D function approximation and its geometric point of view

3.2 Regression in \({\mathbb {R}}^n\)

3.3 Classification for computer vision

3.4 Solution of partial differential equations

4 Discussion and conclusions

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Appendix I: Computer code

Appendix I: Computer code

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation