1 Introduction

Although Artificial Intelligence (AI) has broadened its numerical methods and extended its fields of application (Shahiri Tabarestani and Afzalimehr 2021; Shaibani et al. 2021; Mohebbi Tafreshi et al. 2020; Kasiviswanathan and Sudheer 2017), empirical rigor has not followed such advancements (Sculley et al. 2018; Bakas et al. 2022), with researchers questioning the accuracy of iterative algorithms (Hutson 2018a), as the results obtained for a certain problem are not always reproducible (Hutson 2018b; Belthangady and Royer 2019). In theory, Artificial Neural Networks (ANN) are capable of approximating any continuous function (Hassoun et al. 1995) but, apart from existence, the theory alone cannot conclude on a universal approach to calculate an optimal set of ANN model parameters, also referred to as weights, and a variety of algorithmic implementations have occurred for this purpose (Li et al. 2016; Yang and Wu 2016; Lin et al. 2019). Along these lines, iterative optimization algorithms (Ruder 2016) are usually applied to reach an optimal set of ANN weights, which minimize the total error of model estimates. Note, however, that apart from trivial cases rarely met in practice, the optimization problem has more than one local minima, and its solution requires multiple iterations that significantly increase the computational load. To resolve this issue, enhanced optimization methods, such as stochastic gradient descent (Bottou 2010; Johnson and Zhang 2013), have been proposed. Another common issue in ANN applications is that of overfitting, which relates to the selection of a weighting scheme that approximates a given set of data well, while failing to generalize the accuracy of the predictions beyond the training set. To remedy overfitting problems, several methods have been proposed and effectively applied, such as dropout (Srivastava et al. 2014). Additional, and probably more important concerns regarding effective application of ANN algorithms, are: a) the arbitrary selection of the number of computational Neurons, which may result in an unnecessary increase of the computational time, and b) the optimization of the hyper-parameters of the selected ANN architecture (Bergstra et al. 2011; Bergstra and Bengio 2012; Feurer and Hutter 2019), which corresponds to solving an optimization problem with objective function values determined by the solution of another optimization problem, that is the calculation of ANN weights for a given training set.

The purpose of this work is to develop a numerical scheme for the calculation of the optimal weights, the number of Neurons, and other parameters of ANN algorithms, which relies on theoretical arguments, in our case the Universal Approximation Theorem and, at the same time, is fast and precise. This has been attained without deviating from the classical ANN representation, by utilizing a novel numerical scheme: firstly dividing the studied data set into small neighborhoods, and subsequently, performing matrix manipulations for the calculation of the sought weights. Local approximation with Heaviside activation function cannot be constructed in a Euclidean space with dimension higher than 1 (Chui et al. 1994); thus we propose a scheme with other sigmoid functions such as logistic, tanh, etc. The numerical experiments exhibit high accuracy, attaining a remarkably low number of errors in the test set of known data sets such as those included in MNIST database for computer vision (LeCun et al. 2010), and complex nonlinear functions for regression, while the computational time is kept short. Interestingly, the same algorithmic scheme may be applied to approximate the solution of Partial Differential Equations (PDEs), appearing in Physics, Engineering, Environmental Sciences, etc. The paper is organized as follows: In Sect. 2, we present the general formulation of the suggested method, hereafter referred to as ANNbN (Artificial Neural Networks by Neighborhoods). More precisely, the basic formulation of the ANNbN approach is progressively developed in Sects. 2.1.1 and 2.1.2. Section 2.2 extends the method for the case when radial basis functions are utilized, while Sects. 2.3 and 2.4 implement the method for approximation of derivatives, and solution of PDEs, respectively. Section 2.5 transforms the original scheme to Deep Networks, and Sect. 2.6 to Ensembles of ANNs. The results of the conducted numerical experiments are presented and discussed in Sect. 3. Conclusions and future research directions are presented in Sect. 4. An open-source computer code written in Julia (Bezanson et al. 2017) and Python (Python 2001-2021) programming Languages is available at https://github.com/nbakas/ANNbN.jl.

2 Artificial neural networks by neighborhoods (ANNbN)

Let \({{x}_{ij}}\) be some given data of \(j\in \left\{ 1,2,\ldots ,n \right\}\) input variables in \(i\in \left\{ 1,2,\ldots ,m \right\}\) observations of \(y_i\) responses. The Universal Approximation Theorem (Cybenko 1989; Tadeusiewicz 1995), ensures the existence of an integer N, such that

$$\begin{aligned} y_i\cong {} {{f}_{i}}({{x}_{i1}},{{x}_{i2}},\ldots ,{{x}_{in}})= \sum \limits _{k=1}^{N}{{{v}_{k}}}\sigma \left( \sum \limits _{j=1}^{n}{{{w}_{jk}}{{x}_{ij}}}+{{b}_{k}} \right) +{{b}_{0}}, \end{aligned}$$

with approximation errors \(\epsilon _i=y_i-f_i\) among the given responses \(y_i\) and the corresponding simulated \(f_i\) values, arbitrarily low. N is the number of Neurons, \({{w}_{jk}}\) and \({{b}_{k}}\) denote the local approximation weights and bias terms, respectively, of the linear summation conducted for each neuron k, and \({{v}_{k}},{{b}_{0}}\) correspond to the global approximation weights and bias terms, respectively, of the linear summation upon all neurons. \(\sigma\) is any sigmoid function, as presented below in Sect. 2.1.1.

The suggested ANNbN (Artificial Neural Networks by Neighborhoods) method is based on the segmentation of a given data set into smaller clusters of data, so that each cluster k is representative of the local neighborhood of \(y_{ik}\) responses and, subsequently, uses the weights \({{w}_{jk}}\) calculated for each cluster to derive the global weights v of the overall approximation. To conclude on the neighborhoods (i.e. the proximity clusters) of the response observations \(y_{i}\), we use the well known k-means clustering algorithm (see e.g. MacQueen et al. 1967; Hartigan and Wong 1979) and k-means++ for the initial seed (Arthur and Vassilvitskii 2007). Any other clustering algorithm can be utilized as well, while by supplying the initial seed, the obtained results are always reproducible. It is worth mentioning that the method works well even without clustering the data, however clustering increases the accuracy and is more compatible with a strict implementation of the Universal Approximation Theorem, as we present in Sect. 3.1. Clustering adds significant computational load, especially for large data sets, however, as presented in Table 1 for the MNIST dataset, ANNbN yields prevalent results even without clustering.

Table 1 Computer vision (MNIST)

2.1 Basic formulation for shallow networks

Figure 1 illustrates the calculation process for the ANNbN weights. Contrary to the regular ANN approach where all responses \(y_{i}\) are treated in a single step as a whole, the ANNbN method first splits the responses into proximity clusters, calculates the weights \(w_{jk}\) in each cluster k using the responses \(y_{ik}\) and corresponding input data \(x_{ijk}\), and subsequently uses the derived weights \({{w}_{jk}}\) for each cluster to calculate the global weights \(v_k, b_0\) of the overall approximation. The aforementioned two-step approach is detailed in Sects. 2.1.1 and 2.1.2 below.

Fig. 1
figure 1

Illustration of the numerical procedure to calculate ANNbN local and global weights: Initial calculation of local weights \(w_{jk}\) for each neuron k (left panel), and subsequent calculation of the global weights \(v_k, b_0\) of the entire network (right panel)

2.1.1 Calculation of \(w_{jk}\) and \(b_k\) in the kth cluster

Let \(m_k\) be the observations found in the kth cluster, with \(\sum _{k=1}^N m_k=m\), \(\sigma\) the sigmoid function, which may be selected among the variety of sigmoids, such as \(\sigma (x)={\frac{1}{1+e^{-x}}}\), with the inverted sigmoid \(\sigma ^{-1}\) being \(\sigma ^{-1}(y)=\log \left( {\frac{y}{1-y}}\right)\).

By defining \([n] {:}{=}\{1,2, \ldots , n\}\) the number of features’ iterator, \([m] {:}{=}\{1,2, \ldots , m\}\) the iterator of samples, \([m_k] {:}{=}\{1,2, \ldots , m_k\}\) the local samples’ indices, \({\textbf{X}}_k\) the \(m_k \times n\) matrix

$$\begin{aligned} {\textbf{X}}_k {:}{=}(x_{ijk})_{i \in [m_k], j \in [n]}, \end{aligned}$$

\({\textbf{w}}_k {:}{=}\{w_{1,k},w_{2,k}, \ldots , w_{n,k},b_{k}\}^{T}\) the weights’ vector, with \(A^{T}\) denoting the transpose of matrix or vector A, and \({\textbf{y}}_k {:}{=}\{y_{1,k},y_{2,k}, \ldots , y_{m_k,k}\}^{T}\) the target values found in each cluster, we may write

$$\begin{aligned} \sigma \odot \Big (\big ({\textbf{X}}_k \Big |{\textbf{1}}\big ) \times {\textbf{w}}_k\Big ) = {\textbf{y}}_k, \end{aligned}$$

where \(\big ({\textbf{X}}_k \Big | {\textbf{1}}\big )\) denotes the matrix \({\textbf{X}}_k\), with a column of 1s appended. The symbol \(\odot\) implies the element-wise application of \(\sigma\) on \(\big ({\textbf{X}}_k \Big | {\textbf{1}}\big )\), and the symbol \(\times\) denotes the matrix product. By utilizing the inverse sigmoid function \(\sigma ^{-1}\), and writing the left part of the Equation in matrix form, we deduce that

$$\begin{aligned} \big ({\textbf{X}}_k \Big | {\textbf{1}}\big ) \times {\textbf{w}}_k = \sigma ^{-1} \odot ({\textbf{y}}_k ) \end{aligned}$$
(1)

Hence, because the dimensions of \({\textbf{X}}_k\) are small \((m_k<<m)\), we may rapidly calculate the approximation weights \({\textbf{w}}_k\) (Fig. 1 left) in the kth cluster (corresponding to the kth neuron) by solving the linear system

$$\begin{aligned} \big ({\textbf{X}}_k \Big | {\textbf{1}}\big ) \times {\textbf{w}}_k= {\hat{{\textbf{y}}}}_k, \end{aligned}$$
(2)

with respect to \({\textbf{w}}_k\), with \({\hat{{\textbf{y}}}}_k {:}{=}\sigma ^{-1} \odot {\textbf{y}}_k\).

In cases with \(m_k = n+1\) and linearly independent columns of \({\textbf{X}}_k\), we may solve Eq. 2 with Gaussian Elimination, while when \({\textbf{X}}_k\) is not of full column rank, least squares or generalized inversion may be applied (Marlow 1993; Chapra et al. 2010).

2.1.2 Calculation of \(v_k\) and \(b_0\) exploiting all the given observations

Following the computation of the weights \({\textbf{w}}_k\), for each neuron k in the hidden layer, and concatenating the weights’ vectors \({\textbf{w}}_k\), we obtain the matrix of the weights for all neurons N, \({\textbf{w}} {:}{=}[{\textbf{w}}_1 \, {\textbf{w}}_2 \, \cdots \, {\textbf{w}}_N]\). Let

$$\begin{aligned} {\hat{{\textbf{X}}}} {:}{=}\bigg [ \sigma \odot \Big (\big ({\textbf{X}} \Big | {\textbf{1}}\big )\times {\textbf{w}} \Big ) \bigg | {\textbf{1}}\bigg ], \end{aligned}$$
(3)

where \({\textbf{X}}\) corresponds to the total samples, contrary to the previous step that utilized \({\textbf{X}}_k\) containing the observations in cluster k. Hence, in order to compute the weights of the output layer \({\textbf{v}} {:}{=}\{v_1,v_2, \ldots , v_{N}, b_0 \}^{T}\), for all the neurons connected with the external layer, we solve the linear system

$$\begin{aligned} {\hat{{\textbf{X}}}} \times {\textbf{v}} = {\textbf{y}}, \end{aligned}$$
(4)

for \({\textbf{v}}\), where \({\textbf{y}}=\{y_1, y_2, \ldots , y_m\}^{T}\) is the entire vector of observations.

In the numerical experiments, the local approximation weights \({\textbf{w}}_j\) are distinct, while the number of neurons is usually smaller than the number of observations (\(N<m\)), hence one can obtain the entire representation of the ANNbN by solving:

$$\begin{aligned} {\textbf{v}}=({\hat{{\textbf{X}}}}^{T} {\hat{{\textbf{X}}}})^{-1} {\hat{{\textbf{X}}}}^{T}{\textbf{y}}, \end{aligned}$$
(5)

According to the previous linear systems in Eq. 2, if \({\hat{{\textbf{X}}}}\) is not of full column rank, we may use a solution with least squares or other solvers for dense, tall matrices. The ANNbN approximation scheme is concisely described in the following Algorithm 1.

figure a

2.2 ANNbN with radial basis functions as kernels

In what follows, the method is further expanded by using Radial Basis Functions (RBFs) for the approximation, \(\varphi (r)\), depending on the distances among the observations r, instead of their raw values (Fig. 2). The operation is conducted in the identified clusters of data, as per Sect. 2.1, instead of the entire sample. A variety of studies exist on the approximation efficiency of RBFs (Yiotis and Katsikadelis 2015; Babouskos and Katsikadelis 2015), however they refer to noiseless data, and the entire sample, instead of neighborhoods. We should also distinguish this approach of RBFs implemented as ANNbN from the Radial Basis Function Networks (Schwenker et al. 2001; Park and Sandberg 1991), with \(\varphi ({{\textbf{x}}})=\sum _{{i=1}}^{N}a_{i}\varphi (||{\textbf{x}}-{{\textbf{c}}}_{i}||)\), where the centers \({{\textbf{c}}}_{i}\) are the clusters’ means—instead of collocation points, N is the number of neurons and \(\alpha _i\) are calculated by training, instead of matrix manipulation. In the proposed formulation, the representation regards the distances \(r_{ijk}\) (Fig. 2) among all the observations \({\textbf{x}}_{ik}=\{x_{i1k},x_{i2k},\ldots ,x_{ink}\}\) in cluster k with dimension (features) n, and \({i} \in \{1,2,\ldots ,m_k\}\), and another observation in the same cluster \({\textbf{x}}_{jk}=\{x_{j1k},x_{j2k},\ldots ,x_{jnk}\}\), with \({j} \in \{1,2,\ldots ,m_k\}\). For each cluster \(k \in [N]\), we define

$$\begin{aligned} \pmb \varphi _k {:}{=}\varphi (x_{ijk})_{i \in [m_k], j \in [n]}, \end{aligned}$$

while by using RBFs, the local weights of cluster k, do not comprise a bias term, hence,

$$\begin{aligned} {\textbf{w}}_k {:}{=}\{w_{1,k},w_{2,k}, \ldots , w_{n,k}\}^{T} \end{aligned}$$
Fig. 2
figure 2

kth cluster of Radial ANNbN

In most cases matrix \(\pmb \varphi _k\) is invertible (see below for further elaboration), thus, one may approximate the responses in the kth cluster \(y_{ik}\), as

$$\begin{aligned} \pmb \varphi _k \times {\textbf{w}}_k = {\textbf{y}}_k, \end{aligned}$$
(6)

and compute \({\textbf{w}}_k\), by

$$\begin{aligned} {\textbf{w}}_k=\pmb \varphi ^{-1}_k {\textbf{y}}_k. \end{aligned}$$

The elements \(\varphi _{ijk}=\varphi (\Vert {\textbf{x}}_{jk}-\textbf{x}_{ik}\Vert )\) of the symmetric matrix \(\pmb \varphi _k\), denote the application of function \(\varphi\) to the pairwise Euclidean Distances (or norms) of the observations in the kth cluster. Note that vector \({\textbf{w}}_{k}\) has length \(m_k\) for each cluster k, instead of n for the sigmoid approach. Afterwards, similar to the case of the sigmoid kernels in the previous section (Eqs. 45), and by using

$$\begin{aligned} {\textbf{w}}&{:}{=}[{\textbf{w}}_1 \, {\textbf{w}}_2 \, \cdots \, {\textbf{w}}_N],\\ \pmb \varphi&{:}{=}[\pmb \varphi _1 \, \pmb \varphi _2 \, \cdots \, \pmb \varphi _N], \end{aligned}$$

one can obtain the entire representation for all clusters, for the weights of the output layer \({\textbf{v}}\), by solving

$$[\pmb\varphi \times \mathbf{w} \mid \mathbf{1}] \times \mathbf{v} = \mathbf{y}.$$
(7)

The rows of matrices \(\pmb \varphi _1, \pmb \varphi _2, \ldots\) contain the observations of the entire sample, while the columns contain the collocation points found in each cluster. For calculation of the weights, we use \(\varphi _{ijk}=\varphi (\Vert {\textbf{x}}_{jk}-\textbf{x}_{ik}\Vert )\). After computing the \({\textbf{w}}_k\) and \({\textbf{v}}\), one may interpolate for any new \({\textbf{x}}\) (out-of-sample), using

$$\begin{aligned} \varphi _{jk}({\textbf{x}})=\varphi (\Vert {\textbf{x}}_{jk}-{\textbf{x}}\Vert ), \end{aligned}$$

where \({\textbf{x}}_{jk}\) are the RBF collocation points for the approximation, same as in Eq. 6. Hence, we may predict for out-of-sample observations by using

$$\begin{aligned} f({\textbf{x}})=\sum ^N_{k=1}\left( \sum ^n_{j=1} w_{jk} \varphi _{jk}({\textbf{x}})\right) v_k + b_0. \end{aligned}$$
(8)

As the weights \({\textbf{w}}_k\) are applied directly by multiplication, the calculation of the inverted function \(\varphi ^{-1}\) (corresponding to \(\sigma ^{-1}\) in Eq. 1) is not needed. This results in a convenient formulation for the approximation of the derivatives, as well as the solution of PDEs.

In case matrix \(\pmb \varphi _k\) results to be singular [see Mairhuber–Curtis theorem (Mairhuber 1956)], one should select an alternative kernel for the data under consideration, in order to obtain invertible \(\mathbf {\phi }\), as well as increase accuracy (Fasshauer and Zhang 2007; Fasshauer and McCourt 2012). Some examples of radial basis kernels are the Gaussian \(\varphi (r)=e^{-r^2/c^2}\), Multiquadric \(\varphi (r)={\sqrt{1+(c r)^{2}}}\), etc., where \(r=\Vert {\textbf{x}}_{j}-{\textbf{x}}_{i}\Vert\), and the shape parameter c controls the width of the function. c may take a specific value or be optimized, to attain higher accuracy for the particular dataset studied. Similar to sigmoid functions, after computation of \({\textbf{w}}_k\), we use Eq. 7 to compute \({\textbf{v}}\) and obtain the entire representation.

2.3 ANNbN for the approximation of derivatives

Equation 8 offers an approximation to the sought solution, by applying algebraic operations on \(\varphi _j (r)\), where

$$\begin{aligned} r=\Vert {\textbf{x}}_{j}-{\textbf{x}}\Vert =\sum ^{n}_{p=1}{(x_{jp}-x_p)^2}, \end{aligned}$$
(9)

and \(\varphi _j\) is a differentiable function with respect to any out-of-sample \({\textbf{x}}\), considering the n-dimensional collocation points \({\textbf{x}}_{j}\) as constants.

Accordingly, one may compute any higher-order derivative of the approximated function, by utilizing Eq. 8 and simply differentiating the kernel \(\mathbf \varphi\), and multiplying by the computed weights \({\textbf{w}}_k=w_{jk}\), for all \(\textbf{x}_{j}\). In particular, we may approximate the lth derivative with respect to the pth dimension, at the location of the ith observation by

$$\begin{aligned} \frac{ {\partial }^l f_{i}}{\partial {x^l_{ip}}}=\begin{pmatrix} \frac{ {\partial }^l \varphi _{{i}{1}}}{\partial {x^l_{ip}}} \quad \frac{ {\partial }^l \varphi _{{i}{2}}}{\partial {x^l_{ip}}} \quad \cdots \quad \frac{ {\partial }^l \varphi _{{i}{m_k}}}{\partial {x^l_{ip}}} \end{pmatrix}\mathbf {w_k}, \end{aligned}$$
(10)

where

$$\begin{aligned} \varphi _{{i}{j}}=\varphi _j({\textbf{x}}_i)=\varphi (\Vert \textbf{x}_{jk}-{\textbf{x}}_i\Vert ), \end{aligned}$$

and

$$\begin{aligned} \frac{ {\partial } \varphi _{{i}{j}}}{\partial {x_{ip}}}= \frac{ {\partial } \varphi _{{i}{j}}}{\partial {r_{{i}{j}}}} \frac{ {\partial } r_{{i}{j}}}{\partial {x_{ip}}}, \end{aligned}$$
(11)

where \({\textbf{x}}_{jk}\) denote the collocation points of cluster k, and \({\textbf{x}}_i\) the points where \(f_i=f({\textbf{x}}_i)\) is computed. Since vector \({\textbf{v}}\) applies by multiplication and summation to all N clusters (Eq. 8), one may obtain the entire approximation for each partial derivative (i.e. by differentiating \(\varphi\), and applying all \({\textbf{w}}_k\) and vector \({\textbf{v}}\), to \(\frac{ {\partial }^l \varphi _{ij}}{\partial {x_{ip}}^l}\)). The weights remain the same for the function and its derivatives. We should underline that the differentiation in Eq. 11 holds for any dimension \(p\in \left\{ 1,2,\ldots ,n \right\}\) of \({\textbf{x}}_i\), hence due to Eq. 11, with the same formulation, we derive the partial derivatives with respect to any variable, in a concise setting.

For example, to approximate a function \(f(x_1,x_2)\) and later compute its partial derivatives with respect to \(x_1\), one can utilize the collocation points \({\textbf{x}}_{j}\) and write

$$\begin{aligned} r=\left( {\textbf{x}}_{j1} - x_{1}\right) ^{2} + \left( {\textbf{x}}_{j1} - x_{2}\right) ^{2}, \end{aligned}$$

and using as kernel

$$\begin{aligned} \varphi (x_1,x_2)=-\frac{r^{4}}{4}, \end{aligned}$$

one obtains

$$\begin{aligned} \frac{ {\partial } \varphi (x_1,x_2)}{\partial {x_{1}}} =-2 \left( {\textbf{x}}_{j1} - x_{1}\right) r^{3}, \end{aligned}$$

and hence

$$\begin{aligned} \frac{ {\partial }^2 \varphi (x_1,x_2)}{\partial {x^2_{1}}} =-2\left( {\textbf{x}}_{j1} - x_{1}\right) 6 \left( {\textbf{x}}_{j1} - x_{1}\right) r^{2} - 2 r^{3}. \end{aligned}$$

Variable \(x_1\) may take values from the collocation points or any other intermediate point, after the weights’ calculation, in order to produce predictions for out-of-sample observations. In empirical practice, one may select among the available in literature RBFs, try some new, or optimize their shape parameter c. In Appendix I, we also provide a simple computer code for the symbolic differentiation of any selected RBF, using SymPy (Meurer et al. 2017) package.

Particular interest is exhibited in the Integrated RBFs (IRBFs) (Bakas 2019; Babouskos and Katsikadelis 2015; Yiotis and Katsikadelis 2015), which are formulated from the indefinite integration of the kernel, such that its derivative is the RBF \(\varphi\). Accordingly, we may integrate the kernel more than once, to approximate higher-order derivatives. For example, by utilizing \(\text{erf}(x)=\frac{1}{\sqrt{\pi }}\int _{-x}^{x}{{{e}^{-{{t}^{2}}}}dt}\), and the twice-integrated Gaussian RBF for \(\varphi\) at collocation points \(x_j\),

$$\begin{aligned} {{\varphi }_{j}}(x) =\frac{{{\text{c}}^{2}}{{\text{e}}^{\frac{-{{(x-{{x}_{j}})}^{2}}}{{{c}^{2}}}}}+\text{c}\sqrt{\pi }(x-{{x}_{j}})\,\text{erf}\frac{(x - {{x}_{j}})}{c}}{2}, \end{aligned}$$

we deduce that

$$\begin{aligned} \frac{ {d} \varphi _{{j}}}{d {x}}=\frac{\text{c}\sqrt{\pi }\,\text{erf}\frac{(x - {{x}_{j}})}{c}}{2}, \end{aligned}$$

and hence

$$\begin{aligned} \frac{ {d}^2 \varphi _{{j}}}{d {x^2}}={{e}^{-\frac{{{x-x_{j}}^{2}}}{{{c}^{2}}}}}, \end{aligned}$$

which is the Gaussian RBF, approximating the second derivative \(\ddot{f}(x)\), instead of f(x).

2.4 ANNbN for the solution of partial differential equations

Similar to the numerical differentiation, we may easily apply the proposed scheme to numerically approximate the solution of Partial Differential Equations (PDEs). We consider a generic Differential operator

$$\begin{aligned} T=\sum _{l=1}^{p}g_{l}({\textbf{x}})D^{l}, \end{aligned}$$

depending on the \(D^{l}\) partial derivatives of the sought solution f, for some coefficient functions \(g_{l}({\textbf{x}})\), which satisfy

$$\begin{aligned} Tf=h, \end{aligned}$$

where h may be any function in the form of \(h(x_1,x_2,\ldots ,x_n)\). We may approximate f by

$$\begin{aligned} f=\sum \limits _{j=1}^{n}{{{w}_{j}}} \varphi _j({{x}}) \end{aligned}$$
(12)

By utilizing Eq. 10, we constitute a system of linear equations. Hence, the weights \(w_{jk}\) may be calculated by solving the resulting system, as per Eq. 6.

For example, consider the following generic form of the Laplace equation

$$\begin{aligned}{} & {} \nabla ^{2}f=h, \nonumber \\{} & {} \frac{{{\partial }^{2}}f}{\partial {{x}^{2}}}+\frac{{{\partial }^{2}}f}{\partial {{y}^{2}}}=h(x,y). \end{aligned}$$
(13)

The weights \({w}_{j}\) in Eq. 12 are constant, hence the differentiation regards only function \(\varphi\). Thus, by using

$$\begin{aligned} \mathbf {D^2}\pmb \varphi _k {:}{=}\Big (\frac{{{\partial }^{2}}\varphi _{i,j}}{\partial {{x}^{2}}}+ \frac{{{\partial }^{2}}\varphi _{i,j}}{\partial {{y}^{2}}}\Big )_{i \in [m_k], j \in [n]} \end{aligned}$$

and writing Eq. 13 for all \(h_{ik}=h(\textbf{x}_{ik})=y_{ik}\) found in cluster k, we obtain

$$\begin{aligned} \mathbf {D^2}\pmb \varphi _k \times {\textbf{w}}_k = {\textbf{y}}_k. \end{aligned}$$
(14)

Because the weights \(w_{jk}\) are the same for the approximated function and its derivatives, we may apply some boundary conditions for the function or its derivatives \(D^l\), at some boundary points \([b] {:}{=}\{1,2,\ldots ,m_b\}\),

$$\begin{aligned} \frac{{{\partial }^{l}}f({\textbf{x}}_b)}{\partial {{x}^{l}_p}}=y_b \end{aligned}$$

by defining

$$\begin{aligned} {\mathbf {D^l}\pmb \varphi _k} {:}{=}\Big (\frac{{{\partial }^{l}}\varphi _{ij}}{\partial {{x}^{l}_p}}\Big )_{i \in [m_b], j \in [b]} \end{aligned}$$

and using

$$\begin{aligned} {\mathbf {D^l}\pmb \varphi _k} \times {\textbf{w}}_k = {\textbf{y}}_b \end{aligned}$$
(15)

hence, we may compute \({\textbf{w}}_k\), by solving the resulting system of Equations

$$\begin{aligned} \begin{pmatrix} \mathbf {D^2}\pmb \varphi _k \\ \mathbf {D^l}\pmb \varphi _k \\ \end{pmatrix} {\textbf{w}}_{k} = \begin{pmatrix} {\textbf{y}}_k \\ {\textbf{y}}_b \\ \end{pmatrix}, \end{aligned}$$
(16)

similar to Eq. 6 for cluster k. Afterwards, we may obtain the entire representation for all clusters, by using Eq. 7 for the computation of \({\textbf{v}}\). Finally, we obtain the sought solution by applying the computed weights \({\textbf{w}},{\textbf{v}}\) in Eq. 8, for any new \({\textbf{x}}\).

2.5 Deep networks

A method for the transformation of shallow ANNbNs to Deep Networks is also presented. Although Shallow Networks exhibit vastly high accuracy even for unstructured and complex data sets, Deep ANNbNs may be utilized for research purposes, for example in the intersection of neuroscience and artificial intelligence. After the calculation of the weights for the first layer \(w_{jk}\), we use them to create a second layer (Fig. 3a), where each node corresponds to the given \(y_i\). We then use the same procedure for each neuron k of layer \([l] {:}{=}\left\{ 2,3,\ldots ,L \right\}\), by solving:

$$\begin{aligned} \bigg [ \sigma \odot \Big (\big ({\textbf{X}} \Big |{\textbf{1}}\big ) \times {\textbf{w}}_l\Big ) \bigg ] \times {\textbf{v}}_l = \sigma ^{-1} \odot {\textbf{y}} \end{aligned}$$
(17)

with respect to \({\textbf{v}}_l\). For each layer l, \({\textbf{w}}_l\) corresponds to \({\textbf{w}}\) of Eq. 3. We may arbitrarily select any number of neurons, within each layer l, without additional computational cost, as the solution \({\textbf{v}}_l\) of Eq. 17 depends on the weights \({\textbf{w}}_l\) only, corresponding to the previously computed layer. We should note that although Eq. 17 is similar to Eqs. 3 and 4, we now use both sigmoid (left part) and inverted sigmoid (right part) functions. This procedure is iterated for all neurons k of layer l. Matrix \({\textbf{w}}_l\) corresponds to the weights of layer \(l-1\). Finally, for the output layer, we calculate the linear weights \(v_k\), as per Eq. 4. This procedure results in a good initialization of the weights, close to the optimal solution, and if we normalize \(y_i\) in a range close to the linear part of the sigmoid function \(\sigma\) (say [0.4, 0.6]), we rapidly obtain a deep network with approximately equal errors to those of the shallow. Afterwards, any optimization method may be supplementarily applied to compute the final weights, however, the accuracy is already vastly high.

Fig. 3
figure 3

Transformation of the basic Numerical Scheme

Alternatively, we may utilize the obtained layer for the shallow implementation of ANNbN, \({\hat{{\textbf{X}}}}\) (see Eq. 4), as an input \(x_{ij}\) for the second layer (without computing \({\textbf{v}}\)), then for a third, and sequentially up to any desired number of layers.

2.6 Generalization in terms of ensembles

A generalisation of the approach presented in Sects. 2.12.5, can be obtained, by averaging the results of different ANNbN models, each one fitted to a different portion of the data (Dietterich et al. 2002; Li et al. 2016). This can be done by randomly sub-sampling at a percentage of \(\alpha \%\) of the observations, running the ANNbN algorithm multiple times \(i_f\in \left\{ 1,2,\ldots ,n_f \right\}\), averaging the results with respect to the errors \(\epsilon _{i_f}\) over all n-folds \(n_f\):

$$\begin{aligned} y_i=\frac{\sum ^{n_f}_{i_f=1} y_{i,i_f}\frac{1}{\epsilon _{i_f}}}{\sum ^i_{i_f=1} \frac{1}{\epsilon _{i_f}}}, \end{aligned}$$

and using \(y_i\) to constitute an Ensemble of ANNbN models (Fig. 3b). Ensembles of ANNbNs exhibit increased accuracy and generalization properties for noisy data, as per the following Numerical Experiments.

2.7 Time complexity of the ANNbN algorithm

The training of an ANN with two layers and only three nodes has proven to be NP-Complete (Blum and Rivest 1989), if the nodes compute linear threshold functions of their inputs. Even simple cases, and approximating hypotheses, result in NP-complete problems (Engel 2001). Apart from the theoretical point of view, the slow speed of learning algorithms is a major flaw of ANNs. On the contrary, ANNbNs are fast, because the main part of the approximation regards operations with small-sized square matrices \((n+1)\times (n+1)\), with n being the number of features. Below we provide a theoretical investigation of ANNbNs’ time complexity, which has been empirically validated by running the supplementary code.

Statement 1: ANNbN Training is conducted in three distinct steps: a) Clustering, b) Solution of linear systems of equations with small-sized matrices \({\textbf{X}}_k\) (Eq. 1) for the calculation of \(w_{jk}\) weights, and c) Calculation of \(v_k\) weights (Eq. 5).

Statement 2: Let m be the number of observations, and n the number of features. We assume that \(m>>n\) (e.g. \(m>10n\) as usual for regression problems), that corresponds to the case when the samples are more than the features. We note that the number of clusters N, is also equal to the number of neurons (see Eq. 1, and Fig. 1).

Lemma 1

Time complexity of step (a) is \({\mathcal {O}}(mNni_{cl})\).

Proof

The running time of Lloyd’s algorithm (and most variants) is \({\mathcal {O}}(mNni_{cl})\) (Hartigan and Wong 1979; Manning et al. 2008), with \(i_{cl}\) denoting the number of iterations necessary to converge. We should note that, in the worst case, the complexity of Lloyd’s algorithm is super-polynomial (Blömer et al. 2016; Arthur and Vassilvitskii 2006), with \(i_{cl}=2^{\varOmega ({\sqrt{m}})}\) and the same holds for the ANNBN algorithm. However, in practice, the algorithm converges for small \(i_{cl}\). For the case when the \(k-\)means\(++\) algorithm with \(D^2-\)weighting is used, the total error is at most \({\mathcal {O}}(\log {N})\) (Arthur and Vassilvitskii 2007). For Theorem 1, we use the more conservative approximation, that is the Lloyd’s algorithm with complexity \({\mathcal {O}}(mNni_{cl})\). \(\hfill\square\)

Lemma 2

Time complexity of step (b) is \({\mathcal {O}}(Nmn^2+Nn^3)\)

Proof

Time complexity of step (b) regards the solution of linear systems with matrices \((n+1) \times (n+1)\) (Eq. 1). However, in the general case, the clustering algorithm may result in clusters with unequal sizes. Hence, one needs to solve a system with non-square matrices \({\textbf{X}}_k\) of sizes \(m_k \times n\) (Eq. 1). In the worst case scenario, a cluster contains \(m-N+1\) samples, while the rest of the clusters comprises a single sample. Accordingly, the complexity of step 2 regards the solution of a linear system with \({\textbf{X}}_k\) having dimensions \((m-N+1) \times n\). When solving this system, and assuming that \({\textbf{X}}_k\) is of full column rank, the multiplication \({\hat{{\textbf{X}}}}_k^{T}{\hat{{\textbf{X}}}_k}\) results in complexity \({\mathcal {O}}(nmn)={\mathcal {O}}(n^2m)\), as \(m>(m-N+1)\). Then, one proceeds to solve the corresponding square linear system with complexity \({\mathcal {O}}(n^3)\), as well as multiplication of \(({\hat{{\textbf{X}}}_k}^{T}{\hat{{\textbf{X}}}_k})^{-1}\) with \({\hat{{\textbf{X}}}_k}^{T}\), resulting in complexity \({\mathcal {O}}(nnm)\), and \(({\hat{{\textbf{X}}}_k}^{T}{\hat{{\textbf{X}}}_k})^{-1} {\hat{{\textbf{X}}}_k}^{T}\) with \(\sigma ^{-1} \odot ({\textbf{y}}_k )\), resulting in complexity \({\mathcal {O}}(nm1)\). Thus, the total complexity is \({\mathcal {O}}(mn^2+n^3+mn^2+mn)\). This is repeated N times, hence the complexity is \({\mathcal {O}}(Nmn^2+Nn^3)\). \(\hfill\square\)

Lemma 3

Time complexity of step (c) is \({\mathcal {O}}(mN^2+N^3)\)

Proof

Step c regards the solution of an \(m\times N\) system of Equations (Eq. 4). We assume that the linear systems to be solved in Eq. 5 are of full column rank. Hence the complexity regards the multiplication \({\hat{{\textbf{X}}}}^{T}{\hat{{\textbf{X}}}}\) with complexity \({\mathcal {O}}(NmN)={\mathcal {O}}(N^2m)\), obtaining the solution of the corresponding linear system with complexity \({\mathcal {O}}(N^3)\), as well multiplication of \(({\hat{{\textbf{X}}}}^{T}{\hat{{\textbf{X}}}})^{-1}\) with \({\hat{{\textbf{X}}}}^{T}\), with complexity \({\mathcal {O}}(NNm)\), and \(({\hat{{\textbf{X}}}}^{T}{\hat{{\textbf{X}}}})^{-1} {\hat{{\textbf{X}}}}^{T}\), with \({\textbf{y}}\), with complexity \({\mathcal {O}}(Nm1)\). Thus, the total complexity is \({\mathcal {O}}(mN^2+N^3+mN^2+mN)={\mathcal {O}}(mN^2+N^3)\).

\(\hfill\square\)

Theorem 1

(ANNbN Complexity) The running time of ANNbN algorithm is \({\mathcal {O}}(mNni_{cl} + Nmn^2+Nn^3 + mN^2+N^3)\).

Proof

By considering the Time Complexity of each step, (Lemmas 123), we deduce that the total computational complexity of the ANNbN algorithm is \({\mathcal {O}}(mNni_{cl} + Nmn^2+Nn^3 + mN^2+N^3)\). \(\hfill\square\)

In practice, the clustering algorithm converges after few iterations \(i_{cl}\), and the computing time is a third order polynomial for the number of neurons, second order for the features, and linear for the samples. A comparison between the Theoretical and Experimental Computing Times is illustrated in Fig. 4.

Fig. 4
figure 4

Experimental vs Theoretical Time of the ANNbN algorithm, when varying the samples-m (a), features-n (b), and neurons-N (c)

Particularly, Fig. 4 depicts the Experimental and Theoretical Computing Times for the case of Griewank Function. Figure 4a corresponds to \(n=10^2\) features, \(N=10\) neurons, and a variation of samples m ranging from \(10^3\) to \(10^4\) with step \(10^3\). Figure 4b regards \(m=10^4\) samples, \(N=10^2\) neurons and a variation of features n ranging from 10 to \(10^2\) with step 10. Finally, Fig. 4c corresponds to \(m=10^5\) samples, \(n=10\) features, and a variation of neurons N ranging from \(10^2\) to \(10^3\) with step \(10^2\). Each case has been run 10 times with random samples for input and the Average, Minimum and Maximum Experimental Times are reported. The Theoretical Times presented in Figs. 4a–c have been obtained by standardising the resulting complexity of Theorem 1, based on the average Experimental Times measured on Cyclone Facility (https://hpcf.cyi.ac.cy/), using a single node with 40 cores, for cases (a), (b) and (c), respectively.

Within the limits of this experiment, we notice a linear pattern (with slope less than 1) in the average Experimental Time in Fig. 4a, which is expected from Theorem 1, for the case when we vary m. In Fig. 4b, we may see a non linear growth of the experimental time with n. A second order polynomial fit to the average experimental time results an \(R^2=0.9942\), which is consistent with Theorem 1 as well. Finally, we would expect a third order polynomial for the number of neurons N, however the trend of the experimental time exhibits a sublinear stabilization pattern. This may be a result of the actual logarithmic complexity of the first step (clustering). Additionally, perhaps the BLAS operations which run in parallel offer this improvement, however a more analytical investigation should be further performed in future research. It is worth noting that in the case of small matrices \(\hat{{\textbf{X}}}_k\), instead of Generalized Inversion assumed in Theorem 1, least squares may be used, especially when one solves for many neurons with few samples per cluster, rendering the solution of the systems of Eq. 1 unstable.

3 Validation results

3.1 1D function approximation and its geometric point of view

We consider a simple one-dimensional function f(x), with \(x\in {\textbf{R}}\), to present the basic functionality of ANNbNs. Because \({\sigma ^{-1}} (y)=\log \left( {\frac{y}{1-y}}\right)\) is unstable for \(y\rightarrow 0\), and \(y \rightarrow 1\), we normalize the responses in the domain [0.1, 0.9]. In Fig. 5, the approximation of \(f(x)=0.3\sin (e^{3x})+0.5\) is depicted, by varying the number of neurons utilized in the ANNbN. We may see that by increasing the number of neurons from 2 to 8, the approximating ANNbN exhibits more curvature alterations. This complies with the Universal Approximation theorem and offers a geometric point of view. Interestingly, the results are not affected by adding some random noise, \(\epsilon \sim {{\mathcal {U}}}(-\frac{1}{20},\frac{1}{20})\), as the Mean Absolute Error (MAE) in this noisy dataset was \(2.10\hbox{E}{-}2\) for the train set, and for the test set was even smaller \(1.48\hbox{E}{-}2\), further indicating the capability of ANNbN to approximate the hidden signal and not the noise. We should note that for noiseless data of 100 observations and 50 neurons, the MAE in the train set was \(6.82\hbox{E}{-}6\) and in the test set \(8.01\hbox{E}{-}6\). The approximation of the same function with Gaussian RBF, and shape parameter \(c=0.01\), results in \(7.52\hbox{E}{-}8\) MAE for the train set and \(1.07\hbox{E}{-}7\) for the test set.

Fig. 5
figure 5

ANNbN with 2, 4 and 8 neurons, for the approximation of \(f(x)=0.3\sin (e^{3x}) + 0.5\)

3.2 Regression in \({\mathbb {R}}^n\)

We assess the performance of ANNbN by using four nonlinear functions with singularities and folding points, distorted with five cases of noise each (Fig. 6).

Fig. 6
figure 6

Regression Errors for various Nonlinear Functions and Noise Distributions: 1: Uniform, 2: Normal, 3: Generalized Pareto, 4: Log-Normal, 5: Mixture

Particularly, the Gomez–Levy function

$$\begin{aligned} L(x_1,x_2)=4x_1^{2}-2.1x_1^{4}+{\frac{1}{3}}x_1^{6}+x_1x_2-4x_2^{2}+4x_2^{4} \end{aligned}$$

subjected to \(-\sin (4\pi x_1)+2\sin ^{2}(2\pi x_2)\le 1.5 -\sin (4\pi x_1)+2\sin ^{2}(2\pi x_2)\le 1.5\), in order to check the irregularity in the boundaries, the polynomial of five variables,

$$\begin{aligned} P(\textbf{x})=-x_1+\frac{x_2^2}{2}-\frac{x_3^3}{3}+\frac{x_4^4}{4}-\frac{x_5^5}{5}. \end{aligned}$$

the Shekel function with \(m_a=10\) maxima in \(n=25\) dimensions, \({\textbf{c}}=(c_i)_{i \in [1,2,\ldots ,m_a]}\), \(c_i \sim {\mathcal {U}}(0,1)\), and \({\textbf{a}}=(a_{ij})_{i \in [1,2,\ldots ,m_a], j \in [1,2,\ldots ,n]}\), with \(a_{ij} \sim {{\mathcal {U}}}(-1/2,1/2)\)

$$\begin{aligned} S({{\textbf{x}}})=\sum _{i=1}^{m_a}\;\left( c_{i}+\sum \limits _{j=1}^{n}(x_{j}-a_{ji})^{2}\right) ^{-1} \end{aligned}$$

and the highly nonlinear Griewank function (Griewank 1981),

$$\begin{aligned} G({\textbf{x}})=1+{\frac{1}{4000}}\sum _{{i=1}}^{n}x_{i}^{2}-\prod _{{i=1}}^{n}\cos \left( {\frac{x_{i}}{{\sqrt{i}}}}\right) + \epsilon . \end{aligned}$$

In all cases we use \(m=10^4\) observations for the train as well as the test sets, while we add noise in the train set only, in order to assess the capability of the method to approximate the signal and not the noise. Particularly, we use the following cases: (1) Uniform, (2) Normal, (3) Generalized Pareto, (4) Log-Normal, and 5) Mixture of Log-Normal, Exponential, and Frechet distributions (by sub-sampling from each and concatenating the samples), in order to investigate a variety of noise distributions. All the noise vectors are normalized to have zero mean and mean absolute value 5% of the mean value of the target \({\textbf{y}}\). The target is normalized in [0.1, 0.9].

For the approximation with ANNbN, we use a Gaussian kernel with shape parameter \(c=10\) for the RBFs, and \(N=\lfloor \frac{m}{50}\rfloor =200\) neurons, because in the output layer, a regression is performed (Eqs. 47), and it is important to have a high ratio of neurons vs observations (i.e. 50). We compare the performance with other methods, i.e. Random Forests (Breiman 2001), as implemented in Sadeghi (2013), XGBoost (Xu and Chen 2014; Ruder 2016), and AdaBoost from ScikitLearn (Scikitlearn 2016). We use the Mean Absolute Percentage Error (MAPE) as a metric. The results are presented in Fig. 6 indicating the high accuracy attained with ANNbNs. We should note that the solution exhibited low sensitivity regarding the partitioning. Particularly, by changing the initial seeding algorithm for clustering with k-means, for the polynomial function studied, the MAPE varied from 0.00277 (Kmpp Algorithm Arthur and Vassilvitskii 2007), to 0.00268 (KmCentrality Algorithm Park and Jun 2009), while with random seeding the MAPE was 0.00278.

3.3 Classification for computer vision

As highlighted in the introduction, the reproducibility of AI Research is a major issue. We utilize ANNbN for the MNIST database (LeCun et al. 1998, 2010), obtained from (Shindo 2015), consisting of \(6\times 10^4\) handwritten integers \(\in [0,9]\), for train and \(10^4\) for test. The investigation regards a variety of ANNbN formulations, and the comparison with other methods. In particular, the \(\text{erf}(x)=\frac{1}{\sqrt{\pi }}\int _{-x}^{x}{{{e}^{-{{t}^{2}}}}dt}\), and \(\sigma =\frac{1}{1+e^{-x}}\) were utilized as activation functions, and the corresponding \(erf^{-1}(x)\), and \(\sigma ^{-1}(x)\) in Eq. 1. We constructed ANNbNs with one and multiple layers, varying the number of neurons and normalization of y in the domain \(\left[ \epsilon , 1-\epsilon \right]\). The results regard separate training for each digit. All results in Table 1 are obtained without any clustering. We consider as an accuracy metric the percentage of the Correct Classified (CC) digits, divided by the number of observations m

$$\begin{aligned} \alpha =100\frac{CC}{m}\%. \end{aligned}$$

This investigation aimed to compare ANNbN with standard ANN algorithms such as Flux (Innes et al. 2018; Innes 2018), as well as Random Forests as implemented in Sadeghi (2013), and XGBoost (Xu and Chen 2014). Table 1 presents the results in terms of accuracy and computing time. The models are trained on the raw data set, without any spatial information exploitation. The results in Table 1 are exactly reproducible in terms of accuracy, as no clustering was utilized and the indexes are taken into account in ascending order. For example, the running time to train 5000 neurons is 29.5 s on average for each digit, which is fast, considering that the training regards 3,925,785 weights, for 6E4 instances and 784 features. Also, the Deep ANNbNs with 10 layers with 1000 neurons each, are trained in the considerably short timeframe of 91 s per digit on average (Table 1). Correspondingly, in Table 1, we compare the Accuracy and Running Time, with Random Forests (with \(261\approx 784/3\) Trees), and XGBoost (200 rounds). Future steps may include data preprocessing and augmentation, as well as exploitation of spatial information like in CNNs. Furthermore, we may achieve higher accuracy by utilizing clustering for the Neighborhoods training, Ensembles, and other combinations of ANNbNs. Also, by exploiting data prepossessing and augmentation, spatial information, and further training of the initial ANNbN with an optimizer such as stochastic gradient descent. No GPU or parallel programming was utilized, which might also be a topic for future research. For example, the RBF implementation of ANNbN with clustering and \(1.2\times 10^4\) neurons exhibits a test set accuracy of 99.7 for digit 3. The accuracy results regard the out-of-sample test set with \(10^4\) digits. The running time was measured in an Intel i7-6700 CPU @3.40 GHz with 32 GB memory and SSD hard disk. A computer code to feed the calculated weights into Flux (Innes et al. 2018) is provided.

3.4 Solution of partial differential equations

We consider the Laplace Equation

$$\begin{aligned} \frac{{{\partial }^{2}}f}{\partial {{x}^{2}}}+\frac{{{\partial }^{2}}f}{\partial {{y}^{2}}}=0, \end{aligned}$$

in a rectangle with dimensions (a, b), and boundary conditions \(f(0,y)=0\) for \(y \in [0,b]\), \(f(x,0)=0\), for \(x \in [0,a]\), \(f(a,y)=0\), for \(y \in [0,b]\), and \(f(x,b)=f_0\sin (\frac{\pi }{a}x\)), for \(x \in [0,a]\). In Fig. 7a, the numerical solution as well as the exact solution

$$\begin{aligned} f(x)=\frac{f_0}{\hbox{sinh}\left( \frac{\pi }{a}b\right) } \sin (\frac{\pi }{a}x)\hbox{sinh}\left( \frac{\pi }{a}y\right) , \end{aligned}$$

are presented. The MAE among the closed-form solution and the numerical with ANNbN, was found \(3.97\hbox{E}{-}4\). Interestingly, if we add some random noise in the zero source; i.e.

$$\begin{aligned} \frac{{{\partial }^{2}}f}{\partial {{x}^{2}}}+\frac{{{\partial }^{2}}f}{\partial {{y}^{2}}}=\epsilon \sim {\mathcal {U}}\left( 0,\frac{1}{10}\right) , \end{aligned}$$
(18)

the MAE remains small, and in particular \(2.503\hbox{E}{-}3\), for \(a=b=1\), over a rectangle grid of points with \(dx=dy=0.02\). It is important to underline that numerical methods for the solution of partial differential equations are highly sensitive to noise (Mai-Duy and Tran-Cong 2003; Bakas 2019), as it vanishes the derivatives. However, by utilizing the ANNbN solution the results change slightly, as described in the above errors. This is further highlighted if we utilize the calculated weights of the ANNbN approximation and compute the partial derivatives of the solution f of Eq. 18, \(\frac{{{\partial }^{2}}f}{\partial {{x}^{2}}}\), and \(\frac{{{\partial }^{2}}f}{\partial {{y}^{2}}}\), the corresponding MAE for the second order partial derivatives is \(6.72\hbox{E}{-}4\) (Fig. 7b), which is about two orders less than the added noise \(E({{\mathcal {U}}}(0,\frac{1}{10}))=0.05\), implying that ANNbN approximates the signal and not the noise even in PDEs, and even with a stochastic source.

Fig. 7
figure 7

ANNbN solution of Laplace’s Equation with stochastic source

4 Discussion and conclusions

As described in the formulation of the proposed method, we may use a variety of ANNbNs, such as Sigmoid or Radial Basis Functions scheme, Ensembles of ANNbNs, Deep ANNbNs, etc. The method adheres to the theory of function approximation with ANNs, as per visual representations of ANNs’ capability to approximate continuous functions (Nielsen 2015; Rojas 2013). We explained the implementation of the method in the presented illustrative examples, which may be reproduced with the provided computer code. In general, Sigmoid functions are faster, RBFs more accurate and Ensembles of either sigmoid of RBFs handle the noisy data sets better. RBFs may use smaller than \(N=\lfloor \frac{m}{n+1}\rfloor\) sized matrices, and hence approximate data sets with limited observations and many features. The overall results are stimulating in terms of speed and accuracy, compared to state-of-the-art methods in the literature.

The approximation of the partial derivatives and solution of PDEs, with or without a noisy source, in a fast and accurate setting, offers a solid step towards the unification of Artificial Intelligence Algorithms with Numerical Methods and Scientific Computing. Future research may consider the implementation of ANNs to specific AI applications such as pattern recognition in environmental data and remote sensing observations, hydrometeorological predictions, regression analysis, and solution of other types of PDEs for environmental modelling and risk assessment. Furthermore, the investigation of other sigmoid functions than the logistic, such as \(\tanh , \arctan , \text{erf}, \text{softmax}\), etc., as well as other RBFs, such as multiquadrics, integrated, etc., and the selection of an optimal shape parameter for even higher accuracy, are also of interest. Finally, while the computation of weights is very fast, the algorithm may easily be converted to parallel, as weights’ computation for each neuron requires solving N linear systems with matrices \({\textbf{X}}_k\).

Interpretable AI is a modern demand in Science, and ANNbNs are inherently suitable for this purpose, as by checking the approximation errors of the neurons in each cluster, one may retrieve information for the local accuracy, as well as local and global non-linearities in the data properties. Furthermore, as demonstrated in the examples, the method is proficient for small data sets, without overfitting, by approximating the signal and not the noise, which is a common problem of ANNs.