Gradient free stochastic training of ANNs, with local approximation in partitions

We present a numerical scheme for computation of Artificial Neural Networks (ANN) weights, which stems from the Universal Approximation Theorem, avoiding costly iterations. The proposed algorithm adheres to the underlying theory, is highly fast, and results in remarkably low errors when applied to regression and classification problems of complex data sets with x∈Rn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textbf{x}} \in {\mathbb {R}}^{n}$$\end{document} (e.g. Griewank, Gomez-Levy, Shekel, and Polynomial functions) with random noise addition (i.e. Uniform, Normal, Generalized Pareto, Log-Normal, and a mixture of Log-Normal, Exponential, and Frechet), as well as the database for handwritten digits recognition MNIST (Modified National Institute of Standards and Technology) with 7×104\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$7\times 10^4$$\end{document} images. The same mathematical formulation was found capable of approximating highly nonlinear functions in multiple dimensions, with low errors (e.g. 10-10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-10}$$\end{document}) for the test set of the unknown functions, their higher-order partial derivatives, as well as numerically solving Partial Differential Equations, such as those appearing in Physics, Engineering, Environmental Sciences, etc. The method is based on the calculation of the weights of each neuron in small neighbourhoods of the data. Accordingly, optimization of hyperparameters is not necessary, as the number of neurons stems directly from the dimensionality of the data, further improving the algorithmic speed. Under this setting, overfitting is inherently avoided, and the results are interpretable and reproducible. The complexity of the proposed algorithm is of class P with O(mNnicl+Nmn2+Nn3+mN2+N3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(mNni_{cl} + Nmn^2+Nn^3 + mN^2+N^3)$$\end{document} computing time, with respect to the observations m, features n, and Neurons N, contrary to the NP-Complete class of standard algorithms for ANN training. The performance of the method is high, irrespective of the size of the data set, and the test set errors are similar or smaller than the training errors, indicating the generalization efficiency of the algorithm. A supplementary computer code in Julia and Python Languages is provided, which can be used to reproduce the validation examples, and/or apply the algorithm to other data sets.

x Out-of-sample point in R n X k Local matrix of observations (in kth cluster) b k Bias for kth neuron CC Correct Classified Observations i, j, k Iterators for observations, features, and clusters Observations in the kth cluster y ik Local observations in kth cluster v k ; b 0 Approximation weights and bias for the external layer w jk Weights of first layer for jth neuron, in kth cluster x ij Given data ðm Â nÞ

Introduction
Although Artificial Intelligence (AI) has broadened its numerical methods and extended its fields of application (Shahiri Tabarestani and Afzalimehr 2021; Shaibani et al. 2021;Mohebbi Tafreshi et al. 2020; Kasiviswanathan and Sudheer 2017), empirical rigor has not followed such advancements (Sculley et al. 2018;Bakas et al. 2022), with researchers questioning the accuracy of iterative algorithms Extended author information available on the last page of the article (Hutson 2018a), as the results obtained for a certain problem are not always reproducible (Hutson 2018b;Belthangady and Royer 2019). In theory, Artificial Neural Networks (ANN) are capable of approximating any continuous function (Hassoun et al. 1995) but, apart from existence, the theory alone cannot conclude on a universal approach to calculate an optimal set of ANN model parameters, also referred to as weights, and a variety of algorithmic implementations have occurred for this purpose (Li et al. 2016;Yang and Wu 2016;Lin et al. 2019). Along these lines, iterative optimization algorithms (Ruder 2016) are usually applied to reach an optimal set of ANN weights, which minimize the total error of model estimates. Note, however, that apart from trivial cases rarely met in practice, the optimization problem has more than one local minima, and its solution requires multiple iterations that significantly increase the computational load. To resolve this issue, enhanced optimization methods, such as stochastic gradient descent (Bottou 2010;Johnson and Zhang 2013), have been proposed. Another common issue in ANN applications is that of overfitting, which relates to the selection of a weighting scheme that approximates a given set of data well, while failing to generalize the accuracy of the predictions beyond the training set. To remedy overfitting problems, several methods have been proposed and effectively applied, such as dropout (Srivastava et al. 2014). Additional, and probably more important concerns regarding effective application of ANN algorithms, are: a) the arbitrary selection of the number of computational Neurons, which may result in an unnecessary increase of the computational time, and b) the optimization of the hyper-parameters of the selected ANN architecture (Bergstra et al. 2011;Bergstra and Bengio 2012;Feurer and Hutter 2019), which corresponds to solving an optimization problem with objective function values determined by the solution of another optimization problem, that is the calculation of ANN weights for a given training set. The purpose of this work is to develop a numerical scheme for the calculation of the optimal weights, the number of Neurons, and other parameters of ANN algorithms, which relies on theoretical arguments, in our case the Universal Approximation Theorem and, at the same time, is fast and precise. This has been attained without deviating from the classical ANN representation, by utilizing a novel numerical scheme: firstly dividing the studied data set into small neighborhoods, and subsequently, performing matrix manipulations for the calculation of the sought weights. Local approximation with Heaviside activation function cannot be constructed in a Euclidean space with dimension higher than 1 (Chui et al. 1994); thus we propose a scheme with other sigmoid functions such as logistic, tanh, etc. The numerical experiments exhibit high accuracy, attaining a remarkably low number of errors in the test set of known data sets such as those included in MNIST database for computer vision (LeCun et al. 2010), and complex nonlinear functions for regression, while the computational time is kept short. Interestingly, the same algorithmic scheme may be applied to approximate the solution of Partial Differential Equations (PDEs), appearing in Physics, Engineering, Environmental Sciences, etc. The paper is organized as follows: In Sect. 2, we present the general formulation of the suggested method, hereafter referred to as ANNbN (Artificial Neural Networks by Neighborhoods). More precisely, the basic formulation of the ANNbN approach is progressively developed in Sects.  (Bezanson et al. 2017) and Python (Python 2001(Python -2021 programming Languages is available at https://github.com/nbakas/ANNbN.jl.

Artificial neural networks by neighborhoods (ANNbN)
Let x ij be some given data of j 2 1; 2; . . .; n f ginput variables in i 2 1; 2; . . .; m f gobservations of y i responses. The Universal Approximation Theorem (Cybenko 1989;Tadeusiewicz 1995), ensures the existence of an integer N, such that with approximation errors i ¼ y i À f i among the given responses y i and the corresponding simulated f i values, arbitrarily low. N is the number of Neurons, w jk and b k denote the local approximation weights and bias terms, respectively, of the linear summation conducted for each neuron k, and v k ; b 0 correspond to the global approximation weights and bias terms, respectively, of the linear summation upon all neurons. r is any sigmoid function, as presented below in Sect. 2.1.1.
The suggested ANNbN (Artificial Neural Networks by Neighborhoods) method is based on the segmentation of a given data set into smaller clusters of data, so that each cluster k is representative of the local neighborhood of y ik responses and, subsequently, uses the weights w jk calculated for each cluster to derive the global weights v of the overall approximation. To conclude on the neighborhoods (i.e. the proximity clusters) of the response observations y i , we use the well known k-means clustering algorithm (see e.g. MacQueen et al. 1967;Hartigan and Wong 1979) and k-means?? for the initial seed (Arthur and Vassilvitskii 2007). Any other clustering algorithm can be utilized as well, while by supplying the initial seed, the obtained results are always reproducible. It is worth mentioning that the method works well even without clustering the data, however clustering increases the accuracy and is more compatible with a strict implementation of the Universal Approximation Theorem, as we present in Sect. 3.1. Clustering adds significant computational load, especially for large data sets, however, as presented in Table 1 for the MNIST dataset, ANNbN yields prevalent results even without clustering. Figure 1 illustrates the calculation process for the ANNbN weights. Contrary to the regular ANN approach where all responses y i are treated in a single step as a whole, the ANNbN method first splits the responses into proximity clusters, calculates the weights w jk in each cluster k using the responses y ik and corresponding input data x ijk , and subsequently uses the derived weights w jk for each cluster

R x
Àx e Àt 2 dt, and ¼ 0:00 2 1 Hidden layer with 5000 Neurons, AF r ¼ 1 1þe Àx , and ¼ 0:01 3 1 Hidden layer with 7000 Neurons, AF r, and ¼ 0:01 4 1 Hidden layer with 7000 Neurons, AF r, and ¼ 0:02 5 10 Hidden layes with 1000 Neurons each, AF r, and ¼ 0:02 I Fastest design; N highest accuracy e The ANNbN accuracy results, are exactly reproducible with the supplementary Computer Code. All training examples utilize the raw MNIST database, without any preprocessing or data augmentation. The accuracy (%) regards the out-of-sample test set of MNIST database with 10 4 handwritten digits to calculate the global weights v k ; b 0 of the overall approximation. The aforementioned two-step approach is detailed in Sects. 2.1.1 and 2.1.2 below.

Calculation of w jk and b k in the kth cluster
Let m k be the observations found in the kth cluster, with P N k¼1 m k ¼ m, r the sigmoid function, which may be selected among the variety of sigmoids, such as rðxÞ ¼ 1 1þe Àx , with the inverted sigmoid r À1 being r À1 ðyÞ ¼ log y 1Ày .
By defining ½n:¼f1; 2; . . .; ng the number of features' iterator, ½m:¼f1; 2; . . .; mg the iterator of samples, ½m k :¼f1; 2; . . .; m k g the local samples' indices, X k the m k Â n matrix X k :¼ðx ijk Þ i2½m k ;j2½n ; w k :¼fw 1;k ; w 2;k ; . . .; w n;k ; b k g T the weights' vector, with A T denoting the transpose of matrix or vector A, and y k :¼fy 1;k ; y 2;k ; . . .; y m k ;k g T the target values found in each cluster, we may write where À X k 1 Á denotes the matrix X k , with a column of 1s appended. The symbol implies the element-wise application of r on À X k 1 Á , and the symbol Â denotes the matrix product. By utilizing the inverse sigmoid function r À1 , and writing the left part of the Equation in matrix form, we deduce that Hence, because the dimensions of X k are small ðm k \\mÞ, we may rapidly calculate the approximation weights w k ( Fig. 1 left) in the kth cluster (corresponding to the kth neuron) by solving the linear system with respect to w k , withŷ k :¼r À1 y k . In cases with m k ¼ n þ 1 and linearly independent columns of X k , we may solve Eq. 2 with Gaussian Elimination, while when X k is not of full column rank, least squares or generalized inversion may be applied (Marlow 1993;Chapra et al. 2010).

Calculation of v k and b 0 exploiting all the given observations
Following the computation of the weights w k , for each neuron k in the hidden layer, and concatenating the weights' vectors w k , we obtain the matrix of the weights for all neurons N, w:¼½w 1 w 2 Á Á Á w N . Let where X corresponds to the total samples, contrary to the previous step that utilized X k containing the observations in cluster k. Hence, in order to compute the weights of the output layer v:¼fv 1 ; v 2 ; . . .; v N ; b 0 g T , for all the neurons connected with the external layer, we solve the linear system for v, where y ¼ fy 1 ; y 2 ; . . .; y m g T is the entire vector of observations. In the numerical experiments, the local approximation weights w j are distinct, while the number of neurons is usually smaller than the number of observations (N\m), hence one can obtain the entire representation of the ANNbN by solving: According to the previous linear systems in Eq. 2, ifX is not of full column rank, we may use a solution with least squares or other solvers for dense, tall matrices. The ANNbN approximation scheme is concisely described in the following Algorithm 1.

ANNbN with radial basis functions as kernels
In what follows, the method is further expanded by using Radial Basis Functions (RBFs) for the approximation, uðrÞ, depending on the distances among the observations r, instead of their raw values (Fig. 2). The operation is conducted in the identified clusters of data, as per Sect. 2.1, instead of the entire sample. A variety of studies exist on the approximation efficiency of RBFs (Yiotis and Katsikadelis 2015; Babouskos and Katsikadelis 2015), however they refer to noiseless data, and the entire sample, instead of neighborhoods. We should also distinguish this approach of RBFs implemented as ANNbN from the Radial Basis Function Networks (Schwenker et al. 2001;Park and Sandberg 1991), with uðxÞ ¼ P N i¼1 a i uðjjx À c i jjÞ, where the centers c i are the clusters' means-instead of collocation points, N is the number of neurons and a i are calculated by training, instead of matrix manipulation. In the proposed formulation, the representation regards the distances r ijk (Fig. 2) among all the observations x ik ¼ fx i1k ; x i2k ; . . .; x ink g in cluster k with dimension (features) n, and i 2 f1; 2; . . .; m k g, and another observation in the same cluster x jk ¼ fx j1k ; x j2k ; . . .; x jnk g, with j 2 f1; 2; . . .; m k g. For each cluster k 2 ½N, we define while by using RBFs, the local weights of cluster k, do not comprise a bias term, hence, w k :¼fw 1;k ; w 2;k ; . . .; w n;k g T In most cases matrix u k is invertible (see below for further elaboration), thus, one may approximate the responses in the kth cluster y ik , as and compute w k , by The elements u ijk ¼ uðkx jk À x ik kÞ of the symmetric matrix u k , denote the application of function u to the pairwise Euclidean Distances (or norms) of the observations in the kth cluster. Note that vector w k has length m k for each cluster k, instead of n for the sigmoid approach.
Afterwards, similar to the case of the sigmoid kernels in the previous section (Eqs. 4, 5), and by using one can obtain the entire representation for all clusters, for the weights of the output layer v, by solving The rows of matrices u 1 ; u 2 ; . . . contain the observations of the entire sample, while the columns contain the collocation points found in each cluster. For calculation of the weights, we use u ijk ¼ uðkx jk À x ik kÞ. After computing the w k and v, one may interpolate for any new x (out-ofsample), using where x jk are the RBF collocation points for the approximation, same as in Eq. 6. Hence, we may predict for outof-sample observations by using As the weights w k are applied directly by multiplication, the calculation of the inverted function u À1 (corresponding to r À1 in Eq. 1) is not needed. This results in a convenient formulation for the approximation of the derivatives, as well as the solution of PDEs.
In case matrix u k results to be singular [see Mairhuber-Curtis theorem (Mairhuber 1956)], one should select an alternative kernel for the data under consideration, in order to obtain invertible /, as well as increase accuracy  (Fasshauer and Zhang 2007;Fasshauer and McCourt 2012). Some examples of radial basis kernels are the Gaussian uðrÞ ¼ e Àr 2 =c 2 , Multiquadric uðrÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ ðcrÞ 2 q , etc., where r ¼ kx j À x i k, and the shape parameter c controls the width of the function. c may take a specific value or be optimized, to attain higher accuracy for the particular dataset studied. Similar to sigmoid functions, after computation of w k , we use Eq. 7 to compute v and obtain the entire representation.

ANNbN for the approximation of derivatives
Equation 8 offers an approximation to the sought solution, by applying algebraic operations on u j ðrÞ, where and u j is a differentiable function with respect to any outof-sample x, considering the n-dimensional collocation points x j as constants. Accordingly, one may compute any higher-order derivative of the approximated function, by utilizing Eq. 8 and simply differentiating the kernel u, and multiplying by the computed weights w k ¼ w jk , for all x j . In particular, we may approximate the lth derivative with respect to the pth dimension, at the location of the ith observation by where where x jk denote the collocation points of cluster k, and x i the points where f i ¼ f ðx i Þ is computed. Since vector v applies by multiplication and summation to all N clusters (Eq. 8), one may obtain the entire approximation for each partial derivative (i.e. by differentiating u, and applying all w k and vector v, to o l u ij ox ip l ). The weights remain the same for the function and its derivatives. We should underline that the differentiation in Eq. 11 holds for any dimension p 2 1; 2; . . .; n f gof x i , hence due to Eq. 11, with the same formulation, we derive the partial derivatives with respect to any variable, in a concise setting.
For example, to approximate a function f ðx 1 ; x 2 Þ and later compute its partial derivatives with respect to x 1 , one can utilize the collocation points x j and write and using as kernel and hence  2015), which are formulated from the indefinite integration of the kernel, such that its derivative is the RBF u. Accordingly, we may integrate the kernel more than once, to approximate higher-order derivatives. For example, by utilizing erfðxÞ ¼ 1 ffiffi p p R x Àx e Àt 2 dt, and the twiceintegrated Gaussian RBF for u at collocation points x j , we deduce that and hence which is the Gaussian RBF, approximating the second derivative € f ðxÞ, instead of f(x).

ANNbN for the solution of partial differential equations
Similar to the numerical differentiation, we may easily apply the proposed scheme to numerically approximate the solution of Partial Differential Equations (PDEs). We consider a generic Differential operator depending on the D l partial derivatives of the sought solution f, for some coefficient functions g l ðxÞ, which satisfy where h may be any function in the form of hðx 1 ; x 2 ; . . .; x n Þ. We may approximate f by By utilizing Eq. 10, we constitute a system of linear equations. Hence, the weights w jk may be calculated by solving the resulting system, as per Eq. 6. For example, consider the following generic form of the Laplace equation The weights w j in Eq. 12 are constant, hence the differentiation regards only function u. Thus, by using and writing Eq. 13 for all h ik ¼ hðx ik Þ ¼ y ik found in cluster k, we obtain Because the weights w jk are the same for the approximated function and its derivatives, we may apply some boundary conditions for the function or its derivatives D l , at some boundary points ½b:¼f1; 2; . . .; m b g, and using hence, we may compute w k , by solving the resulting system of Equations similar to Eq. 6 for cluster k. Afterwards, we may obtain the entire representation for all clusters, by using Eq. 7 for the computation of v. Finally, we obtain the sought solution by applying the computed weights w; v in Eq. 8, for any new x.

Deep networks
A method for the transformation of shallow ANNbNs to Deep Networks is also presented. Although Shallow Networks exhibit vastly high accuracy even for unstructured and complex data sets, Deep ANNbNs may be utilized for research purposes, for example in the intersection of neuroscience and artificial intelligence. After the calculation of the weights for the first layer w jk , we use them to create a second layer (Fig. 3a), where each node corresponds to the given y i . We then use the same procedure for each neuron k of layer ½l:¼ 2; 3; . . .; L f g , by solving: with respect to v l . For each layer l, w l corresponds to w of Eq. 3. We may arbitrarily select any number of neurons, within each layer l, without additional computational cost, as the solution v l of Eq. 17 depends on the weights w l only, corresponding to the previously computed layer. We should note that although Eq. 17 is similar to Eqs. 3 and 4, we now use both sigmoid (left part) and inverted sigmoid (right part) functions. This procedure is iterated for all neurons k of layer l. Matrix w l corresponds to the weights of layer l À 1. Finally, for the output layer, we calculate the linear weights v k , as per Eq. 4. This procedure results in a good initialization of the weights, close to the optimal solution, and if we normalize y i in a range close to the linear part of the sigmoid function r (say [0.4, 0.6]), we rapidly obtain a deep network with approximately equal errors to those of the shallow. Afterwards, any optimization method may be supplementarily applied to compute the final weights, however, the accuracy is already vastly high. Alternatively, we may utilize the obtained layer for the shallow implementation of ANNbN,X (see Eq. 4), as an input x ij for the second layer (without computing v), then for a third, and sequentially up to any desired number of layers.

Generalization in terms of ensembles
A generalisation of the approach presented in Sects. 2.1-2.5, can be obtained, by averaging the results of different ANNbN models, each one fitted to a different portion of the data (Dietterich et al. 2002;Li et al. 2016). This can be done by randomly sub-sampling at a percentage of a% of the observations, running the ANNbN algorithm multiple times i f 2 1; 2; . . .; n f È É , averaging the results with respect to the errors i f over all n-folds n f : and using y i to constitute an Ensemble of ANNbN models (Fig. 3b). Ensembles of ANNbNs exhibit increased accuracy and generalization properties for noisy data, as per the following Numerical Experiments.

Time complexity of the ANNbN algorithm
The training of an ANN with two layers and only three nodes has proven to be NP-Complete (Blum and Rivest 1989), if the nodes compute linear threshold functions of their inputs. Even simple cases, and approximating hypotheses, result in NPcomplete problems (Engel 2001). Apart from the theoretical point of view, the slow speed of learning algorithms is a major flaw of ANNs. On the contrary, ANNbNs are fast, because the main part of the approximation regards operations with smallsized square matrices ðn þ 1Þ Â ðn þ 1Þ, with n being the number of features. Below we provide a theoretical investigation of ANNbNs' time complexity, which has been empirically validated by running the supplementary code. Statement 1: ANNbN Training is conducted in three distinct steps: a) Clustering, b) Solution of linear systems of equations with small-sized matrices X k (Eq. 1) for the calculation of w jk weights, and c) Calculation of v k weights (Eq. 5).
Statement 2: Let m be the number of observations, and n the number of features. We assume that m [ [ n (e.g. m [ 10n as usual for regression problems), that corresponds to the case when the samples are more than the features. We note that the number of clusters N, is also equal to the number of neurons (see Eq. 1, and Fig. 1).
Lemma 1 Time complexity of step (a) is OðmNni cl Þ.
Proof The running time of Lloyd's algorithm (and most variants) is OðmNni cl Þ (Hartigan and Wong 1979;Manning et al. 2008), with i cl denoting the number of iterations necessary to converge. We should note that, in the worst case, the complexity of Lloyd's algorithm is super-polynomial (Blömer et al. 2016;Arthur and Vassilvitskii 2006), with i cl ¼ 2 Xð ffiffi ffi m p Þ and the same holds for the ANNBN algorithm. However, in practice, the algorithm converges for small i cl . For the case when the kÀmeansþþ algorithm with D 2 Àweighting is used, the total error is at most Oðlog NÞ (Arthur and Vassilvitskii 2007). For Theorem 1, we use the more conservative approximation, that is the Lloyd's algorithm with complexity OðmNni cl Þ. h Lemma 2 Time complexity of step (b) is OðNmn 2 þ Nn 3 Þ Proof Time complexity of step (b) regards the solution of linear systems with matrices ðn þ 1Þ Â ðn þ 1Þ (Eq. 1). However, in the general case, the clustering algorithm may result in clusters with unequal sizes. Hence, one needs to solve a system with non-square matrices X k of sizes m k Â n Fig. 3 Transformation of the basic Numerical Scheme (Eq. 1). In the worst case scenario, a cluster contains m À N þ 1 samples, while the rest of the clusters comprises a single sample. Accordingly, the complexity of step 2 regards the solution of a linear system with X k having dimensions ðm À N þ 1Þ Â n. When solving this system, and assuming that X k is of full column rank, the multipli-cationX T kX k results in complexity OðnmnÞ ¼ Oðn 2 mÞ, as m [ ðm À N þ 1Þ. Then, one proceeds to solve the corresponding square linear system with complexity Oðn 3 Þ, as well as multiplication of ðX k TX k Þ À1 withX k T , resulting in complexity OðnnmÞ, and ðX k TX k Þ À1X k T with r À1 ðy k Þ, resulting in complexity Oðnm1Þ. Thus, the total complexity is Oðmn 2 þ n 3 þ mn 2 þ mnÞ. This is repeated N times, hence the complexity is OðNmn 2 þ Nn 3 Þ. h

Lemma 3 Time complexity of step
Step c regards the solution of an m Â N system of Equations (Eq. 4). We assume that the linear systems to be solved in Eq. 5 are of full column rank. Hence the complexity regards the multiplicationX TX with complexity Proof By considering the Time Complexity of each step, (Lemmas 1, 2, 3), we deduce that the total computational complexity of the ANNbN algorithm is In practice, the clustering algorithm converges after few iterations i cl , and the computing time is a third order polynomial for the number of neurons, second order for the features, and linear for the samples. A comparison between the Theoretical and Experimental Computing Times is illustrated in Fig. 4. Particularly, Fig. 4 depicts the Experimental and Theoretical Computing Times for the case of Griewank Function. Figure 4a corresponds to n ¼ 10 2 features, N ¼ 10 neurons, and a variation of samples m ranging from 10 3 to 10 4 with step 10 3 . Figure 4b regards m ¼ 10 4 samples, N ¼ 10 2 neurons and a variation of features n ranging from 10 to 10 2 with step 10. Finally, Fig. 4c corresponds to m ¼ 10 5 samples, n ¼ 10 features, and a variation of neurons N ranging from 10 2 to 10 3 with step 10 2 . Each case has been run 10 times with random samples for input and the Average, Minimum and Maximum Experimental Times are reported. The Theoretical Times presented in Figs. 4a-c have been obtained by standardising the resulting complexity of Theorem 1, based on the average Experimental Times measured on Cyclone Facility (https://hpcf.cyi.ac.cy/), using a single node with 40 cores, for cases (a), (b) and (c), respectively.
Within the limits of this experiment, we notice a linear pattern (with slope less than 1) in the average Experimental Time in Fig. 4a, which is expected from Theorem 1, for the case when we vary m. In Fig. 4b, we may see a non linear growth of the experimental time with n. A second order polynomial fit to the average experimental time results an R 2 ¼ 0:9942, which is consistent with Theorem 1 as well. Finally, we would expect a third order polynomial for the number of neurons N, however the trend of the experimental time exhibits a sublinear stabilization pattern. This may be a result of the actual logarithmic complexity of the first step (clustering). Additionally, perhaps the BLAS operations which run in parallel offer this improvement, however a more analytical investigation should be further performed in future research. It is worth noting that in the case of small matricesX k , instead of Generalized Inversion assumed in Theorem 1, least squares may be used, especially when one solves for many neurons with few samples per cluster, rendering the solution of the systems of Eq. 1 unstable.

1D function approximation and its geometric point of view
We consider a simple one-dimensional function f(x), with x 2 R, to present the basic functionality of ANNbNs.
Because r À1 ðyÞ ¼ log y 1Ày is unstable for y ! 0, and y ! 1, we normalize the responses in the domain [0.1, 0.9]. In Fig. 5, the approximation of f ðxÞ ¼ 0:3 sinðe 3x Þ þ 0:5 is depicted, by varying the number of neurons utilized in the ANNbN. We may see that by increasing the number of neurons from 2 to 8, the approximating ANNbN exhibits more curvature alterations. This complies with the Universal Approximation theorem and offers a geometric point of view. Interestingly, the results are not affected by adding some random noise, $ UðÀ 1 20 ; 1 20 Þ, as the Mean Absolute Error (MAE) in this noisy dataset was 2:10EÀ2 for the train set, and for the test set was even smaller 1:48EÀ2, further indicating the capability of ANNbN to approximate the hidden signal and not the noise. We should note that for noiseless data of 100 observations and 50 neurons, the MAE in the train set was 6:82EÀ6 and in the test set 8:01EÀ6. The approximation of the same function with Gaussian RBF, and shape parameter c ¼ 0:01, results in 7:52EÀ8 MAE for the train set and 1:07EÀ7 for the test set.

Regression in R n
We assess the performance of ANNbN by using four nonlinear functions with singularities and folding points, distorted with five cases of noise each (Fig. 6).
Particularly, the Gomez-Levy function Lðx 1 ; x 2 Þ ¼ 4x 2 1 À 2:1x 4 1 þ 1 3 x 6 1 þ x 1 x 2 À 4x 2 2 þ 4x 4 2 subjected to À sinð4px 1 Þ þ 2 sin 2 ð2px 2 Þ 1:5 À sinð4px 1 Þ þ2 sin 2 ð2px 2 Þ 1:5, in order to check the irregularity in the boundaries, the polynomial of five variables, the Shekel function with m a ¼ 10 maxima in n ¼ 25 dimensions, c ¼ ðc i Þ i2½1;2;...;m a , c i $ Uð0; 1Þ, and a ¼ ða ij Þ i2½1;2;...;m a ;j2½1;2;...;n , with a ij $ UðÀ1=2; 1=2Þ and the highly nonlinear Griewank function (Griewank 1981), In all cases we use m ¼ 10 4 observations for the train as well as the test sets, while we add noise in the train set only, in order to assess the capability of the method to approximate the signal and not the noise. Particularly, we use the following cases: (1) Uniform, (2) Normal, (3) Generalized Pareto, (4) Log-Normal, and 5) Mixture of Log-Normal, Exponential, and Frechet distributions (by sub-sampling from each and concatenating the samples), in order to investigate a variety of noise distributions. All the noise vectors are normalized to have zero mean and mean absolute value 5% of the mean value of the target y. The target is normalized in [0.1, 0.9]. For the approximation with ANNbN, we use a Gaussian kernel with shape parameter c ¼ 10 for the RBFs, and N ¼ b m 50 c ¼ 200 neurons, because in the output layer, a regression is performed (Eqs. 4,7), and it is important to have a high ratio of neurons vs observations (i.e. 50). We compare the performance with other methods, i.e. Random Forests (Breiman 2001), as implemented in Sadeghi (2013), XGBoost (Xu and Chen 2014;Ruder 2016), and AdaBoost from ScikitLearn (Scikitlearn 2016). We use the Mean Absolute Percentage Error (MAPE) as a metric. The results are presented in Fig. 6 indicating the high accuracy attained with ANNbNs. We should note that the solution exhibited low sensitivity regarding the partitioning. Particularly, by changing the initial seeding algorithm for clustering with k-means, for the polynomial function studied, the MAPE varied from 0.00277 (Kmpp Algorithm Arthur and Vassilvitskii 2007), to 0.00268 (KmCentrality Algorithm Park and Jun 2009), while with random seeding the MAPE was 0.00278.

Classification for computer vision
As highlighted in the introduction, the reproducibility of AI Research is a major issue. We utilize ANNbN for the MNIST database (LeCun et al. 1998(LeCun et al. , 2010, obtained from (Shindo 2015), consisting of 6 Â 10 4 handwritten integers 2 ½0; 9, for train and 10 4 for test. The investigation regards a variety of ANNbN formulations, and the comparison with other methods. In particular, the erfðxÞ ¼ 1 ffiffi Àx e Àt 2 dt, and r ¼ 1 1þe Àx were utilized as activation functions, and the corresponding erf À1 ðxÞ, and r À1 ðxÞ in Eq. 1. We constructed ANNbNs with one and multiple layers, varying the number of neurons and normalization of y in the domain ; 1 À ½ . The results regard separate training for each digit. All results in Table 1 are obtained without any clustering. We consider as an accuracy metric the percentage of the Correct Classified (CC) digits, divided by the number of observations m a ¼ 100 CC m %: This investigation aimed to compare ANNbN with standard ANN algorithms such as Flux Innes 2018), as well as Random Forests as implemented in Sadeghi (2013), and XGBoost (Xu and Chen 2014). Table 1 presents the results in terms of accuracy and computing time. The models are trained on the raw data set, without any spatial information exploitation. The results in Table 1 are exactly reproducible in terms of accuracy, as no clustering was utilized and the indexes are taken into account in ascending order. For example, the running time to train 5000 neurons is 29.5 s on average for each digit, which is fast, considering that the training regards 3,925,785 weights, for 6E4 instances and 784  (200 rounds). Future steps may include data preprocessing and augmentation, as well as exploitation of spatial information like in CNNs. Furthermore, we may achieve higher accuracy by utilizing clustering for the Neighborhoods training, Ensembles, and other combinations of ANNbNs. Also, by exploiting data prepossessing and augmentation, spatial information, and further training of the initial ANNbN with an optimizer such as stochastic gradient descent. No GPU or parallel programming was utilized, which might also be a topic for future research. For example, the RBF implementation of ANNbN with clustering and 1:2 Â 10 4 neurons exhibits a test set accuracy of 99.7 for digit 3. The accuracy results regard the outof-sample test set with 10 4 digits. The running time was measured in an Intel i7-6700 CPU @3.40 GHz with 32 GB memory and SSD hard disk. A computer code to feed the calculated weights into Flux ) is provided.

Solution of partial differential equations
We consider the Laplace Equation in a rectangle with dimensions (a, b), and boundary conditions f ð0; yÞ ¼ 0 for y 2 ½0; b, f ðx; 0Þ ¼ 0, for x 2 ½0; a, f ða; yÞ ¼ 0, for y 2 ½0; b, and f ðx; bÞ ¼ f 0 sinð p a x), for x 2 ½0; a. In Fig. 7a, the numerical solution as well as the exact solution are presented. The MAE among the closed-form solution and the numerical with ANNbN, was found 3:97EÀ4. Interestingly, if we add some random noise in the zero source; i.e.
the MAE remains small, and in particular 2:503EÀ3, for a ¼ b ¼ 1, over a rectangle grid of points with dx ¼ dy ¼ 0:02. It is important to underline that numerical methods for the solution of partial differential equations are highly sensitive to noise (Mai-Duy and Tran-Cong 2003; Bakas 2019), as it vanishes the derivatives. However, by utilizing the ANNbN solution the results change slightly, as described in the above errors. This is further highlighted if we utilize the calculated weights of the ANNbN approximation and compute the partial derivatives of the solution f of Eq. 18, o 2 f ox 2 , and o 2 f oy 2 , the corresponding MAE for the second order partial derivatives is 6:72EÀ4 (Fig. 7b), which is about two orders less than the added noise EðUð0; 1 10 ÞÞ ¼ 0:05, implying that ANNbN approximates the signal and not the noise even in PDEs, and even with a stochastic source.

Discussion and conclusions
As described in the formulation of the proposed method, we may use a variety of ANNbNs, such as Sigmoid or Radial Basis Functions scheme, Ensembles of ANNbNs, Deep ANNbNs, etc. The method adheres to the theory of function approximation with ANNs, as per visual representations of ANNs' capability to approximate continuous functions (Nielsen 2015;Rojas 2013). We explained the implementation of the method in the presented illustrative examples, which may be reproduced with the provided computer code. In general, Sigmoid functions are faster, RBFs more accurate and Ensembles of either sigmoid of RBFs handle the noisy data sets better. RBFs may use smaller than N ¼ b m nþ1 c sized matrices, and hence approximate data sets with limited observations and many features. The overall results are stimulating in terms of speed and accuracy, compared to state-of-the-art methods in the literature.
The approximation of the partial derivatives and solution of PDEs, with or without a noisy source, in a fast and accurate setting, offers a solid step towards the unification of Artificial Intelligence Algorithms with Numerical Methods and Scientific Computing. Future research may consider the implementation of ANNs to specific AI applications such as pattern recognition in environmental data and remote sensing observations, hydrometeorological predictions, regression analysis, and solution of other types of PDEs for environmental modelling and risk assessment. Furthermore, the investigation of other sigmoid functions than the logistic, such as tanh; arctan; erf; softmax, etc., as well as other RBFs, such as multiquadrics, integrated, etc., and the selection of an optimal shape parameter for even higher accuracy, are also of interest. Finally, while the computation of weights is very fast, the algorithm may easily be converted to parallel, as weights' computation for each neuron requires solving N linear systems with matrices X k .
Interpretable AI is a modern demand in Science, and ANNbNs are inherently suitable for this purpose, as by checking the approximation errors of the neurons in each cluster, one may retrieve information for the local accuracy, as well as local and global non-linearities in the data properties. Furthermore, as demonstrated in the examples, the method is proficient for small data sets, without overfitting, by approximating the signal and not the noise, which is a common problem of ANNs.

Appendix I: Computer code
The presented method is implemented in Julia (Bezanson et al. 2017), and Python (Python 2001(Python -2021 Languages. The corresponding computer code is available on https:// github.com/nbakas/ANNbN.jl Author Contributions All Authors contributed equally to the formulation of the method as well as the numerical experiments and paper writing. Funding Open access funding provided by HEAL-Link Greece. The contribution of Andreas Langousis has been conducted within the project PerManeNt, which has been co-financed by the European Regional Development Fund of the European Union and Greek National Funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH -CREATE -INNOVATE (Project code: T2EDK-04177). The contribution of Nikolaos Bakas has been conducted within the EuroCC Project (GA 951732) and EuroCC 2 Project (101101903) of the European Commission. Parts of the runs were performed on the MeluXina (https://docs.lxp.lu/) as well as Cyclone (https://hpcf.cyi. ac.cy/) Supercomputers.

Declarations
Competing Interests The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.