Computational characteristics of feedforward neural networks for solving a stiff differential equation

Feedforward neural networks offer a possible approach for solving differential equations. However, the reliability and accuracy of the approximation still represent delicate issues that are not fully resolved in the current literature. Computational approaches are in general highly dependent on a variety of computational parameters as well as on the choice of optimisation methods, a point that has to be seen together with the structure of the cost function. The intention of this paper is to make a step towards resolving these open issues. To this end, we study here the solution of a simple but fundamental stiff ordinary differential equation modelling a damped system. We consider two computational approaches for solving differential equations by neural forms. These are the classic but still actual method of trial solutions defining the cost function, and a recent direct construction of the cost function related to the trial solution method. Let us note that the settings we study can easily be applied more generally, including solution of partial differential equations. By a very detailed computational study, we show that it is possible to identify preferable choices to be made for parameters and methods. We also illuminate some interesting effects that are observable in the neural network simulations. Overall we extend the current literature in the field by showing what can be done in order to obtain useful and accurate results by the neural network approach. By doing this we illustrate the importance of a careful choice of the computational setup.


Introduction
Differential equations occur in many fields of science and engineering and represent a useful description of many physical phenomena.They are usually formulated as initial or boundary value problems, where conditions at the beginning of a process or at boundary points are given to obtain one specific solution.It is useful to approximate differential equations by numerical methods [1], [2] like finite difference methods, and also neural networks have been applied to this end, see e.g., [3,4,5].
The variety of possible neural network architectures is immense [6].Already in classic works in the field, feedforward neural networks have proven to be useful for solving differential equations [7,8].Within this framework of feedforward neural networks (from now on denoted here simply as neural networks), two particular approaches have been investigated in the literature within the last decades that appear to be very promising.
The trial solution (TS) method has been proposed for the approximation of a given differential equation, which we abbreviate here as TSM [9].The TS, also called neural form in [9,10], has to contain the neural network output and has to satisfy given initial or boundary conditions by construction.Under the latter conditions there are multiple different possible forms of the TS for the approximation of a differential equation.Recently a systematic construction approach for the TS has been proposed [10].However, as indicated in the latter work, the TS construction may become difficult to realise for complex problems.We will follow here the original approach proposed in [9].Let us note that the same TS structure is used for example in the recent Legendre neural network [11].It appears evident that our investigation may also be useful in the context of such extensions.
Recently published in 2019, an approach has been proposed to avoid finding a TS [12] as this may be an intricate ingredient of the TSM.Because the corresponding method is motivated and technically related to the TSM, we call it here modified trial solution method (mTSM).Instead of building the cost function by use of the TS that meets conditions imposed on a differential equation, the approximating solution function is set in [12] to be the neural network output directly.The latter does not satisfy given initial or boundary conditions by construction as in TSM, but these are added as additional terms in the cost function.
In the mentioned works, both TSM and mTSM have proven to be capable of solving ordinary (ODEs) and partial differential equations (PDEs) as well as systems of ODEs and PDEs.This has been demonstrated for several examples and even complex simulations, showing the potential of the methods to obtain high-quality results.This has motivated us to consider in higher detail some of the computational issues that arise in the application of these methods in a first study [13].Let us mention here also a recent complementary work where the activation functions are subject of a computational study [14].
Despite these promising developments, there are still many open questions related to both TSM and mTSM.First of all, in the original work [12] an emphasis was layed on the new construction principle and the application of the proposed method in a cosmological context.However, one may wonder about the direct comparison of TSM and mTSM in terms of quality of results as well as in the related computational aspects.Let us stress in this context, that the original works [9,10,12] mainly describe the network architecture and elaborate on the TS and mTSM construction, but they do not contain more details of the computational characteristics of the methods.Yet it turns out that it is not trivial to define a computational framework that gives competitive results.
Our contribution.In this work we build upon our first parameter study in [13] and extend the investigations.Following the basic line of the first work, we study here the variance between the exact solution of an ordinary differential equation and the approximations provided by TSM and mTSM.As one apparent difference to the proceeding in our previous conference paper, we extend here the investigation w.r.t. the number of training points, and we give many more details in the evaluation.We also give here additional and as it turns out meaningful experiments concerned with the roles of neural network weight initialisation, number of hidden layers and number of hidden layer neurons.We perform several experiments on the variety of parameters related to the differential equation, neural network and optimisation methods.
Let us stress that the amount of parameters for the differential equation, neural network and optimisation is numerous.Our contribution in the main part of this paper is a study of the variation on (i) Weight initialisation methods, (ii) Number of hidden layer neurons, (iii) Number of hidden layers, (iv ) Number of training epochs, (v ) Stiffness parameter and domain size, (vi) Optimisation methods, and especially their mutual dependence.Let us note that it has turned out to be a nontrivial task to set up a meaningful proceeding that gives an account of the latter aspect.We consider the evaluation presented here as a number of carefully chosen experiments that are in many respects related to each other.
For investigating the computational characteristics, we consider a simple while important stiff ODE model equation [15] with a damping behaviour for studying the stability and reliability of both methods.We also present here as another contribution a detailed study of the influence of the stiffness parameter contained in the ODE.Let us note that a similar solution behaviour is to be expected when resolving for instance parabolic PDEs.

Neural network architecture
Neural networks are usually pictured as neurons (circles) and connecting weights (lines).Fig. 1 shows the standard neural network architecture for our experiments.It consists of three layers and features one input layer neuron for x ∈ D ⊂ R (where D denotes the domain) with one bias neuron which can be considered as an offset, five hidden layer neurons and one linear output layer neuron.In experiments on the number of hidden layers and the number of hidden layer neurons, the architecture is extended.Each neuron is connected with every single neuron in the Neural network architecture as usually in many of our experiments featuring one input layer, one hidden layer with sigmoid activation functions and one output layer next layer by the weights w j , u j and v j , j = 1, . . ., 5, which are stored in the weight vector p.The input layer passes the domain data x, weighted by w j and u j to the hidden layer for processing.The processed data is then, now weighted by v j , sent to the output layer in order to generate the neural network output N (x, p).That means in detail, the hidden layer receives the weighted sum z j = w j x + u j as input and processes this data by the sigmoid activation function σ j = σ(z j ) = 1/(1 + e −zj ).Since the output layer consists of a linear neuron, the neural network output is generated by the linear combination The sigmoid activation function is a continuous and arbitrarily often differentiable function with values between 0 and 1.
Let us note, that in order to solve differential equations of order n with neural networks, it is important to choose an activation function, which is at least (n + 1) times continuously differentiable, since the later shown solution approaches require the n-th activation function derivative and the optimisation methods require another differentiation.
The universal approximation theorem [16] states, that one hidden layer with a finite number of sigmoidal activation functions is able to approximate every continuous function on a subset of R.
In general, it is common to initialise the weights with small random values [17], therefore the first computation of N (x, p) returns a random value.This value is used to compute the cost or loss function E[ p] which is then subject to optimisation.With the first random output of N (x, p), the optimisation will return different weight updates when starting several computations with exactly the same computational parameters (but random weight initialisation).
Another option is to choose the initialisation to be constant when starting several computations.That is, N (x, p) first returns always the same value, and therefore with the weight updates to be constant as well, computations with same parameter setting return equal results.
For supervised learning, where both input data x i (representing the discrete domain or grid) and correct output data d i , i = 1, . . ., n, are known, the cost function may be chosen as the squared l 2 -norm while in case of unsupervised learning, where no correct output data is known, the cost function is part of the modelling process.Our approach follows the latter track.

Solution approaches
In this section, we will describe the trial solution construction for TSM and mTSM more in detail, as well as the approaches on how to make use of neural networks in order to solve ordinary differential equations (ODEs) in form of with given initial or boundary conditions.In Eq. (1), u(x) denotes the exact solution function with x as independent variable.Although G denotes a first order ODE, let us note again that it is also possible to solve higher order ordinary or partial differential equations (PDEs), as well as systems of ODEs or PDEs, cf.[9,12].

Trial solution method (TSM)
Let us now recall the approach from [9].In order to satisfy initial or boundary conditions, the TS is constructed to satisfy these conditions and is therefore written as a sum of two terms In Eq. (2), A(x) is supposed to satisfy the initial or boundary conditions at the initial or boundary points, while B(x) is constructed to become zero at these points to eliminate the impact of N (x, p) there.That is, the TS may be defined in many possible forms for one differential equation, satisfying the mentioned conditions.Especially the choice of B(x) determines the impact of N (x, p) over the domain.Now, the TS transforms Eq. ( 1) into so that the partial derivative of the trial solution with respect to input x which we have to consider is In order to generate training points for the neural network, we discretise the domain D by a uniform grid with n grid points x i .Over this discrete domain, Eq. ( 3) is now solved by an unconstrained optimisation problem using the cost function .

Modified trial solution method (mTSM)
This method, proposed in [12], introduces as a TS directly for all differential equations.Therefore u t does not satisfy initial or boundary conditions by construction as in (2), they rather appear in the cost function as additional terms where K(x m ), m = 1, . . ., l, denote the initial or boundary conditions.The modelled cost function is now subject to optimisation with respect to the adjustable neural network weights p.

Optimisation
For cost function minimisation we use first order methods, based on gradient descent.A commonly employed, simple optimisation technique is backpropagation, which uses the cost function gradient with respect to the neural network weights to determine their influence on N (x, p) and to update them.It is well-known that backpropagation enables to find a local minimum in the weight space.The training is usually done several times with all training points.After one complete iteration through all input data, one epoch of training is done and for efficient training (finding a minimum in the weight space), several epochs of training are performed.For the k -th epoch, backpropagation with momentum update rule [18] reads as .
Since only the neural network output N (x, p) and the derivative w.r.t.x depend on p and as both expressions are given, the corresponding derivatives used in the gradient of the cost function are computed as and The momentum term in Eq. ( 4), with momentum parameter β, uses impact from last epoch to reduce the chance of getting stuck too early during training in a local minimum or at a saddle point.The learning rate α in general, is a scaling factor for the gradient and has major influence on the update.A very basic approach is to choose α as a constant learning rate (cBP).In order to prevent the optimiser from oscillating around a minimum one may employ a variable learning rate (vBP) as an alternative.Different approaches for learning rate control exist [19], we opt to employ the linear decreasing model with an initial learning rate α 0 , a final learning rate α e and an epoch cap k c .
In our experiments we also consider Adam (adaptive moment estimation) which is an adaptive optimisation method.It uses estimations of first (mean) and second (uncentered variance) moments of the gradient, see [20] for details.An advantage of Adam is the potential for achieving rapid training speed.While backpropagation scales the gradient uniformly in every direction in weight space (by α), Adam computes an individual learning rate for every weight.

Experiments and results
For experiments on both solution approaches with different parameter variations, as well as optimisation with Adam and backpropagation, we make use of the model problem a homogeneous first order ordinary differential equation with λ ∈ R, λ < 0. The ODE (5) has the exact solution u(x) = e λx and respresents a simple model for stiff phenomena involving a damping mechanism.The numeric error ∆u shown in subsequent diagrams is defined as the l 1 -norm of the difference between the exact solution and the corresponding trial With u t (x, p) = 1 + xN (x, p) we take the form of the trial solution for TSM proposed in [9] to construct the cost function .
For mTSM the trial solution u t (x, p) = N (x, p) results in the cost function In subsequent experiments we study ∆u with respect to several, meaningful variations of computational parameters.
The main parameters and abbreviations of the computational settings are defined as in Table 1.We use our own Fortran implementation for the neural network, the solution approaches and the optimiser, by following the proposed methods in the corresponding papers, without the use of deep learning libraries.Therefore we have total control over the computations and are able to perform investigations related to every aspect of the methods and the code.
Concerning the following experiments, let us stress again that these are not considered to be separate or independent of each other.We will consequently follow a line of argumentation that enables us (i) to reduce step by step the degrees of freedom in the choice of computational settings, and (ii) to clarify the influence of individual computational parameters.In doing this we also demonstrate how to achieve tractable results.We consider this as an important part of our work since this makes the whole approach more meaningful.

Experiment 1: Weight initialisation
This experiment illustrates differences between the two weight initialisation methods, employing either p init const or p init rnd .We averaged 100 iterations for displaying each point in the graph depicting ∆u rnd , implying that for the values given at the lower axis we perform computations with 100 overlaid random perturbations, with random numbers in range of 1e-2, as initialisation around each point.The averaging is important to mention, because every iteration with p init rnd and exactly the same parameter setup, is expected to return different results.
Let us first comment on our choice of p init const .Evidently, one has to choose here some fixed value, and by further experiments not documented here in detail, the value zero appears to be a suitable generic choice for mTSM.
Let us now consider the experiments documented in Fig. 2. In general, TSM with both cBP and Adam (see illustrations (a)-(f)) does not return helpful results for p init const with the current parameter setup.All experiments for TSM with p init const give here uniformly a very high error (depicted by orange/solid lines), even when increasing the number of training points.
Turning to mTSM, the overall clearly best results for p init const are provided by using Adam and ntP=40 in terms of the largest stable region.The results demonstrated in all of the experiments for mTSM with p init const show that the Adam solver provides a desirable proceeding, virtually independently of the number of training points.When considering p init rnd , the Adam solver gives also for TSM reasonable results in terms of the numerical error with a large stable region.A similar but less clear error behaviour can be observed for using Adam with mTSM.As a general trend in all experiments with p init rnd , we observe that weight initialisation in a small range around zero seems to work best.
Let us also comment on illustrations (j)-(l), that we observe here the behaviour that both p init const and p init rnd around zero seem to work reasonably with cBP.One may conjecture for other example ODEs, that there could be some constant initialisation and a range of random fluctuations around it that may work well.
Since a suitable choice of p init const and p init rnd is important in all subsequent experiments, we decided as a consequence of the experiments discussed here to initialise p init const with zeros and p init rnd with random values in range of 0 to 1e-2 from now on.

Experiment 2: Number of hidden layer neurons
The behaviour of ∆u const and ∆u rnd when increasing the number of hidden layer neurons is subject to this experiment, where ∆u rnd is averaged over 100 computations for every tested number of hidden layer neurons.
There is almost no difference between the experiments for TSM in Fig. 3 (a)-(f), they all show a similar saturating behaviour.As discussed in the previous experiment, it is clear that we have to focus here on the random initialisation, and for this setup we observe here desirable results for about five or more neurons.
Turning to mTSM, a higher number of hidden layer neurons leads to an increase in accuracy for Adam for larger numbers of training points explored here (ntP=40).For smaller numbers of training points (ntP=10,20) we observe here that the number of hidden layer neurons and thus the degrees of freedom introduced by the neural network should be in a relatively small range, e.g., about half the amount of training points.
Also for cBP the saturation value of the error is affected by increasing the amount of hidden layer neurons.The general trend for ∆u rnd is that slightly higher accuracy is provided in this way, and that the saturation level is visible already when using a small number of neurons.
Generally in all cases, one can clearly observe the benefit of introducing two to three or more neurons, as this leads to a significant drop in all computed numerical errors.
As a consequence of these investigations, we employ five hidden layer neurons in the other experiments (note that this setting has also been used in the previous experiment) as this appears to be justified by the stable solutions and the amount of computational time.

Experiment 3: Number of hidden layers
In order to focus on the impact of the number of hidden layers, we decided here to keep the number of neurons in the hidden layers constant, employing five neurons plus an additional bias neuron in each layer.As in previous experiments, ∆u rnd is averaged by 100 iterations.
Results in Table 2 show that one hidden layer is not always enough to provide useful results, especially for ∆u const and TSM.Increasing the number of training points (ntP) changes the number of hidden layers that give the best approximation in some cases, but it does not seem to have in general a highly beneficial influence.
Turning to the most important aspect of our investigation in this experiment, one has to distinguish the effect of increasing the number of hidden layers with respect to the individual methods mTSM and TSM.For the mTSM we find that one or two layers are sufficient to obtain -together with Adam optimisation -accurate and convenient results.Considering TSM our study shows a very different result, namely that each increase in the number of hidden layers up to about three or four makes up one order of accuracy gain, for p init const and p init rnd .The latter result appears to be to some degree surprising, as the universal approximation theorem should imply that one hidden layer could be enough to give here experimentally an accurate approximation of our solution function.Let us recall in this context Experiment 5.2, where we have seen that an increase of the number of neurons in one hidden layer leads to a saturation in the accuracy for p init rnd , while we observe here a clear improvement.Increasing the number of neurons and using p init const did not lead to reasonable results there, while p init const here in combination with more hidden layers gives good results plus a significant improvement in the current study.
As a consequence of this investigation, we decided to use one hidden layer for all computations in the

Experiment 4: Number of epochs
In this experiment we aim to investigate if it is possible to fix the maximal number of training epochs to a convenient value.This relates to the question if one could bound the computational load by employing in  2.33e-2 2.17e-3 2.37e-4 3.01e-4 1.55e-3 5.84e-5 4.68e-4 5.23e-5 3 2.01e-3 1.16e-1 3.28e-4 1.16e-1 2.25e-3 1.80e-4 4.91e-4 1.19e-4 4 3.15e-1 1.16e-1 3.15e-1 1.16e-1 3.14e-1 2.29e-4 3.13e-1 2.94e-2 5 3.15e-1 1.16e-1 3.15e-1 1.16e-1 3.14e-1 9.46e-2 3.13e-1 1.03e-2 general a small number of training cycles.To this end, we consider the convergence of the training as a function of an increasing maximal number of epochs k max .In addition we illuminate the influence of the number of training points.More precisely, we increased k max from 1 to 1e5 and averaged 100 iterations for one and the same k max .Put in other words, and to make clear the meaning of the lower axis in Fig. 4, one entry of the number k max relates to 100 corresponding complete optimisations of the neural network.Let us note again, that in the case of p init rnd , the convergence behaviour can only be evaluated by average values, and that each computation was done with a new p init rnd .As can be seen in Fig. 4, best results are returned by mTSM with Adam for both p init rnd (especially ntP=20) and p init const (especially ntP=40).Except for TSM and ntP=10, the Adam optimiser clearly reaches a saturation regime showing convergence for TSM and mTSM with p init rnd .For cBP, ∆u const and ∆u rnd , still may decrease for even higher k max as evaluated here.However, let us note here that we employed in cBP a constant learning rate, for decreasing learning rates as often used for training we may expect that a saturation regime may be observed.However, with Adam, ∆u rnd shows a small fluctuat-In conclusion, we find that k max =1e5 as used for all other experiments is suitable to obtain useful approximations.Let us now investigate the solution behaviour with respect to interesting choices of the stiffness parameter λ, and it turns out that it makes sense to do this together with an investigation of the domain size of D. Let us note that informally speaking, these parameters also impact the general trend of the exact solution in a similar way so that it appears also from this point of view natural to evaluate them together in one experiment.
As shown in Fig. 6, the influence of different domains with increasing ntP is the objective of this experiment.Intervals used for computations are given in terms of x ∈ [0, x end ], with the smallest interval being x ∈ [0, 5e−2] and then increasing in steps of 5e-2.As also in the first experimental part here, ∆u rnd is averaged by 100 iterations for each domain.
Turning to the results, first we want to point out that for TSM, cBP and ntP=20 there are values displayed as ∆u rnd = 9e0, to visualise them.In reality, these values were Not a Number (NaN), which means, that at this point at least one of the 100 averaged iterations diverged for small values of λ in Fig. 5 (c), or large domains.
Furthermore, the solution accuracy for TSM and cBP is strictly decreasing for smaller λ and larger domains until it saturates in unstable regions.While increasing the number of training points from ntP=10 to ntP=20 some iterations diverged, another increase to ntP=40 enlarges the unstable region with a stabilisation in between.
In the total, we observe that there seems to be a relation between the experiments that one may roughly formulate as a relation between λ and domain size given by x end as a factor of −2.We also conjecture, that the higher the values of −λ and x end , the more neurons or layers are required for a convenient solution.As a consequence of these experiments, we decided to fix λ = −5 and x ∈ [0, 2] for all computations in the other experiments.

Experiment 6: Optimisation methods
The final experiment in this paper compares Adam, cBP and vBP optimisation for TSM and mTSM, depending on ntP=10,20,40 with the other computational parameters fixed to one hidden layer, five hidden layer neurons, k max = 1e5, λ = −5 and x ∈ [0, 2].Fig. 7 and 8 show 1e5 (non-averaged) computed results for each parameter setup and weight initialisation.
Previous experiments led to the conclusion, that TSM in combination with p init const only provides unstable solutions for the chosen parameter setup.Therefore, when evaluating TSM, we will only refer to the non-averaged numeric error ∆u rnd for p init rnd .To start the evaluation with TSM and Adam, there are almost no visible differences between ntP=10 and ntP=40, with a large difference between the best and the least good approximation, see first row in Fig. 7.Only for ntP=20 the solutions tend to be more similar.
In contrast, the difference between the best and the least good approximation for TSM and cBP grows by one order of magnitude with a higher number of training points while simultaneously the accuracy for the best approximations increases, cf.second row in the figure .The reason we show results on vBP only in this final experiment (see third row in the figure) is, that the efficiency of an adaptive step size method may be in general highly dependent on the used step size model and parameters.However, the results turn out to be interesting.In combination with ntP=10, vBP and TSM reveal several minima far away from the best approximation.Even more minima appear for a training points increase to ntP=20.However, another increase to ntP=40 stabilises the solutions.In addition, ntP=40 provides the best approximations for TSM and vBP.One may conjecture here, that either one has here to reach a critical number of training points, or that the weight initialisation here is not adequate together with lower ntP.Now we turn to mTSM and Adam, see first row in Fig. 8.We find p init const to show useful results (∆u const ) and a small gain in accuracy for higher ntP.For p init rnd , we find the best approximations throughout the whole experiment to be provided by ntP=10.However, most of the 1e5 computed results appear around a less good (but still reasonable) accuracy with only a few results peaking further in accuracy.Increasing the number of training points to ntP=20 and ntP=40 results in a drop of accuracy from the former best solutions, while overall the results become more similar.Turning to Table 3, we now discuss the stochastic quantities for p init rnd , related to the results shown in Fig. 7 and 8.We focus on the results for random weight initialisation as the diagrams have shown the constant weight initialisation to always return the same numerical error.The 1e5 complete computations (optimisations) should sufficiently support the meaning of the analysed data.
Regarding the mean value, Adam has the overall smallest value and seems to be the best choice.However, for TSM and ntP=40, cBP almost equals Adam with vBP outperforming Adam in this specific setting.That result is particular interesting, since vBP shows for TSM and both ntP=10 and ntP=20 very limited approximations.In contrast to TSM, Adam dominates for mTSM the lowest mean value without any exception.
The former statement however does only hold partially when it comes to the standard deviation.Here, TSM seems to favour cBP over Adam, again with vBP for ntP=40 to pass downwards.Excluding vBP for ntP=10 and ntP=20, the standard deviation in the other cases are in an acceptable range.That is, the mean value and standard deviation could be suitable when evaluating stability and reliability.However, it can be difficult to specify the term reliability.The lower the numerical error, the better the approximation.Nonetheless, defining a threshold needs justification and discussion on how the neural network a variety of parameters.We find the weight initialisation to have a major influence.While the initialisation with zeros does not provide reasonable appromixations for TSM with one hidden layer, it is capable to work reasonably well for mTSM.First setting the weights to small random values shows the best results with Adam and mTSM, although the use of more training points may yield less suitable results.This may indicate an overfitting and could be resolved by employing more neurons or other adjustments.This may be a subject for a future study.However, our work also indicates that all the investigated issues may have to be considered together as a complete package, i.e., the investigated aspects may not be evaluated completely independent of each other.Even after a detailed investigation as provided here it seems not to be possible to single out an individual aspect that dominates the overall accuracy and reliability.
We tend to favour the combination of Adam and mTSM in further computationally oriented research, since it provides the best approximations for both weight initialisation methods.Future research may also include theoretical work, e.g., on sensitivity and different trial solution forms for TSM.One main goal in this context is to decrease the variation of possible solutions together with an increase of the solution accuracy.
Moreover, our third experiment has shown that it may make sense to investigate deep networks, since these could result in a significant accuracy gain, reminding of higher order effects in classic numerical analysis.
Furthermore, our future work will include more difficult differential equations with a focus on initial value problems and the improvement of constant weight initialisation.

Table 1 :
Abbreviations in the experiment section