Preprocessing algorithms for the estimation of ordinary differential equation models with polynomial nonlinearities

The data analysis task of determining a model for an ordinary differential equation (ODE) system from given noisy solution data is addressed. Since modeling with ODE is ubiquitous in science and technology, finding ODE models from data is of paramount importance. Based on a previously published parameter estimation method for ODE models, four related model estimation algorithms were developed. The algorithms are tested for over 20 different polynomial ordinary equation systems comprising 60 equations at various noise levels. Two algorithms frequently compute the correct model. They are compared to the prominent SINDy-family for those SINDy-algorithms that have simple default hyperparameters. This demonstrates that they are comparable to SINDy and more resilient towards noise than the tested SINDy algorithms.

as for many research fields in engineering. Typically an ODE model is constructed by thoughtfully deriving formula from preexisting knowledge of the field of interest. The data analysis task to infer an ODE model from given noisy solution curves is still a grand challenge, although in the decades since an early publication [1] large and broad research efforts were made. These are reviewed in the next paragraphs.
A huge amount of literature exists in the field of BST alone, which is reviewed by Engl et al. [2], Voit et al. [3] and a text mining assisted overview of Loskot et al. [4]. The latter copes mainly with parameter estimation. Since data gathering in BST is extremely difficult, the prevailing incompleteness of the data typically allows only parameter and state estimation at a qualitative level, e.g. by using the underlying bifurcation scenario [2]. Therefore often only a qualitative judgment on the model is possible. Systematic procedures of model estimation are sparsity enforcing regularization methods with Lagrange parameters like Tikhonov, maximum entropy and bounded variation regularization [2]. They are used to overcome the ill-posedness of the parameter estimation (i.e. overfitting) by eliminating terms from the ODE that are unnecessary to explain the data. However many other approaches like evolutionary and genetic algorithms, particle-swarm optimization, Kalman filters, to mention only a few, are employed using parameter pruning strategies to determine the model [5]. Another line of research is gene regulatory network detection [6]. But due to the complications of the subject still models are aggregated using preexisting knowledge about smaller submodels (bottom up approach) or derived theoretically from a holistic consideration of the problem (top down) [7]. The parameters are then inferred by maximum likelihood estimates or related methods.
ODE model estimation has also a long history in engineering under the label system identification [8], where normally exogenous inputs and observed signals are given. However in systems identification one rarely considers ODEs as in [8] but rather linear and nonlinear time series models that arise, if one multiplies a finite time difference form of a first order ODE system in standard form by the time step t. Then methods like NARMAX and many others [9] are applied. In order to reduce complexity and computational burden frequently one tries to reduce the number of degrees of freedom by model order reduction [10]. There frequently linear or linearizable systems are considered. For non-linearizable dynamical systems, which have multiple solutions for a fixed set of parameters, recently the spectral submanifold method (SSM) was developed [11].
A further line of development could be cast under the name computational intense methods. The paper of Bongard [12] may be placed under this label, since it uses stochastic methods to generate the models and subsequent automated testing and snipping methods to select the model. Also particle filters are employed to estimate nonlinear ODE for periodic orbits [13]. Approximate Bayesian computation was successfully applied to select a model and its parameters out of five candidate models for mechanic vibration insulators of the wire rope type [14].
But likelihood related methods outside BST should be also mentioned. Among them is the marginal likelihood method in [15]. The ODEion Software [16] uses log-likelihood corrected by an extra term for the number of parameters. Another method is to use a predefined restricted number of models and a special version of the Bayesian information criterion [17] to determine the best model. The degree of prior knowledge is also discussed outside the context of likelihood methods in a review written by the Saratov group [18].
In recent years the SINDy-method developed by Brunton et al. [19] gained considerable attention. It relies on LASSO-regression [20] that constrains the regression problem by the L 1 -norm of the parameter vector using a Lagrange multiplier. It belongs therefore into the category of sparsity enforcing regularization methods with Lagrange parameters. Additional quantities used there in the process of model assessment are the Akaike information criterion (AIC) and Pareto analysis [21]. Recently also Michaelis-Menton terms were included beside polynomials [22].
Also artificial intelligence related methods are emplo yed for model estimation, but there often qualitative information or conceptual insight rather than precise knowledge of an equation system is sought. The treated problems, to mention only a few research lines, are among others dynamical model classification by support vector machines [23], construction of neuronal networks that mimic human reasoning about dynamics [24] and approximation of dynamical systems by deep neuronal networks [25]. Neuronal networks are however also used to select or create coordinates from high dimensional data sets. Using the LASSO and a neuronal network governing equations are determined that form a parsimonious dynamical model for high dimensional data [26]. A review about the incorporation of physical equations into machine learning algorithms is given in [27]. This technique, called physics informed machine learning, is also combined with SINDy-method in [28]. To address the problem of a very low density of data points over time E-SINDy was developed, which uses various bagging methods, and was successfully applied to the lynx and hare data of the Hudson bay company [29].
Compressed sensing [30] had in the past fifteen years a huge impact on data evaluation and therefore also on model estimation for ODE's. A review is given in [31]. Frequently it is used in a way that minimizes the side condition of the LASSO under the side condition of the linear regression objective function and belongs to the class of sparsity enforcing regularization methods with Lagrange parameters.
The Lagrange parameter of the sparsity enforcing regularization methods mentioned here affects the result and is hard to determine. A recent paper concerning the problems with Lagrange multipliers and the LASSO is [32]. In the case of the LASSO the Lagrange parameter works as a cutoff [33] that eliminates terms in the ODE, which have small parameters. But in nonlinear ODE's even terms with small parameters can affect the qualitative behavior of the model, so that without such terms the solution goes to infinity or spirals to a fixed point.
The method of this paper searches precise systems of equations for given data and builds upon a previously published parameter estimation method [34]. It uses the poor condition [35] of the overfitted regression problem that incorporates all monomials of a search space in Appendix A. By a special pruning strategy, which eliminates one superfluous term after the other, the terms that are affected most by the overfitted situation are removed first. Moreover it operates scale-free so that terms in the model, which have small parameters, have a chance to survive. The algorithms use several hyperparameters, which are however easy to determine. This is demonstrated by using the identical hyperparameter values for all examples of this paper.

The algorithms
The N-dimensional ODE equation systems considered in this paper have coordinate vectors x = (x 1 , . . . , x N ) and polynomial right-hand sides in N variables up to degree 3. Like all regressions with polynomial regressors the algorithms presented here suffer from the fact described in [36], that the matrix gets closer to an extremely poor conditioned Hilbert matrix the higher the degree of the polynomial approximation is. The admissible monomials g i (x) in a N -dimensional search space and their parameter numbers are given in Appendix A. Let the integer α run from 1 to N . M α is the set with the indexes i of the g i (x) occurring in the ODE. Let p α,i be the parameter associated with the monomial g i (x(t)) of component α. Then one can wite a component of the N-dimensional ODE system as or integrated as in [37] In order to be more in line with the statistical terminology [38] the linear sampling method (LIS) of [34] is relabeled as linear subsampling method (LISS). For subsampling estimations as in [34], one chooses R randomly selected trial points x α (t r ) from the data. For the standard form in Eq. 1 the tangent has to be estimated by linear regression as in [34] and one gets denoised estimatesx α (t r ) andx α (t r ). With these results Eq. 1 can be transformed into a linear regression problem with a regression matrix A containing elements A ri = g i (x(t r )) and a regressand vector b α = (x α (t 1 ), . . . , x α (t R )) ie. Ap α = b α . For the integrated form in Eq. 2 one finds without denoising Pictorial representations of these linear regression equations can be found in [19,37]. The integrations were performed numerically with the trapezoid rule [39]. The regression matrix A is often called the feature matrix since it contains all the monomial terms that can be selected during model estimation.
Regression model selection using parameter distributions is not uncommon in statistical science and was proposed in a forward selection method by Taylor et al. [40]. For the present publication however the parameter distribution is gathered by subsampling as in [34]. The notation of the latter publication is also used here. Based on the parameter distributions arising from subsampling like the one in Fig. 3 of [34] backward elimination [36] is performed. The distributions reflect the poor condition of the regression problem due to the overfitted situation that occurs, when all monomials of a search space are present in the regression problem. From the best estimatesp α,s of a sample with number s and s = 1, . . . , S one can form the arithmetic mean and standard deviation over all samples for the i-th component of the parameter vectorp α,s as α,i,s and In the spirit of Cohens effect size [41] a ratio can be formed, to check how strong a parameter p α,i differs from zero: 0 − p α,i σ α,i < ρ := 3.0 .
Throughout this paper a value of ρ = 3.0 is used, although slightly smaller values often give bet- ter results. Values close to ρ = 2.0 still work to some extent, but values of ρ = 4.0 work not. Inequality 6 means thatp α,i is inside the ±3σ α,i interval around 0. A violation of inequality 6 corresponds to a 3σ -effect, which is a deterministic effect. Note that the quantity in Eq. 6 differs by a factor of a square root of the sample size from the pivot quantity used in hypothesis testing. Using such pivot quantities for the selection of model terms in deterministic ODE turns out to be problematic, since extremely large confidence levels (larger than 99.999 . . . %) have to be used.
But also covariance information can be used to determine the effect of a monomial in an ODE from the subsampled parameter distribution. The scale free covariance matrix of the joint distribution of the parameter estimates can be formed. An element of this matrix is If one approximates the k-dimensional joint distribution of the parameters by a multivariate Gaussian with covariance matrix , its entropy is given by Let (n) be the matrix that can be formed from by removing the nth row and nth column that reflect parameter p α,i . Then by excluding the parameter p α,i information is gained, if The labels of the four algorithms are now specified in Table 1.
With these labels the family of algorithms, which are labeled as model-LISS, is given as follows. 3. Output the numbers of the not excluded parameters, whose numbering is described in Appendix A, as well as the parameter means and standard deviations Note that these algorithms work only for deterministic data with small noise and not for typical statistical data. The main idea of the algorithms is that monomials, which are present in a deterministic ODE, have parameters that appear at a deterministic level in the subsampling distribution. This results in a violation of inequalities 6 and 9. The algorithms are pure parameter elimination procedures. It was avoided to add again already eliminated parameters, since this lead to oscillating behavior of the algorithms and did not improve the correctness of the model estimation. In the following the family of algorithms is labeled model-LISS.

Aggregated results for all test systems
The 22 test equation systems in Appendix B comprise 60 individually estimable equations. These were estimated with all four algorithms at various noise levels.  Table 1 over the noise parameter q, which is defined in Eq. 10 The noise is applied to the entire solution curve by adding the output of a Gaussian random number generator independently to each coordinate x α (t k ) of a data point x(t k ). For a component α of the ODE system and a noise percentage parameter q the random generator had mean 0.0 and a standard deviation where k runs over all points in time in the data. This ensures equal relative strength of the noise for all equations. The parameter q is varied from 0.1 to 2.0 percent in steps of 0.1. It should be noted that a noise parameter value of q = 2.0 turns the data points of a chaotic solution curve into a quite diffuse point cloud.
In Fig. 1 the percentage of correct model estimations of all 60 equations is given over the noise parameter q of Eq. 10 for all four algorithms. All ODE solution curve data and the code that produces Fig. 1 can be found in [42] together with an application manual. The parameters for the LISS-algorithm were in the terminology of [34] for all 60 equations: I = 10, R = 100 and S = 100. The elimination parameter ρ for the effect size or the information gain is 3.0 for all equations.
For the same ODE systems, search spaces (see Appendix A) and noise levels those algorithms of the SINDy-family of algorithms [43,44] are evaluated that have a simple default parameters. The rationale behind this decision is that the model-LISS algorithms have a default hyperparameter set for all estimations and that the conditions are comparable. The results are presented in Fig. 2 and the labels for the differentiation Fig. 2 Percentage of correct estimated equations for those SINDy-algorithms that have a simple default initialization over the noise parameter q, which is defined in Eq. 10. The algorithms differ by the differentiation method (Details see supplemental material [42]). They are less resilient towards noise than the DIF_EF and DIF_EN methods of the current paper methods are explained in Table 2. All ODE solution curve data and the code that produces Fig. 2 can be found in [42] together with an application manual.
In Fig. 3 the average correctness of the estimated parameters is examined for the cases of the SINDy and model-LISS estimations, where the estimated model was correct. For a given noise percentage parameter q the relative difference of the given parameter p α,i used for the calculation of the data and the estimated parameter valuep α,i was formed as d α,i = ( p α,i −p α,i )/ p α,i . The mean parameter value deviation< d > as depicted in Fig. 3 was then evaluated by taking the arithmetic mean of d α,i over all parameters of all the 60 equations that were estimated successfully.

The real-world perspective: a truly unknown system
In the previous subsection it was demonstrated that the correct low order polynomial model can be found with high probability, if the data stem from such a model. For real-world data this is however typically unknown.
In order to explore a truly unknown model, an astable multivibrator was tamed with capacities and resistors to produce low amplitude oscillations with low curvature of the trajectory. The details are in Appendix C. Moreover this little example with experimental data is  All ODE solution curve data and the code that produces this figure can be found in [42] together with an application manual instructive to understand, why methods like those of Table 1 should be labeled preprocessing methods. Application of the DIF_EF algorithm with the standard hyperparameters of this paper to the data shown as a black line in Fig. 4 yields the following model: The parameter values and the values obtained by an improvement of parameters by some kind of a Bock algorithm [45] (for details see [42]) are given in Table 3. The refinement through Bocks algorithm results in a change of a few percent for the parameter values and brings the curve calculated by the model closer to the experimental data, as can be seen in Fig. 4. Note that the model is highly unstable in the sense that data curves quite close to the presented one result in a slightly different model (i.e. some terms are different), indicating that the data have no model with polynomials of order 3. Also a rapid approach of the solution generated by the model towards ±∞ may occur for data close to the presented ones. In black the experimental data obtained from the oscillator described in Appendix C. In green the results of the DIF_EF preprocessing method and in red the improvement of the model found with DIF_EF by adjusting the parameters using a simple algorithm of the Bock type [45]. Details and code to create this figure automatically from the data are in [42] 4 Discussion

Performance of the model-LISS algorithms
The estimations of the algorithms in standard form (DIF_EF and DIF_EN in Fig. 1) are nicely successful, while the integrated versions are much less effective. Especially the integrated version based on parameter elimination using the information gain (INT_EN) is not at all an algorithm, since it never succeeds. This is due to the multicollinearity in the covariance matrix of the parameters that leads to an exceedingly small entropy. Then no information is gained by the elimination of parameters and the algorithm does not even start.
As can be seen in the upper part of Fig. 3 the algorithms in standard form have typically deviations for the parameters less than one percent, because the tangent and coordinate estimation by linear regression for these algorithms remove the noise. In contrast, for the integrated algorithm INT_EF no noise is removed, what is reflected in an increase of the mean parameter deviation with noise parameter q.
The INT_EF algorithm finds typically only in less than 20% of the studied equations the correct solution.
Often very few spurious terms with small parameters occur in the estimated model. One reason is that no noise is eliminated by tangent and coordinate estimation through linear regression as in the algorithms for the standard ODE. Also due to integration structures in the data vanish as can be seen in Fig. 5. By Eq. 18 the curve x 2 (t) is the integral curve for x 3 (t) and x 1 (t) is the integral of x 2 (t). The disappearance of structures in the data is clearly discernible. Also the integral formulation integrates over the entire time interval of the data. If some coordinate has oscillations mainly or entirely below (above) zero, the time integral of that coordinate has a huge downward (upward) trend with tiny oscillations around this trend. Thus the column in the regression matrix representing this coordinate is dominated by this artificial trend due to integration. This hampers the selection of columns in the regression matrix, since the feature almost disappears on huge trend that just represents progress in time. The same argument applies to columns of coordinate polynomials in the regression matrix. In contrast the weak formulation discussed in Sect. 4.2 integrates over short time intervals so that no such trend emerges in the columns of the regression matrix.
When one applies algorithm INT_EF the regressand b α is by Eq. 4 the solution curve shifted by a constant, since it has elements x α (t r ) − x α (0). It contains fewer structures than the regressand of the DIF_EN and DIF_EN algorithms, which contains elements x α (t r ). The same is true for the corresponding regression matrix A defined in Eq. 4. Since by regression one expands the regressand into a linear combination of the columns in the regression matrix, the loss of structures or the appearance of a huge trend due to integration impedes the elimination of the regression matrix columns that should not be incorporated into the model. As a consequence the percentage of correctly estimated models is degraded for the INT_EF algorithm as can be seen in Fig. 1. The combination of the integrated algorithm INT_EF with the profile likelihood method [46] could be helpful to eliminate the few spurious terms by their flat profile likelihood, if one subsequently applies Bocks algorithm [45] to verify the result.
As can be seen in Fig. 1, for too large values of the noise percentage parameter q the algorithm performance degrades for the DIF_EF and DIF_EN algorithms. This is exactly what one expects. For values of q ≤ 10 −3 the percentage of correctly estimated models decreases with decreasing q for these two algorithms. Thus the elimination procedure in these cases needs a minimum level of noise in the data in order to work properly. A too low noise level in the data is however very rarely a problem in a real-world estimation setup.

Discussion in the context of recent developments
The discussion is mainly confined to those methods from the introduction that aspire, like the algorithms of this publication, a precise estimation of an equation system. The algorithms, which use Lagrange multipliers to enforce a sparse model, have the difficulty that the multiplier must be determined (see discussion of [37]). In contrast, the hyperparameters of this publication can easily be determined, which was demonstrated by using the same hyperparameters for all 22 test systems.
The LASSO works also as a cutoff for small parameters [33]. This leads to the increased sensitivity of the SINDy-algorithm towards noise as can be seen from a comparison of Figs. 1 to 2. In contrast the algorithms presented here probe the scale of the parameter by subsampling and eliminate the parameters in a scalefree manner. In an exceptional case, ie. first equation of Plasma Edge Dynamics in Sect. B.4, the parameter values differed by over four orders of magnitude. However, depending on the quality of the data, differences of two orders of magnitude may lead already to a discrimination of small parameters. Figure 3 shows that the SINDy estimates of the parameter values are slightly better for smaller values of the noise parameter q, while the DIF_EF and DIF_EN algorithms are better than SINDy for larger values of q.
Recently numerical weak differentiation was applied to model estimation of differential equations [47,48]. This should not be confused with estimation in the integral form following [37] like in Eq. 4. However preliminary results on a few data sets and noise levels used in this paper [42] yielded rarely correct models. Typically the condition number [35] of the regression matrix is increased by a factor 10 2 to 10 3 due to the weak formulation of the regression problem. SINDy tries to cure this by scaling the columns of the matrix to unit length. But the weak formulation affects also the angles between the column vectors and increases thereby the condition number of the regression matrix. In the cases of correct model estimations however the extraordinary accuracy of the estimated parameter values emerged, as claimed in [48]. Thus, weak numerical derivatives should be investigated by systematic mathematical research in the future.
Finally, it should be mentioned, why the problems of oscillations and ripples in the cost function reported in [49] for ODE system parameter estimations do not occur here. As stated in [49] these phenomena are due to the time integration over the cost function. The time integration over a cost function is however avoided by the algorithms presented here. They set up a system of equations using only local points in time that can be solved by linear regression as emphasized in [34].

Coverage of the coordinate space by data
This point is rarely addressed for ODE model estimation, but applies to all methods in the literature. In environmental sciences it is discussed within the context of equifinality [50], ie. several distinct models fit the data equally well, because the data are insufficient to distinguish between the models. Another variation of this theme is the fact that regression models found for data in a region of the coordinate space cannot necessarily be extrapolated to another region of coordinates due to the complex dependency of the model on coordinates that are not covered by data.
A special problem arises for periodic solutions in coordinate spaces with a dimension greater or equals two. There large parts of the coordinate space are not covered by data and the model cannot necessarily be extrapolated to coordinates not covered by data. As a consequence the periodic orbits of the FitzHugh-Nagumo equations and the Brusselator in Sect. B.5 as well as the plasma edge dynamics of Sect. B.4 give quite rarely correct estimates. In contrast, solution curves for the Hamiltonian systems of Sect. B.1 and the driven systems of Sect. B.3 cover the coordinate space quite densely due to their irregular behavior. Therefore these systems can be estimated correctly even for large noise levels. For the Heneon-Heiles system in Sect. B.1 the supplemental material contains an example, where correct estimates emerged for noise percentage parameters q close to 10 percent. From this discussion it is clear that interventions on the system like the application of external driving forces as in Sect. B.3 allow to cover the coordinate space quite densely with data points, what facilitates the model estimation. The calculation of numerical derivatives also requires a high density of data points in time.
For data with unobserved coordinates successful parameter estimation for a given model was performed in [51]. In the case of model estimation however an even worse situation occurs compared to periodic solutions, since an entire subspace of the coordinate space is not covered by data points. Then model estimation can become exceedingly difficult.

The real-world perspective
As demonstrated with a simple experimental data example in Sect. 3.2 the real world cannot always be explained with ODE having low order polynomial right hand sides. The real-world dynamics can also stem from a process obeying a differential algebraic equation, an equation that is nonlinear in the parameters, a piecewise defined system, a non-smooth system or something else. The instability of the model, discussed in Sect. 3.2, shows that the data cannot be modeled with polynomials of degree three. Then however the algorithm frequently gives a reasonable approximation of the tangent field resulting in a fitting integral curve of the model, what can be seen in Fig. 4.
The reason labeling these algorithms as preprocessing methods is as follows. Although some algorithms yield frequently the correct result, there is no guarantee or check by the algorithm that the estimated model gives solution curves, which fit the data on a long time scale. This must be verified independently by applying for instance Bocks algorithm [45], which was demon-strated in Sect. 3.2 for experimental data of a simple electronic oscillator. This resulted in a change of the parameters of a few percent. Observe that nonlinear systems can undergo in such a parameter range dozens or hundreds of, for example, flip bifurcations altering the solution curve completely. Also a single missing term in the model can yield a solution curve that approaches ±∞ quickly, causing a severe mismatch of the model with the data. Therefore the true model check is a comparison of the integral curve of the model with the experimental data. As a consequence the methods of this paper are labeled as preprocessing methods [34].

Conclusion
A family of novel preprocessing algorithms for the estimation of polynomial ODE models from given solution curves was developed. They were tested against a set of 22 ODE equation systems comprising 60 estimable equations for 20 noise levels per equation. A comparison to those algorithms of the widely cited SINDyfamily [19], which have simple default hyperparameters like the algorithms of the current paper, shows that the model-LISS algorithms are more resilient towards noise. Thus model-LISS opens a promising novel line of development in the field of ODE model estimation avoiding constraints involving Lagrange parameters.
Funding The authors have not disclosed any funding.

Data availability
The data and code utilized for automatic generation of the key Figs. 1 to 4 is available at [42].

Conflict of interest The author declares that he has no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/ by/4.0/.

A Polynomial search spaces
The search space for a component α of an N-dimensional ODE system is a cubic polynomial C α (x) in N = dim(x) variables. Here N ∈ {2, 3, 4} is used. For an ODE component α the parameter numbers of the monomials in the search space in the case of N = 2 are: (12) In the case of N = 3 the parameter numbers for the monomials are In the case of N = 4 the parameter numbers are

B ODE systems, parameters and initial conditions
The solution curve data used for model estimation of the ODE systems in this appendix are in the supplemental material attached to this paper [42] together with the python code necessary to reproduce Figs. 1 and 3. In the following the ODE systems, their parametrization and integration procedures are given that were used to produce the data. The choice of initial conditions and parameters was made as follows. For those systems already used in [34] the same parameters and initial systems were taken. For the jerky dynamics [52] in Sect. B.2 the parameters are given in that paper and the initial conditions were chosen in a way that the trajectory stayed bounded. The bounded trajectory rationale was also used for the driven elementary catastrophes of Kuznetzova [53] in Sect. B.3 and two Contopoulos systems in Sect. B.1. But in addition a reasonable coverage of the coordinate space was aspired.

B.1 Hamiltonian systems
Here equation all four components were estimated each in the 35-dimensional search space of Eq. 14. Henon-Heiles [54]: The step size for ODE integration using the symplectic Euler algorithm [55] was h = 10 −3 , while data were recorded at step size h r = 10 −2 . The trajectory used for estimation started at t = 0.0 and ended at t = 399.99.

B.2 Jerky dynamics
Seven such systems are described in Table II of [52] and the labels for the systems in the following paragraphs are taken from that table. They are three dimensional ODE systems, where the first two equations are identical for all seven systems. All three equations of a system are estimated by the algorithm. The search space and parameter numbers are given in Eq. 13. J D 1 : The initial conditions were x 1 (0) = 0.0, x 2 (0) = 0.5 and x 3 (0) = 0.0. The parameters are p 1,2 = 1.0, p 2,3 = 1.0, p 3,0 = −0.08, p 3,1 = −0.4, p 3,3 = −1.0 and p 3,5 = 1.0. The step size for ODE integration using the Euler algorithm [57] was h = 10 −3 , while data were recorded at step size h r = 10 −2 . The trajectory used for estimation started at t = 200.0 and ended at t = 599.99.

B.3 Periodically driven systems
In this subsection second order ODE driven periodically by a third coordinate are investigated. The search space of admitted polynomials for the first two coordinates are cubic polynomials in three variables of Eq. 13.The ODE of the third coordinate is not estimated. Van-der-Pol: The initial conditions were x 1 (0) = 1.0, x 2 (0) = 1.0 and x 3 (0) = 1.0. The parameters are p 1,2 = 1.0, p 2,1 = −1.0, p 2,2 = 0.1, p 2,3 = −0.32 and p 2,11 = −0.1. The step size for ODE integration using the Euler algorithm [57] was h = 10 −3 , while data were recorded at step size h r = 10 −2 . The trajectory used for estimation started at t = 0.0 and ended at t = 99.99. The driving frequency and the phase were considered to be known and had values of ω = 1.15 and φ = 0.0, while the driver amplitude is estimated as p 2,3 .

C Experimental data generation
In Fig. 6 the tamed astable multivibrator is shown, which was used to produce the data in Fig. 4. The supply voltage of the LM741 OP-AMP was ±22 V. The data were recorded using a Picoscope 3404 A Oscilloscope with the corresponding software. For reproduction of the data start with both potentiometers in the 1 M position and reduce the resistance for both. In the 200 k range and below oscillations with a small amplitude and small curvature emerge.