Interpretable Scientific Discovery with Symbolic Regression: A Review

Symbolic regression is emerging as a promising machine learning method for learning succinct underlying interpretable mathematical expressions directly from data. Whereas it has been traditionally tackled with genetic programming, it has recently gained a growing interest in deep learning as a data-driven model discovery method, achieving significant advances in various application domains ranging from fundamental to applied sciences. This survey presents a structured and comprehensive overview of symbolic regression methods and discusses their strengths and limitations.


Introduction
Symbolic Regression (SR) is a rapidly growing subfield within machine learning (ML) to infer symbolic mathematical expressions from data [1,2].Interest in SR is being driven by the observation that it is not sufficient to only have accurate predictive models; however, it is often necessary that the learned models be interpretable [3].A model is interpretable if the relationship between the input and output of the model can be logically or mathematically traced in a succinct manner.In other words, learnable models are interpretable if expressed as mathematical equations.As "disciplines" become increasingly data-rich and adopt ML techniques, the demand for interpretable models is likely to grow.For example, in the natural sciences (e.g., physics), mathematical models derived from first principles make it possible to reason about the underlying phenomenon in a way that is not possible with predictive models like deep neural networks.In critical disciplines like healthcare, non-interpretable models may never be allowed to be deployed -however accurate they maybe [4].
Example: Consider a data set consisting of samples (q 1 , q 2 , r, F ), where q 1 and q 2 are the charges of two particles, r is the distance between them and F is the measured force between the particles.Assume q 1 , q 2 , and r are the input variables, and F is the output variable.Suppose we model the input-output relationship as F = θ 0 + θ 1 q 1 + θ 2 q 2 + θ 3 r.Then, using the data set, we can infer the model's parameters (θ i ).The model will be interpretable because we will know the impact of each variable on the output.For example, if θ 3 is negative, then that implies that as r increases, the force F will decrease.From physics, we know that this form of the model is unlikely to be accurate.On the other hand, we could model the relationship using a neural network F = N N (q 1 , q 2 , r, θ).We expect the model to be highly accurate and predictive because neural networks (NNs) are universal function approximators.However, the model is uninterpretable because the input-output relationship is not easily apparent.The input feature vector subsequently undergoes several layers of nonlinear transformations, i.e., y = σ , where σ is a nonlinear activation function, and W idx are the learnable parameters of NN layer of index idx.Such models, called "blackbox", do not have an internal logic to let users understand how inputs are mathematically mapped to outputs.Explainability is the application of other methods to explain model predictions and to understand how it is learned.It refers to why the model makes the decision that way.What distinguishes explainability from interpretability is that interpretable models are transparent [3].For example, the linear regression model predictions can be interpreted by evaluating the relative contribution of individual features to the predictions using their weights.An ideal SR model will return the relationship as k q1q2 r 2 , which is the definition of the Coulomb force between two charged particles with a constant1 k = 8.98 × 10 9 .However, learning the SR model is highly non-trivial as it involves searching over a large space of mathematical operations and identifying the right constant (k) that will fit the data.SR models can be directly inferred from data or can be used to "whitebox" a "blackbox" model such as a neural network.
The ultimate goal of SR is to bridge data and observations following the Keplerian trial and error approach [5].Kepler developed a data-driven model for planetary motion using the most accurate astronomical measurements of the era, which resulted in elliptic orbits described by a power law.In contrast, Newton developed a dynamic relationship between physical variables that described the underlying process at the origin of these elliptic orbits.Newton's approach [6] led to three laws of motion later verified by experimental observations.Whereas both methods fit the data well, Newton's approach could be generalized to predict behavior in regimes where no data were available.Although SR is regarded as a data-driven model discovery tool, it aims to find a symbolic model that simultaneously fits data well and could be generalized to uncovered regimes.SR is deployed as an interpretable and predictive ML model or a data-driven scientific discovery method.SR was investigated as early as 1970 in research works [7,8,9] aiming to rediscover empirical laws.Such works iteratively apply a set of data-driven heuristics to formulate mathematical expressions.The first AI system meant to automate scientific discovery is called BACON [10,11].It was developed by Patrick Langley in the late 1970s and was successful in rediscovering versions of various physical laws, such as Coulomb's law and Galileo's laws for the pendulum and constant acceleration, among many others.SR was later studied by Koza [12,13,1] who proposed that genetic programming (GP) can be used to discover symbolic models by encoding mathematical expressions as computational trees, where GP is an evolutionary algorithm that iteratively evolves an initial population of individuals via biology-inspired operations.SR was since then tackled with GP-based methods [1,14,15,16,17,18,19,20,21,22,23,24,25].Moreover, it was popularized as a datadriven scientific discovery tool with the commercial software Eureqa [26] based on a research work [2].Whereas GP-based methods achieve high prediction accuracy, they do not scale to high dimensional data sets and are sensitive to hyperparameters [19].More recently, SR has been addressed with deep learning-based methods [27,28,19,29,30,31,32,33] which leverage neural networks (NNs) to learn accurate symbolic models.SR has been applied in fundamental and applied sciences such as astrophysics [34], chemistry [35,36], materials science [37,38], semantic similarity measurement [39], climatology [40], medicine [41], among many others.Many of these applications are promising, showing the potential of SR.A recent SR benchmarking platform SRBench is introduced by La Cava et al. [42].It comprises 14 SR methods (among which ten are GP-based), applied on 252 data sets.The goal of SRBench was to provide a benchmark for rigorous evaluation and comparison of SR methods.
This survey aims to help researchers effectively and comprehensively understand the SR problem and how it could be solved, as well as to present the current status of the advances made in this growing subfield.The survey is structured as follows.First, we define the SR problem, present a structured and comprehensive review of methods, and discuss their strengths and limitations.Furthermore, we discuss the adoption of these SR methods across various application domains and assess their effectiveness.Along with this survey, a living review [25] aims to group state-of-the-art SR methods and applications and track advances made in the SR field.The objective is to update this list often to incorporate new research works.This paper is organized as follows.The SR problem definition is presented in Section 2. We present an overview of methods deployed to solve the SR problem in Section 3, and the methods are discussed in detail in Sections 4, 5 and 6.Selected applications are described and discussed in Section 7. Section 8 presents an overview of existing benchmark data sets.Finally, we summarize our conclusions and discuss perspectives in Section 10.

Problem Definition
The problem of symbolic regression can be defined in terms of classical Empirical Risk Minimization (ERM) [43].
, where x i ∈ R d is the input vector and y i ∈ R is a scalar output.
Function Class: Let F be a function class consisting of mappings f : R d → R.
Loss Function: Define the loss function for every candidate f ∈ F: A common choice is the squared difference between the output and prediction, i.e. l(f Optimization: The optimization task is to find the function (f ) over the set of functions F that minimizes the loss function: As stated below, what distinguishes SR from conventional regression problems is the discrete nature of the function class F. Different methods for solving the SR problem reduce to characterizing the function class.

Class of Function
In SR, to define F, we specify a library of elementary arithmetic operations and mathematical functions and variables, and an element f ∈ F is the set of all functions that can be obtained by function composition in the library [44].For example, consider a library: Then the set of all of the polynomials (in one variable x) with integer coefficients can be derived from L using function composition.

Expression representation
It is convenient to express symbolic expressions in a sequential form using either a unary-binary expression tree or the polish notation [45].For example, the expression f (x) = x 1 x 2 − 2x 3 can be derived using function composition from L (Eq. 3) and represented as a tree-like structure illustrated in Figure 1a.By traversing the (binary) tree top to bottom and left to right in a depth-first manner, we can represent the same expression as a unique sequence called the polish form, as illustrated in Figure 1b.In practice, the library L includes many other common elementary mathematical functions, including the basic trigonometric functions like sine, cosine, logarithm, exponential, square root, power low, etc. Prior domain knowledge is advantageous for library definition because it reduces the search space to only include the most relevant mathematical operations to the studied problem.Furthermore, a large range of possible numeric constants should be possible to express.For example, numbers in base-10 floating point notation rounded up to four significant digits can be represented as triple of (sign, mantissa, exponent) [31].The function sin(3.456x),for example, can be represented as [sin, mul, 3456, E − 3, x].

Symbolic regression methods overview
In this survey, we categorize SR methods in the following manner: regression-based methods, expression treebased methods, physics-inspired and mathematics-inspired methods, as presented in Figure 2.For each category, a summary of the mathematical tool, the expression form, the set of unknowns, and the search space, is presented in Table 1.
The linear method defines the functional form as a linear combination of nonlinear functions of x that are comprised in the predefined library L. Linear models are expressed as: where j spans the base functions of L. The optimization problem reduces to find the set of parameters {θ} that minimizes the loss function defined over a continuous parameter space Θ = R M as follows: This method is advantageous for being deterministic and disadvantageous because it imposes a single model structure which is fixed during training when the model's parameters are learned.
The nonlinear method defines the model structure by a neural network.Nonlinear models can thus be expressed as: where σ is a nonlinear activation function, and W idx are the learnable parameters of NN layer of index idx.
Similarly to the linear method, the optimization problem reduces finding the set of parameters {W, b} of neural network layers, which minimizes the loss function over the space of real values.
Expression tree-based methods treat mathematical expressions as unary-binary trees whose internal nodes are operators and terminals are operands (variables or constants).This category comprises GP-based, deep neural transformers, and reinforcement learning-based methods.In GP-based methods, a set of transition rules (e.g., mutation, crossover) is defined over the tree space and applied to an initial population of trees throughout

Regressionbased
Linear y = j θ j f j (x) Figure 2: Taxonomy based on the type of symbolic regression methods.φ denotes a neural network function, W denotes the set of learnable parameters in NN. x denotes the input data, z denotes a reduced representation of x, and x denotes a new representation of x, e.g., by defining new features based on the original ones.T represents the final population of selected expression trees in genetic programming.many iterations until the loss function is minimized.Transformers [46] represent a novel architecture of neural network (encoder and decoder) that uses attention mechanism.The latter was primarily used to capture longrange dependencies in a sentence.Transformers were designed to operate on sequential data and to perform sequence-to-sequence (seq2seq) tasks.For their use in SR, input data points (x, y) and symbolic expressions (f ) are encoded as sequences and transformers perform set-to-sequence tasks.The unknowns are the weight parameters of the encoder and the decoder.Reinforcement learning (RL) is a machine learning method that seeks to learn a policy π(x|θ) by training an agent to perform a task by interacting with its environment in discrete time steps.An RL setting requires four components: state space, action space, state transition probabilities, and reward.The agent selects an action that is sent to the environment.A reward and a new state are sent back to the agent from its environment and used by the agent to improve its policy at the next time step.In the context of SR, symbolic expression (sequence) represents a state, predicting an element in a sequence represents an action, the parent and sibling represent the environment, and the reward is commonly chosen as the mean square error (MSE).RL-based SR methods are commonly hybrid and use various ML tools (e.g., NN, RNN, etc.) in a joint manner with RL.

Linear symbolic regression
The linear approach assumes, by definition, that the target symbolic expression (f (x)) is a linear combination of nonlinear functions of feature attributes: Here x denotes the input features vector, θ j denotes a weight coefficient, and h j (•) denotes a unary operator of the library L. This approach predefines the model's structure and reduces the SR problem to learn only the model's parameters by solving a system of linear equations.The particular case where f (x) is a linear combination of degree-one monomial reduces to a conventional linear regression problem, i.e., f (x) = j θ j x j = θ 0 + θ 1 x + θ 2 x 2 + • • • .There exist two cases for this problem: (1) a unidimensional case defined by f : R d → R; and (2) a multidimensional case defined by f : R d → R m , with d the number of input features and m the number of variables required for a complete description of a system; for example, the Lorenz system for fluid flow is defined in terms of three physical variables which depend on time.

Unidimensional case
Given a data set D = {(x i , y i )} n i=1 , the mathematical expression could be either univariate (x i ∈ R, . The methodology of linear SR is presented in detail for the univariate case in Secion 4.1.1 for simplicity and is extended for the multivariate case in Section 4.1.2.

Univariate function
Library: L can include any number of mathematical operators such that the dimension of the data set is always greater than the dimension of the library matrix (see discussion below).
In this approach, a coefficient θ j is assigned to each candidate function (f j (•) ∈ L) as an activeness criterion such that: Applying Eq. 8 to input-output pairs (x i , y i ) yields a system of linear equations as follows: . . .
which can be represented in a matrix form as: Equation 10 can then be presented in a compact form: where Θ ∈ R (k+1) is the sparse vector of coefficients, and U ∈ R n×(k+1) is the library matrix which can be represented as a function of the input vector X as follows: Example: For a library defined as: The matrix U becomes: Each row (of index i) in Eq. 12 is a vector of (k + 1) functions of x i .The vector of coefficients, i.e., the model's parameters, is obtained by solving Eq. 11 as follows: The magnitude of a coefficient θ k effectively measures the size of the contribution of the associated function f k (•) to the final prediction.Finally, the prediction vector Ŷ can be evaluated using Eq.11.
An exemplary schematic is illustrated in Figure 3 for the univariate function f (x) = 1 + αx 3 .Only coefficients associated with functions {1, x 3 } of the library are non-zero, with values equal to 1 and α, respectively.
: Schematic of the system of linear equations of Eq. 11 for f (x) = 1 + αx 3 .A library matrix U(X) of nonlinear functions of the input is constructed, where L = {1, x, x2 , x 3 , • • • }.The marked entries in the Θ vector denote the non-zero coefficients determining which functions of the library are active.
In the following, linear SR is tested on synthetic data.In each experiment, training and test data sets are generated.Each set consists of twenty data points randomly sampled from a uniform distribution U(−1, 1), and y is evaluated using a univariate function, i.e., D = {(x i , f (x i ))} n i=1 .Two libraries are considered in these experiments: The results are reported in terms of the output expression (Equation 7) and the coefficient of determination R 2 .SR problems are grouped into (i) pure polynomial functions and (ii) mixed polynomial and trigonometric functions.In each experiment, parameters are learned using the training data set, and results are reported for the test data set in Table 2.For polynomial functions, an exact output is obtained using L 1 with an R 2 = 1.0, whereas only approximate output is obtained using L 2 .In the latter case, the quality of the fit depends on the size of the training data set.An exemplary result is shown in Figure 4 for f (x) = x + x 2 + x 3 .Points represent the (test) data of the input file, i.e., X; the red curve represents f (x) as a function of x, and the blue and black dashed curves represent the predicted function f (x) obtained using L 1 and L 2 respectively.An exact match between the ground-truth function and the predicted one is found using L 1 , whereas a significant discrepancy is obtained using L 2 .This discrepancy could be explained by the fact that various functions in L 2 exhibit the same x-dependence over the covered x-range.
Nguyen-2 For mixed polynomial and trigonometric expressions, both library choices do not produce the exact expression.However, a better R 2 -coefficient is obtained using L 1 .In the case of Nguyen-5 benchmark for example, i.e., f (x) = sin(x 2 ) cos(x) − 1, the resulting function is the Taylor expansion of f : In conclusion, this approach can not learn the ground-truth function when the latter is a multiplication of two functions (i.e., f or when it has a multiplicative or an additive factor to the variable (e.g., sin(α + x), exp(λ * x), etc.).In the best case, it outputs an approximation of the ground-truth function.Furthermore, this approach fails to predict the correct mathematical expression when the library is extended to include a mixture of polynomial, trigonometric, exponential, and logarithmic functions.

Multivariate function
For a given data set , where d is the number of features, the same equations presented in Section 4.1.1 are applicable.However, the dimension of the library matrix U changes to consider the features vector dimension.For example, for the same library shown in Eq. 13 and a two dimensional features vector, i.e., X ∈ R 2 , U(X) becomes: Here, X Pq denotes polynomials in X of the order q.
Table 3 presents the results of the experiments performed on two-variables dependent functions, i.e., f (x 1 , x 2 ).Similarly to Section 4.1.1,training and test data sets are generated by randomly sampling twenty pairs of points The same choices for the library are considered: An exact match between the ground-truth and predicted function is obtained using L 1 for any polynomial function, whereas only approximate solutions are obtained for trigonometric functions.The results are approximate of the ground-truth function using L 2 .
Table 3: Results for multivariate functions using linear SR.
T and F refer to True and False.

Benchmark
Expression Furthermore, linear SR is tested on a dataset generated using a two-dimensional multivariate normal distribution N (µ, Σ), as shown in Fig. 5. Different analytic expressions for f (x 1 , x 2 ) were tested with different library bases that are summarized in Table 4, including pure polynomial basis functions, polynomial and trigonometric basis functions, and a mixed library.The function y 1 = cos(x 1 ) + sin(x 2 ) is explored with all three bases.In the case of a pure polynomial basis, the correct terms of the Taylor expansion of both cos(x 1 ) and sin(x 2 ) are identified with only approximate values of their coefficients, i.e., ŷ1 = (0.88 − 0.3x 2 1 + 0.01x 4 1 ) + (0.97 − 0.2x 3 2 ), which is reflected in the significantly high reconstruction error of the order of 30%.In both bases where trigonometric functions are enlisted, the correct terms cos(x 1 ) and sin(x 2 ) are identified with an excellent reconstruction error, that is ≥ 10 −7 .Note that the lowest reconstruction error is obtained for the library U2, which has the least number of operations and, consequently, the lowest number of coefficients.1, X, X P 2 , X P 3 , X P 4 U2 1, X, X P 2 , cos(X), sin(X) U3 1, X, X P 2 , cos(X), sin(X), tan(X), exp(X), sigmoid(X) The function y 2 = x 2 1 + cos(x 2 ) is also tested.For the pure polynomial basis, the reconstructed function ŷ2 = x 2  1 + (0.83 + 0.49x 2 − x 2 2 ) predicts approximate values with a reconstruction error of ≤ 1%.An excellent prediction is made for the other bases, which enlist both operations in y 2 (x 1 , x 2 ).
In the same exercise, a more complicated function form is tested that includes mixed terms, i.e., y 3 = x 1 (1 + x 2 ) + cos(x 1 ) * sin(x 2 ).The difference between the true and the predicted function is illustrated in Fig. 6.The linear approach performs similarly for all three library bases.A low reconstruction error is obtained because the operation term cos(x 1 ) * sin(x 2 ) in y 3 is not enlisted in any of the libraries, showing an important limitation of the current approach.

Multidimensional case
The target mathematical expression comprises m components, i.e., Y = [y 1 , • • • , y m ], and the goal is to learn the coefficients of a system of linear equations rather than one mathematical expression.Each component (y j ) is described by: In this case, there exist m sparse vectors of coefficients, i.e., Θ = [θ 1 • • • θ m ].Consider the Lorenz system, which is a set of ordinary differential equations that captures nonlinearities in the dynamics of fluid convection.It consists of three variables {x 1 , x 2 , x 3 } and their first-order derivatives with respect to time { dx1 dt , dx2 dt , dx3 dt }, which we will refer to as {y 1 , y 2 , y 3 }.Using the library of Eq. 13, the system of linear equations is represented in a matrix form as follows: Here, Y ∈ R n×3 , U(X) ∈ R n×k and Θ ∈ R k×3 , where n is the size of the input data and k is the number of columns in the library matrix U.The j th -component of the Y vector is given by: Equation 17 can be written in a compact form as: The application presented in [33] uses this approach, where the authors aim to learn differential equations that govern the dynamics of a given system, such as a nonlinear pendulum and the Lorenz system.The approach successfully learned the exact weights, allowing them to recover the correct governing equations.
An exemplary schematic is illustrated in Figure 7 for the Lorenz system defined by ẋ = σ(y −x), ẏ = x(ρ−z)−y, ż = xy − βz.Here x, y, and z are physical variables and ẋ, ẏ, and ż are their respective time-derivatives.Only coefficients associated with functions {x 1 , x 1 x 2 , } should be non-zero and equal to the factors shown in the Lorenz system's set of equations.
Figure 7: Schematic of the system of Eq. 11 for the Lorenz system defined by A library U(X) of nonlinear functions of the input is constructed.The marked entries in the θs vectors denote the non-zero coefficients determining which library functions are active for each of the three variables {y 1 , y 2 , y 3 }.
In summary, the linear approach is only successful in particular cases and can not be generalized.Its main limitation is in predefining the model's structure as a linear combination of nonlinear functions, reducing the SR problem to solve a system of linear equations.In contrast, the main mission of SR is to learn the model's structure and parameters.A direct consequence of this limitation is that the linear approach fails to learn expressions in many cases: (i) composition of functions (e.g., f (x) = f 1 (x) * f 2 (x)); (ii) multivariate functions (e.g., exp(x * y), tan(x+y), etc.); and (iii) functions including multiplicative or additive factors to their arguments (e.g., exp(λx)).Finally the dimension of the library matrix can be challenging in computing resources for extended libraries and high-dimensional data sets.

Nonlinear symbolic regression
The nonlinear method uses deep neural networks (DNN), known for their great ability to detect and learn complex patterns directly from data.
DNN has the advantage of being fully differentiable in its free parameters allowing end-to-end training using back-propagation.This approach searches the target expression by replacing the standard activation functions in a neural network with elementary mathematical operations.Figure 8 shows an NN-based architecture for SR called the Equation Learner (EQL) network proposed by Martius and Lampert [47] in comparison with a standard NN.Only two hidden layers are shown for simple visualization, but the network's deepness is controlled as per the case study.
The EQL network uses a multi-layer feed-forward NN with one output node.A linear transformation z [l] is applied at every hidden layer (l), followed by a nonlinear transformation a [l] i using unary (i.e., one argument) and binary (i.e., two arguments) activation functions as follows where {W, b} denote the weight parameters and f i denotes individual activation function from the library L = {identity, (•) n , cos, sin, exp, log, sigmoid}.In a standard NN, the same activation function is applied to all hidden units and is typically chosen among {RELU, tanh, sigmoid, softmax, etc.}.
The problem reduces to learn the correct weight parameters {W [l] , b [l] }, whereas the operators of the target mathematical expression are selected during training.To overcome the interpretability limitation of neural network-based architectures and to promote simple over complex solutions as a typical formula describing a physical process, sparsity is enforced by adding a regularization term l 1 to the l 2 loss function such that, Where N denotes the number of data entries and L denotes the number of layers.Whereas this method is endto-end differentiable in NN parameters and scales well to high dimensional problems, back-propagation through activation functions such as division or logarithm requires simplifications to the search space, thus limiting its ability to produce simple expressions involving divisions (e.g., sin (x/y)

x
).An extended version EQL ÷ [48] includes only the division, whereas exponential and logarithm activation functions are not included because of numerical issues.

Tree expression
This section discusses SR methods in which a mathematical expression is regarded as a unary-binary tree consisting of internal nodes and terminals.Every tree node represents a mathematical operation (e.g., +, −, ×, sin, log, etc.) that is drawn from a pre-defined function class F (Section 2.1) and every tree terminal node (or leaf) represents an operand, i.e., variable or constant, as illustrated for the example shown in Figure 9. Expression tree-based methods include genetic programming, transformers, and reinforcement learning.

Genetic Programming
Genetic programming (GP) is an evolutionary algorithm in computer science that searches the space of computer programs to solve a given problem.Starting with a "population" (set) of "individuals" (trees) that is randomly generated, GP evolves the initial population T (0) GP using a set of evolutionary "transition rules" (operations) {r i : f → f | i ∈ N} that is defined over the tree space.GP evolutionary operations include mutation, crossover, and selection.The mutation operation introduces random variations to an individual by replacing one subtree with another randomly generated subtree (Figure 10 .In each generation, each individual has a probability of undergoing a mutation operation and a probability of undergoing a crossover operation.The selection is applied when the dimension of the current population is the same as the previous one.Throughout M k iterations, the following steps are undertaken: (1) transition rules are applied to the function set where k denotes the iteration index; (2) the loss function (F k ) is evaluated for the set; and (3) an elite set of individuals is selected for the next iteration step.The GP algorithm repeats this procedure until a predetermined accuracy level is achieved.Whereas GP allows for large variations in the population resulting in improved performance for out-of-distribution data, GP-based methods do not scale well to high dimensional data sets and are highly sensitive to hyperparameters [19].

Transformers
Transformer neural network (TNN) is a novel NN architecture introduced by Vaswani et al. [46] in natural language processing (NLP) to model sequential data.TNN is based on the attention mechanism that aims to model long-range dependencies in a sequence.Consider the English-to-French translation of the two following sentences: En: The kid did not go to school because it was closed.
En: The kid did not go to school because it was cold.Fr: L'enfant n'est pas allé à l'école parce qu'il faisait froid.
The two sentences are identical except for the last word, which refers to the school in the first sentence (i.e., "closed") and to the weather in the second one (i.e., "cold").Transformers create a context-dependent word embedding that it pays particular attention to the terms (of the sequence) with high weights.In this example, the noun that the adjective of each sentence refers to has a significant weight and is therefore considered for translating the word "it".Technically, an embedding x i is assigned to each element of the input sequence, and a set of m key-value pairs is defined, i.e., S = {(k 1 , v 1 ), • • • , (k m , v m )}.For each query, the attention mechanism computes a linear combination of values j ω j v j , where the attention weights (ω j ∝ q • k j ) are derived using the dot product between the query (q) and all keys (k j ), as follows: Here, q = xW q is a query, a value, and W q , W k , W v are the learnable parameters.The architecture of the self-attention mechanism is illustrated in Figure 11.
Figure 11: Evaluation of Attention(q, S) (Eq.22) for a query q i , computed using the input vector embedding x i .
In the context of SR, both input data points {(x i , y i ) and mathematical expressions f are encoded as sequences of symbolic representations as discussed in Section 2.2.The role of the transformer is to create the dependencies at two levels, first between numerical and symbolic sequences and between tokens of symbolic sequence.Consider the mathematical expression f (x, y, z) = sin(x/y) − sin(z), which can be written as a sequence of tokens following the polish notation: Each symbol is associated with an embedding such that: x 1 : − x 2 : sin x 3 : ÷ x 4 : x x 5 : y x 6 : sin x 7 : z In this particular example, for query (x 7 : z), the attention mechanism will give a higher weight for the binary operator (x 1 : −) than for the variable (x 5 : y) or the division operator (x 3 : ÷).
Transformers consist of an encoder-decoder structure; each block comprises a self-attention layer and a feedforward neural network.TNN inputs a sequence of embeddings {x i } and outputs a "context-dependent" sequence of embeddings {y i } one at a time, through a latent representation z i .TNN is an auto-regressive model, i.e., sampling each symbol is conditioned by the previously sampled symbols and the latent sequence.An example of a TNN encoder is shown in Figure 12.
Figure 12: Structure of a TNN encoder [46].It comprises an attention layer and a feed-forward neural network.
In symbolic regression case, the encoder and the decoder do not share the same vocabulary because the decoder has a mixture of symbolic and numeric representations, while the encoder has only numeric representations.There exist two approaches to solving SR problems using transformers.First is the skeleton approach [32,49] where the transformer conducts the two-steps procedure: (1) the decoder predicts a skeleton f e , a parametric function that defines the general shape of the target expression up to a choice of constants, using the function class F and (2) the constants are fitted using optimization techniques such as the non-linear optimization solver BFGS.For example, if f = cos(2x 1 ) − 0.1 exp(x 2 ), then the decoder predicts f e = cos(• x 1 ) − • exp(x 2 ) where • denotes an unknown constant.The second is an end-to-end (E2E) approach [31] where both the skeleton and the numerical values of the constants are simultaneously predicted.Both approaches are further discussed in Section 7.

Reinforcement learning
Reinforcement learning provides a framework for learning and decision-making by trial and error [50].An RL Setting consists of four components (S, A, P, R) in a Markov decision process.In this setting, an agent observes a state s ∈ S of the environment and, based on that, takes action a ∈ A, which results in a reward r = R(s, a), and the environment then transitions to a new state s ∈ S. The interaction goes on in time steps until a terminal state is reached.The aim of the agent is to learn the policy P (also called transition dynamics), which is a mapping from states to actions that maximize the expected cumulative reward.An exemplary sketch of an RL-based SR method is illustrated in Figure 13.SR problem can be framed in RL as follows: the agent (NN) observes the environment (parent and sibling in a tree) and, based on the observation, takes an action (predict the next token of the sequence) and transitions into a new state.In this view, the NN model is like a policy, the parent and sibling are like observations, and sampled symbols are like actions.

Applications
Most existing algorithms for solving SR are GP-based, whereas many others, and more recent, are deep learning (DL)-based.There exist two different strategies to solve SR problems, as illustrated in the taxonomy of Figure 14.The first is a one-step approach, where data points are directly fed into an SR algorithm.A second is a two-step approach involving a process which either learns a new representation of data or learns a "blackbox" model, which will be then fed into SR algorithm as described below: 1. Learn a new representation of the original data set by defining new features (reducing the number of independent variables) or a reduced representation using specific NN architectures such as principal component analysis and autoencoders.
2. Learn a "blackbox" model either using regular NN or using conceptual NN such as graph neural network (GNN).In this case, an SR algorithm is applied to the learned model or parts of it.
We group the applications based on the categories presented in Section 3, and we summarize them in Table 5. GP-based applications will not be reviewed here; they are listed in the living review [25], along with DL-based applications.State-of-the-art GP-based methods are discussed in detail in [54].Among GP-based applications is the commercial software Eureqa [26], the most well-known GP-based method that uses the algorithm proposed by Schmidt and Lipson in [2].Eureqa is used as a baseline SR method in several research works.[33] is a hybrid SR method that combines autoencoder network [55] with linear SR [51].The novelty of this approach is in simultaneously learning sparse dynamical models and reduced representations of coordinates that define the model using snapshot data.Given a data set x(t) ∈ R n , this method seeks to learn coordinate transformations from original to intrinsic coordinates z = φ(x) (encoder) and back via x = ψ(z) (decoder), along with the dynamical model associated with the set of reduced coordinates z(t

SINDY-AE
through a customized loss function L, defined as a sum of four terms: encoder loss decoder loss Here the derivative of the reduced variables z are computed using the derivatives of the original variable x, i.e. ż = ∇ x φ(x) ẋ.Predicted coordinates denoted as a pred represent NN outputs and are expressed in terms of coefficient vector Θ and library matrix U(x) following Eq.19, i.e., z rec = U(z T )Θ = U(φ(x) T )Θ.The library is specified before training, and the coefficients Θ are learned with the NN parameters as part of the training procedure.
A case study is the nonlinear pendulum motion whose dynamics are governed by a second-order differential equation given by ẍ = − sin(x).The data set is generated as a series of snapshot images from a simulated video of a nonlinear pendulum.After training, the SINDY autoencoder correctly identified the equation z = −0.99sin z, which is the dynamical model of a nonlinear pendulum in the reduced representation.This approach is particularly efficient when the dynamical model may be dense in terms of functions of the original measurement coordinates x.This method and similar works [56] make the path to "Gopro physics" where researchers point a camera on an event and get back an equation capturing the underlying phenomenon using an algorithm.
Despite successful applications involving partial differential equations, still, one main limitation of this method is in its linear SR part.For example, a model expressed as f (x) = x 1 x 2 − 2x 2 exp(−x 3 ) + 1 2 exp(−2x 1 x 3 ) is discovered only if each term of this expression is comprised in the library, e.g., exp(−2x 1 x 2 ).The presence of the exponential function, i.e., exp(x), is insufficient to discover the second and the third terms.
Symbolic metamodel [30] (SM) is a model-of-a-model method for interpreting "blackbox" model predictions.It inputs a learned "blackbox") model and outputs a symbolic expression.Available post-hoc methods aim to explain ML model predictions, i.e., they can explain some aspects of the prediction but can not offer a full model interpretation.In contrast, SM is interpretable because it uncovers the functional form that underlies the learned model.The symbolic metamodel is based on Meijer G-function [57,58], which is a special univariate function characterized by a set of indices, i.e., G m,n p,q (a p , b q |x), where a and b are two sets of real-values parameters.An instance of the Meijer G-function is specified by (a, b), for example the function G 1,2 2,2 ( a,a a,b |x) takes different forms for different settings of the parameters a and b, as illustrated in Figure 15.
In the context of SR problem solving, the target mathematical expression is defined as a parameterization of the Meijer function, i.e., {g(x) = G(θ, x) | θ = (a, b)}, thus reducing the optimization task to a standard parameter optimization problem that can be efficiently solved using gradient descent algorithms The parameters a and b are learned during training, and the indices (m, n, p, q) are regarded as hyperparameters of the model.SM was tested on both synthetic and real data and was deployed in two modes spanning (1) only polynomial expressions (SM p ) and (2) closed-form expressions (SM c ), in comparison to a GP-based SR method.SM p produces accurate polynomial expressions for three out of four tested functions (except the Bessel function), whereas SM c produces the correct ground-truth expression for all four functions and significantly outperforms GP-based SR.
More generally, consider a problem in a critical discipline such as healthcare.Assuming a feature vector comprising (age, gender, weight, blood pressure, temperature, disease history, profession, etc.) with the aim to predict the risk of a given disease.Predictions made by a "blackbox" could be highly accurate.However, the learned model does not provide insights into why the risk is high or low for a patient and what parameter is the most critical or weightful in the prediction.Applying the symbolic metamodel to the learned model outputs a symbolic expression, e.g., , where x 1 is the blood pressure and x 2 is the age.
Here, we can learn that only two features (out of many others) are crucial for the prediction and that the risk increases with high blood pressure and decreases with age.This is an ideal example showing the difference between "blackbox" and interpretable models.In addition, it is worth mentioning that methods applied for model interpretation only exploit part of the prediction and can not unveil how the model captures nonlinearities in the data.Thus model interpretation methods are insufficient to provide full insights into why and how model predictions are made and are not by any means equivalent to interpretable models.
End-to-end symbolic regression [31] (E2ESR) is a transformer-based method that uses end-to-end learning to solve SR problems.It is made up of three components: (1) an embedder that maps each input point (x i , y i ) to a single embedding, (2) a fully-connected feedforward network, and (3) a transformer that outputs a mathematical expression.What distinguishes E2ESR from other transformer-based applications is the use of an end-to-end approach without resorting to skeletons, thus using both symbolic representations for the operators and the variables and numeric representations for the constants.Both input data points {(x i , y i ) | i ∈ N n } and mathematical expressions f are encoded as sequences of symbolic representations following the description in Section 2.2.E2ESR is tested and compared to several GP-based and DL-based applications on SR benchmarks.Results are reported in terms of mean accuracy, formula complexity, and inference time, and it was shown E2ESR achieves very competitive results for SR and outperforms previous applications.
AIFeynman [27] is a physics-inspired SR method that recursively applies a set of solvers, i.e., dimensional analysis3 , polynomial fit, and brute-force search to solve an SR problem.If the problem is not solved, the algorithm searches for simplifying intrinsic properties in data (e.g.invariance, factorization) using NN and deploys them to recursively simplify the dataset into simpler sub-problems with fewer independent variables.Each sub-problem is then tackled by a symbolic regression method of choice.The authors created the Feynman SR database (see Section 8) to test their approach.All the basic equations and 90% of the bonus equations were solved by their algorithm, outperforming Eureqa.
Deep Symbolic Regression (DSR) [19] is an RL-based search method for symbolic regression that uses a generative recurrent neural network (RNN).RNN defines a probability distribution (p(θ)) over mathematical expressions (τ ), and batches of expressions T = {τ (i) } N i=1 are stochastically generated.An exemplary sketch of how RNN generates an expression (e.g., x 2 − cos(x)) is shown in Figure 16.Starting with the first node following the pre-order traversal (Section 2.2) of an expression tree, RNN is initially fed with empty placeholders tokens (a parent and a sibling) and produces a categorical distribution, i.e., outputs the probability of selecting every token from the defined library L = {+, −, ×, ÷, sin, cos, log, etc.}.The sampled token is fed into the first node, and the number of siblings is determined based on whether the operation is unary (one sibling) or binary (two siblings).The second node is then selected, and the RNN is fed with internal weights along with the first token and outputs a new (and potentially different) categorical distribution.This procedure is repeated until the expression is complete.Expressions are then evaluated with a reward function R(τ ) to test the goodness of the fit to the data D for each candidate expression (f ) using normalized root-mean-square error, To generate better expressions (f ), the probability distribution p(τ |θ) needs to be optimized.Using a gradientbased approach for optimization requires the reward function R(τ ) to be differentiable with respect to the RNN parameter θ, which is not the case.Instead, the learning objective is defined as the expectation of the reward under expressions from the policy, i.e., J(θ) = E τ ∼p(τ |θ) [R(τ )], and reinforcement learning is used to maximize J(θ) by means of the "standard policy gradient": This reinforcement learning trick, called REINFORCE [59], can be derived using the definition of the expectation E[•] and the derivative of log(•) function as follows: The importance of this result is that it allows estimating the expectation using samples from the distribution.More explicitly, the gradient of J(θ) is estimated by computing the mean over a batch of N sampled expressions as follows: The standard policy gradient (Eq.25) permits optimizing a policy's average performance over all samples from the distribution.Since SR requires maximizing best-case performance, i.e., to optimize the gradient over the top fraction of samples from the distribution found during training, a new learning objective is defined as a conditional expectation of rewards above the (1 − )-quantile of the distribution of rewards, as follows: where R (θ) represent the samples from the distribution below the -threshold.The gradient of the new learning objective is given by: DSR was essentially evaluated on the Nguyen SR benchmark and several additional variants of this benchmark.An excellent recovery rate was reported for each set, and DSR solved all mysteries except the Nguyen-12 benchmark given by x 4 − x 3 + 1 2 y 2 − y.More details on SR data benchmarks can be found in Section 8.
Neural-guided genetic programming population seeding [29] (NGPPS) is a hybrid method that combines GP and RNN [19] and leverages the strengths of each of the two components.Whereas GP begins with random starting populations, the authors in [29] propose to use the batch of expressions sampled by RNN as a staring population for GP: T  where n can vary across examples, and the number of independent input variables is at most three.In the test phase, a set of input-output pairs {x i , y i } is fed into the encoder that maps it into a latent vector z, and the decoder iteratively samples candidates' skeletons.What distinguishes this method is the learning task, i.e., it improves over time with experience, and there is no need to be retrained from scratch on each new experiment.It was shown that NeSymReS outperforms selected baselines (including DSR) in time and accuracy by a large margin on all datasets (AI-Feynman, Nguyen, and strictly out-of-sample equations (SOOSE) with and without constants).NeSymReS is more than three orders of magnitudes faster at reaching the same maximum accuracy as GP while only running on CPU.
GNN [52] is a hybrid scheme performing SR by training a Graph Neural Network (GNN) and applying SR algorithms on GNN components to find mathematical equations.
A case study is Newtonian dynamics which describes the dynamics of particles in a system according to Newton's laws of motion.D consists of an N-body system with known interaction (force law F such as electric, gravitation, spring, etc.), where particles (nodes) are characterized by their attributes (mass, charge, position, velocity, and acceleration) and their interaction (edges) are assigned the attribute of dimension 100.The GNN functions are trained to predict instantaneous acceleration for each particle using the simulated data and then applied to a different data sample.The study shows that the most significant edge attributes, say {e 1 , e 2 }, fit to a linear combination of the true force components, {F 1 , F 2 }, which were used in the simulation showing that edge attributes can be interpreted as force laws.The most significant edge attributes were then passed into Eureqa to uncover analytical expressions that are equivalent to the simulated force laws.The proposed approach was also applied to datasets in the field of cosmology, and it discovered an equation that fits the data better than the existing hand-designed equation.
The same group has recently succeeded in inferring Newton's law for gravitational force using GNN and PySR for symbolic regression task [34].GNN was trained using observed trajectories (position) of the Sun, planets, and moons of the solar system collected during 30 years.The SR algorithm could correctly infer Newton's formula that describes the interaction between masses, i.e., F = −GM 1 M 2 /r 2 , and the masses and the gravitational constant as well.

Datasets
For symbolic regression purposes, there exist several benchmark data sets that can be categorized into two main groups: (1) ground-truth problems (or synthetic data) and ( 2) real-world problems (or real data), as summarized in Figure 18.In this section, we describe each category and discuss its main strength and limitations.

SR Benchmark
Ground-truth problems

Physics equations Mathematics equations
Real-world problems Observations Measurements Ground-truth regression problems are characterized by known mathematical equations, they are listed in Table 6.These include (1) physics-inspired equations [27,60] and (2) real-valued symbolic equations [1,14,15,16,17,18,19,61].The Feynman Symbolic Regression Database [62] is the largest SR database that originates from Feynman lectures on Physics series [63,64] and is proposed in [27].It consists of 1194 physics-inspired equations that describe static physical systems and various physics processes.The proposed equations depend on at least one variable and, at most, nine variables.Each benchmark (corresponding to one equation) is generated by randomly sampling one million entries.Each entry is a row of randomly generated input variables, which are sampled uniformly between 1 and 5.This range of sampling was slightly adjusted for some equations to avoid unphysical results (e.g., division by zero or the square root of a negative number).The output is evaluated using function f , e.g.
This benchmark is rich in proposing various theoretical formulae.Still, it suffers a few limitations: (1) there is no distinction between variables and constants, i.e., constants are randomly sampled and, in some cases, in domains extremely far from physical values.For example, the speed of light is sampled from a uniform distribution U(1, 20) whereas its physical value is orders of magnitude higher, i.e., c = 2.988 × 10 8 m/s, and the gravitational constant is sampled from U(1, 2) whereas its physical value is orders of magnitude smaller, G = 6.6743 × 10 −11 m 3 kg −1 s −2 , among others (e.g., vacuum permittivity ∼ 10 −12 , Boltzmann constant k b ∼ 10 −23 , Planck constant h ∼ 10 −34 ).( 2) Some variables are sampled in nonphysical ranges.For example, the gravitational force is defined between two masses distant by r as F = Gm 1 m 2 /r 2 .This force is weak unless defined between significantly massive objects (e.g., the mass of the earth is M e = 5.9722 × 10 24 kg) whereas m 1 and m 2 are sampled in U(1, 5) in the Feynman database.(3) Some variables are treated as floats while they are integers, and (4) many equations are duplicates of each other (e.g., a multiplicative function of two variables f (x, y) = x * y) or have similar functional forms.
The ODE-Strogatz repository [60] consists of ten physics equations that describe the behavior of dynamical systems which can exhibit chaotic and/or non-linear behavior.Each dataset is one state of a two-state system of ordinary differential equations.
Within the same category, there exist several benchmarks [1,14,15,16,17,18,19] consisting of real-valued symbolic functions.The majority of these benchmarks are proposed for GP-based methods and grouped into four categories: polynomial, trigonometric, logarithmic, exponential, and square-root functions, and a combination of univariate and bivariate functions.The suggested functions do not have any physical meaning, and most depend either on one or two independent variables.Datasets are generally generated by randomly sampling either 20 or 100 points in narrow ranges.The most commonly known is the so-called Nguyen benchmark, which consists of 12 symbolic functions taken from [65,66,67].Only four equations have the scalars {1,2,1/2} as constants therein.Each benchmark is defined by a ground-truth expression, training, and test datasets.The equations proposed in these benchmarks can not be found in a single repository.Therefore we list them in the appendix in Tables [7][8][9] and Tables [10][11] for completeness and for easy comparison.
Real-world problems are characterized by an unknown model that underlies data.This category comprises two groups: observations and measurements.Data sets in the observations category can originate from any domain, such as health informatics, environmental science, business, commerce, etc.Data could be collected online or offline from reports or studies.A wide range of problems can be assessed from the following repositories: the PMLB [68], the OpenML [69], and the UCI [70].An exemplary application in this category is wind speed forecasting [71].Measurements represent sets of data points that are collected (and sometimes analyzed) in physics experiments.Here the target model is either an underlying theory than can be derived from first principles or not.In the first case, symbolic regression would either infer the correct model structure and parameters or contribute to the theory development of the studied process, whereas in the second case, the symbolic regression output could be the awaited theory.

Discussion
SR is a growing area of ML and is gaining more attention as interpretability is increasingly promoted [3] in AI applications.SR is propelled by the fact that ML models are becoming very big in parameters at the expense of making accurate predictions.An exemplary application is the chatGPT-4, a large language model comprising hundreds of billions of parameters and trained on hundreds of terabytes of textual data.Such big models are very complicated networks.ChatGPT-4, for example, is accomplishing increasingly complicated and intelligent tasks to the point that it is showing emergent properties [72].However, it is not straightforward to understand when it works and, more importantly, when it does not.In addition, its performance improves with increasing the number of parameters, highlighting that its prediction accuracy depends on the size of the training data set.Therefore, a new paradigm is needed, especially in scientific disciplines, such as physical sciences, where problems are of causal hypothesis-driven nature.SR is by far the most potential candidate to fulfill the interpretability requirements and is expected to play a central role in the future of ML.
Despite the significant advances made in this subfield and the high performance of most deep learning-based SR methods proposed in the literature, still, SR methods fail to recover relatively simple relationships.A case in point is the Nguyen-12 expression, i.e., f (x, y) = x 4 − x 3 + y 2 /2 − y, where x and y are uniformly sampled in the range [0, 1].The NGPPS method could not recover this particular expression using the library basis L = {+, −, ×, ÷, sin, cos, exp, log, x, y}.A variant of this expression, Nguyen-12 , consisting of the same equation but defined over a larger domain, i.e., data points sampled in [0, 10], was successfully covered using the same library, with a recovery rate of 12%.This result is significantly below the perfect performance on all other Nguyen expressions.A similar observation is made for the Livermore-5 whose expression is f (x, y) = x 4 − x 3 + x 2 − y.
We ran NGPPS on Nguyen-12 with two libraries, a pure polynomial basis L 1 = {+, −, ×, ÷, (•) 2 , (•) 3 , (•) 4 , x, y} and a mixed basis L 2 = L 1 ∪ {sin, cos, exp, log, sqrt, expneg}.The algorithm succeeds in recovering Nguyen-12 only using a pure polynomial basis with a recovery rate of 3%.The same observation is made by applying linear SR on Nguyen-12.This highlights how strongly the predicted expression depends on the set of allowable mathematical operations.A practical way to encounter this limitation is to implement basic domain knowledge in SR applications whenever possible.For example, astronomical data collected by detecting the light curves of astronomical objects exhibit periodic behavior.In such cases, periodic functions such as trigonometric functions should be part of the library basis.
Most SR methods are only applied to synthetic data for which the input-output relationship is known.This is justified because the methods must be cross-checked, and their performance must be evaluated using groundtruth expressions.However, the reported results are for synthetic data only.To the best of our knowledge, only one physics application [34] succeeded in extracting New's laws of gravitation by applying SR to astronomical data.The absence of such applications leads us to state that SR is still a relatively nascent area with the potential to make a big impact.Physics in general, and physical sciences in particular, represent a very broad field for SR development purposes and are very rich both in data and expressions, e.g., areas such as astronomy and high-energy physics are very rich in data.In addition, lots of our acquired knowledge in physics can be used for SR methods test purposes because underlying phenomena and equations are well known.All that is needed is greater effort and investment.

Conclusion
This work presents an in-depth introduction to the symbolic regression problem and an expansive review of its methodologies and state-of-the-art applications.Also, this work highlights a number of conclusions that can be made about symbolic regression methods, including (1) linear symbolic regression suffer many limitations, all originating from predefining the model structure, (2) neural network-based methods lead to numerical issues and the library can not include all mathematical operations, (3) expression tree-based methods are yet the most powerful in terms of model performance on synthetic data, in particular transformer-based ones, (4) model predictions strongly depend on the set of allowable operations in the library basis, and (5) generally, deep learning-based methods are performing better than other ML-based methods.
Symbolic regression represents a powerful tool for learning interpretable models in a data-driven manner.Its application is likely to grow in the future because it balances prediction accuracy and interpretability.Despite the limited SR application to real data, the few existing ones are very promising.A potential path to boost progress in this subfield is to apply symbolic regression to experimental data in physics.

Figure 1 :
Figure 1: (a) Example of a unary-binary tree that encodes f (x) = x 1 x 2 − 2x 3 .(b) Sequence representation of the tree-like structure of f (x).

Figure 4 :
Figure 4: Result of linear SR for the Nguyen-1 benchmark, i.e., f (x) = x + x 2 + x 3 .Red points represent (test) data set.The red curve represents the true function.The blue and black dashed curves represent the learned functions using L 1 and L 2 , respectively.

Figure 5 :
Figure 5: Two-dimensional multivariate normal distribution used in test applications.

Figure 8 :
Figure 8: Exemplary setup of a standard NN (8a) and EQL-NN (8b) with input x, output ŷ and two hidden layers.In (a), f denotes the activation function usually chosen among {RELU, tanh, sigmoid} while in EQL each node has a specific activation function drawn from the function class F.

Figure 9 :
Figure 9: (a) Expression-tree structure of f (x) = x 2 − cos(x).(b) f (x) as a function of x (blue curve) and data points (red points) generated using f (x).

Figure 10 :
Figure 10: Crossover (left) and mutation (right) operations on exemplary expression trees in genetic programming.

Figure 13 :
Figure 13: Exemplary sketch of a general RL-based SR method.s t , a t , and r t = R(s t , a t ) denote the state, action, and reward at time step t. (t + 1) denotes the next time step.

Figure 14 :
Figure 14: Strategies for solving SR problem.An SR algorithm has three types of input: data (x), a new or reduced representation of the data (x or z), or a model (f (x)) learned from the data.

Figure 15 :
Figure 15: Example of a Meijer G-function G 2,2 1,1 ( a,a a,b |x) for different values of a and b [30].
GP = T RN N .Each iteration of the proposed algorithm consists of 4 steps: (1) The batch of expressions sampled by RNN is passed as a starting population to GP, (2) S generations of GP are performed and result in a final GP population T S GP , (3) An elite set of top-performing GP samples is selected T E GP and passed to the gradient update of RNN.

Figure 18 :
Figure 18: Taxonomy based on the type of SR benchmark problems.

Table 1 :
Table summarizing symbolic regression methods.The mathematical tool, the expression form, the set of unknowns, and the search space are specified for each method.Set2seq abbreviates "set-to-sequence".

Table 2 :
Results of linear SR in the case of univariate functions.

Table 4 :
Library bases used in test problems of Sec.4.1.2

Table 5 :
Table summarizing symbolic regression applications.D refers to data input, 1 refers to data representation input, and 2 refers to model input to SR.

Table 6 :
Table summarizing ground-truth problems for symbolic regression.