Efficient generator of mathematical expressions for symbolic regression

We propose an approach to symbolic regression based on a novel variational autoencoder for generating hierarchical structures, HVAE. It combines simple atomic units with shared weights to recursively encode and decode the individual nodes in the hierarchy. Encoding is performed bottom-up and decoding top-down. We empirically show that HVAE can be trained efficiently with small corpora of mathematical expressions and can accurately encode expressions into a smooth low-dimensional latent space. The latter can be efficiently explored with various optimization methods to address the task of symbolic regression. Indeed, random search through the latent space of HVAE performs better than random search through expressions generated by manually crafted probabilistic grammars for mathematical expressions. Finally, EDHiE system for symbolic regression, which applies an evolutionary algorithm to the latent space of HVAE, reconstructs equations from a standard symbolic regression benchmark better than a state-of-the-art system based on a similar combination of deep learning and evolutionary algorithms.


Introduction
Symbolic regression (also known as equation discovery) aims at discovering closed-form equations in collections of measured data [1,2].Methods for symbolic regression explore the vast space of candidate equations to find those that fit the given data well.They often employ modeling knowledge from the domain of use to constrain the search space of candidate equations.The knowledge is usually formalized into grammars [3] or libraries of model components, such as entities and processes [4].Knowledge-based equation discovery methods have successfully solved practical modeling problems in various domains [5,6].
Grammars and libraries of model components are used to generate candidate expressions that might appear in the discovered equations.However, they must be manually crafted, which is a severe obstacle to their broader use.The central aim of this article is to develop a novel generative model of mathematical expressions that can be used for efficient symbolic regression.The model can be trained from a corpus of mathematical expressions from the domain of interest, thus automatically tailoring the space of candidate equations to the application at hand.The developed generative model must have two essential properties to be applicable in such a scenario.First, it should be trainable from a small number of mathematical expressions, e.g., collected from a textbook or from scientific literature in the application domain.Second, the model should encode the expressions in a low-dimensional latent space.The latter space can then be efficiently explored by optimization methods to solve the task of symbolic regression.Lowering the dimensionality of the latent space will significantly increase the efficiency of symbolic regression.
Recently, several variational autoencoders (VAEs) have been shown to be efficient generative models.CVAE [7] employs a VAE based on recurrent neural networks to encode discrete expressions into a continuous latent space and then decode points from the latent space back into discrete mathematical expressions.This decoder can be used to generate expressions.However, CVAE still generates invalid sequences and requires extensive training data to reduce the likelihood of generating invalid expressions [8].The grammar variational autoencoder, GVAE [9], and its successor, SD-VAE [10], employ a context-free grammar to ensure the syntactic validity of the generated expressions.Instead of directly training models on sequences, they model the distribution of parse trees that are produced by the grammar while deriving syntactically (and, in the case of SD-VAE, semantically) valid expressions.
We claim that grammars are an unnecessarily powerful and too general formalism for generating mathematical expressions.Grammars add syntactic categories to the expression symbols rendering the parse trees, i.e., the structures modeled with the autoencoder, more complex than the original sequences.This overhead on training expressions inevitably translates to a requirement for more extensive training data and a latent space with larger dimensionality, reducing the efficiency of optimization methods for symbolic regression operating in that latent space.
We propose a novel variational autoencoder for hierarchical data objects, HVAE, to address these issues.It builds upon the ideas of variational autoencoders for hierarchical data [11] and gated recursive convolutional neural networks [12].HVAE combines simple atomic units with shared weights to encode and decode the individual nodes in the hierarchy.The atomic units are extensions of the standard gated recurrent unit (GRU) cells.The encoding units are stacked into a tree that follows the hierarchy of the training object, and they encode the hierarchy bottom-up, compiling the codes of the descendants to encode the ancestor nodes.The decoding units proceed top-down and use the decoded symbols of the ancestor nodes to decide upon the need to extend the hierarchy with descendant nodes.We claim that HVAE can be efficiently trained to generate valid mathematical expressions from a training set of modest size, while using a low-dimensional latent space.
We exploit these expected properties of our HVAE to implement a novel approach for symbolic regression, EDHiE.It performs an evolutionary search through the latent space of a HVAE trained on mathematical expression trees as shown in Figure 1.The genetic operations utilize the HVAE encoder to obtain the expressions' latent codes, generate new individuals with crossover and mutation in the latent space, and decode the latter back to mathematical expressions.EDHiE can then evaluate the fit of the obtained expressions against the measurements.We conjecture that the performance of EDHiE on standard benchmarks [13,14] would compare favorably to that of a state-of-theart symbolic regression methods [15].The results of our empirical evaluation of HVAE and EDHiE confirm our conjectures.HVAE can achieve better reconstruction of the training expressions with order-of-magnitude fewer training examples while using latent spaces with fewer dimensions.EDHiE outperforms alternative methods for symbolic regression on the task of reconstructing the ten equations in the Ngyuen benchmark.
We can summarize the contributions of this work as follows: • We propose HVAE, a variational autoencoder for hierarchical data, that can be efficiently trained to generate mathematical expressions from modest amounts of data, while using a low-dimensional latent space.• We introduce EDHiE, a symbolic regression approach that exploits HVAE to efficiently search through the space of candidate equations.
The remainder of the paper is organized as follows.Section 2 reviews related work on generative models and symbolic regression.We introduce the hierarchical variational autoencoder HVAE and the symbolic regression approach EDHiE in Section 3. Section 4 presents the results of the empirical evaluation of HVAE and EDHiE.Finally, Section 5 summarizes and discusses the contributions of the presented work and outlines directions for further research.
Fig. 1: A schematic representation of the EDHiE approach.In the first step, we train a HVAE model.In the second step, we explore the latent space of the HVAE model with an evolutionary algorithm.The red dot represents the best expression in a given iteration.

Related work
Most of the early successful applications of generative models have been in the domains of text, speech, images, and video, i.e., they have been mainly used for generating unstructured data objects composed of continuous data elements.The discrete data structures that generative models have most often tackled are strings or sequences of characters, where the data elements are discrete symbols.The models that use strings as input (and output) usually do so by training a recurrent neural network [16], most commonly using seq2seq autoencoders [17].
A major problem of sequence-to-sequence autoencoders is that they do not guarantee the syntactic correctness of the generated expressions.One way to solve this problem is to learn an additional validation model for checking the correctness of the generated sequence [18].Grammar variational autoencoders (GVAE) [9] use context-free grammars for specifying the space of valid structured data objects.Each data object can be then represented as a sequence of grammar productions (rewrite rules) that derives it.In turn, GVAE encode sequences of rewrite rules that derive objects instead of the objects themselves.The structure of the decoder is constrained to generate valid sequences of rewrite rules that are then used together with the grammar to generate valid expressions.
Dai et al. [10] propose the use of attribute grammars, i.e., context-free grammars that attach attributes to the grammar's syntactic categories.By prescribing properties and relationships between the attributes, such grammars can also encode semantic constraints on the derived data objects.The attribute grammars, together with SD-VAE, i.e., syntax-directed VAE, can generate expressions that are consistent with a set of both syntactic and semantic constraints.Alternative generalizations of grammars have been used for generative modeling of program source code in high-level languages [19].
Most of the above approaches can also generate mathematical expressions.However, they need the complex formalism of grammars to generate more complex data structures, most often molecular structures [7,10].Since mathematical expressions can be represented as simpler structures, i.e., binary trees, our work concerns generative models for hierarchical (tree-structured) data.
Hierarchical data have been tackled by generative models in several ways.By making a node depend on its parent and previous sibling, DRNN [20] combines representations obtained from the depth-wise and width-wise recurrent cells to generate new nodes, which proves useful for recovering the structure of a tree.On the other hand, Tree-LSTM [21] and JT-VAE [11] focus on adapting equations for recurrent cells to encode (and decode) hierarchical structures more efficiently.Tree-LSTM proposes a generalization of the LSTM cell for encoding trees into a representation that proves effective for classification tasks and semantic relatedness of sentence pairs.JT-VAE adapts recurrent cells for tree message passing.Trees are used as scaffolding for the graph that represents molecules.Encoding and decoding are thus split into four parts: encoding of the graph, encoding of the tree, decoding of the tree, and decoding of the graph.While these adaptations are similar to the ones presented in our work, their focus is on encoding more general structures that are unrelated to mathematical expressions.
Note that our model falls into the general framework of gated recursive convolutional neural networks [12] that combine atomic units with shared parameters in a hierarchy.The output of the root node produces a fixed-length encoding of a data object with an arbitrarily complex structure.Another model, marginally related to ours, is the one of equivalence neural networks [22].The encoding produced by these networks follows the expressions' semantic similarity and equivalence, in contrast to their syntactic similarity, which is followed by all the other approaches, including ours.
Finally, our work is also related to algorithms for equation discovery and symbolic regression.Most of them generate candidate expressions for equations first and then estimate the values of their constant parameters by matching the equations against data in the second phase.Classical symbolic regression approaches [1,23,24], based on evolutionary algorithms, use stochastic generators of expression trees: At the beginning, the expression trees are randomly sampled, and later on, they are transformed using the evolutionary stochastic operations of mutation and cross-over.In contrast, process-based modeling approaches [4] generate equations by following domain-specific knowledge (provided by the user) that specifies a set of entities (variables) and processes (interactions among entities).Grammar-based approaches to equation discovery employ user-specified context-free grammars (which can also be based on domain knowledge), deterministic [25] or probabilistic [3], as efficient generators of expressions.
Recently, many symbolic regression approaches based on neural networks have been proposed [14,[26][27][28][29][30].In particular, Deep Symbolic Optimization, DSO approaches symbolic regression (among other optimization tasks [15]) by combining neural networks and reinforcement learning with evolutionary algorithms.The neural networks are used to sample the individuals in the initial population of the evolutionary algorithm and are retrained at each iteration to focus on expressions leading to better fit.It is closely related to our work, since it combines similar methods.Yet our focus here is on efficient neural networks for generating mathematical expressions that are trained before the beginning of the evolutionary process.

Methodology
We start this section by briefly introducing the task of symbolic regression and the search space of mathematical expressions (Section 3.1).After this, we introduce variational autoencoders and the structure of the hierarchical variational autoencoder, HVAE (Section 3.2).We finish the section by explaining how to use HVAE for generating mathematical expressions and how to combine it with an evolutionary algorithm for symbolic regression (Section 3.3).

Symbolic regression and expression trees
Symbolic regression (SR) is the machine learning task of discovering equations in collections of measured data.Symbolic regression methods take a data set S consisting of multiple measurements of a set of real-valued variables V = {x 1 , x 2 , . . ., x p , y}, where y is a designated target variable.The output of SR is an equation of the form y = f (x 1 , x 2 , . . ., x p ), where the right-hand side of the equation is a closed-form mathematical expression.The equation should provide an optimal fit against the measurements from S, i.e., minimize the discrepancy between the observed values of the target variable y and values calculated by using the equation.Symbolic regression methods usually follow the parsimony principle, preferring simpler expressions over more complex ones.
Symbolic regression methods search through the space of candidate mathematical expressions for the right-hand side of the equation to find the one that optimally fits the measurements.Mathematical expressions can be represented in different ways.We commonly use the infix notation, where operators are placed between two sub-expressions they operate on, e.g., A + B, where A and B are sub-expressions.Infix notation uses parentheses to indicate the order in which the operations need to be performed.Prefix (Polish) or postfix (reverse Polish) notations do not need parentheses since the operators are written before or after the two sub-expressions, e.g., +AB or AB+.The three notations correspond to different traversals of the nodes in an expression tree.
The latter is a hierarchical data structure, where the inner nodes correspond to mathematical operators and functions, while the leaf nodes correspond to variables and constants.
In symbolic regression, the constants' values are fitted against the measured data from S, while variables include elements from V without the target variable.We assume binary expression trees since standard arithmetic operators are binary.We take that the second descendant node is null in the inner nodes corresponding to single-argument functions.We define the height of an expression tree as the number of nodes on the longest path from the root node to one of the leaves.Figure 2 depicts an example expression tree with a height of four, along with the corresponding mathematical expression in different notations.
Fig. 2: An expression tree with a height of four and three sequence-based representations of the corresponding mathematical expression.
Our model generates expression trees, as they have several advantages over sequences (strings).Firstly, it is easy to achieve syntactic correctness, since operators and functions are in the inner nodes, while variables and constants are in the leafs.Secondly, information needs to travel at most log n steps up the tree (up to the tree's height) instead of n steps along the sequence (up to the length of the sequence).Lastly, sub-expressions can be encoded independently of each other during the encoding process.

Hierarchical variational autoencoder
In recent years, variational autoencoders [31] have emerged as one of the most popular generative models.The reason for this is that, when trained correctly, variational autoencoders map the observed data with an unknown distribution into a latent representation with a known distribution.This results in a continuous latent space, from which one can sample and synthesize new data.In contrast to a (deterministic) autoencoder, where the encoder outputs a latent representation z that is directly fed into the decoder, the encoder in the variational autoencoder outputs the parameters for an approximate posterior distribution, e.g., µ and σ in the case of a latent space parameterized by a multivariate Gaussian distribution.
Thus, a representation z that is fed into the decoder is sampled from the underlying distribution with the learned parameters (µ, σ).The loss of  the variational autoencoder is the reconstruction error, i.e., the difference between the input to the encoder and the output of the decoder.Additionally, variational autoencoders typically use Kullback-Leibler (KL) divergence [32] as the regularization term for the loss.The loss can thus be calculated as: where J rec (x) is the reconstruction loss of x and λ ≥ 0 the regularization cost parameter.In case the underlying distribution is Gaussian, KL divergence to an isotropic unit Gaussian can be estimated as We use cost annealing [33] to focus on the reconstruction error (i.e., use small values of λ) at the beginning and then gradually shift the focus towards the smoothness of the latent space by increasing the value of λ.

Model overview
Our approach uses a variational autoencoder structure that consists of an encoder and a decoder.The encoder takes tree-structured data as input and outputs a distribution in the latent vector space, represented with the mean (µ z ) and the logarithm of the variance (log σ z ) vectors.The decoder works in the opposite direction, sampling a point from the latent vector space as input and transforming it into a binary expression tree.To make the backward propagation possible, we sample points with the reparametrization trick.
Trees are encoded recursively, starting from leafs and ending at root nodes.To encode a subtree with a root in n, we first encode its left and right subtrees.We then pass their codes, along with the symbol in the node n, as inputs to the encoding cell (further described in Section 3.2.2).This cell outputs the code of the subtree rooted in n.At the beginning of the recursion, in each leaf node, the codes corresponding to the (missing) children are assumed to be vectors of zeros.Once the root of the tree is encoded, its code is passed through two fully connected layers that give the mean and log-variance vectors that form the latent representation of the tree.Figure 3a illustrates the recursive encoding process on the expression x + cos x.
The first layer of the decoder transforms the sampled point from the latent space into the code of the hierarchy.After this, the tree is generated recursively by passing the code of the current node (subtree) through the decoding cell (further described in section 3.2.3).This cell takes the code of the node (subtree) as input and generates a node symbol, along with the codes of the two child nodes.There are three possible symbol types.If we encounter an operator, both child nodes are generated recursively.On the other hand, if the symbol represents a function, we only generate the left child.Lastly, if the symbol is either a variable or a constant, no further child nodes are generated in this branch.This process is shown in Figure 3b, where the expression x + cos x is decoded.
During training, we follow the structure of the encoded tree and try to predict the correct node symbols.In turn, we jointly learn to predict the structure of the expression tree and the symbols inside the node, since the structure is determined by the symbols.We calculate the loss using cross-entropy on a sequence of symbols obtained with the in-order traversal of the expression tree.Some additional implementation details are explained in Appendix D.

Encoder
The encoding proceeds in two phases.The first follows the hierarchy of the input and applies the encoding cell to each node of the hierarchy as described above.In the second phase, the code of the root node is transformed into the mean and log-variance vectors of the input's latent representation.
Encoding comprises a GRU21 cell, which we have adapted from the GRU cell [34].The (output) code h in GRU21 is computed from the input vector x, and codes h l of the left and h r of the right child with the following equations: where φ S denotes the Sigmoid activation function.In these expressions, r, u, and n represent the standard reset gate, update gate and candidate activation vectors from a GRU cell.When compared to the original equations of the GRU cell, Equations ( 3),( 4),( 5) exhibit two differences.First, instead of the code of the previous symbol in the sequence, the concatenation of the codes h l and h r of the child nodes is used (denoted by (h l + + h r )).Second, the dimension of the weight matrices W hr , W hu , W hn must be dim(h l ) + dim(h r ) instead of dim(h).Thus, while Equation ( 6) remains similar to the original one, we change the second term (from its usual form u * h t−1 ) to u 2 * h l + u 2 * h r , to retain information from the codes of the two child nodes.Recall that * denotes the element-wise multiplication of vectors.
In the second phase, the model transforms the code of the root node into the latent representation of the input expression through two fully-connected layers.

Decoder
The decoding also comprises two phases.In the first, a fully-connected layer transforms a point from the latent vector space into the code of the root node.In the second phase, the decoding cell is recursively deployed to decode each of the nodes in the expression tree. Figure 4 depicts the structure of the decoding cell.The cell is composed of a fully connected layer, a softmax layer, and the GRU12 cell, an adaptation of the original GRU cell.The input code is first passed through the sequence of a fullyconnected and a softmax layer.The latter creates the vector of probabilities, from which the most probable output symbol is chosen.If the output symbol is either a constant or a variable, the decoding stops.Otherwise, the output vector is also used as an input to the GRU12 cell, together with the code that is given as input into the decoding cell.The GRU12 cell produces two codes, one for the left and one for the right child.
GRU12 computes the two codes h l and h r for the child nodes using the input vector x and the code h with the following equations: There are two major differences between GRU12 and the original GRU cell.First, the vectors r,u, and n in Equations ( 7),( 8),( 9) are of dimension 2 • dim(h) instead of dim(h).Consequently, all bias vectors are of dimension 2 • dim(h), and all weight matrices have an output dimension of 2 • dim(h).Second, in Equation ( 10), the code h is concatenated with itself to make the dimensions in the equation match.Vector d is then split in half in Equation 11.The first part is used as a code for the left child, while the second is used as a code for the right child.

Generating expressions for symbolic regression
Recall that the goal of symbolic regression is to efficiently search through the space of mathematical expressions and find the one that, when used on the right-hand of an equation, fits given measurements well.In this section, we explain how to use HVAE for generating expressions.

HVAE as a generative model
We can generate expressions in two ways, corresponding to two different symbolic regression scenarios.The first way, which aims at discovering equations from data, samples random vectors from the standardized Gaussian distribution N (0, I) in the latent space and passes them through the decoder.
On the other hand, we might want to generate expressions in a scenario that corresponds to the revision of existing equations to fit newly gathered data.Here, we want to generate mathematical expressions that are similar to the one given as input.Our approach achieves this by encoding an expression and sampling its immediate neighborhood in the latent space.We expect these points to be decoded into expressions similar to the one given as input.We will show that HVAE meets this expectation in Section 4.1.4.

Evolutionary algorithm operators
Finally, we can search the latent space spanned by our model with evolutionary algorithms [35], one of the most commonly used paradigms for symbolic regression.Evolutionary algorithms explore the search space by first randomly sampling individuals for the initial population.Then they repetitively generate new populations by combining pairs of individuals from the current population with the genetic operators of mutation and crossover.
An individual in a population is in our case a real-valued vector z, corresponding to the code of an expression tree in the latent vector space.Using the HVAE model, z can be decoded into an expression tree.To calculate the individual's fit against the training data, we first fit the values of the constant parameters in the decoded expression tree and then measure the error of the equation with the resulting expression on the right-hand side (with respect to the training data).We generate the initial population by randomly sampling individuals from the Gaussian distribution N (0, I).
Crossover combines two individuals, referred to as parents z A and z B , into an offspring z O .We generate the latter as a convex combination of z A and z B , i.e. z O = (1 − a) • z A + a • z B , where a is sampled from the uniform distribution on the interval [0, 1].For values of a close to 0 and 1, the offspring is close to one of the parents, while values of a close to 0.5 lead to an offspring equally dissimilar to both parents.
The mutation operator transforms an individual z into a mutated individual z M .To perform a mutation, we first decode z into an expression tree and immediately encode it back into its latent space representation to obtain the value of σ z .Now, we can mutate z into an individual with a syntactically similar expression by sampling from N (µ z , σ z ) or into a random individual by sampling the offspring z O from N (0, I).Similarly to the case of crossover, we interpolate between these two extremes by sampling from N (a , where a is randomly sampled from the uniform distribution on the interval [0, 1].When a is close to 0, the offspring z O is chosen at random (see the first paragraph of Section 3.3.1).On the other hand, when a is close to 1, z O is syntactically similar to z (second paragraph of Section 3.3.1).
We implement the EDHiE (Equation Discovery with Hierarchical variational autoEncoders) approach for symbolic regression by combining HVAE with evolutionary algorithms using these operators.Our implementation uses pymoo [36] for evolutionary algorithms and ProGED [3] functionality for evaluating the fit of a candidate equation.

Evaluation
In this section, we will investigate the validity of our hypothesis that the hierarchical variational autoencoder is a more efficient generator of mathematical expressions than the alternative VAEs for sequences by conducting two series of computational experiments.In the first series, we are going to evaluate the performance and efficiency of HVAE on the task of generating mathematical expressions.In the second series, we will evaluate the performance of EDHiE on the symbolic regression downstream task.

The performance of HVAE
We start this section by introducing the experimental setup (Section 4.1.1).We continue with reporting the experimental results of evaluating HVAE with respect to the reconstruction error (Section 4.1.2),efficiency in terms of the size of training data needed, the dimensionality of the latent space (Section 4.1.3),and finally the smoothness of the latent space (Section 4.1.4).In Appendix B, we further justify our claim that points close in the latent space of HVAE are decoded into similar expressions.

Experimental setup
Data sets.We estimate the reconstruction error of the variational autoencoders on a collection of six synthetic data sets, ranging from small ones, including simple expressions, to large ones, including complex expressions.The data sets are as follows: AE4-2k, AE5-15k, and AE7-20k have 2, 15, and 20 thousand mathematical expressions with trees with a maximum height of four, five, and seven.These expressions can contain constants, variables, and the operators +,−,•,/, and ˆ.Trig4-2k, Trig5-15k, and Trig7-20k are the same as above, but the expressions also contain the sine and cosine functions.
We create these data sets with the ProGED [3] system by randomly sampling mathematical expressions from a given probabilistic context-free grammar.The generated expressions are simplified using the Python library SymPy [37].The context-free grammars that constrain the output of GVAE and the ones used to generate the data sets are documented in Appendix A.
Parameter setting.We train GVAE and CVAE for 150 epochs with the following values of their hyper-parameters: latent dimension = 128, hidden dimension = 128, batch size = 64, kernel sizes of the convolution layers = 2, 3, 4, and the ADAM optimizer [38].For reconstruction results created with our approach (HVAE), the hyper-parameters are: latent size = 128, batch size = 32, and the ADAM optimizer with the default learning rate.For the first 1,800 iterations i, we calculate the regularization cost parameter λ using λ i = 0.5 • (tanh i−4,500 2 + 1), after this, we set λ i to λ 1,800 .
Estimating the reconstruction error.The Levenshtein distance [39] (often referred to as the edit distance) quantifies the dissimilarity of two strings in terms of the number of insertion, removal, and substitution operations that are needed to transform one string into the other.We use this distance to test how well our autoencoder recreates expressions.
We first pass the expression through the VAE to get the predicted expression.If needed, we validate the syntactical correctness of the latter and transform it into an expression tree.We then traverse the input and the output trees in post-order (left child, right child, node symbol) to obtain the input and the output expressions in the postfix notation (which does not require parentheses and is hence more suited for calculating the distance between expressions).Finally, we calculate the edit distance between those two strings.
To estimate the reconstruction error on unseen expressions, we use five-fold cross-validation with the same splits across all methods.GVAE and CVAE sometimes produce invalid expressions, which we discard from the evaluation.Because of this, the results in Section 4.1.2and 4.1.3might be biased in favor of CVAE due to many syntactically incorrect expressions being discarded.Note that GVAE has fixed-size input (and output) that might be too short for encoding all the grammar rules needed to derive an expression.In those cases, GVAE returns empty strings, which we consider invalid expressions.CVAE, on the other hand, produces syntactically incorrect expressions such as xc(/x)c) sin sin(c)), • • x − c • / sin(x)), or (/x(−x)c) (presented here in infix notation).

Out-of-sample reconstruction error
Table 1 compares the out-of-sample reconstruction error and the ratio of invalid expressions for the three variational autoencoders.Our hierarchical VAE significantly outperforms the other two methods on all data sets.An interesting observation is that GVAE works consistently better on expressions involving trigonometric functions, while HVAE and CVAE perform worse.The reason for the opposite effect is probably the following: for GVAE, functions only represent yet another production rule in the grammar, while for HVAE and CVAE they drastically change the structure of the expression (tree).This translates to better performance of GVAE, as expressions with trigonometric functions are usually shorter, given that the nodes corresponding to the trigonometric functions have only one descendant instead of the usual two.

Table 1:
The out-of-sample reconstruction error and the percentages of syntactically incorrect expressions generated by the three variational autoencoders.The percentages of invalid expressions generated by the approaches show that our approach always produces syntactically correct expressions, while GVAE and CVAE sometimes fail to produce valid outputs.The fraction of such expressions is quite small when the GVAE approach is used (see the explanation above) but quite significant when CVAE is used.Lastly, we can notice that, as expected, longer expressions are harder to recreate and thus have higher edit distance and a higher percentage of invalid expressions than shorter ones, provided enough training data is used.

Training efficiency and the latent space dimensionality
We proceed to test our conjectures about the efficiency of training the generators of mathematical expressions.We expect that HVAE would require less training data and a lower dimensionality of the latent space to achieve the same levels of reconstruction error in comparison to other approaches.The latter is especially important because of the exploration of the latent space, which is more efficient in low-dimensional latent spaces.Figure 5b shows the impact of the dimensionality of the latent space on the reconstruction error across different VAEs.In line with the previous results, HVAE significantly outperforms both CVAE and GVAE.HVAE with latent space of dimension 16 performs on par or better than GVAE and CVAE with latent spaces of 256 dimensions.We can see that the reconstruction error quickly raises when the latent space dimension is less than 32, but otherwise, the reconstruction error is consistently low.Even with a latent space size of 16, our approach is still comparable to the other two methods with a latent space of dimension 256.This allows us to reduce the dimensionality of the latent space by two orders of magnitude, which makes HVAE an excellent candidate for generating expressions for symbolic regression.
The reason for the superior efficiency of HVAE is the use of expression trees, as subexpressions are always encoded into the same code, regardless of their position in the expression.This significantly reduces the space of possible codes and allows for training the model in a way that better generalizes to the repetitive subexpressions (subpatterns) it encounters.

Latent space smoothness
Finally, we expect the latent space of HVAE to be smooth in the sense that samples close to the latent representation of the input expression are decoded into expressions similar to the one given as input.We investigate the validity of this conjecture by applying linear interpolation (performing a homotopic transformation) between two expressions in the latent space.Assume that we are given two expressions, A and B. Using the model, we encode them into their latent representations z A and z B .We can then generate new latent representations z α by combining the two representations with the formula z α = (1 − α) • z A + α • z B , where α ∈ {i/n : i ∈ N ∧ i ≤ n}.Decoding the latent representations z α in a smooth latent space should produce intermediate expressions that represent a smooth transition from A to B in n steps.
Table 2: Examples of linear interpolation between two expressions in the latent spaces of the three VAEs.Expressions that are colored red are syntactically incorrect.Here, we set n = 4 and α = i/4, 0 ≤ i ≤ 4.
Table 2 shows the results of such a linear interpolation in the latent spaces of the different VAEs.HVAE and GVAE produce continuous latent spaces where the transition from expression A to expression B is indeed smooth.CVAE also produces a smooth transition, but some of the intermediate expressions might be syntactically incorrect.The second interpolation in the lower part of the table is an example of a smooth transition in the HVAE latent space.We can see that at each step only a few expression symbols change and that these changes are rarely redundant.Appendix B provides further examples of interpolations with visualizations of the expression trees.

Evaluating EDHiE
In the second series of experiments, we evaluate the performance of EDHiE.We start the section by introducing the experimental setup (Section 4.2.1).We then report on the impact of the dimensionality of the latent space on the performance of symbolic regression (Section 4.2.2).Furthermore, we compare the performance of EDHiE with that of other methods for symbolic regression on the Nguyen benchmark (Section 4.2.3) and report the performance of EDHiE on the Feynman benchmark (Section 4.2.4).

Experimental setup
Data sets.The Nguyen [13] benchmark contains eight equations with one nontarget variable and two equations with two non-target variables.The right-hand sides of these equations are shown in the second column of Table 3.We generate two data sets (train and test) with five thousand simulated measurements for each equation.We use the train set to select the best expressions and the test set to evaluate their performance with the metrics described below.We sample points from the interval [−20, 20] for equations 1-6, the interval [0, 40] for equation 7, [0, 80] for equation 8, and [0, 20] for equations 9 and 10.
We further evaluate our approach on the 16 equations involving up to two variables from the Feynman benchmark [14].The right-hand sides of these equations are shown on in the last column of Table 3.Because equations in the Feynman benchmark represent real-world equations, each of the equations FM-3, FM-4, FM-5, and FM-7 contains two entries.Each entry comes with its own variables and data sets of measurements.
Table 3: Expressions from the Nguyen (first two columns on the left-hand side) and Feynman (last three columns on the right-hand side) benchmarks.

ID Expression ID ID-Feynman Expression
NG-1 Evaluation process.We compare the performance of EDHiE on the Nguyen benchmark equations to the performance of three other symbolic regression systems.ProGED [3] uses probabilistic grammars as generators of mathematical expressions.DSO [30] combines deep neural networks with evolutionary algorithms.PySR [24] employs evolutionary optimization with operators directly applied to the expression trees.We run each system ten times on each equation and evaluate at most 100,000 sampled expressions.All approaches use the same library of tokens and/or grammars, further described in Appendix A. When running PySR, we allow fitting the values of the constant parameters since it can not be turned off in the implementation.The dimensionality of the latent space of HVAE is set to 32; the ADAM optimizer uses the default learning rate.We elaborate on the setting of the dimensionality of the latent space in the next section.Appendix C gives the complete report on the experiments in latent spaces with varying dimensions.
Estimating the performance.We use three metrics: the number of successful reconstructions, i.e., runs leading to an equation equivalent to the original one; the mean R 2 of the best equation; and the number of expressions sampled to achieve reconstruction.We consider a run successful if we find an expression where the RMSE between the target and predicted values is lower than 10 −10 .To guarantee accurate reporting, we manually check if the original and expression with the lowest RMSE are equivalent.In each run, we use the expression with the lowest RMSE to calculate the bounded R 2 metric on the test set using the formula where ŷi denotes the predicted value of the target variable (calculated by using the equation), y i is the measured value of the target variable, and y is the mean value of y in the training data set.Lastly, we show the average number (across the ten runs) of unique expressions considered before reconstructing the original equation.To this end, we count the unique expressions that the symbolic regression system has considered before the reconstructed equation is encountered in the generation process for all the methods that report this.

The impact of the dimensionality of the latent space
Let us start with a series of computational experiments exploring the latent space for encoding mathematical expressions with random sampling.Here, we perform symbolic regression by taking randomly sampled points in the latent space and decoding them into expressions that are then evaluated on the measurements/data.The expression that fits the data best is selected as the candidate for the discovered equation.Table 4 shows the number of successful runs of the random sampling approaches based on the three VAEs, CVAE, GVAE, and HVAE.In the further discussion of results, we use the name HVAR for HVAE with random sampling.We can see here a typical example of the curse of dimensionality at work.When the symbolic regression algorithm explores high-dimensional latent spaces, it can easily slip into parts of those spaces that do not lead to optimal equations.Table 4: The performance of symbolic regression (number of successful reconstructions) by randomly sampling with CVAE, GVAE, and HVAE on the Nguyen benchmark.Table 5: Comparison of the performance of symbolic regression (number of successful reconstructions, R 2 , and number of evaluated equations) with random sampling on the Nguyen benchmark.We compare sampling from a manuallycrafted probabilistic grammar (ProGED) with sampling using a trained HVAE (HVAR).This shows that the ability of HVAE to encode mathematical expressions in lowdimensional latent spaces is crucial for the performance of symbolic regression with HVAR.
Based on the results of the experiments in Table 4 and Appendix C, in the remainder of this section, we use 32-dimensional latent space for EDHiE.

Comparison on the Nguyen equations
In the next series of experiments, we compare the performance of HVAR, the random sampling method using HVAE, to the one of ProGED-the latter samples mathematical expressions using manually crafted probabilistic grammar.Table 5 reports the results of the comparison.The results show that the generator used within HVAE is not worse than the probabilistic grammar.To our surprise, HVAR outperforms ProGED significantly.First, it successfully reconstructs five (of the ten) equations from the Nguyen benchmark in ten runs, one more than ProGED.Second, for the three equations of NG-2, NG-6, and NG-9, the reconstruction is achieved faster, i.e., by evaluating fewer candidate expressions.
Furthermore, we check the contribution of the evolutionary approach in EDHiE over the random sampling method HVAR.To this end, we compare the last three columns of Table 5 with the last three columns of Table 6.EDHiE successfully reconstructs all ten equations from the Nguyen benchmarks in at least one of the ten runs.In three cases, the equations are reconstructed in every run.Note also that the successful reconstructions with EDHiE require fewer evaluations of candidate equations than the random sampling approaches.
Table 6 compares EDHiE with PySR, which uses evolutionary operators on expression trees directly (i.e., without embedding them into a latent space), and DSO, that similarly to our approach, combines deep learning with evolutionary optimization.Overall, EDHiE performs better than the other two methods across all metrics1 : it achieves the highest total number of successful reconstructions.EDHiE has more successful reconstructions for five equations than PySR and less for a single equation, NG-9.The superior performance of EDHiE relative to PySR indicates that evolutionary optimization is more efficient in the latent space than in the space of expression trees.For four equations, EDHiE achieves successful reconstruction more often than DSO.
In the two instances of reconstructing NG-3 and NG-4, DSO achieves success twice as often as our method.Finally, note that the experiments on the Nguyen benchmark were performed on noise-free synthetic data.The results of the experiments on synthetic data with added noise, reported in Appendix C.3, show that EDHiE is robust to noise: The increasing noise level has little effect on the reconstruction success rate while significantly increasing the rank of the successfully reconstructed equation in the list of evaluated equations, sorted with respect to increasing RMSE.Appendix C also includes additional results on the Nguyen benchmark by random sampling of CVAE, GVAE, and HVAE latent space with varying number of dimensions.

Results on the Feynman equations
In this section, we evaluate the ability of EDHiE to reconstruct real equations from the domain of physics included in the Feynman database.Table 7 presents the results of symbolic regression on a subset of 16 equations from the database with up to two non-target variables.EDHiE successfully reconstructs 13 equations in all the runs.Most of these equations are simple; thus, EDHiE explores small search spaces comprising less than two hundred evaluated expressions.A more complex equation FM-10 is reconstructed in five out of ten runs exploring more than 20 thousand expressions on average.The equation FM-6 could not be reconstructed in any of the runs since it includes the function arcsin that has not been included in our token library.Finally, EDHiE fails to reconstruct the most complex equation FM-2.

Discussion and conclusion
We introduce a novel variational autoencoder for hierarchies, HVAE, that can be efficiently trained to generate valid mathematical expressions represented as expression trees.Compared to generators based on variational autoencoders for sequences, HVAE has three significant advantages.First, it consistently generates valid expressions.Second, its performance is robust even for small training sets: HVAE trained from only two thousand expressions achieves much lower reconstruction error than sequential VAEs trained from 12 thousand examples.Third, the HVAE operating in 32-dimensional latent space has a lower reconstruction error than sequential VAEs with comparable latent spaces.The ability of HVAE to encode mathematical expressions in a lowdimensional latent space makes it an excellent proxy for exploring the search space of candidate expressions in symbolic regression.Indeed, when performing a random search through the latent space, we achieve comparable performance with a random search through the space of candidate expressions defined by a manually crafted probabilistic grammar.EDHiE, a symbolic regression system that performs evolutionary optimization in the latent space of the HVAE, significantly outperforms methods based on random search and achieves performance comparable to the state-of-the-art symbolic regression system DSO based on a similar combination of evolutionary algorithms and deep learning.The comparison of EDHiE with PySR, a genetic programming approach operating on expression trees directly, shows the benefit of performing evolutionary optimization in the latent space.
HVAE has been used here for symbolic regression, but its potential to efficiently generate and encode hierarchies makes it useful in many different contexts, e.g., generating molecular structures or more general symbolic expressions.Analysis of its performance in these application domains is a promising direction for further research.Moreover, the ability of HVAE to learn from small corpora of expressions might prove helpful in retraining the generator after each generation of the evolutionary search, much like the iterative learning in DSO.This will narrow its focus to generating better expressions, leading to more accurate equations.In general, training the generator on expressions involved in mathematical models that have proved useful in a domain of interest will enable seamless integration and transfer of background knowledge in symbolic regression.While we do not explicitly have the power operator in the grammars to be used during the generation of the data sets, exponentiation (and thus the power operator) can occur during the simplification of the generated expressions.Because of this, expressions in the data sets also contain the power operator.
In addition, GVAE also needs grammars for generating valid expressions.When applied to the data sets with a name prefix of AE, GVAE uses the following grammar: For the data sets with the name prefix of Trig, GVAE uses the same grammar as above with different productions for the non-terminal symbol T : CVAE uses the token library {+, −, •, /,ˆ, x, c, (, ), "} for data sets with the name prefix AE, and {+, −, •, /,ˆ, sin, cos, x, c, (, ), "} for data sets with the name prefix Trig.
For experiments on the Nguyen benchmark, we use the grammar: for ProGED, GVAE, and to generate training examples for HVAE.HVAE and DSO use the token library {x, +, −, •, /,ˆ2,ˆ3,ˆ4,ˆ5, sin, cos, exp, log, sqrt}, while CVAE uses token {(, ), "} in addition to the tokens used by HVAE and DSO.For expressions with two non-target variables, we add token y and change the non-terminal symbol V to:

C.1 Dimensions
Table C2 shows the results of evaluation using HVAE with different dimensions of the latent vector space.Here new expressions are generated by randomly sampling points from the standardized Gaussian distribution and decoding them.We can see overall, models, where the dimension of the latent vector space is either 16 or 32, perform the best.HVAE 32 produces expressions that usually have a slightly higher mean R 2 , while HVAE 16 usually needs to evaluate less unique expressions to generate the desired one.Because we prefer the number of successful runs and the mean R 2 metrics, we select the model with the latent vector space dimension 32 for experiments where the HVAE model is coupled together with evolutionary algorithms.

C.2 Learning curves
Optimization algorithms continually enhance solutions over time by iteratively exploring the input space to minimize an objective function or maximize performance.The learning curve serves as a valuable measure for evaluating algorithmic performance, illustrating how the chosen metric evolves as optimization progresses.Steep improvements in the learning curve indicate rapid convergence towards better solutions, while plateaus or slow convergence suggest challenges in finding superior solutions.By analyzing the learning curve, we can be gain an insight into the algorithm's effectiveness, convergence, stability, and potential for further improvement.
Figure C2 shows the learning curves of HVAR, ProGED, and EDHiE on six equations from the Nguyen benchmark.We can see that overall EDHiE performs the best as it achieves the highest R 2 score and needs to test the least expression to do so.This is not true on equation NG-8, where the other two approaches find the desired expressions quicker.This happens because all approaches find the desired expression quickly and the evolution part of EDHiE does not yet come into effect.

C.3 Robustness to noise
In practical scenarios, working with noisy data is common, making it crucial for symbolic regression approaches to perform well in the presence of noise.The performance of EDHiE on noisy data is demonstrated in Table C3.To generate noisy data sets, we sample values ϵ from a Gaussian distribution N (0, I) and add them to the target values y using the formula ỹ = y • (1 + ηϵ), where η represents the noise level.
To evaluate our approach on noisy data, we employ two metrics: the number of successful runs and the mean rank.We execute our approach on noisy data, rank all the generated expressions based on their RMSE, and evaluate these expressions using noiseless data.A run is considered successful if we find an expression that achieves an RMSE below 10 −10 on noiseless data.In the case of a successful run, we record the rank of the first expression with an RMSE below 10 −10 and use it to calculate the mean rank.The results demonstrate that the number of successful runs remains relatively consistent across different noise levelsm indicating the robustness of our approach.Additionally, we can see that the mean rank increases as the amount of noise rises.This outcome is expected, as higher noise levels allow more expressions to overfit the noisy data.In practical applications, expressions that overfit can be eliminated by assigning complexity scores to each expression and selecting less complex expressions from the Pareto front.

C.4 Performance of CVAE and GVAE on symbolic regression
Table C4 shows the results of the CVAE baseline.Models presented in the table use the same parameters as they do in Section 4.1.1apart from the dimension of the latent vector space.We can see that this baseline performs very poorly, as it finds only the two simplest equations.The main reason for this is the high number of invalid expressions (more than 96.8%) the baseline produces.When training the model, we restrict the output tree's structure to match the input structure.Specifically, we utilize the input batched node and incorporate a prediction matrix into each node.We then calculate the reconstruction error using the target and prediction sequences obtained through an in-order traversal of the batched node.While computing the cross-entropy loss, we apply a masking technique to exclude nodes that do not occur in an expression tree from the loss calculation, effectively removing them from the training process.

Fig. 3 :
Fig.3: The processes of (a) encoding and (b) decoding the expression tree of x + cos x.The acronyms EC and DC stand for "encoding cell" (introduced in Section 3.2.2) and "decoding cell" (introduced in Section 3.2.3).

Fig. 4 :
Fig. 4: The structure of the decoding cell.

Fig. 5 :
Fig. 5: The impact of the (a) training data set size and (b) dimensionality of the latent space on the reconstruction error of the three autoencoders.

Figure
Figure 5a depicts the impact of the number of expressions in the training set on the reconstruction error for the three different generative models.Again, HVAE significantly outperforms the other two VAEs.Its reconstruction error is estimated to be consistently lower than 0.25, even when trained on 2 thousand examples only.This error is an order of magnitude lower than the lowest error of 1.5 achieved by the second best model, GVAE, when trained on the whole data set of 12 thousand examples.Figure5bshows the impact of the dimensionality of the latent space on the reconstruction error across different VAEs.In line with the previous results, HVAE significantly outperforms both CVAE and GVAE.HVAE with latent space of dimension 16 performs on par or better than GVAE and CVAE with latent spaces of 256 dimensions.We can see that the reconstruction error quickly raises when the latent space dimension is less than 32, but otherwise, the reconstruction error is consistently low.Even with a latent space size of 16, our approach is still comparable to the other two methods with a latent space of dimension 256.This allows us to reduce the dimensionality of the latent space by two orders of magnitude, which makes HVAE an excellent candidate for generating expressions for symbolic regression.The reason for the superior efficiency of HVAE is the use of expression trees, as subexpressions are always encoded into the same code, regardless of their position in the expression.This significantly reduces the space of possible codes and allows for training the model in a way that better generalizes to the repetitive subexpressions (subpatterns) it encounters.
Linear interpolation of examples in the HVAE latent space.The first row shows examples from the AE4-2k data set, the second examples from the AE7-20k data set, and the third from the Trig5-15k data set.Here n = 4 and α = i 4 , 0 ≤ i ≤ 4.

Fig. C2 :
Fig. C2: Learning curves for HVAR, ProGED, and EDHiE on the selected equations from the Nguyen benchmark.Curves for the equations NG-2, NG-3, NG-4, and NG-7 are omitted since they resemble the curve for the NG-1 equation.

Table 6 :
Comparison of the performance of the symbolic regression systems EDHiE, DSO, and PySR on the Nguyen benchmark.

Table 7 :
Results of EDHiE on the 16 equations from the Feynman database that include at most two non-target variables.

Table C2 :
The performance of HVAR (number of successful reconstructions, R 2 , and number of evaluated equations) with varying number of latent dimensions on the Nguyen benchmark.

Table C3 :
The performance of EDHiE with varying level of noise added to synthetic data from the Nguyen benchmark.

Table C4 :
Results of symbolic regression by random sampling of the CVAE latent space with varying number of dimensions on the Nguyen benchark.