1 Introduction

Neural networks on trees have made significant progress in recent years, achieving unprecedented performance with new models such as tree echo state networks (Gallicchio & Micheli, 2013), tree LSTMs (Tai et al., 2015), code2vec (Alon et al., 2019), or models from the graph neural network family (Kipf & Welling, 2017; Micheli, 2009; Scarselli et al., 2009). However, these achievements are mostly limited to tasks with numeric output, such as classification and regression. By contrast, much less research has focused on tasks that require trees as output, such as molecular design (Kusner et al., 2017; Jin et al., 2018) or hint generation in intelligent tutoring systems (Paaßen et al., 2018). For such tasks, autoencoder models are particularly attractive because they support both encoding and decoding, thus enabling tree-to-tree tasks (Kusner et al., 2017). In this paper, we propose a novel autoencoder for trees, which is the first model to combine grammar information, recursive processing, and deep learning. Hence, we name it recursive tree grammar autoencoder (RTG-AE).

Our core claim in this paper is that combining these three features—recursive processing, grammar knowledge, and deep learning—performs better than combining only two of the three (refer to Fig. 1). Incidentally, there exist baseline models in the literature which combine two features but not the last one. In particular, the grammar variational autoencoder (Kusner et al., 2017, GVAE) represents strings as a sequence of context-free grammar rules and is trained via deep learning, but does not use recursive processing. Instead, it encodes the rule sequence via a series of 1D convolutions and a fully connected layer; and it decodes a vector back to a rule sequence via a multi-layer gated recurrent unit (Cho et al., 2014, GRU). We believe that this sequential scheme makes sense for string data but becomes a limitation for trees because the sequential processing introduces long-term dependencies that do not exist in the tree. For example, when processing the tree \(\wedge (x, \lnot (y))\), a sequential representation would be \(y, \lnot , x, \wedge\). When processing \(\wedge\), we want to take the information from its children x and \(\lnot\) into account, but \(\lnot\) is already two steps away (refer to Fig. 1a). If we replace x with a large subtree, this distance can become arbitrarily large.

Recursive neural networks (Tai et al., 2015; Pollack, 1990; Sperduti & Starita, 1997) avoid this problem. They compute the representation of a parent node based on the representation of all children, meaning the information flow follows the tree structure and the length of dependencies is bounded by the depth of the tree. There also exist autoencoding models in the recursive neural network tradition, such as the model of Pollack (1990) or the directed acyclic graph variational autoencoder (M. Zhang, Jiang, Cui, Garnett, & Chen, 2019, D-VAE). Unfortunately, the autoencoding capability of recursive networks is limited due to the enormous number of trees one could potentially decode from a vector. More precisely, Hammer (2002) showed that one needs exponentially many neurons to represent all trees of a certain depth with a recursive network. We believe that grammars help to avoid this limitation because the set of grammatical trees is usually much smaller than the set of possible trees over an alphabet. For example, the tree \(\wedge (x, \lnot (y))\) represents a valid Boolean expression but not the tree \(\wedge (x, \lnot , y)\) (refer to Fig. 1b). Without a grammar as inductive bias, models like D-VAE need to learn to avoid trees such as \(\wedge (x, \lnot , y)\). D-VAE serves as our baseline model which uses recursive processing and deep learning but no grammar knowledge.

Finally, in prior research we proposed tree echo state auto encoders (Paaßen, Koprinska, & Yacef, 2020, TES-AE), a recursive, grammar-based autoencoder for trees which does not use deep learning. Instead, this model randomly initializes its network parameters and only trains the final layer which chooses the next grammar rule for decoding. This shallow learning scheme follows the (tree) echo state network paradigm, which claims that a sufficiently large, randomly wired network is expressive enough to represent any input (Gallicchio & Micheli, 2013). However, a fixed representation may need more dimensions compared to one that adjusts to the data. Consider our example of Boolean formulae, again. Let’s code x as 0, y as 1, \(\lnot\) as 2, and \(\wedge\) as 3. We can then encode trees as sequences of these numbers, padding with zeros wherever needed. In 2D, we can then represent all trees with up to three nodes (filling up with x, where needed). In particular, (0, 0) corresponds to x, (1, 0) to y, (2, 0) to \(\lnot (x)\), (3, 0) to \(\wedge (x, x)\), (2, 1) to \(\lnot (y)\), and (3, 1) to \(\wedge (y, x)\). However, we can also adapt our encoding by using the first dimension to encode the x variable, and the second dimension to encode the y variable, such that (1, 0) decodes to x, (0, 1) to y, (1, 1) to \(\wedge (x, y)\), and \((1,-1)\) to \(\wedge (x, \lnot (y))\) (refer to Fig. 1c). Adjusting to the data enabled us to represent larger trees with the same number of dimensions and to better take the semantics of the domain into account. Further, learning enables us to enforce smoothness in the coding space, which may be helpful for downstream tasks.

Fig. 1
figure 1

An illustration of the advantages (a) of recursive over sequential processing, b of utilizing grammatical knowledge, and c of learning the encoding end-to-end. In (c), each point represents the encoding of a tree and color indicates some semantic attribute with respect to which the encoding space should be smooth (right)

The key contributions of our work are:

  • We develop a novel autoencoder for trees which is the first to combine recursive processing, grammar knowledge, and deep learning, whereas prior models combined only two of the three. We call our model RTG-AE.

  • We provide a correctness proof for the encoding scheme of RTG-AE.

  • Experimentally, we compare RTG-AE to models which combine two of the three features, namely GVAE, which combines grammar knowledge and deep learning, but not recursive processing; D-VAE, which combines recursive processing and deep learning, but not grammar knowledge; and TES-AE, which combines recursive processing and grammar knowledge, but not deep learning. We observe that RTG-AE has the lowest autoencoding error and runtime—except for TES-AE, which has lower autoencoding error on the smallest dataset and is always faster because it does not use deep learning.

  • In a further experiment, we evaluate the capability of a CMA-ES optimizer to find an optimal tree in the encoding space of each model. We find that RTG-AE yields the best median scores.

We begin by discussing background and related work before we introduce the RTG-AE architecture and evaluate it on four datasets, including two synthetic and two real-world ones.

2 Background and related work

Our contribution relies on substantial prior work, both from theoretical computer science and machine learning. We begin by introducing our formal notion of trees and tree grammars, after which we continue with neural networks for tree representations.

2.1 Regular tree grammars

Let \(\Sigma\) be some finite alphabet of symbols. We recursively define a tree \(\hat{x}\) over \(\Sigma\) as an expression of the form \(\hat{x} = x(\hat{y}_1, \ldots , \hat{y}_k)\), where \(x \in \Sigma\) and where \(\hat{y}_1, \ldots , \hat{y}_k\) is a list of trees over \(\Sigma\). We call \(\hat{x}\) a leaf if \(k = 0\), otherwise we call \(\hat{y}_1, \ldots , \hat{y}_k\) the children of \(\hat{x}\). We define the size \(|\hat{x}|\) of a tree \(\hat{x} = x(\hat{y}_1, \ldots , \hat{y}_k)\) as \(1 + \vert \hat{y}_1\vert + \ldots + \vert \hat{y}_k\vert\).

Next, we define a regular tree grammar (RTG) (Brainerd, 1969; Comon et al., 2008) \({\mathcal {G}}\) as a 4-tuple \({\mathcal {G}} = (\Phi , \Sigma , R, S)\), where \(\Phi\) is a finite set of nonterminal symbols, \(\Sigma\) is a finite alphabet as before, \(S \in \Phi\) is a special nonterminal symbol which we call the starting symbol, and R is a finite set of production rules of the form \(A \rightarrow x(B_1, \ldots , B_k)\) where \(A, B_1, \ldots , B_k \in \Phi\), \(k \in {\mathbb {N}}_0\), and \(x \in \Sigma\). We say a sequence of rules \(r_1, \ldots , r_T \in R\) generates a tree \(\hat{x}\) from some nonterminal \(A \in \Phi\) if applying all rules to A yields \(\hat{x}\), as specified in Algorithm 1. We define the regular tree language \({\mathcal {L}}({\mathcal {G}})\) of grammar \({\mathcal {G}}\) as the set of all trees that can be generated from S via some (finite) rule sequence over R.

figure a

The inverse of generation is called parsing. In our case, we rely on the bottom-up parsing approach of Comon et al. (2008), as shown in Algorithm 2. For the input tree \(\hat{x} = x(\hat{y}_1, \ldots , \hat{y}_k)\), we first parse all children, yielding a nonterminal \(B_j\) and a rule sequence \({\bar{r}}_j\) that generates child \(\hat{y}_j\) from \(B_j\). Then, we search the rule set R for a rule of the form \(r = A \rightarrow x(B_1, \ldots , B_k)\) for some nonterminal A, and finally return the nonterminal A as well as the rule sequence \(r, {\bar{r}}_1, \ldots , {\bar{r}}_k\), where the commas denote concatenation. If we don’t find a matching rule, the process fails. Conversely, if the algorithm returns successfully, this implies that the rule sequence \(r, {\bar{r}}_1, \ldots , {\bar{r}}_k\) generates \(\hat{x}\) from A. Accordingly, if \(A = S\), then \(\hat{x} \in {\mathcal {L}}({\mathcal {G}})\).

figure b

Algorithm 2 can be ambiguous if multiple nonterminals A exist such that \(A \rightarrow x(B_1, \ldots , B_k) \in R\) in line 5. To avoid such ambiguities, we impose that our regular tree grammars are deterministic, i.e. no two grammar rules have the same right-hand-side. This is sufficient to ensure that any tree corresponds to a unique rule sequence.

Theorem 1

(mentioned in page 24 of Comon et al. (2008)) Let \({\mathcal {G}} = (\Phi , \Sigma , R, S)\) be a regular tree grammar. Then, for any \(\hat{x} \in {\mathcal {L}}({\mathcal {G}})\) there exists exactly one sequence of rules \(r_1, \ldots , r_T \in R\) which generates \(\hat{x}\).

Proof

Refer to Appendix A.1. \(\square\)

This is no restriction to expressiveness, as any regular tree grammar can be transformed into an equivalent, deterministic one.

Theorem 2

(Therorem 1.1.9 by Comon et al. (2008)) Let \({\mathcal {G}} = (\Phi , \Sigma , R, S)\) be a regular tree grammar. Then, there exists a regular tree grammar \({\mathcal {G}}' = (\Phi ', \Sigma , R', {\mathcal {S}})\) with a set of starting symbols \({\mathcal {S}}\) such that \({\mathcal {G}}'\) is deterministic and \({\mathcal {L}}({\mathcal {G}}) = {\mathcal {L}}({\mathcal {G}}')\).

Proof

Refer to Appendix A.2. \(\square\)

It is often convenient to permit two further concepts in a regular tree grammar, namely optional and starred nonterminals. In particular, the notation B? denotes a nonterminal with the production rules \(B? \rightarrow B\) and \(B \rightarrow \varepsilon\), where \(\varepsilon\) is the empty word. Similarly, \(B^*\) denotes a nonterminal with the production rules \(B^* \rightarrow B, B^*\) and \(B^* \rightarrow \varepsilon\). To maintain determinism, one must ensure two conditions: First, if a rule generates two adjacent nonterminals that are starred or optional, then these nonterminals must be different, so \(A \rightarrow x(B*, C*)\) is permitted but \(A \rightarrow x(B*, B?)\) is not, because we would not know whether to assign an element to \(B*\) or B?. Second, the languages generated by any two right-hand-sides for the same nonterminal must be non-intersecting. For example, if the rule \(A \rightarrow x(B*, C)\) exists, then the rule \(A \rightarrow x(C?, D*)\) is not allowed because the right-hand-side x(C) could be generated by either of them (refer to Appendix A.2 for more details). In the remainder of this paper, we generally assume that we deal with deterministic regular tree grammars that may contain starred and optional nonterminals.

2.2 Tree encoding

We define a tree encoder for a regular tree grammar \({\mathcal {G}}\) as a mapping \(\phi : {\mathcal {L}}({\mathcal {G}}) \rightarrow {\mathbb {R}}^n\) for some encoding dimensionality \(n \in {\mathbb {N}}\). While fixed tree encodings do exist, e.g. in the form of tree kernels (Aiolli et al.., 2015; Collins & Duffy, 2002), we focus here on learned encodings via deep neural networks. A simple tree encoding scheme is to list all nodes of a tree in depth-first-search order and encode this list via a recurrent or convolutional neural network (Paaßen et al., 2020). However, one can also encode the tree structure more directly via recursive neural networks (Gallicchio & Micheli, 2013; Tai et al., 2015; Pollack, 1990; Sperduti & Starita, 1997; Sperduti, 1994). Generally speaking, a recursive neural network consists of a set of mappings \(f^x : {\mathcal {P}}({\mathbb {R}}^n) \rightarrow {\mathbb {R}}^n\), one for each symbol \(x \in \Sigma\), which receive a (perhaps ordered) set of child encodings as input and map it to a parent encoding. Based on such mappings, we define the overall tree encoder recursively as

$$\begin{aligned} \phi \big ( x(\hat{y}_1, \ldots , \hat{y}_k) \big ) := f^x\big (\{ \phi (y_1), \ldots , \phi (y_k) \}\big ). \end{aligned}$$
(1)

Traditional recursive neural networks implement \(f^x\) with single- or multi-layer perceptrons. More recently, recurrent neural networks have been applied, such as echo state nets (Gallicchio & Micheli, 2013) or LSTMs (Tai et al., 2015). In this work, we extend the encoding scheme by defining the mappings \(f^x\) not over terminal symbols x but over grammar rules r, thereby tying encoding closely to parsing. This circumvents a typical problem in recursive neural nets, namely to handle the order and number of children (Sperduti & Starita, 1997).

Recursive neural networks can also be related to more general graph neural networks (Kipf & Welling, 2017; Micheli, 2009; Scarselli et al., 2009). In particular, we can interpret a recursive neural network as a graph neural network which transmits messages from child nodes to parent nodes until the root is reached. Thanks to the acyclic nature of trees, a single pass from leaves to root is sufficient, whereas most graph neural net architectures would require as many passes as the tree is deep (Kipf & Welling, 2017; Micheli, 2009; Scarselli et al., 2009). In other words, graph neural nets only consider neighboring nodes in a pass, whereas recursive nets incorporate information from all descendant nodes. Another reason why we choose to consider trees instead of general graphs is that graph grammar parsing is NP-hard (Turán, 1983), whereas regular tree grammar parsing is linear (Comon et al., 2008).

For the specific application of encoding syntax trees of computer programs, three further strategies have been proposed recently, namely: Code2vec considers paths from the root to single nodes and aggregates information across these paths using attention (Alon et al., 2019); AST-NN treats a syntax tree as a sequence of subtrees and encodes these subtrees first, followed by a GRU which encodes the sequence of subtree encodings (Zhang et al., 2019); and CuBERT treats source code as a sequence of tokens which are then plugged into a big transformer model from natural language processing (Kanade et al., 2020). Note that these models focus on encoding trees, whereas we wish to support encoding as well as decoding.

2.3 Tree decoding

We define a tree decoder for a regular tree grammar \({\mathcal {G}}\) as a mapping \(\psi : {\mathbb {R}}^n \rightarrow {\mathcal {L}}({\mathcal {G}})\) for some encoding dimensionality \(n \in {\mathbb {N}}\). In early work, Pollack (1990) and Sperduti (1994) already proposed decoding mechanisms using ’inverted’ recursive neural networks, i.e. mapping from a parent representation to a fixed number of children, including a special ’none’ token for missing children. Theoretical limits of this approach have been investigated by Hammer (2002), who showed that one requires exponentially many neurons to decode all possible trees of a certain depth. More recently, multiple works have considered the more general problem of decoding graphs from vectors, where a graph is generated by a sequence of node and edge insertions, which in turn is generated via a deep recurrent neural net (Zhang et al., 2019; Bacciu et al., 2019; Liu et al., 2018; Paaßen et al., 2021; You et al., 2018). From this family, the variational autoencoder for directed acyclic graphs (D-VAE) (Zhang et al., 2019) is most suited to trees because it explicitly prevents cycles. In particular, the network generates nodes one by one and then decides which of the earlier nodes to connect to the new node, thereby preventing cycles. We note that there is an entire branch of graph generation devoted specifically to molecule design which is beyond our capability to cover here (Sanchez-Lengeling & Aspuru-Guzik, 2018). However, tree decoding may serve as a subroutine, e.g. to construct a junction tree in (Jin et al., 2018).

Another thread of research concerns the generation of strings from a context-free grammar, guided by a recurrent neural network (Kusner et al., 2017; Dai et al., 2018). Roughly speaking, these approaches first parse the input string, yielding a generating rule sequence, then convert this rule sequence into a vector via a convolutional neural net, and finally decode the vector back into a rule sequence via a recurrent neural net. This rule sequence, then, yields the output string. Further, one can incorporate additional syntactic or semantic constraints via attribute grammars in the rule selection step (Dai et al., 2018). We follow this line of research but use tree instead of string grammars and employ recursive instead of sequential processing. This latter change is key because it ensures that the distance between the encoding and decoding of a node is bounded by the tree depth instead of the tree size, thus decreasing the required memory capacity from linear to logarithmic in the tree size.

A third thread of research attempts to go beyond known grammars and instead tries to infer a grammar from data, typically using stochastic parsers and grammars that are controlled by neural networks (Allamanis et al., 2017; Dyer et al., 2016; Li et al., 2019; Kim et al., 2019; Yogatama et al., 2017; Zaremba et al., 2014). Our work is similar in that we also control a parser and a grammar with a neural network. However, our task is conceptually different: We assume a grammar is given and are solely concerned with autoencoding trees within the grammar’s language, whereas these works attempt to find tree-like structure in strings. While this decision constrains us to known grammars, it also enables us to consider non-binary trees and variable-length rules which are currently beyond grammar induction methods. Further, pre-specified grammars are typically designed to support interpretation and semantic evaluation (e.g. via an objective function for optimization). Such an interpretation is much more difficult for learned grammars.

Finally, we note that our own prior work (Paaßen et al., 2020) already combines tree grammars with recursive neural nets (in particular tree echo state networks (Gallicchio & Micheli, 2013)). However, in this paper we combine such an architecture with an end-to-end-learned variational autoencoder, thus guaranteeing a smooth latent space, a standard normal distribution in the latent space, and smaller latent spaces. It also yields empirically superior results, as we see later in the experiments.

2.4 Variational autoencoders

An autoencoder is a combination of an encoder \(\phi\) and a decoder \(\psi\) that is trained to minimize some form of autoencoding error, i.e. some notion of dissimilarity between an input \(\hat{x}\) and its autoencoded version \(\psi (\phi (\hat{x}))\). In this paper, we consider the variational autoencoder (VAE) approach of Kingma and Welling (2019), which augments the deterministic encoder and decoder to probability distributions from which we can sample. More precisely, we introduce a probability density \(q_\phi (\mathbf {z} \vert \hat{x})\) for encoding \(\hat{x}\) into a vector \(\mathbf {z}\), and a probability distribution \(p_\psi (\hat{x}\vert \mathbf {z})\) for decoding \(\mathbf {z}\) into \(\hat{x}\).

Now, let \(\hat{x}_1, \ldots , \hat{x}_m\) be a training dataset. We train the autoencoder to minimize the loss:

$$\begin{aligned} \ell (\phi , \psi ) = \sum _{i=1}^m {\mathbb {E}}_{q_\phi (z_i \vert \hat{x}_i)}\Big [-\log \big [p_\psi \big ( \hat{x}_i \big \vert \mathbf {z}_i \big )\big ]\Big ] + \beta \cdot {\mathcal {D}}_\mathrm {KL}(q_\phi \Vert {\mathcal {N}}), \end{aligned}$$
(2)

where \({\mathcal {D}}_\mathrm {KL}\) denotes the Kullback-Leibler divergence between two probability densities and where \({\mathcal {N}}\) denotes the density of the standard normal distribution. \(\beta\) is a hyper-parameter to weigh the influence of the second term, as suggested by Burda et al. (2016).

Typically, the loss in (2) is minimized over tens of thousands of stochastic gradient descent iterations, such that the expected value over \(q_\phi (z_i \vert \hat{x}_i)\) can be replaced with a single sample (Kingma & Welling, 2019). Further, \(q_\phi\) is typically modeled as a Gaussian with diagonal covariance matrix, such that the sample can be re-written as \(\mathbf {z}_i = \rho (\mathbf {\mu }_i + \mathbf {\epsilon }_i \odot \mathbf {\sigma }_i)\), where \(\mathbf {\mu }_i\) and \(\mathbf {\sigma }_i\) are deterministically generated by the encoder \(\phi\), where \(\odot\) denotes element-wise multiplication, and where \(\mathbf {\epsilon }_i\) is Gaussian noise, sampled with mean zero and standard deviation s. s is a hyper-parameter which regulates the noise strength we impose during training.

We note that many extensions to variational autoencoders have been proposed over the years (Kingma & Welling, 2019), such as Ladder-VAE (Sønderby et al., 2016) or InfoVAE (Zhao et al., 2019). Our approach is generally compatible with such extensions, but our focus here lies on the combination of autoencoding, grammatical knowledge, and recursive processing, such that we leave extensions of the autoencoding scheme for future work.

Fig. 2
figure 2

An illustration of the recursive tree grammar autoencoder (RTG-AE) for the tree \(\hat{x} = \wedge (x, \lnot (y))\). Steps ad encode the tree as the vector \(\phi (\wedge (x, \lnot (y)))\) (also refer to Algorithm 3). Step e maps it to the VAE latent space vector \(\mathbf {z}\). Steps fi decode the vector back to the tree \(\wedge (x, \lnot (y))\) (also refer to Algorithm 4)

3 Method

Our proposed architecture is a variational autoencoder for trees, where we construct the encoder as a bottom-up parser, the decoder as a regular tree grammar, and the reconstruction loss as the crossentropy between the true rules generating the input tree and the rules chosen by the decoder. An example autoencoding computation is shown in Fig. 2. Because our encoding and decoding schemes are closely related to recursive neural networks (Pollack, 1990; Sperduti & Starita, 1997; Sperduti, 1994), we call our approach recursive tree grammar autoencoders (RTG-AEs). We now introduce each of the components in turn.

3.1 Encoder

Our encoder is a bottom-up parser for a given regular tree grammar \({\mathcal {G}} = (\Phi , \Sigma , R, S)\), computing a vectorial representation in parallel to parsing. In more detail, we introduce an encoding function \(f^r : {\mathbb {R}}^{k \times n} \rightarrow {\mathbb {R}}^n\) for each grammar rule \(r = (A \rightarrow x(B_1, \ldots , B_k)) \in R\), which maps the encodings of all children to an encoding of the parent node. Here, n is the encoding dimensionality. As such, the grammar guides our encoding and fixes the number and order of inputs for our encoding functions \(f^r\). Note that, if \(k = 0\), \(f^r\) is a constant.

Next, we apply the functions \(f^r\) recursively during parsing, yielding a vectorial encoding of the overall tree. More precisely, our encoder \(\phi\) is defined by the recursive equation

$$\begin{aligned} \phi \Big (x(\hat{y}_1, \ldots , \hat{y}_K), A\Big ) = f^r(\phi (\hat{y}_1, B_1), \ldots , \phi (\hat{y}_1, B_k)), \end{aligned}$$
(3)

where \(r = A \rightarrow x(B_1, \ldots , B_k)\) is the first rule in the sequence that generates \(x(\hat{y}_1, \ldots , \hat{y}_K)\) from A. As initial nonterminal A, we use the grammar’s starting symbol S. Refer to Algorithm 3 for details. Figure 2a–d shows an example of the scheme.

We implement \(f^r\) as a single-layer feedforward neural network of the form \(f^r(\mathbf {y}_1, \ldots , \mathbf {y}_k) = \tanh (\sum _{j=1}^k \varvec{U}^{r, j} \cdot \mathbf {y}_j + \mathbf {a}^r )\), where the weight matrices \(\varvec{U}^{r, j} \in {\mathbb {R}}^{n \times n}\) and the bias vectors \(\mathbf {a}^r \in {\mathbb {R}}^n\) are parameters to be learned. For optional and starred nonterminals, we further define \(f^{B? \rightarrow \varepsilon } = f^{B^* \rightarrow \varepsilon } = \mathbf {0}\), \(f^{B? \rightarrow B}(\mathbf {y}) = \mathbf {y}\), and \(f^{B^* \rightarrow B, B^*}(\mathbf {y}_1, \mathbf {y}_2) = \mathbf {y}_1 + \mathbf {y}_2\). In other words, the empty string \(\varepsilon\) is encoded as the zero vector, an optional nonterminal is encoded via the identity, and starred nonterminals are encoded via a sum, following the recommendation of Xu et al. (2019) for graph neural nets.

figure c

We can show that Algorithm 3 returns without error if and only if the input tree is part of the grammar’s tree language.

Theorem 3

Let \({\mathcal {G}} = (\Phi , \Sigma , R, S)\) be a deterministic regular tree grammar. Then, it holds: \(\hat{x}\) is a tree in \({\mathcal {L}}({\mathcal {G}})\) if and only if Algorithm 3 returns the nonterminal S as first output. Further, if Algorithm 3 returns with S as first output and some rule sequence \({\bar{r}}\) as second output, then \({\bar{r}}\) uniquely generates \(\hat{x}\) from S. Finally, Algorithm 3 has \({\mathcal {O}}(|\hat{x}|)\) time and space complexity.

Proof

Refer to Appendix A.3. \(\square\)

3.2 Decoder

Our decoder is a stochastic version of a given regular tree grammar \({\mathcal {G}} = (\Phi , \Sigma , R, S)\), controlled by two kinds of neural network. First, for any nonterminal \(A \in \Phi\), let \(L_A\) be the number of rules in R with A on the left hand side. For each \(A \in \Phi\), we introduce a linear layer \(h_A : {\mathbb {R}}^n \rightarrow {\mathbb {R}}^{L_A}\) with \(h_A(\mathbf {x}) = \varvec{V}^A \cdot \mathbf {x} + \mathbf {b}^A\). To decode a tree from a vector \(\mathbf {x} \in {\mathbb {R}}^n\) and a nonterminal \(A \in \Phi\), we first compute rule scores \(\mathbf {\lambda } = h_A(\mathbf {x})\) and then sample a rule \(r_l = (A \rightarrow x(B_1, \ldots , B_k)) \in R\) from the softmax distribution \(p_A(r_l\vert \mathbf {x}) = \exp (\lambda _l) / \sum _{l' = 1}^{L_A} \exp (\lambda _{l'})\). Then, we apply the sampled rule and use a second kind of neural network to decide the vectorial encodings for each generated child nonterminal \(B_1, \ldots , B_k\). In particular, for each grammar rule \(r = (A \rightarrow x(B_1, \ldots , B_k)) \in R\), we introduce k feedforward layers \(g^r_1, \ldots , g^r_k : {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n\) and decode the vector representing the jth child as \(\mathbf {y}_j = g^r_j(\mathbf {x})\). Finally, we decode the children recursively until no nonterminal is left. More precisely, the tree decoding is guided by the recursive equation

$$\begin{aligned} \psi (\mathbf {x}, A) = x\Big (\psi (\mathbf {y}_1, B_1), \ldots , \psi (\mathbf {y}_k, B_k) \Big ), \end{aligned}$$
(4)

where the rule \(r = x \rightarrow A(B_1, \ldots , B_k)\) is sampled from \(p_A\) as specified above and \(\mathbf {y}_j = g^r_j(\mathbf {x})\). As initial nonterminal argument we use the grammar’s starting symbol S. For details, refer to Algorithm 4. Figure 2f–i shows an example of the scheme. Note that the time and space complexity is \({\mathcal {O}}(|\hat{x}|)\) for output tree \(\hat{x}\) because each recursion step adds exactly one terminal symbol. Since the entire tree needs to be stored, the space complexity is also \({\mathcal {O}}(|\hat{x}|)\). Also note that Algorithm 4 is not generally guaranteed to halt (Chi, 1999). In practice, we solve this problem by imposing a maximum number of generated rules.

figure d

For optional nonterminals, we introduce a classifier \(h_{B?}\) which decides whether to apply \(B? \rightarrow B\) or \(B? \rightarrow \varepsilon\), and we define \(g^{B? \rightarrow B}_1(\mathbf {y}) = \mathbf {y}\). For starred nonterminals, we introduce \(h_{B^*}\) which decides whether to apply \(B^* \rightarrow B, B^*\) or \(B^* \rightarrow \varepsilon\), and we introduce new decoding layers \(g^{B^* \rightarrow B, B^*}_1\) and \(g^{B^* \rightarrow B, B^*}_2\).

An interesting special case are trees that implement lists. For example, consider a carbon chain CCC from chemistry. In the SMILES grammar (Weininger, 1988), this is represented as a binary tree of the form single_chain(single_chain(single_chain(chain_end, C), C), C), i.e. the symbol ‘single_chain’ acts as a list operator. In such a case, we recommend to use a recurrent neural network to implement the decoding function \(g^{\text {Chain} \rightarrow \text {single\_chain}(\text {Chain}, \text {Atom})}_1\), such as a gated recurrent unit (GRU) (Cho et al., 2014). In all other cases, we stick with a simple feedforward layer. We consider this issue in more detail in Appendix D.

figure e

3.3 Training

We train our recursive tree grammar autoencoder (RTG-AE) in the variational autoencoder (VAE) framework, i.e. we try to minimize the loss in Eq. 2. More precisely, we define the encoding probability density \(q_\phi (\mathbf {z}\vert \hat{x})\) as the Gaussian with mean \(\mu (\phi (\hat{x}))\) and covariance matrix \(\mathrm {diag}\big [\sigma (\phi (\hat{x}))\big ]\), where the functions \(\mu : {\mathbb {R}}^n \rightarrow {\mathbb {R}}^{n_\mathrm {VAE}}\) and \(\sigma : {\mathbb {R}}^n \rightarrow {\mathbb {R}}^{n_\mathrm {VAE}}\) are defined as

$$\begin{aligned} \mu (\mathbf {x})&= \varvec{U}^\mu \cdot \mathbf {x} + \mathbf {a}^\mu , \nonumber \\ \sigma (\mathbf {x})&= \exp \left(\frac{1}{2} \cdot [\varvec{U}^\sigma \cdot \mathbf {x} + \mathbf {a}^\sigma ] \right), \end{aligned}$$
(5)

where \(\varvec{U}^\mu , \varvec{U}^\sigma \in {\mathbb {R}}^{n_\mathrm {VAE} \times n}\) and \(\mathbf {a}^\mu , \mathbf {a}^\sigma \in {\mathbb {R}}^{n_\mathrm {VAE}}\) are additional parameters.

To decode, we first transform the encoding vector \(\mathbf {z}\) with a single feedforward layer \(\rho : {\mathbb {R}}^{n_{\mathrm {VAE}}} \rightarrow {\mathbb {R}}^n\) and then apply the decoding scheme from Algorithm 4. As the decoding probability \(p_\psi (\hat{x}\vert {\mathbf {z}})\), we use the product over all probabilities \(p_A(r_t\vert \mathbf {x}_t)\) from line 5 of Algorithm 4, i.e. the probability of always choosing the correct grammar rule during decoding, provided that all previous choices have already been correct. The negative logarithm of this product can also be interpreted as the crossentropy loss between the correct rule sequence and the softmax probabilities from line 5 of Algorithm 4. The details of our loss computation are given in Algorithm 5. Note that the time and space complexity is \({\mathcal {O}}(|\hat{x}|)\) because the outer loop from line 7–17 runs \(T = |\hat{x}|\) times, and the inner loop in lines 12–15 runs \(|\hat{x}|-1\) times in total because every node takes the role of child exactly once (except for the root). Because the loss is differentiable, we can optimize it using gradient descent schemes such as Adam (Kingma & Ba, 2015). The gradient computation is performed by the pyTorch autograd system (Paszke et al., 2019).

Table 1 The statistics for all four dataset
Table 2 The number of parameters for all models on all datasets

4 Experiments and discussion

We evaluate the performance of RTG-AEs on four datasets, namely:

  • Boolean Randomly sampled Boolean formulae over the variables x and y with at most three binary operators, e.g. \(x \wedge \lnot y\) or \(x \wedge y \wedge (x \vee \lnot y)\).

  • Expressions Randomly sampled algebraic expressions over the variable x of the form \(3 * x + \sin (x) + \exp (2 / x)\), i.e. consisting of a binary operator plus a unary operator plus a unary of a binary. This dataset is taken from (Kusner et al., 2017).

  • SMILES Roughly 250k chemical molecules as SMILES strings (Weininger, 1988), as selected by (Kusner et al., 2017).

  • Pysort 29 Python sorting programs and manually generated preliminary development stages of these programs, resulting in 294 programs overall.

    Table 1 shows dataset statistics, Appendix C lists the grammars for each of the datasets as well as the detailed sampling strategies for Boolean and Expressions.

    We compare RTG-AEs to the following baselines.

  • GVAE Grammar variational autoencoders (Kusner et al., 2017) are grammar-based auto-encoders for strings. A string is first parsed by a context-free grammar, yielding a sequence of grammar rules. Next, the rule sequence is encoded via one-hot-coding, followed by three layers of 1D convolutions, followed by a fully connected layer. Note that this requires the maximum sequence length to be fixed in advance. Decoding occurs via a three-layer gated recurrent unit (Cho et al., 2014, GRU), followed by masking out invalid rule sequences according to the grammar. In this paper, we slightly adapt GVAE because we apply it to rule sequences of a regular tree grammar instead of a context-free grammar (otherwise, GVAE could not be applied to trees). Note that GVAE is different from RTG-AE because it uses sequential processing instead of recursive processing.

  • GRU-TG-AE Even though GVAE uses sequential instead of recursive processing, it is not a strict ablation of RTG-AE because it uses different architectures for encoding (conv-nets) and decoding (GRUs). Therefore, we also introduce an ablation of RTG-AE which uses GRUs for both encoding and decoding, but otherwise the same architecture. We call this baseline GRU-TG-AE.

  • D-VAE Directed acyclic graph variational autoencoders (Zhang et al., 2019) encode an input directed acyclic graph via a graph neural net that computes encodings following the structure of the graph—and hence becomes equivalent to a recursive neural net, as it is used by RTG-AE. However, the encoding does not use any grammatical information. It simply encodes the symbol at each node via one-hot coding and uses it as an additional input of the recursive net. For decoding, D-VAE uses the following recurrent scheme: Update the graph representation \({\mathbf {h}}\) with a GRU. Sample a new node from a softmax distribution over node types, where the scores for each type are computed via an MLP from \({\mathbf {h}}\). Then, for each previously sampled node, compute an edge probability via an MLP based on the encoding of the current node and the encoding of the previous node and sample the edge from a Bernoulli distribution with the computed probability. Continue with sampling nodes and edges until a special end token is sampled. Note that this scheme is different from RTG-AE because it does not use grammar knowledge. So D-VAE serves as the ablation of our model with recursive processing, learned encoding, but grammar-free processing.

  • TES-AE Tree Echo State Auto Encoders (Paaßen et al., 2020) are a predecessor to RTG-AE. TES-AEs use the same, recursive, grammar-based encoding and decoding scheme as presented in Sect. 3. However, the training scheme is entirely different: TES-AEs set the parameters for encoding layers \(f^r\) and decoding layers \(g^r_j\) randomly. Only the rule classifiers \(h_A\) are trained. This implies that the encoding for each tree is generated by an untrained function, as usual in an echo state network paradigm. Accordingly, the model needs to use a very high-dimensional encoding space to make sure that the untrained encodings still retain sufficient information to represent the tree. On the upside, treating \(f^r\) and \(g^r_j\) as fixed reduces the trainable parameters massively, and speeds up the training. TES-AEs serve as an ablation of RTG-AE because it uses recursive processing and grammar knowledge but a random encoding.

The number of parameters for all models on all datasets is shown in Table 2. We trained all neural networks using Adam (Kingma & Ba, 2015) with a learning rate of 10−3 and a ReduceLROnPlateau scheduler with minimum learning rate 10−4, following the learning procedure of Kusner et al. (2017) as well as Dai et al. (2018). We sampled 100k trees for the first two and 10k trees for the latter two datasets. To obtain statistics, we repeated the training ten times with different samples (by generating new data for boolean and expressions and by doing a 10-fold crossvalidation for SMILES and pysort). Following Zhang et al. (2019), we used a batch size of 32 for all approaches.Footnote 1 For each approach and each dataset, we optimized the regularization strength \(\beta\) and the sampling standard deviation s in a random search with 20 trials over the range [10−5, 1], using separate data. For the first two datasets, we set \(n = 100\) and \(n_{\mathrm {VAE}} = 8\), whereas for the latter two we set \(n = 256\) and \(n_{\mathrm {VAE}} = 16\). For TES-AE, we followed the protocol of Paaßen et al. (2020), training on a random subset of 500 training data points, and we optimized the sparsity, spectral radius, regularization strength with the same hyper-parameter optimization scheme.

The SMILES experiment was performed on a computation server with a 24core CPU and 48GB RAM, whereas all other experiments were performed on a consumer grade laptop with Intel i7 4core CPU and 16GB RAM. All experimental code, including all grammars and implementations, is available at https://gitlab.com/bpaassen/RTGAE.

We measure autoencoding error on test data in terms of the root mean square tree edit distance (Zhang & Shasha, 1989). We use the root mean square error (RMSE) rather than log likelihood because D-VAE measures log likelihood different to GVAE, GRU-TG-AE, and RTG-AE, and TES-AE does not measure log likelihood at all. By contrast, the RMSE is agnostic to the underlying model. Further, we use the tree edit distance as a tree metric because it is defined on all possible labeled trees without regard for the underlying distribution or grammars (Bille, 2005) and hence does not favor any of the model.

Table 3 The average autoencoding RMSE (± SD)
Table 4 The average runtime in seconds (± SD) as measured by Python

The RMSE results are shown in Table 3. We observe that RTG-AE achieves significantly lower errors compared to all baselines on the first two datasets (p < 0.001 in a Wilcoxon signed rank test), significantly lower than all but D-VAE on the SMILES dataset (p < 0.01), and significantly lower than all but TES-AE on the Pysort dataset (p < 0.001). On SMILES, we note that we used a GRU in RTG-AE to represent chains of atoms, as described in Sect. 3. The vanilla RTG-AE performs considerably worse with an RMSE of 594.92 (the same as GVAE). On Pysort, TES-AE performs best, which is likely due to the fact that Pysort is roughly 100 times smaller than the other datasets while also having the most grammar rules. For this dataset, the dataset size was likely too small to fit a deep model.

Training times in seconds are shown in Table 4. We note that TES-AE is always the fastest because it only needs to fit the last layer. All other methods use deep learning. Among these models, RTG-AE had the lowest training times, likely because it consists of feedforward layers (except for one GRU layer in the SMILES case) whereas all other models use GRUs.

Table 5 The rate of syntactically correct trees decoded from standard normal random vectors

To evaluate the ability of all models to generate syntactically valid trees, we sampled 1000 standard normal random vectors and decoded them with all models.Footnote 2 Then, we checked the syntactic correctness syntax with the regular tree grammar of the respective dataset (refer to Appendix C). The percentage of correct trees is shown in Table 5. Unsurprisingly, D-VAE has the worst results across the board because it does not use grammatical knowledge for decoding. In principle, the other models should always have 100% because their architecture guarantees syntactic correctness. Instead, we observe that the rates drop far below 100% for GRU-TG-AE on Pysort and for all models on SMILES. This is because the decoding process can fail if it gets stuck in a loop. Overall, RTG-AE, fails the least with an average of 83.74% across datasets, whereas D-VAE has 13.53%, GVAE has 72.53%, and GRU-TG-AE has 51%.

We also evaluated the utility of the autoencoder for optimization, in line with Kusner et al. (2017). The idea of these experiments is to find an optimal tree according to some objective function by using a gradient-free optimizer, such as Bayesian optimization, in the latent space of our tree autoencoder. If the optimizer is able to achieve a better objective function value, this indicates that the latent space behaves smoothly with respect to the objective function and, thus, may be a useful representation of the trees.

Kusner et al. (2017) considered two data sets, namely Expressions and SMILES. For the Expressions dataset, Kusner et al. (2017) suggest as objective function the log mean square error compared to the ground truth expression \(\frac{1}{3} + x + \sin (x * x)\) for 1000 linearly spaced values of x in the range \([-10, +10]\). So any expression which produces similar outputs as the ground truth will achieve a good objective function value, but expressions that behave very differently achieve a worse error. For the SMILES dataset, Kusner et al. (2017) suggest an objective function which includes a range of chemically relevant properties, such as logP, synthetic availability, and cycle length. Here, higher values are better. For details, please refer to the original paper and/or our source code at https://gitlab.com/bpaassen/RTG-AE. We used both objective functions exactly as specified in the original work. For SMILES, we also re-trained all models with \(n_\mathrm {VAE} = 56\) and 50k training samples to be consistent with Kusner et al. (2017).

We tried to use the same optimizer as Kusner et al. (2017), namely Bayesian Optimization, but were unable to get the original implementation to run. Instead, we opted for a CMA-ES optimizer, namely the cma implementation in Python. CMA-ES is a well-established method for high-dimensional, gradient-free otpimization and has shown competitive results to Bayesian optimization in some cases (Loshchilov & Hutter, 2016). Conceptually, CMA-ES fits well with variational autoencoders because both VAEs and CMA-ES use Gaussian distributions in the latent space. In particular, we initialize CMA-ES with the standard Gaussian and then let it adapt the mean and covariance matrix to move to samples with better objective function value. As hyper-parameters, we set 15 iterations and a budget of 750 objective function evaluations, which is the same as Kusner et al. (2017). To obtain statistics, we performed the optimization 10 times for each autoencoder and each dataset.

Table 6 The optimization scores for the Expressions (lower is better) and SMILES (higher is better) with median tree and median score (\(\pm \frac{1}{2}\) IQR)
Fig. 3
figure 3

A 2D t-SNE reduction of the codes of 1000 random molecules from the SMILES dataset by GVAE (left), TES-AE (center), and RTG-AE (right). Color indicates the objective function value. Rectangles indicate the 30-means cluster with highest mean objective function value ± std

The median results (± inter-quartile ranges) are shown in Table 6. We observe that RTG-AEs significantly outperform all baselines on both datasets (p < 0.01 in a Wilcoxon rank-sum test) by a difference of several inter-quartile ranges. On the SMILES dataset, D-VAE failed with an out-of-memory error during re-training and G-VAE failed because CMA-ES could not find any semantically valid molecule in the latent space, such that both methods receive an n.a. score. We note that, on both datasets, our results for GVAE are worse than the ones reported by Kusner et al. (2017), which is likely because Bayesian optimization is a stronger optimizer in these cases. Still, our results show that even the weaker CMA-ES optimizer can consistently achieve good scores in the RTG-AE latent space. We believe there are two reasons for this: First, RTG-AE tends to have a higher rate of syntactically correct trees (refer to Table 5); second, recursive processing tends to cluster similar trees together (Paaßen et al., 2020; Tiňo & Hammer, 2003) in a fractal fashion, such that an optimizer only needs to find a viable cluster and optimize within it. Figure 3 shows a t-SNE visualization of the latent spaces, indicating a cluster structure for TES-AE and RTG-AE, whereas GVAE yields a single blob. We also performed a 30-means clustering in the latent space, revealing that the cluster with highest objective function value for TES-AE and RTG-AE had higher mean and lower variance compared to the one for GVAE (black rectangles in Fig. 3).

5 Conclusion

In this contribution, we introduced the recursive tree grammar autoencoder (RTG-AE), a novel neural network architecture that combines variational autoencoders with recursive neural networks and regular tree grammars. In particular, our approach encodes a tree with a bottom-up parser, and decodes it with a tree grammar, both learned via neural networks and variational autoencoding. Experimentally, we showed that the unique combination of recursive processing, grammatical knowledge, and deep learning generally improves autoencoding error, training time, and optimization performance beyond existing models that use only two of these features, but not all three. The lower autoencoding error can be explained by three conceptual observations: First, recursive processing follows the tree structure whereas sequential processing introduces long-range dependencies between children and parents in the tree; second, grammatical knowledge avoids obvious decoding mistakes by limiting the terminal symbols we can choose; third, deep learning allows to adjust all model parameters to the data instead of merely the last layer. The lower training time can be explained by the fact that RTG-AE uses (almost) only feedforward layers whereas other models require recurrent layers. The improved optimization performance is likely because recursive processing yields a clustered latent space that helps optimization, grammar knowledge avoids a lot of trees with low objective function values, and variational autoencoding encourages a smooth encoding space that makes it easier to optimize over the space.

Nonetheless, our work still has notable limitations that provide opportunities for future work. First, we need a pre-defined, deterministic regular tree grammar, which may not be available in all domains. Second, we only consider optimization as downstream task, whereas the utility of our model for other tasks—such as time series prediction—remains to be shown. Third, on the (small) Pysort dataset, RTG-AE was outperformed by the much simpler TES-AE model, indicating that a sufficient dataset size is necessary to fit RTG-AE. This data requirement ought to be investigated in more detail. Finally, RTG-AE integrates syntactic domain knowledge (in form of a grammar) but currently does not consider semantic constraints. Such constraints could be integrated in future extensions of the model.