Evolving graphs with semantic neutral drift

We introduce the concept of Semantic Neutral Drift (SND) for genetic programming (GP), where we exploit equivalence laws to design semantics preserving mutations guaranteed to preserve individuals’ fitness scores. A number of digital circuit benchmark problems have been implemented with rule-based graph programs and empirically evaluated, demonstrating quantitative improvements in evolutionary performance. Analysis reveals that the benefits of the designed SND reside in more complex processes than simple growth of individuals, and that there are circumstances where it is beneficial to choose otherwise detrimental parameters for a GP system if that facilitates the inclusion of SND.


Introduction
In genetic programming the ability to escape local optima is key to finding globally optimal solutions.Neutral drift, a mechanism whereby individuals with fitnessequivalent phenotypes to the existing population may be generated by mutation [10] offers the search of new neighborhoods for sampling thus increasing the chance of leaving local optima.A number of studies on neutrality in Cartesian Genetic Programming (CGP) [23,37,33] find it to be an almost always beneficial property for studied problems.In general, comparative studies [22] find that CGP using only mutation and neutral drift is able to compete with traditional tree-based Genetic Programming (GP) which uses more familiar crossover operators (see [18]) to introduce genetic variation.[33] makes a distinction between implicit neutral drift (where a genetic operator yields a semantically equivalent child) and explicit neutral drift (where a genetic operator only modifies intronic code).We note that many comparative studies largely focus on the role of both types of neutral drift as byproducts of existing genetic operators and neutrality within the representation [23,37,33,3] rather than as deliberately designed features of an evolutionary system.We propose the opposite; to employ domain knowledge of equivalence laws to specify mutation operators on the active components of individuals which always induce neutral drift.Hence our work can be viewed as an attempt to explicitly induce additional implicit neutral drift in the sense of [33].
We build on our approach EGGP (Evolving Graphs by Graph Programming) [1], by implementing semantics preserving mutations to directly achieve neutral drift on the active components of individual solutions.Here, we implement logical equivalence laws as mutations on the active components of candidate solutions to digital circuit problems to produce semantically equivalent, equally fit, children.While our semanticspreserving mutations produce semantically equivalent children they do not guarantee preservation of size; our fitness measures evaluate semantics only, not, for example, size or complexity.
We describe and implement Semantic Neutral Drift straightforwardly by using rule-based graph programs, here in the probabilistic language P-GP 2 [2].This continues from [1] where we use a probabilistic variant of the graph programming language GP 2 to design acylicity-preserving edge mutations for digital circuits that correctly identify the set of all possible valid mutations.The use of P-GP 2 here enables concise descrip-tion of complex transformations such as DeMorgan's laws by identifying and rewriting potential matches for these laws in the existing formalism of graph transformation.This reinforces the notion that the direct encoding of solutions as graphs is useful as it allows immediate access to the phenotype of individual solutions and makes it possible to design complex mutations by using powerful algorithmic concepts from graph programming.
We investigate four sets of semantics-preserving mutations for digital circuit design, three built upon logical equivalence laws and a fourth taken from term-graph rewriting.We run EGGP with each rule-set on a set of benchmark problems and establish statistically significant improvements in performance for most of our visited problems.An analysis of our results reveals evidence that it is the semantic transformations, beyond simple 'neutral growth', which are aiding performance.We then combine our two best performing sets of mutation operators and evaluate this new set under the same conditions, achieving further improvements.We also provide evidence that, although operators implementing semantics-preserving mutations may be more difficult to use, the inclusion of those semantics-preserving mutations may allow evolution to out-perform equivalent processes that use 'easier' operators.
The rest of this paper is organised as follows.In Section 2 we review existing literature on Neutral Drift in Genetic Programming.In Sections 3 and 4 we describe the graph programming language GP 2 and our existing approach EGGP.In Section 5 we describe our extension to EGGP where we incorporate deliberate neutral drifts into the evolutionary process.In Section 6 we describe our experimental setup and in Section 7 we give the results from these experiments.In Section 8 we provide in-depth analysis of these results to establish precisely what components of our approach are aiding performance.In Section 9 we conclude our work and propose potential future work on this topic.

Neutral Drift in Genetic Programming
Neutral drift remains a controversial subject in Evolutionary Computation.See [10] for a survey.Here, we focus on neutrality in the context of genetic programming as the most relevant area to our own work; there is also literature on, for example, genetic algorithms [13] and landscape analysis [4].
The process of neutral drift might be described as the mutation of individual candidate solutions to a given problem without advantageous or deleterious effect on their fitness.This exposes the evolutionary algorithm to a fitness 'plateau' with each fitness-equivalent individual offering a different portion of the landscape to sample.Neutral drift can be viewed as random walks on the neighborhoods of surviving candidate solutions.In a system with neutral drift, an apparently local optimum might be escaped by 'drifting' to some other fitness-equivalent solution that has advantageous mutations available.
The most apparent demonstration of neutral drift in genetic programming literature occurs in Cartesian Genetic Programming (CGP) [24].In CGP, individuals encode directed acyclic graphs; some portion of a genome may be 'inactive', contributing nothing to the phenotypic fitness, because it represents a subgraph that is not connected to the phenotype's main graph.These inactive genes can mutate without influencing an individual's fitness and then, at some later point, may become active.Early work on CGP has found that by allowing neutral drift to take place (by choosing a fitness-equivalent child over its parent in the 1 + λ algorithm), the success rate of experiments significantly improves [37].A later claim that neutrality in CGP aids search in needle-in-haystack problems [39] has been contested by a counter-claim that better performance can be achieved by random search [7].[23] finds that better performance can be achieved with neutral drift enabled by increasing the amount of redundant material present in individuals.[33] establishes a distinction between explicit and implicit neutral drift.Explicit neutral drift occurs on inactive components of the individual, whereas implicit neutral drift occurs when active components of the individual are mutated but the fitness does not change.The authors were able to isolate explicit neutral drift and demonstrate that it offers additive benefits beyond those of implicit neutral drift.
Outside of CGP, [3] proposes a form of Linear Genetic Programming where programs are decoded from bit-strings, and redundancy exists, in that certain operations have multiple representations.A study of evolvability in Linear GP [14] found that neutrality cooperates with 'variability' (the ability of a system to generate phenotypic changes) to generate adaptive phenotypic changes which aid the overall ability of the system to respond to the landscape.Recent work [15] studying the role of neutrality in small Linear GP programs found that the robustness of a genotype (the proportion of its neighbours within the landscape which are neutral changes) has a complex and non-monotonic relationship with the overall evolvability of the genotype.
In [8], binary decision diagrams are evolved with explicit neutral mutations.Although those neutral mutations are not isolated for their advantages/disadvantages, a later work has found that a higher rate of neutral drift on binary decision diagrams is advantageous [9].Koza also makes some reference to the ideas we employ in Section 5 when he describes the editing digital circuits by applying DeMorgan's laws to them [18,Ch.6].A study of neutrality in tree-based GP for boolean functions [35] found a correlation between using a more effective function set and the existence of additional neutrality when using that function set.
While not directly related to neutrality, a number of investigations have been carried out exploring the notion of semantically aware genetic operators to improve the locality of mechanisms such as crossover in treebased GP [25,26].We refer the reader to the extensive survey [34] on this field of research.Whereas neutrality is the process whereby phenotypically identical and genotypically distinct individuals are visited by the evolutionary process, semantically aware genetic operators attempt to produce phenotypically 'close' individuals to improve the locality of the search neighbourhood.It should be noted that employing semantically aware genetic operators may sometimes lead to a loss of diversity [27].It could be argued that the deliberate neutral operators we propose in this work are a form of semantically aware mutation operators designed to explicitly exploit neutrality.
Neutral drift has some parallels with work on biological evolution.Kimura's Neutral Theory of Molecular Evolution [17] posits that most mutations in nature are neither advantageous or deleterious, instead introducing 'neutral' changes that do not affect phenotypes but account for much of the genetic variation within and between species.While Kimura's theory remains controversial (see [12]), it appears to loosely correspond to the notions of neutral mutation described in genetic programming literature.
Throughout the literature we have covered, neutrality is mostly considered in the sense of explicit neutral drift as defined in [33].Conversely in our work here we are focusing on neutral drift on the active components of individual solutions, with some relationship therefore to the neutral mutations on binary decision diagrams in [8].

Graph Programming with P-GP 2
Here we give a brief introduction to the graph programming language GP 2; see [29] for a detailed account of the syntax and semantics of the language.
A graph program consists of declarations of graph transformation rules and a main command sequence controlling the application of the rules.Graphs are directed and may contain loops and parallel edges.The rules operate on host graphs whose nodes and edges are labelled with integers, character strings or lists of integers and strings.Variables in rules (relevant for this paper) are of type int, string or list.Integers and strings are considered as lists of length one, hence every label in GP 2 is a list.For example, in Figure 1, the list variables a, c and e are used as node labels while b and d serve as edge labels.The small numbers attached to nodes are identifiers that specify the correspondence between the nodes in the left and the right graph of the rule.
Besides carrying list expressions, nodes and edges can be marked.For example, in the program of Figure 3, blue and red node marks are used to prevent the rule mutate edge from creating a cycle.In rules, a magenta colour can be used as a wildcard for any mark.For example, in the rules remove edge, unmark edge and unmark node of Figure 6, pairs of magenta nodes with the same identifier on the left and the right represent nodes with the same green, blue or grey mark.
The principal programming constructs in GP 2 are conditional graph-transformation rules labelled with expressions.To apply a rule to a host graph, the rule is first instantiated by replacing all variables with values and evaluating the expressions.The rule's condition, if present, has to evaluate to true.Then the left graph of the instantiated rule is matched (injectively) with a subgraph of the host graph.Finally the subgraph is replaced with the right graph of the instantiated rule.This means that the nodes corresponding to the numbered nodes of the left graph are preserved (but possibly re-labelled), any other nodes and all edges of the left graph are deleted, and any unnumbered nodes and all edges of the rule's instantiated right graph are inserted.
For example, given any host graph G, the program in Figure 1 produces the smallest transitive graph that results from adding unlabelled edges to G. (A graph is transitive if for each directed path from a node v 1 to another node v 2 , there is an edge from v 1 to v 2 .)The program applies the single rule link as long as possible to a host graph.In general, any subprogram can be iterated with the postfix operator "!".Applying link amounts to non-deterministically selecting a subgraph of the host graph that matches link's left graph, and adding to it an edge from node 1 to node 3 provided there is no such edge (with any label).The application condition where not edge (1,3) ensures that the program terminates and extends the host graph with a minimal number of edges.
Besides applying individual rules, a program may apply a rule set {r 1 , . . ., r n } to the host graph by nondeterministically selecting a rule r i among the applicable rules and applying it.Further control constructs  In general, the execution of a program on a host graph may result in different graphs, fail, or diverge.The semantics of a program P maps each host graph to the set of all possible outcomes [28].GP 2 is computationally complete in that every computable function on graphs can be programmed [29].GP 2's inherent non-determinism is useful as many graph problems are naturally multi-valued, for example the computation of a shortest path or a minimum spanning tree.The results described in the rest of this paper have been obtained with a probabilistic extension of GP 2, called P-GP 2. This provides a rule-set command [r 1 , . . ., r n ] which chooses a rule uniformly at random among the applicable rules and applies the rule with a match selected uniformly at random among all matches of that rule [2].

Introduction to EGGP
In [1] we introduce EGGP, an evolutionary algorithm that evolves graphs (specifically, in that case, digital circuits) using graph programming.We have found that by evolving graphs directly and designing mutation operators that respect the constraints of the problem, we are able to significantly outperform CGP under similar conditions on a number of digital circuit benchmark problems.In this section we formally describe this approach.
Our approach is justified by two observations: (i) the use of graphs as a representation is beneficial, as it directly addresses a number of motivating problems within computer science such as neural network topology, Bayesian network topology, digital circuit design, program design, and quantum circuit design; (ii) with graphs as a representation it is necessary to have a language to describe the neighborhoods (mutations) on individuals.Graph programming readily lends itself to this endeavour due to its computational completeness over functions on graphs.
Our approach is not alone in addressing the issue of evolving graphs and graph-like programs.CGP [24], where individuals encode directed acyclic graphs, is a primary candidate for related work and is used as a benchmark here.Parallel Distributed Genetic Programming [30,31] introduces a 'graph on a grid' representation for genetic programming in a similar manner to the grid-like description of CGP, allowing the evolution of programs with multiple outputs and sharing.MIOST [20] also extends traditional genetic programming to these same concepts of multiple outputs and sharing.For a more detailed discussion of related approaches, see [1].Our approach differs from these in that (i) we deal with graphs directly rather than through an encoding or some subset of graphs; and (ii) our mutation operators are domain-specific and may be changed to suit the constraints of a problem and to exploit domainspecific knowledge.
Here we address the problems of digital circuits, primarily because they suit our discussion of neutral drift by design.For this reason, the rest of this paper focuses on the evolution of digital circuits as a concrete case study.

Evolving Digital Circuits as Graphs
We directly encode digital circuits as graphs such that the graph contains input and output nodes (corresponding to the inputs and outputs of the intended problem) and function nodes.In P-GP 2, we identify input nodes and output nodes by labels of the form "IN" : x and "OUT" : y respectively, where x and y are integers that identify which particular input or output the node corresponds to.Function nodes are labelled as "[f i ]":a, where [f i ] is a string uniquely identifying function f i ∈ F and a is the arity of f i .In this work our functions are symmetrical, but an extension is available to associate each edge with an integer to identify which particular input of a function it corresponds to.Fig. 2 shows a digital circuit encoded in this form.
For a specific i input, o output problem over function set F , we must evolve graphs that are constrained: -Individual solutions are acyclic.
-Individual solutions have i input nodes.
-Individual solutions have o output nodes.
-All other nodes that are neither inputs nor outputs must be function nodes associated with some function f i ∈ F and have exactly a outgoing edges where a is the arity of f i .
We use three graph programs to induce a landscape; InitCircuit, MutateFunction and MutateEdge.The first is the initialisation program for generating individual graphs, and the others are mutation operators.InitCircuit and MutateFunction are given in Appendix A; it should be clear that they satisfy the constraints described above.Here we describe in more detail the MutateEdge operator, which is the mutation operator primarily responsible for the topological changes to individual solutions.
The MutateEdge operator is shown in Fig. 3.It works by first picking an edge to mutate at random using the pick edge rule, marking that edge red, its source blue and its target red.Then mark output is applied as long as possible, marking blue every node for which there is a directed path to the source of the edge we wish to mutate.mutate edge can be safely applied to redirect the edge to target some unmarked node (chosen at random); this cannot introduce a cycle as the new target is unmarked and therefore does not have a directed path to the existing source of the mutating edge.Finally unmark is applied as long as possible to return the graph to an unmarked state.This P-GP 2 pro-gram uses a uniform random distribution to chose the edge to mutate, a uniform distribution over all possible edge mutations that preserve acyclicity, and clearly respects the other constraints mentioned above, as it does not relabel any nodes or change the number of outgoing edges of any node.In [1] we argue that this edge mutation generalises the order preserving mutations of CGP and offers additional possible mutations.A visual stepby-step execution of this mutation operator is shown in Figure 4.
In general, we use the 1 + λ evolutionary algorithm with EGGP. 1 + λ has been used extensively with CGP with favourable comparisons with large-population GP systems (see [22]).A comparative study of crossover in CGP [16] found that there is no currently known universal crossover operator for CGP and that 1 + λ is sometimes the best known approach for certain problems.Current advice [22,32] is to use 1 + λ as the 'standard' CGP approach.The comparative study between EGGP and CGP [1] exclusively used the 1 + λ strategy with EGGP performing favourably on many digital circuit benchmark problems.In combination, these points appear to justify the exclusive use of 1 + λ with EGGP in our study.Additionally, the use of 1+λ has the added effect of 'isolating' our notion of semantic neutral drift, in that we can apply logical equivalence laws to the single surviving individual in each generation knowing that its application is not disrupting other processes e.g.crossover or non-elitist selection.

The Concept
Semantic Neutral Drift (SND) is the augmentation of a GP system with semantics-preserving mutations.These mutations are added to the standard mutation and crossover operators, which are intended to introduce variation to search.In this section we refer to mutation operators and individuals generally, not just our specific operation.For individual solutions i, j and mutation operator m, we write i → m j to mean that j can be generated from i by using mutation m.A semanticspreserving mutation is one that guarantees that the semantic meaning of a child generated by that mutation is identical to that of its parent, for any choice of parents and a given semantic model.This definition is adequate for our domain of GP, where there is no distinction between the genotype and phenotype.
For our digital circuits case study, this semantic equivalence is well-defined: two circuits are semantically equivalent if they describe identical truth tables.Therefore, semantics preserving mutations in this context are Fig. 3 A P-GP 2 edge mutation MutateEdge for digital circuits.This edge mutation preserves acylicity.The rule pick edge is used to probabilistically choose an edge to mutate.Then mark output is applied as long as possible, marking every node with a path to the source of the edge we wish to mutate blue.mutate edge can then be applied safely, redirecting the edge to target some unmarked node which does not have a path to the source of the mutating edge.Finally unmark is applied as long as possible to return the graph to an unmarked state.
ones which preserve an individual's truth table.As we will be evaluating individuals by the number of incorrect bits in their truth tables, there may be individuals with equivalent fitness but different truth tables.Therefore, semantic equivalence is distinct from, but related to, fitness equivalence.Additionally, semantics preserving mutations do not necessarily induce neutral drift.In the circumstance that a fitness function considers more than the semantics of an individual, there is no guarantee that the child of a parent generated by a semantics-preserving mutation has equal fitness to its parent.For example, if a fitness function penalized the size of an individual, a semantics-preserving mutation which introduces additional material (e.g.increases size) would generate children less fit than their parents under this measure.
We identify a special class of fitness functions, where fitness depends only on semantics, and so where semantics-preserving mutations are guaranteed to preserve fitness.In this circumstance, any use of semantics-preserving mutations is a deliberate, designed-in, form of neutral drift.The fitness function in our case study is an example of this; the fitness of an individual depends only on its truth table.Formally we have the following: a set of semantics-preserving mutation operators M over search space S with respect to a fitness function f that considers only semantics guarantees that ∀i, j ∈ S, m ∈ M : (j Consider a GP run that has reached a local optimum; no available mutations or crossover operators offer positive improvements with respect to the fitness function.It may be the case that there is a solution elsewhere in the landscape that is equally fit as the best found solution but has a neighborhood with positive mutations available.By applying a semantics preserving mutation to transform the best found solution into this other, semantically equivalent, solution, the evolutionary process gains access to this better neighborhood and can continue its search.Hence the proposed benefit of Semantic Neutral Drift is the same as conventional neutral drift: that by transforming discovered solutions we gain access to different parts of the landscape that may allow the population to escape local optima.The distinction here is that we are employing domain knowledge to deliberately preserve semantics, rather than accessing neutral drift as a byproduct of other evolutionary processes.The hypothesis we are investigating is that this deployment of domain knowledge yields more meaningful neutral mutations than simple rewrites of intronic code, and that this leads the evolutionary algorithm to more varied (and therefore useful) neighborhoods.
A simple visualization of Semantic Neutral Drift is given in Figure 5.Here the landscape exists in one dimension (the x-axis) with fitness of individuals given in the y-axis.In this illustration, the individual has eached a local optimum, then a semantics-preserving mutation moves it to a different 'hill' from which it is able to reach the global optimum.
While our experiments will focus on the role of semantic neutral drift when evolving graphs with EGGP, we argue that the underlying concept is extendable to  The edge e is mutated to target some randomly chosen unmarked (non-output) node, preserving acyclicity.The new target has been marked with an star ' ' for visual clarity.Finally, all marks are removed.
Fig. 4 A step-by-step execution of the edge mutation operator given in Figure 3.For visual simplicity, node labels have been omitted.
other GP systems.For example, Koza noted the possibility of applying DeMorgan's laws to GP trees [18,Ch.6]which, if used in a continuous process rather than as a solution optimiser, would induce semantic neutral drift.It is also plausible to apply similar operators to CGP [24] representations, although the ordering imposed on the representation raises some technical difficulties with respect to where newly created nodes should be placed.The potential for Embedded CGP [38] to effectively grow and shrink the overall size of the genotype offers some hope in this direction.

Designing Semantic Neutral Drift
We extend EGGP by applying semantics-preserving mutations to members of the population each generation.
Here we focus on digital circuits as a case study, and design mutations which modify the active components of the individual by exploiting domain knowledge of logical equivalence.
For the function set {AND,OR,NOT} there are a number of known logical equivalences.Here we use DeMorgan's laws: and the identity and double negation laws: Here we investigate different subsets of these semanticspreserving rules.We encode them as graph transformation rules to apply to the active component of an individual.In the context of the 1 + λ evolutionary algorithm, we apply one of the rules from the subset to the surviving individual of each generation.
Encoding these semantics-preserving rules is nontrivial for our individuals as they incorporate sharing; multiple nodes may use the same node as an input, and therefore rewriting or removing that node, e.g. as part of DeMorgan's, may disrupt the semantics elsewhere in the individual.To overcome this, we need a more sophisticated rewriting program.The graph program in Fig. 6 is designed for the logical equivalence laws DeMorgan F 1|F 2 , DeMorgan R1|R2 ; analogous programs are used for other operators.
The program Main in Fig. 6 works as follows.{mark out, mark active}!: Mark all active nodes with the given rule-set applied as long as possible.Once this rule-set has no matches, all inactive nodes must be unmarked: these are 'neutral' nodes that do not contribute to the semantics of the individual.
mark neutral!: Mark these neutral nodes grey with the rule applied as long as possible.We can then rewrite the individual while preserving semantics with respect to shared nodes by incorporating neutral nodes into the active component rather than overwriting existing nodes.
try [demorgan f1, demorgan f2, demorgan r1, demorgan r2] : pick some rule with uniform probability from the subset of the listed rules that have valid matches.When a rule has been chosen, a match is chosen for it from the set of all possible matches with uniform probability.The probabilistic rule-set call is surrounded by a try statement to catch the fail case that none of the rules have matches.
In Fig. 6 we show one of the 4 referenced rules, demorgan f1, which corresponds to the logical equivalence law DeMorgan F 1 ; the others may be given analogously.On the left hand side is a match for the pattern ¬(a ∧ b) in the active component and 2 neutral nodes.If the matched pattern were directly transformed, any nodes sharing use of the matches for node 2 or node 3 could have their semantics disrupted.Instead, the righthand-side of demorgan f1 changes the syntax of node 1 to correspond to ¬a∨¬b by absorbing the matched neutral nodes (preserving its semantics) without rewriting nodes 1 or 2 and disrupting their semantics.Nodes 3 and 4 are marked green and their newly created outgoing edges are marked red.These marks are used later in the program to clean up any previously existing outgoing edges they have to other parts of the graph.
remove edge: once a semantics preserving rule has been applied, the rule is applied as long as possible to remove the other outgoing edges of green marked absorbed nodes.
unmark edge!; unmark node!: return the graph to an unmarked state, where nodes and edges with any mark (indicated by magenta edges and nodes in the rules) have their marks removed.
This program highlights the helpfulness of graph programming for this task.The probabilistic application of complex transformations, such as DeMorgan's law, to only the active components of a graph-like pro- gram with sharing is non-trivial, but can be concisely described by a graph program.

Variations on our approach
We identify 3 sets of logical equivalence rules to study, alongside another example of semantics preserving transformation taken from term-rewriting theory.These sets are detailed in Table 1.The first 3 sets comprise the logical equivalence laws already discussed.The last, CC, refers to collapsing and copying from term graph rewriting (see [11]).Collapsing is the process of merging semantically equivalent subgraphs, and copying is the process of duplicating a subgraph.The rules collapse 2 and copy 2 are shown in Fig. 7.These collapse and copy, respectively, function nodes of arity 2 without garbage collection.We only require rules for arity 1 and arity 2 as our function sets in experiments are limited to arity 2. This final set is included for several reasons: it takes a different form from the domain-specific logical equivalence laws in the other 3 sets; it allows us to investigate if the apparent overlap between term-graph rewriting and evolutionary algorithms bears fruit; it appears to resemble gene duplication, which is a natural biological process believed to aid evolution [40].

Experimental Setup
To evaluate our approach, we study the same digital circuit benchmark problems as in [1], listed in Table 2.We perform 100 runs of each of our 4 neutral drift sets (Table 1) on each problem (Table 2).We use the 1 + λ evolutionary algorithm with λ = 4.We use a mutation rate of 0.01 and fix all individuals to use 100 function nodes.The fitness function used is the number of incorrect bits in an individual's truth table compared to the target truth table, hence we are minimizing the fitness.We are able to achieve 100% success rate in finding global optima in our evolutionary runs, so we compare the number of evaluations required to find perfect fitness.
The function set used here is {AND, OR, NOT}, rather than the set {AND, OR, NAND, NOR} used in [1] and [22,Ch.2].Our function set is chosen to directly correspond to the logical equivalence laws used.To give context to the results in Section 7, and to highlight that the chosen function set is the harder of the two, we run EGGP with both function sets and detail the results in Table 3.For additional context, the comparative study in [1] has shown EGGP to perform favourably in comparison to CGP on these problems with the {AND, OR, NAND, NOR} function set.
We use a two-tailed Mann-Whitney U test [21] to establish a statistically significant difference between the median number of evaluations using the two different function sets.When a result is statistically significant (p < 0.05) we also use a Vargha-Delaney A test [36] to measure the effect size.On every problem, using {AND, OR, NOT} takes significantly (p < 0.05) more effort (in terms of evaluations) than when using {AND, OR, NAND, NOR}.This justifies our assertion that the former function set is 'harder' to evolve.

Results
The results from our experiments are given in Table 4.Each neutral rule-set is listed with the median evaluations (ME) required to solve each benchmark problem.
We use a two-tailed Mann-Whitney U test to demonstrate statistical significance in the difference of the copy_2(a,b,c,d,e,f,g,h,i,j Fig. 7 The rules copy 2 and collapse 2. The rule copy 2 matches a 2-arity function node that is shared by 2 active nodes and absorbs a neutral node to effectively copy that 2-arity function node and redirect one of the original node's shared incoming edges to that copy.The rule collapse 2 attempts the reverse of copy 2 by matching 2 active identical 2-arity function nodes and redirecting one of those nodes' incoming edges to the other.The node which has lost an incoming edge, if it was shared by no other nodes, may now become neutral.

Digital Circuit No. Inputs
No. Outputs Table 2 Digital Circuit benchmark problems.
median evaluations for these runs and the unmodified EGGP results given in Table 3.
For most problems and neutral rule-sets, the inclusion of semantic neutral drift yields statistically significant improvements in performance.There are some exceptions: for the 4×1-bit comparator (COMP) problem, the inclusion of neutral rule-sets leads either to insignificant differences or to significantly worse per-  formance for every rule-set except ID, which performs significantly better.The DeMorgan's rule-set (DM) and Copy/Collapse rule-set (CC) appear to yield the smallest benefit, finding significant improvement on only 8 and 9 of the 13 benchmark problems respectively.Additionally, both of these rule-sets yield significantly worse performance for the 4×1-bit comparator (COMP) problem.The DeMorgan's and Negation rule-set (DMN) offer the best performance on the 2-bit and 3-bit adder problems (2-Add and 3-Add), in terms of median evaluations, p value and effect size.The Identity rule-set (ID) achieves the best performance on the 2-bit and 3-bit multiplier problems (2-Mul and 3-Mul) but fails to achieve significant improvements on the 3:8-bit demultiplexer problem (DeMux).
Our results show that, for some problems and certain neutral rule-sets, the inclusion of neutral drift may improve performance with respect to the effort (measured by the number of evaluations) required.Additionally, they offer strong evidence for the claim that there are some neutral rule-sets which may generally improve performance for a wide range of problems, particularly evidenced by the DMN and ID rule-sets.
We identify DMN and ID as the best performing rule-sets; each of these yield significant improvements in performance across all but one problems (the excep-tions being Comp and DeMux, respectively), and on those single problems that they fail to improve upon, their inclusion does not lead to significant detriment in performance.For this reason, these rule-sets are the subject of further analysis in Section 8.

Neutral Drift or Neutral Growth?
Analysis of the runtime of EGGP augmented with the DMN and ID neutral rule-sets reveals their bias towards searching the space of larger solutions.When we refer to larger solutions, given that EGGP uses fixed-size representations, we refer to the proportion of the individual graph which is active, defined by the number of nodes to which there is a path from an output node.We demonstrate this with the results given in Table 5.Here, we measure the average (mean) size of the single surviving member throughout evolutionary runs on the 3-Add and Comp problems and give the median and interquartile range of these average sizes over 100 runs.The size of an individual is the number of active function nodes (those which are reachable from output nodes) contained within it.We give these values for DMN, ID and EGGP alone.We use a two-tailed Mann-Whitney U test to measure for statistical differences between these observations.On both problems, DMN has a higher median average size (MAS) than both ID and EGGP alone (p < 0.05) and ID also has a higher MAS than EGGP alone (p < 0.05).
This observation challenges existing ideas that increasing the proportion of inactive code aids evolution [23].We are able to achieve improvements in performance while effectively reducing the proportion of inactive code.It may be the case that high proportions of inactive code are helpful only when other forms of neutral drift are not available.
The result that DMN and ID increase the active size of individuals initially appears to challenge our hypothesis that it is semantic neutral drift that aids evolution.An alternative explanation could be that it is 'neutral growth', where our neutral rule-sets increase the size of individuals, that biases search towards larger solutions, which then happen to be better candidates for the problems we study.However, the CC neutral rule-set exclusively features neutral growth and neutral shrinkage, exploiting no domain knowledge beyond the notion that identical nodes in identical circumstances perform the same functionality, and featuring no meaningful semantic rewriting.We therefore compare how CC and DMN perform with different numbers of nodes available, to determine whether larger solutions are indeed better candidates for the studied problems.
We run DMN, CC and standard EGGP on the 2-Add, 3-Add and Comp problems, with fixed representation sizes of 50, 100 and 150 nodes.If it is the case that larger solutions are better candidates, and that our neutral rule-sets bias towards neutral growth, then we would expect to see degradation of performance (more evaluations needed) with a size of 50, and improvements (fewer evaluations needed) with a size of 150, over a baseline size of 100.
The results of these runs are shown in Fig. 8.For 2-Add and 3-Add with the DMN neutral rule-set, performance actually degrades when increasing the fixed size from 100 to 150, while remaining relatively similar when decreasing the size to 50.For EGGP alone and for the CC neutral rule-set, performance remains relatively similar when increasing the fixed size from 100 to 150, but degrades when decreasing the size to 50.These observations imply that the DMN rule-set is not simply growing solutions to a more beneficial search space, since it performs better when limited to a smaller space.Therefore, on these problems, there is some other property of the DMN rule-set that is benefiting performance.
For the Comp problem, trends remain similar for EGGP alone and the CC neutral rule-set.However, the performance of the DMN rule-set degrades when the fixed size is decreased from 100 to 50.This suggests that the Comp problem is in some way different from the other problems.Further, when DMN is run on the Comp problem, the average proportion of active code is nearly 100%.This may offer an explanation to why the DMN rule-set struggles to outperform standard EGGP on the Comp problem, which has more than twice as many outputs (18) as the next nearest problem (8, De-Mux).DMN's bias towards growth paired with the high number of outputs may give some of the problem's many outputs little room to change and configure to a correct solution.

DMN and ID in Combination
We investigate the effect of using DMN and ID, our two best performing neutral rule-sets, in combination.This combined set, which we refer to as DMID, consists of the following logical equivalence laws: ID-AND F , ID-AND R , ID-OR F , ID-OR R , ID-NOT F and ID-NOT R .
We use this set under the same experimental conditions described in Section 6 to produce the results given in Table 6.In Table 6 we provide p and A values in comparison to the DMN and ID results in Table 4 and the EGGP results in Table 3.
The DMID rule-set significantly outperforms DMN on 7 of the 12 problems, and shows no significant difference for the other 5 problems.DMID significantly outperforms ID on 5 problems (notably the n-Bit Adder problems), shows no significant difference on 3 problems, and is significantly outperformed by ID on 4 problems (notably the 3-Mul, Comp and 7-EP).DMID significantly outperforms EGGP without neutral rule-sets on all but 1 problem, with the exception being the Comp problem that DMN also fails to find significant benefits on.These results position DMID and ID on a Pareto front of studied problems, with DMID effectively dominating DMN but neither DMID nor ID universally outperforming each other.

{AND, OR, NOT}: A Harder Function Set?
In Table 3 we show that solving problems with the function set {AND, OR, NOT} is significantly more difficult than when using the function set {AND, OR, NAND, NOR}.
We justify using the former function set over the latter in our experiments as it lends itself to known logical equivalence laws despite costing performance.When we introduce these logical equivalence laws to the evolutionary process with the {AND, OR, NOT} function set, this 'cost' no longer universally holds.We identify 3-Add, 3-Mul, Comp and 7-EP as the 4 hardest problems, based on the median number of evaluations required to solve them, Table 3. EGGP with the {AND, OR, NOT} function set and augmented with the DMID neutral rule-set significantly (p < 0.05) outperforms EGGP with the {AND, OR, NAND, NOR} function set on two of the problems.
Fig. 9 shows the number of evaluations across 100 runs for the 3-Mul and Comp problems, for (A) EGGP with the {AND, OR, NOT} function set and augmented with the DMID neutral rule-set and (B) EGGP with the {AND, OR, NAND, NOR} function set.Here the difference in medians and interquartile ranges for these two evolutionary algorithms can be clearly seen; with EGGP with the DMID neutral rule-set requiring a median evaluations outside of the interquartile range of EGGP with the {AND, OR, NAND, NOR} function set for the 3-Mul problem.In stark contrast, the third quartile of evaluations required for the Comp problem lies below the first quartile of EGGP with the DMID neutral rule-set.
This offers an interesting secondary result: there are circumstances and problems where it may be beneficial to choose representations that on their own would yield detrimental results, if that decision then facilitates the inclusion of semantic neutral drift, which may in combination provide enhanced performance over the original representation.

Conclusions and Future Work
We have investigated the augmentation of a genetic programming system for learning digital circuits with semantic neutral drift.From our experimental results, we can draw a number of conclusions both for our own specific setting and for the broader evolutionary community.
Firstly, we offer further evidence that there are circumstances where neutral drift aids evolution, building upon existing works that offer evidence in this direction.Additionally, the precise nature of our neutral drift by design offers evidence that neutral drift on the active component of individuals, rather than the intronic components, can aid evolution.For every benchmark problem studied, at least one neutral rule-set was able to yield significant improvements in performance.
Secondly, we have shown that by using graphs as a representation and graph programming as a medium for mutation, it is possible to directly inject domain knowledge into an evolutionary system to improve performance.The application of DeMorgan's logical equivalence laws to graphs with sharing is non-trivial, but becomes immediately accessible in our graph evolution framework.Our ability to design complex domain-specific mutation operators supports the view that that the choice of representation of individuals in an evolutionary algorithm matters.This injection of domain knowledge has been shown to offer benefits beyond simple 'neutral growth'.
Thirdly, while the approach we have proposed here offers promising results, the specific design of neutral drift matters.There are neutral rule-sets that appear to dominate each other, as is found comparing the DMID rule-set to the DMN rule-set.There are also neutral rule-sets which outperform each other on different problems, as is demonstrated comparing the DMID ruleset to the ID rule-set.As we highlighted in comparing DMID to EGGP with what initially appeared to be a preferential function set, there are circumstances where a GP practitioner may want to deliberately degrade the representation in order to access beneficial neutral drift techniques.There are also other circumstances where the cost of incorporating these techniques may outweigh their immediate benefits.
There are a number of immediate extensions to our work that we believe should be investigated.Firstly, the use of the complete function set {AND, OR, NAND, NOR, NOT} alongside the DMID semantics preserving mutations and additional mutations for converting between AND and OR gates and their negations via NOT should be investigated.It may be the case that this overall com-bination yields better results than either of the function sets and semantics preserving mutations we have covered in this work.Additionally, while semantics preserving mutations have generally improved performance with respect to the number of evaluations required to solve problems, it would be worthwhile to measure the clock-time cost of executing these transformations in every generation.Then it would be possible to study the trade-off between gained efficiency and additional overhead.Future work should also investigate the potential use of our proposed approach in CGP and tree-based GP as discussed in Section 5.1.
While we do not address theoretical aspects of SND here, it may be possible to prove convergence of evolutionary algorithms equipped with SND under certain properties, such as the completeness of the semantics preserving mutations used with respect to equivalence classes.
There are a number of application domains to investigate for future work: hard search problems where individual solutions may be represented by graphs and where there are known semantics-preserving laws.A primary candidate is the evolution of Bayesian Network topologies, a well-studied field [19], as there are known equivalence classes for Bayesian Network topologies [5].A secondary candidate is learning quantum algorithms using the ZX-calculus, which represents quantum computations as graphs [6], and is equipped with graphical equivalence laws that preserve semantics.

A.1 InitCircuit
The program InitCircuit in Figure 10 generates EGGP individuals for the digital circuit problems described in this work.This program is defined abstractly, for some function set F .The actual form of the first rule-set call is instantiated for a specific function set F where each function fx has a corresponding version of the rule add_node_fx shown in Figure 11.
This program expects the a problem-specific variant of the graph given in Figure 12, where there are i input nodes and o output nodes and the blue node is labelled with n where n is an integer representing the number of nodes generated individuals should contain.The specific graph in Figure 12   where n > 0 Fig. 11 A P-GP 2 rule for adding a node of some function fx.For the label of node 2 on the right-hand-side and a specific function fx, a unique string representation of fx replaces '[fx]' and the arity of fx replaces '[ax]'.The blue marked node counter is decreased, and the created function node is marked red so that its edges can be inserted.Fig. 12 The initial graph to be used as input to the program in Figure 10.Applying the program InitCircuit to this graph will generate acyclic graphs with 3 inputs, 2 outputs and 100 function nodes.
where outdeg(1) > x Fig. 13 A P-GP 2 program MutateNode for mutating function nodes of digital circuits.The program probabilistically applies a mutate node fx rule (see Figure 14 to mutate a node's function and mark that node red.In a similar manner to the edge mutation program in Figure 3, all nodes with a directed path to the mutating node are marked blue by mark output applied as long as possible.
Then add edge and delete edge can be applied as long as possible to ensure that the node's outgoing edge's respect its new function's arity.Additionally, the fact that all nodes which would introduce a cyclic if tareted are now marked blue ensures that applying add edge cannot introduce a cycle.Finally unmark node is used to return the graph to an unmarked state.

Fig. 1 A
Fig. 1 A GP 2 program computing the transitive closure of a graph.

( 1 )
pick edge: An edge to mutate is chosen at random and marked (red) alongside its source node s (blue) and target node t (red).

( 2 )( 3 )
mark output!: Invalid candidate nodes for redirection are identified.If a node v has a directed path to s it is marked blue, as targeting it would introduce a cycle.mutate edge; unmark!:

Fig. 5 A
Fig. 5 A simple visualization of Semantic Neutral Drift.Individuals exist in one dimension along the x-axis.On the y-axis, each individual has an associated fitness.Normal mutations (black arrows) allow the evolutionary algorithm to hill-climb by sampling from adjacent points.A semanticspreserving mutation (red arrow) allows the EA to leave a local optimum to move to a different slope where it can then climb to the global optimum.

Fig. 6 A
Fig.6A P-GP 2 program for performing semantics preserving mutations to digital circuits.

Fig. 8
Fig. 8 Results of running DMN, CC and EGGP on (A) 2-Add, (B) 3-Add and (C) Comp problems.The y axis gives the median evaluations required to solve each problem across 100 runs.The x axis groups setups by algorithm and then lists the observed median evaluations when running that algorithm with 50, 100 or 150 nodes as the fixed representation size.

Fig. 9
Fig. 9 Box-plots showing observed evaluations required to solve (A) 3-Bit Multiplier and (B) 4×1-Bit Comparator problems using EGGP augmented with the DMID neutral rule-set (DMID) and EGGP with the {AND, OR, NAND, NOR} function set (AONN).Vertical jitter is included for visual clarity.

1 Fig. 10 A 1 "
Fig.10A P-GP 2 program InitCircuit for generating digital circuits.The program repeatedly probabilistically applies a add node fx rule (see Figure11as long as possible, probabilistically connecting each newly added function node to the existing graph with the connect node rule until the node's function arity is satisfied.Once the add node rules are no longer applicable , the connect output rule is applied as long as possible to connect the outputs to the rest of the graph.Finally remove counter cleans the graph up, removing the blue marked counter node.The generated graph must be acyclic, as edges are only created outgoing from nodes with no incoming edges.

1 "Fig. 14 A
Fig. 14 A generic P-GP 2 rule for mutating a function node to some function fx.For the label of node 1 on the right-hand-side and a specific function fx, a unique string representation of fx replaces '[fx]' and the arity of fx replaces '[ax]'.
include the sequential composition P ; Q of programs P and Q, and the branching construct try T then P else Q.To execute the latter, test T is executed on the host graph G and if this results in some graph H, program P is executed on H.If T fails (because a rule or set of rules cannot be matched), program Q is executed on G.The variant try T of this construct executes T on G and if this results in graph H, returns H.If the execution fails, G is returned unmodified.

Table 1
The studied semantics preserving rule-sets.

Table 3
Baseline results from Digital Circuit benchmarks for EGGP on the {AND, OR, NOT} and {AND, OR, NAND, NOR} function sets.ME/IQR: the median/inter-quartile range of the number of evaluations used to solve the problem.The p value is from the two-tailed Mann-Whitney U test.Where p < 0.05, the effect size from the Vargha-Delaney A test is shown; large effect sizes (A > 0.71) are shown in bold.−12 0.79 103,393 10 −5 0.68 3-Add 255,003 10 −19 0.87 186,647 10 −25 0.93 279,140 10 −18 0.86 592,815 0.09 -

Table 4
Results from Digital Circuit benchmarks for the various proposed neutral rule-sets.The p value is from the two-tailed Mann-Whitney U test.Where p < 0.05, the effect size from the Vargha-Delaney A test is shown; large effect sizes (A > 0.71) are shown in bold.

Table 5
Observed average solution size of the surviving population for the DMN rule-set, ID rule-set and EGGP without a neutral rule-set.Results are for the 3-Bit Adder (3-Add) and 4×1-Bit Comparator (Comp) problems.For each result, the Median Average Size (MAS) and Interquartile Range (IQR) are given.The p value is from the two-tailed Mann-Whitney U test.

Table 6
Results from Digital Circuit benchmarks for the DMID neutral rule-set.The p value is from the two-tailed Mann-Whitney U test.Where p < 0.05, the effect size from the Vargha-Delaney A test is shown; large effect sizes (A > 0.71) are shown in bold.Statistics are given in comparison to the DMN and ID neutral rule-sets and EGGP.