1 Introduction

An evening at the 2022 GECCO conference was devoted to celebrating the thirtieth anniversary of the publication of John Koza’s book “Genetic Programming: On the programming of computers by means of natural selection” [1].Footnote 1 Indeed that is the purpose of this special issue of Genetic Programming and Evolvable Machines. I hope to put my own spin on and fill out points raised in that panel discussion (which was recorded and is available on lineFootnote 2). I should stress this is not a survey of GP and that many valuable contributions are omitted. Similarly many digressions are placed in footnotes and there are hyper links to online articles in Wikipedia etc.

Fig. 1
figure 1

Prof. Dr. Wolfgang Banzhaf holding his copy of “Genetic Programming: On the programming of computers by means of natural selection” (Jaws) 834 pages [1] at the GECCO 2022 celebration of 30 years after its publication (Wolfgang says he was told that his copy was the first one sold by the bookshop in Boston.)

Fig. 2
figure 2

At 834 pages, the first genetic programming book [1] weighs in at 4lb 2oz

1.1 The book

Dr. Amy Brand, Director of The MIT Press, was clearly delighted that John Koza had chosen MIT Press to publish the first book on genetic programming [1] (see Figs. 1 and 2). She says it ”was one of the seeds from which sprang a whole ecosystems of books and journals at the intersection of computer and biological sciences for the MIT Press.” Adding it “is still available and selling in print-on-demand. That’s quite solid for a specialized and ground-breaking work in computer science from 1992.”Footnote 3

John Koza said that the motivation for the book was his team in the preceding five years had published GP solutions to 81 diverse problems common to artificial intelligence, machine learning and knowledge based systems. They had shown that instead of, as had previously been done, using a solution technique devoted to each benchmark, a single evolutionary computing technique (now named Genetic ProgrammingFootnote 4) could solve them allFootnote 5Footnote 6. However the GP solutions were published in widely disperse conference venues. The goal of the book was to convince everyone that 1) a single technique could solve many diverse problems and 2) they could all be recast as the problem of searching for (and finding) a computer program. Whereas previous solutions had often used (non-evolutionary) search but used a representation, e.g. graph, grammar, network, often purpose built for each benchmark. The size of the bookFootnote 7 stems from the need to convince people that GP is a general solution. Whereas everyone who first comes to programming knows that programming languages are exceedingly picky about insisting they get everything, every comma, every semicolon, in the right place: so how could random stand a hope? Hence a substantial book, backed by a video, would be necessary to convince a skeptical public.Footnote 8

1.2 The man

John R. Koza was born 1944 and did both his undergraduate degree and PhD at the University of Michigan in Ann Arbor, studying mathematics and, the then newfangled, computer science. He reports great interest in playing games including computer games, with students and faculty, for example, John H. Holland. As with John Holland’s other studentsFootnote 9, he was well versed in John Holland’s genetic algorithms.

1.3 The millionaire

John Koza graduated from the University of Michigan in December 1972, and using his mathematical skills in combinatorics, probability and game playing he joined a lottery company which printed games on paper which were sold at petrol stations and supermarkets. In 1974 he and a colleague formed their own company, Scientific Games Inc., to exploit John Koza’s invention of a secure way of printing scratch off lottery tickets. They successfully lobbied various USA states to allow them to run the state’s lotteryFootnote 10. By 1978 the technology of printing had moved on and they jettisoned their own technique in favour of more flexible computer based printing. In 1987, having made his fortune, he returned to research.

1.4 The researcher

From about 1987 until 2005, John Koza devoted himself to research, applying genetic algorithms to the discovery of computer programs (GP). He published some 208 items, predominately papers but also book chapters, technical reports, proceedings, etc. and of course Jaws [1] and the three follow-up up door stoppers [4,5,6] and the four accompanying videos [7,8,9,10]. Initially the genetic programming systems were written in Lisp, although later implementations where in C, e.g. [22].

There were GP workshops associated with the International Conference on Genetic Algorithms, ICGA-93Footnote 11 and again in the summer of 1995 at ICGA-95 and ICML-95. In the fall, John R. Koza and Eric V. Siegel organised a GP event with the 1995 Fall Symposium of the AAAI in MIT. In 1994 Kim Kinnear had launched the “Advances in Genetic Programming” edited book series published by MIT Press [23,24,25]. But, since ICGA was a biannual conference, there was no ICGA conference in 1996, and instead it was the right time to launch the first GP conference [26]. One of the rules laid down at GP-96, was the absolute need for independent peer review.

July 1997 saw the return of ICGA-97, carefully scheduled a few days after the second GP conference, GP-97 [27], so attendance at both was encouraged. Again there was no ICGA in 1998, instead at GP-98 [28] there were serious discussions about combining the growing number of evolutionary computing conferences. John Koza in particular felt that the separate EC events were splitting EC into separate communities, and that the balkanisation of EC, did not make sense to people outside, particularly to funding bodies. And that this divergence was hurting the field. So at GP-98 there were negotiations about unifying, particularly: the Evolutionary Programming Society conference (EP), the IEEE’s WCCI/ICEC, GP, ICGA, and the International Workshop on Learning Classifier Systems (IWLCS). These were only partially successful, leading in 1999 to the formation of the duopoly of CEC 1999 [29] and GECCO 1999 [30]. Of the european evolutionary computing conferences, only the IEE’s Galesia elected to join CEC. PPSN,Footnote 12ICANNGA and the newly established EuroGPFootnote 13 [31] continued as beforeFootnote 14.

Again John Koza’s organisational skills came to the for, with him helping to draft the byelaws for GECCO. These ensure it has a federal “big tent” structure, whereby none of its constituent groups would feel left out or put down by the others.

Having progressed genetic programming to the point were it could be described as a routine invention machine [6, 32, 33], John Koza turned to public service and electoral reform and in 2006 founded National Popular Vote.

1.5 The public benefactor

In 2004 John Koza started the annual “Humies” awards for human-competitive results produced by genetic and evolutionary computation. He continues to fund the cash prizes. The finals are held each year as part of the GECCO conference.

Since 2016 he has endowed Michigan State University with the first chair in genetic programming in the United States (held by Prof. Dr. Wolfgang Banzhaf).

1.6 Pre-history

At GECCO-2022 the question of research before genetic programming was raised. John Koza pointed out that by 1987 the field of Genetic Algorithms was already well establishedFootnote 15. There had been early experiments on machine learning in Columbia [34] and Manchester [35]Footnote 16 universities. However John Koza traced Evolutionary Computing back to Alan Turing. He said Turing’s 1948 paper on machine intelligence [36, 37] suggested three routes to machine intelligence: 1) knowledge based, 2) based on logic (as would be expected of a mathematician), but John Koza highlighted the third: 3) in which machine intelligence was based on evolution. Although he pointed out it did not use crossover (which was added by John Holland).

1.7 Advice for the future

Another question raised at GECCO-2022 was did John Koza have advice for new researchers. His answer was researchers must keep current, i.e., keep up to date with research, but not just in your area but with research in general. Take an interdisciplinary approach. He stressed be open to ideas from elsewhere, particularly from Biology.

John Koza’s heuristic (perhaps common to all John Holland’s students) was to ask himself “What would John Holland do?” to which the answer was often: John Holland would respond with his own question, “What does Nature do?” John Koza’s particular example was: how did Nature evolve from microscopic organisms (like bacteria) which have genes for creating may be about 500 proteins to multicellural organisms (e.g. us) which have genes for creating about 20 000 proteins. He reported asking this question around the Stanford School of Medicine.

The example John Koza quoted was the evolution of Myoglobin and Hemoglobin, which is thought to have occurred via gene duplication and subsequent specialisation. The idea being: “accidental” copying of parts of DNA sequences is common.Footnote 17 Once a species has two copies of a vital gene, it may be free to tinker with one. Since the other gene remains functional, the children with the duplicated gene remain viable and so some can survive long enough to carry both the working gene and the tinkered copy to the grand children. Over subsequent generations the two genes may diverge allowing the species to find new proteins which may help it survive. Susumu Ohno in his 1970 book [40] suggested that such gene duplication is a powerful mechanism in natural evolution. Indeed John Koza used it as inspiration [41] for his architecture-altering operations. These GP operations allow, not just the code within automatically defined functions (ADFs) [4] to evolve, but also their structure (e.g. which ADF calls which ADF) evolves [5, 42, 9, minute 10]. In terms of traditional AI, this can be thought of as dividing the whole problem into subcases and having an evolvable representation which facilitates not just the solution of the sub-problems but also their subsequent combination into a complete solution. Some form (or indeed many forms of) automatic problem decomposition is essential if any AI technique is to scale.

John Koza felt that in the 1960s the University of Michigan had had a wide ranging curriculum. He said computer scientists need to know about biology, language processing, psychology, information theory, electronic circuits, etc. However, this breadth has been lost from modern computer science curricula. Instead people should seek ideas from many places. He cited successful start ups in silicon valley, such as Adobe, which had come from co-working between two people with experience of newspaper publishing and another with a computer science background. Often in silicon valley success had come from partnerships of individuals with different experience. Alternatively, success may arise when different experience or many odd ideas are held by one person.

I would like to add, be ambitious in the problems you tackle. John Koza’s impact, the impact of his book [1], stems from showing something widely viewed as impossible could be done. Before his work, the idea of automatically evolving a computer program was clearly ludicrous. Similarly, the idea of a computer fixing computer bugs was clearly impossible, until Stephanie Forrest et al. showed GP could do it  [43]. Readers may remember Lewis Carroll’s Alice and the White Queen [44] (Fig. 3), Alice reproaches the White Queen for some nonsense, saying it is clearly impossible, to which the White Queen responds that Alice should practice believing the impossible. My suggestion would be to an ambitions researcher that she should do the impossible. Claire Le Goues was a PhD student in 2009 [45, 46]. Fortunately her adviser did not tell her her idea was impossible. And so She and the team are famous, not because they completely solved the problem, but because they took something impossible and partially solved it. So that today the argument is not if it can be done, but what is the best way  [12] to solve the previously impossible problem [47,48,49].

Fig. 3
figure 3

When I was your age I could think of six impossible things before breakfast

1.8 The ones that got away: missing gaps

John Koza was asked to muse on his less successful experiments. Two came to mind: FPGAs and GPUs.

1.8.1 Genetic programming and field programmable gate arrays, FPGAs

John Koza had hope to create a field programmable gate array (FPGA), which had all the likely to be useful program operations pre-loaded. An ultra fast evolved GP program would then simply be an evolvable way of linking these together.

In some ways this seams similar to Juille’s [50] way of running a GP interpreter on the hugely parallel MasPar MP-2 computer. Although it had thousands of processing units, they each did the same one thing at the same time. Juille’s brainwave was to say: since computing is cheap, we will discard most of it. (Simplifying), Juille built a tiny interpreter which ran on all processing elements one of a handful of GP operations. The different members of the GP population were spread across the processing elements. Each with its own program counter. If the interpreter was currently executing a GP op code that was not the one the GP individual wanted, it did nothing but wait. However the interpreter cycled round all possible GP op codes. When it did reach the desired op code, that processor executed it and moved that GP individual’s programme counter on by one. (The right hand side of Fig. 4 shows the same idea in the context of GPUs.)

It sounds hideously inefficient, but bear in mind the GP is getting useful works done, whereas mostly human programmers could not handle the MasPar MP-2’s SIMD architecture efficiently at all. Secondly often in many high performance computers (HPCs), most of the time the processing elements are waiting for data to arrive and so spend most of their time spinning in idle loops. This turns on its head our common conception of computers. In HPC (and indeed GPUs, see Sect. 1.8.2), computing is often cheap compared to moving data. Indeed sometimes it can be more efficient to compute a value a second time, rather than store it and retrieve it later when it is neededFootnote 18.

Fig. 4
figure 4

Left: Avoid compilation overhead by interpreting GP trees. Run single SIMD interpreter on GPU’s stream processors (SP) on many trees. Right: Programs wait for the interpreter to offer an instruction they need evaluating. For example an addition. When the interpreter wants to do an addition, everyone in the whole population who is waiting for addition is evaluated. The operation is ignored by everyone else. The interpreter moves on to its next operation. The interpreter runs round its loop until the whole population has been interpreted. Fitness values can also be calculated in parallel

In many cases FPGAs form the bed rock of evolvable hardware (EHW) [51, 52]. As well as offering a cheap and flexible alternative to dedicated integrated circuits (also known as application-specific integrated circuits, ASICs) they can be cost effective, particularly when only a limited number of chips will be needed. There are several examples where FPGAs have been used to run GP, e.g. [53,54,55].

1.8.2 Genetic programming and graphics cards, GPUs

In the early 2000s it was noticed that the graphics cards (GPUs) used to drive computer screens were becoming increasingly powerful parallel computing devices in their own right and so people started using them for other things.

Initially GPUs were designed just to rapidly render images on the computer’s screen. To do this quickly (in real time) they comprised many parallel components all doing the same thing but for different parts of the screen. As the computer video games market took off, GPUs rapidly ramped up their processing abilities and power. Each parallel component became a fully functional processor, often with special support for operations common in graphics applications (such as reciprocal square root [56]). This was so that more of the parallel aspects of generating, rather than simply displaying, real time video could be devolved from the (serial) CPU to the (parallel) graphics card. As GPUs were often somewhat independent of the end users’ computer mother board, keen video gamers could easily upgrade their GPU. This promoted rapid technological improvement, as rival GPU manufactures sought sales by offering better and/or cheaper hardware than their rivals. However even today, GPUs essentially (like the SIMD MasPar, page 8) require their parallel processing elements, to do the same thing at the same time.

Initially GPUs were very hard to program and their support software was only designed to be used by dedicated programmers employed by video game companies. However the abundant and cheap parallel processing the GPUs offered was taken up by scientific programming, leading to the field of General-Purpose Computing on GPUs (GPGPU) [57]. As GPGPU became more popular, the GPU manufactures, particularly nVidia provided much better software support.

At first in genetic programming GPUs were only used to speed up fitness evaluation, e.g. work by Simon Harding [58]. and Darren Chitty [59]. Indeed it was said that, due to the GPUs peculiar SIMD architecture, running the GP interpreter on the GPU was impossible (cf. Fig. 3). Of course this was not true, and inspired by Juille’s work with the MasPar SIMD supercomputer  [60] (page 8), I built a SIMD interpreter for nVidia’s GPUs (see Fig. 4) [61, 62]Footnote 19. See also [64,65,66,67,68,69,70].Footnote 20

As the memory available on the GPU cards increased, it became possible to work with huge populations of small GP trees. In [71] I used a cascade of GP populations to winnow useful bioinformatic data from more than a million GeneChip features. The top level GP populations contained more than five million individuals trees. This GPU application could scale from a $50 GPU to a top 500 super computer [72]. Figure 1 in [73] shows the dramatic improvement in nVidia GPU speed (2003 to 2012, which still continues), whilst Table 3 in [74] shows some high performance parallel GP implementations, almost all running on GPUs.Footnote 21

1.8.3 Deep learning and accelerators: GPUs and TPUs

Due to the availability of internet scale data sets and GPGPU processing power, since 2010 the field of deep learning has taken off [77]. It is generally accepted that researchers need a GPU (possibly a whole cluster of GPUs) to do any form of competitive deep neural net learning. Even with the availability of cloud computing, this may soon have the effect of “pricing out” individual academic researchers from the future of deep learning [78].

Sometimes the whole notion of using a GPU to drive a computer’s screen (also called the computer’s monitor) may be disregarded. Often called “headless” GPUs, to save space and power, some GPUs dispensed with the screen interface altogether. An extreme examples of this is Google’s TPU, which is totally specialised to Artificial Neural Network (ANN) processing.

As gaming and now AI have become more important, the notion of a GPU as a cheap alternative to the computer’s CPU has also faded, and now a top end GPU can cost more than a CPU.

1.9 Other gaps: memory, theory, bloat

John Koza mentioned that even though Jaws [1] did not include much work on evolving memory, he regarded it as important because it provides another route to allow re-use. Since a value stored in memory can be re-used, potentially many times, without the code for it having to be evolved more than once. He mentioned my book [79], although using indexed memory in GP is due to Teller [80]. Surprisingly, there has been a steady stream of research on evolving memory within GP [81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133].

Genetic programming theory has a variety of forms [134]. Jaws [1] starts with adapting the then current explanations of how linear bit string genetic algorithms work, due to John Holland and Dave Goldberg. Such schema theories were also analysed by Una-May O’Reilly  [135], Justin Rosca [136] and most notably by Riccardo Poli [137]. Another popular thread is to take ideas from biology about how evolution works and use them to understand GP [138], e.g. Price’s theorem [139, 140], population convergence [76, 141, 142] and neutral networks (plateaus) [143] in fitness landscapes [144,145,146,147,148]. Similarly biology has been an inspiration for other search operators, such as homologous crossover [149]. In recent years there has been a flowering of formal or rigorous run time analysis in evolutionary computing and some success applying mathematical techniques to GP problems [150,151,152,153,154,155]. Of course it is difficult to make such theorems widely applicable and when using results we must remember the inevitable assumptions they require. For example, SAT has been proved to be NP-complete. Nevertheless in the last decade considerable progress has been made with practical SAT solvers and they are now routinely applied, e.g. in software engineering. Similarly, the No Free Lunch theorem [156] applies to GP (as with all optimisers) but fortunately (as in other branches of AI) that has not inhibited development of the field. Although, as noted above, there are exceptions, but genetic programming as a whole remains a deeply empirical endeavour with many new ideas being reported. However it is difficult to persuade authors to carefully analyse their evolving populations of programs so as to be able to explain why their experiment succeeded (or even why it failed).

Although John Koza reports [1] bloatFootnote 22 from the start of genetic programming, the tendency, indeed the name, for programs to be bigger than necessary is not unique to GP. Bloated human written programs are common. Indeed people writing computer programs with unnecessary instructions goes back to the very beginning of electronic digital computers, with bloat reported in programs run on the first stored program digital computer, the Manchester Mark I [35]. This human tendency is rampant, with some Internet code bases having grown to over a billion lines of code in less than 20 years. Bloat continues to be a well studied topic in GP with 426 entries in the GP bibliography mentioning it.

Although there are potential ways of mitigating bloat’s impact on runtime [157] and reducing its memory requirements with DAGs [158] (indeed bloated trees produced by crossover [159] should be highly compressible), in practice bloated populations can quickly overwhelm the available computer resources and so the common approach is to shut bloat down. For example, by enforcing either depth or size limits on the evolving programs. However this is not risk free [160] and more sophisticated approaches may be wanted. For example, controls on selection, such as using multiple fitness objectives (e.g. a size versus performance Pareto trade-off [161, 162]) or tighter controls on offspring generation [163,164,165,166]. In many cases bloat appears to be an unexpected aspect of early (even premature) convergence and so has some similarity with overfitting sometimes seen with artificial neural networks (ANNs), where prolonged search drives locally improved performance on the training data. This gives a more convoluted mapping between the ANN’s inputs and outputs but at the possible expense of the ANN’s ability to generalise to unseen data. Where the goal is to explain or predict, such complexity or overfitting is clearly unhelpful. In ANN anti-over fitting techniques are essential. These include stopping training early (i.e. in GP terms using fewer generations), regularization [167,168,169], changing the training data during training [170, 171] and even expression simplification [172], either during evolution [173] or to increase comprehensibility and explainability, cf.  XAI, after GP has finished [174]. Whilst Dale Hopper [173] and other authors, ensure their automatic rewrite of GP individuals gives a semantically equivalent but smaller replacement, in many cases this is not wanted. When a 100% correct program is not realistic, e.g. on many prediction tasks, it may be better to accept (or allow evolution to find) a similar but much simpler program, rather than spending a lot of effort creating an exactly equivalent program to what is essentially only an approximation.

However, bear in mind that evolution is a hacker. It builds on what was there before. In biology evolution overfits. Classic example include: 1) the Giraffe’s left laryngeal nerve, which runs the whole length of its neck from its head, round the aorta in its chest and then returns to its throat at the top of its neck, because evolution did not find a shorter path, 2) the male peacock’s heavy tail which helps secure a mate but impedes flight and 3) the human brain which consumes 20% of our food [175] but made our ancestors more appealing as mates to other members of their tribe [176].

2 A brief selection of other genetic programming work

In addition to continuing with evolving Lisp like trees, major branches of genetic programming include: linear genetic programming [177] cartesian genetic programming (CGP) [178] and grammatical evolution (GE) [179], all of which use a linear chromosome. Following John Koza’s automatically defined functions, ADFs, see page 7, there were several attempts to encourage the evolution of modular programs using individuals with multiple trees or libraries of subtrees [180,181,182,183]. However, these seem not to have taken hold.

As with evolutionary computation in general, the major computational cost of GP is usually evaluating fitness [1, p783]. In tree GP this is usually the cost of interpreting the trees. When members of the population are going to be run many times.Footnote 23 it may be worth the cost of compiling the population and then running the compiled programs.Footnote 24 rather than interpreting them [188]. However, as Ronald Crepeau showed [189], for GP, it is not essential to run a full blown compiler, instead knowing the restricted set of primitives used by GP, he constructed a dedicated fast compiler which converted the evolved code into machine code and ran that directly. Peter Nordin eliminated the compilation step entirely by using GP to evolve firstly Sun 32 bit SPARC RISC architecture machine code  [85] and later Intel x86 binaries  [190] (which in turn later became Discipulus [191]). He used tailored mutation operations which respected the layout of the machine code. Although perhaps first motivated by speed and simplicity, the idea of evolving variable length linear programs has taken off [192, 193].

Grammatical Evolution (GE) [194, 195] shows the virtues of trying ideas out. Michael O’Neill and Conor Ryan took the idea of a variable length linear chromosome, simplified it to become just an ordered list of byte sized integers (0..255) and married it to another favourite of computer scientists: the Backus-Naur form grammar (BNF). Pretty much anything which can be run on a computer can be expressed in a BNF grammar. They disregarded that BNF is essentially tree shaped and trusted in evolution to find a way of putting them together. The linear stream of bytes is mapped using modulus to say which branch to take next in the grammar. If there are not enough bytes, we simply wrap round and start again from the first. If there are too many, we ignore the excess. The resulting grammar is treated as the individual’s phenotype and in a problem dependent way converted into a trial solution with a fitness value. The sloppiness of the mapping from genotype to phenotype offended some and provoked wide discussion in a peer commentary issue of “Genetic Programming and Evolvable Machines” [196]. But as Conor Ryan says “GE works” [197]. Indeed the separation of genotype from BNF grammar makes grammatical evolution flexible and has been widely used. (The GP bibliography contains well over seven hundred entries relating to grammatical evolution.)

With Cartesian Genetic Programming (CGP) [178, 198,199,200,201], Julian Miller turned to a fixed representation, more a kin to traditional bit string genetic algorithms (GAs). However the chromosome is a fixed sized two dimensional rectangle, rather than a single string, where each cell contains a digital computational unit, such as an XOR gate. Both the contents of the cells and crucially the connections between them are evolvable.Footnote 25 Notice, like linear GP (but unlike GE), evolution directly sets the contents and connections of each cell (i.e. evolution acts directly on the phenotype). Also there is no explicit left-right flow of control. In CGP the chromosome is treated as a circuit and so its evaluation has to take note of where data enters and leaves. It is also not necessary to evaluate cells which are not connected. Cartesian GP has been widely used, including in the evolution of approximate computing [202, 203], where evolution can be well suited to finding good trade-offs between conflicting objectives, such as fidelity, size, number of components, power consumption and speed.

2.1 Inspired by computer science

In order for subtree crossover to freely mix subtrees from parents to create children, John Koza required the components of his GP trees to have closure [1, Sect. 6.1.1]. Meaning 1) any leaf or function in the tree can be an argument to any other function. Since components typically communicate via function return values, this often means GP trees use a single type, often float. 2) To ensure each function can deal with any combination of inputs, many functions have protected GP versions. Such as protected log RLOG [1, p83], which returns a defined value (rather than raising an exception) even if its input is zero or negative. Alternatives might be to allow evolution to deal with the exception, or simply assign poor fitness to individuals with illegal combinations. However notice that ruling it out prevents GP exploring not only this tree but all the trees that might have evolved from it.

Perhaps the most famous extensions to closure are Dave Montana’s strongly typed GP [204] and Tina Yu’s polymorphic GP [205, 206] which allow multiple types but ensure evolution explores only type safe expressions. Another approach is to use various types of grammar to try and keep evolution in the most productive parts of the search space [207]. For example, using context free grammars [208, 209], using grammars to ensure the evolution of expressions which are dimensionally consistent [210], using tree-adjunct grammars to guide GP (TAG3) [211] and using GP with Lindenmayer Systems (often abbreviated to L-Systems) [212,213,214,215].

Whereas Lisp and most GP systems implicitly use the system stack, programs which explicitly use a stack [216, 217], e.g. to pass vectors and matrices  [218], are also possible. An explicit stack allows the evolution of Reverse Polish Notation (RPN) [62] and even infix expressions [219]. In PushGP [220] there are multiple stacks, one per type. These may include a code stack, so allowing GP to manipulate code, thus permitting GP to evolve its own genetic operators.

2.2 Non genetic GP

John Koza’s GP [1] is clearly strongly influenced by his PhD supervisor, John Holland, and GP [1] is essentially the application of John Holland’s genetic algorithms to the evolution of Lisp s-expressions, i.e. tree shaped programs. But, as we have seen, the programs need not be trees, and similarly the search algorithm does not have to be a genetic algorithm. Other techniques include: local search, Simulated Annealing [221, 222], Differential Evolution [223], Bayesian probability search [224], Estimation-of-Distribution Algorithms (EDAs) [225, 226] Ken Stanley’s Neat [227,228,229] and even deterministic search, e.g. Trent McConaghy’s FFX [230]. Indeed search does not have to be guided only by fitness but can “look inside” the program [231] and its execution [232]. SRbench [233] compares many GP and non-GP approaches to symbolic regression, including MRGP [234], M3GP [235], FEW [236] and Operon [237].

2.3 Less explored

2.3.1 Assembly code, byte code

In human terms assembly code is usually viewed as intermediate between high level languages and machine code. Offering the potential advantage of machine code (speed and compactness), and ease of use and readability of high level source code. There has been very little GP work on evolving assembly code. Exceptions include microcontroller assembly [238], nVidia GPU PTX [239, 240] and the intermediate (IR) code used by LLVM [241], and again on GPUs [242].

Java, and some other interpreted languages, compile the source code into byte code which they then interpret. Eduard Lukschandl showed it is possible to run GP at the level of Java byte code [243].

2.3.2 Modularity, recursion, loops

Some of the work on encouraging the evolution of modular code was mentioned on page 13. In Jaws, John Koza described GP solving the Fibonacci problem [1, pp473–477] as an example requiring the evolution of recursion and several examples where GP evolved do-until loops and other forms of iteration, but again there has been relatively little work on either by others. Again a few exceptions. These include work by Peter Whigham [244, 245] and Tom Castle [246].

2.3.3 Coevolution

As with many topics, there are examples of co-evolution  [247, 248] in Jaws [1] and many elsewhere in genetic programming [81], for example in agent learning [249]. However, it does feel like coevolution has not yet fulfilled its potential. In deep artificial neural networks there is interest in antagonistic adversarial learning and so perhaps this will stimulate renewed interest in coevolution in genetic programming.

3 The future

At GECCO 2022 Erik Goodman asked if there we any applications of GP that had surprised John Koza. Amongst the many human competitive [6] results, perhaps one of the most encouraging is quantum computing. As with quantum physics, quantum computing has a deserved reputation for being difficult for people. However, the rules about quantum computing gates can be coded for GP to use without being an expert quantum physicist, and then GP can be left to evolve novel quantum circuit designs incorporating them [250,251,252]. Riccardo Poli, Leonardo Vanneschi and others have previously reported on the state of GP and in particular what remains to be done [253, 254].

In genetic improvement [13] existing (human written) software is optimised (typically by using GP). Notice genetic improvement does not start from primordial ooze  [1]. Instead search automates the potentially labour intensive, tedious and error prone task of find modifications. For example, to repair bugs [12, 43, 47, 49, 255], including energy bugs [256], reducing memory consumption [257], reduce run time [174, 258,259,260,261,262,263,264,265] improve existing functionality (e.g. to give better predictions [266]), porting to new hardware [267] including improving GPU applications [242, 262,263,264,265, 268] or even to incorporate existing functionality from outside the existing code base  [269].

The idea of mixing evolutionary computing (including GP) with other optimisation tools to give hyperheuristics [270] has a long history. In particular, with the recent explosion of interest in deep artificial neural networks, combining evolutionary learning and artificial neural networks seems set to continue. One particularly encouraging trend is AutoML tools such as TPOT [271, 272] which automatically tune existing machine learning pipelines.

In GP, as in most optimisation problems, most of the computation effort is spent on evaluating how good the proposed solutions are. Various ideas for speeding up fitness evaluation have been proposed, for example surrogate fitness functions [273]. Colin Johnson’s Learned Guidance Functions [274] seem a particularly elegant approach to making best used of previously gained knowledge. It would be interesting to see Learned Guidance Functions applied to genetic programming or when using genetic improvement to adapt existing human written programs.

Since all digital computing progressively loses information, information about crossover and mutation gets progressively washed out the further it has to travel. In nested functions without side effects, deep genetic changes become invisible to the fitness function. Thus to evolve complex programs, they must remain shallow and so I propose that to evolve large complex programs, they be composed of many shallow trees, within a strong low entropy-loss data interconnect to and from the environment. This should ensure that the good and bad effects of most genetic code changes are externally measurable [275].

At GECCO John Koza pointed out that in both biology and in human design, modularity and reuse are ever present. Biology scales from a single cell to individuals containing billions of cells. It does this, like human engineers, not by solving many billions of individual problems but by reusing existing designs. We need to revisit the scaling problem.

4 Conclusions

We have seen that in the thirty years since John Koza published his first GP book, the field has blossomed. The genetic programming bibliography contains some 16 367 entries by 16 342 authorsFootnote 26. Many of the genetic and evolutionary computation papers judged to be the best human competitive work of each year have used genetic programming. Clearly GP is doing well in its mission to help the world.

As mentioned at the end of the last section, although GP continues to flourish, perhaps we need to tackle the scaling problem. Are we evolving small things? Do we need to be more ambitious? Following Stephanie Forrest’s recent questions [276]: what could GP do with Google Deep AI scale resources?

As John Koza foresaw, 30 years of Moore’s law [277] (with component count doubling every 18 months) means 20 lots of doubling (\(2^{20}\) = 1 048 576). That is, since the genetic programming field started, the computer power available to us has increased a million fold. What of the next 30 years? Perhaps Moore’s Law will end? Certainly the death of Moore’s Law has been confidently predicted many times. What seems certain is we will not see dramatic increases in silicon computing’s clock speeds. Instead we anticipate the future of computing will be ever more parallel. But as John Koza says GP is embarrassingly parallel. Indeed the use of distributed parallel GP populations, not only makes good use of current and future compute resources but is in keeping with Sewall Wright’s [278] model of natural evolution and as John Koza reports by keeping population diversity, the distributed population demes of the island model, improve GP results as well as speeding it up.

In 2052 will genetic programming researchers be using computers a million times faster than they use today? Certainly GP seems well placed to exploit them.