On sampling error in genetic programming

The initial population in genetic programming (GP) should form a representative sample of all possible solutions (the search space). While large populations accurately approximate the distribution of possible solutions, small populations tend to incorporate a sampling error. This paper analyzes how the size of a GP population affects the sampling error and contributes to answering the question of how to size initial GP populations. First, we present a probabilistic model of the expected number of subtrees for GP populations initialized with full, grow, or ramped half-and-half. Second, based on our frequency model, we present a model that estimates the sampling error for a given GP population size. We validate our models empirically and show that, compared to smaller population sizes, our recommended population sizes largely reduce the sampling error of measured fitness values. Increasing the population sizes even more, however, does not considerably reduce the sampling error of fitness values. Last, we recommend population sizes for some widely used benchmark problem instances that result in a low sampling error. A low sampling error at initialization is necessary (but not sufficient) for a reliable search since lowering the sampling error means that the overall random variations in a random sample are reduced. Our results indicate that sampling error is a severe problem for GP, making large initial population sizes necessary to obtain a low sampling error. Our model allows practitioners of GP to determine a minimum initial population size so that the sampling error is lower than a threshold, given a confidence level.


Introduction
In optimization, evaluating all solutions for a problem instance (complete enumeration) is often too difficult, expensive, or time-consuming (Rothlauf 2011). Therefore, population-based heuristic search methods like genetic programming (GP; Koza 1992) start with a small sample taken from the set of all solutions and improve these solutions through the application of variation operators and selection. 1 When using a sample, there are usually differences between the properties of the statistical population and the information obtained from the sample. These differences are called errors (Lee et al. 2013). Non-systematic errors, describing random variations caused by observing only a subset of the statistical population are called sampling error (Lee et al. 2013;Cochran 1977;Särndal et al. 1992). The expected amount of sampling error can be reduced by using larger samples (Lee et al. 2013;Cochran 1977;Särndal et al. 1992).
Sampling error is a problem in evolutionary algorithms (EAs), leading to unreliable search results due to random variations. The problem has been discussed in the genetic algorithm (GA) literature. For example, Goldberg and Segrest (1987) note that small initial populations in a genetic algorithm (GA) can be problematic when relevant building blocks (BBs) are not represented by the sample. 2 Goldberg et al. (2001) and Reeves (1993) argue that an initial supply of BBs is necessary for the search to allow for the possibility that high-quality BBs will take over the population in later generations (BB growth). Recent work of Burlacu et al. (2015Burlacu et al. ( , 2018aBurlacu et al. ( , 2018b evaluates this hypothesis for GP. The authors show that building blocks have a large influence on the evolutionary process. Thus, if the sampling error in an initial GP population is large, random variations can have a negative effect on the search performance. In addition, having a low sampling error is also relevant for estimation of distribution genetic programming (EDA-GP), where standard GP variation operators such as crossover and mutation are replaced by model building and sampling from the learned model (Kim et al. 2014;Shan et al. 2006). In EDA-GP algorithms, early sampling errors are learned by the model and, as a consequence, finding favorable solutions can be more difficult. Therefore, we argue that an initial (EDA-)GP population should form a representative sample of the statistical population of possible solutions and that the sampling error in the initial GP population should be low.
This article studies how the initial GP population size affects the sampling error of subtree frequencies. We present a model that estimates the minimum size of a GP population that is required for a sampling error to be below a certain value that can be specified a priori by a GP user. Therefore, we first present a probabilistic model of the expected frequencies of subtrees in GP populations initialized with full, grow, or ramped half-and-half (Koza 1992). We use n-grams of ancestors (Hemberg et al. 2012) as a possible measure to describe subtrees in GP. An ngram of ancestors in a GP parse tree is the sequence of the values represented by a node i and its n À 1 ancestor nodes on the same branch (parent, grandparent, greatgrandparent, etc.; Hemberg et al. 2012). The difference between the expected and the observed frequencies of n-grams in a sample is the sampling error. Thus, our model allows us to measure and investigate sampling error in initial GP populations.
Based on the model of the expected frequencies of ngrams of ancestors, we present a model to estimate the size of an initial GP population, given the desired degree of sampling error and a confidence level. Our model allows GP practitioners to estimate a minimum initial GP population size in such a way that the sampling error is lower than a threshold. Furthermore, we empirically validate our model. First, we measure the frequencies of subtrees in very large GP populations and compare these with the expected frequencies calculated with our model. As expected, there are no differences between the expected and measured frequencies for all BBs and thus, the reliability of our model is good. Second, we use our model to estimate the GP population sizes for different desired degrees of sampling error. Then, we sample (random) GP populations of the respective estimated population sizes and compare the resulting empirical sampling errors with our predictions. We find that our model accurately estimates the sampling error. Our results indicate that the estimated sampling error calculated with expected frequencies of n-grams is a good proxy for the empirical variance of fitness values in our experiments. Last, we recommend minimum population sizes for benchmark problems that are often used in the literature to avoid problems with sampling error. We make our code publicly available in the form of a GP population size calculator, so that users of GP can calculate population sizes for other problem instances as well. 3 In summary, our results indicate that sampling error is a severe problem for GP, making large initial population sizes necessary to obtain a low sampling error. Our model allows to estimate a minimum population size that is necessary to reduce sampling error to a given amount. Lowering the sampling error means that the overall random variations in a random sample are reduced. As a consequence, the reliability of the results of a GP run increases with larger population sizes.
In Sect. 2, we present a short summary of related work on population sizing and BB supply in GAs and GP. In Sect. 3 we describe initialization in GP with full, grow, and ramped half-and-half and introduce our model of the expected frequencies of n-grams of ancestors. Section 4 presents the model to estimate a GP population size, given a threshold of sampling error and a confidence level. We validate the proposed models in Sect. 5 with experimental results and recommend minimum population sizes for benchmark problems that are often used in the literature. The article ends with concluding remarks.

Related work
We summarize previous work on population sizing and BB supply in GAs and GP. Some researchers (Goldberg et al. 1992(Goldberg et al. , 2001Goldberg and Segrest 1987) argue that a successful GA needs to ensure a sufficient supply of relevant BBs. An adequate supply of BBs can be ensured by 1. a high diversity in the BBs of the initial population (spatial approach) and/or by 2. generating BB diversity during runtime, e.g., by applying a suitable mutation operator (temporal approach; Goldberg et al. 2001).
A spatial approach for GAs was first proposed by Holland (1975). He discussed the issue of initial BB supply and proposed a model to estimate the expected number of observations of BBs in a population, given the size of the population. Later, Goldberg (1989b) improved Holland's model. Reeves (1993) developed a model to estimate the minimum population size needed to ensure the presence of at least one instance of every BB. However, Reeves' model only considered BBs with a length of one. Therefore, Goldberg et al. (2001) proposed a more general model that also considers larger BBs of a fixed size.
Other work analyzed the probability that high-quality BBs will take over the population in later generations (BB growth). E.g., Holland proposed to use the two-armed bandit problem to model the decision between competing BBs (Holland 1973(Holland , 1975. Later models also considered decision errors due to genetic drift (De Jong 1975) and variance of BB fitness (collateral noise; Goldberg and Rudnick 1991). The population sizing model presented in (Goldberg et al. 1992) permits the inclusion of other sources of decision errors to estimate population sizes that minimize these errors. Harik et al. (1999) proposed a model to predict the convergence quality of GAs based on the size of the population. They considered the initial supply of BBs as well as the selection of the best BB(s) over competing BBs.
In summary, previous research on GAs found that the initial supply of relevant BBs leads to improved BB growth (Goldberg et al. 2001;Reeves 1993). However, at the beginning of a search run, it is not known if a BB is relevant or not. Therefore, the authors argue that the initial GA population should be large enough to ensure that at least one copy of each BB is present in the initial population.
Following the GA literature, papers about population sizing in GP focus on BB supply. In the context of GP, BBs describe relationships between nodes in GP parse trees. BBs in GP were usually defined as subtrees of a GP parse tree (Poli and Langdon 1998;Poli 2001;Walsh and Ryan 1996;Koza 1992;O'Reilly and Oppacher 1994;Whigham 1995;Sastry et al. 2003Sastry et al. , 2005Hemberg et al. 2012). GP subtrees can be described by using n-grams of ancestors (Sastry et al. 2003(Sastry et al. , 2005Hemberg et al. 2012). An n-gram of ancestors in a GP parse tree is the sequence of the values represented by a node i and its n À 1 ancestor nodes on the same branch (parent, grandparent, greatgrandparent, etc.). Hemberg et al. (2012) found that n-grams of ancestors represent relevant relationships between nodes of a GP parse tree. Sastry et al. (2003) proposed a model of the initial supply of BBs in GP based on n-grams of ancestors. The authors estimate the population size required to ensurewith a given error-the initial presence of at least one copy of all possible BBs in the initial GP population. Sastry et al. (2005) improved the population sizing model by incorporating decision-making errors among competing BBs in the population sizing model. However, both models only hold for simple test problems and assume full trees, binary functions, and knowledge about the size of the trees (i.e., a given tree size distribution). Therefore, the models are only generalizable to a limited extent (Sastry et al. 2003(Sastry et al. , 2005Hemberg et al. 2012).
In the GP community, there is still a debate on the role of building blocks. Following GA literature, O'Reilly and Oppacher (1994) formulated a GP schema theorem and a GP building block hypothesis. In their analysis they focused on dynamic aspects of the search (i.e., BB growth). They concluded that due to many reasons the probability of the recombination of building blocks is difficult to predict. This is identified as a major obstacle in formulating a model to verify the GP building block hypothesis. Based on their results they question whether building blocks are relevant for GP.
The early GA as well as GP work on building blocks has been criticized due to the strong assumptions that it requires to work and the large simplifications that-at least initially-were made. As a consequence, more recent work (e.g., by Poli and McPhee 2003) proposed exact schema theorems for GP that improve on previous definitions of a schema. Since Poli and McPhee (2003) use a temporal approach, we do not discuss the details but refer the interested reader to the original article.
Recently, Burlacu et al. (2015Burlacu et al. ( , 2018aBurlacu et al. ( , 2018b) evaluated the BB hypothesis for GP empirically. They performed schema analyses on GP populations and identified schemata with an above-average quality as well as an increasing frequency in the populations over multiple generations. They found that GP is able to effectively evolve BBs at least for some problem instances.

Expected frequencies of n-grams of ancestors in initial GP populations
We present a model to calculate the expected frequencies of n-grams of ancestors in GP populations initialized by the GP initialization methods full, grow, and ramped half-andhalf. First, we introduce the relevant notation and describe tree initialization in GP. Then, we develop a model of the expected number of nodes representing functions and terminals in a GP parse tree. Based on this, we model the expected frequencies of n-grams of ancestors.

Initialization of GP populations
We describe the GP initialization methods full, grow, and ramped half-and-half, using the following notation: The leaf nodes of GP parse trees are terminals t from a terminal set T (t 2 T) and all remaining nodes (''inner nodes'') are functions f from a function set F (f 2 F) (Koza 1992;Poli et al. 2008). Let a(f) be the arity (number of parameters) of f 8 f 2 F. Then all nodes in a GP parse tree that represent a function f have a(f) child nodes. Nodes representing a terminal do not have any child nodes. In full, grow, and ramped half-and-half, the user specifies two hyperparameters that determine the depth of the generated trees, where the tree depth is defined as the length of the longest non-backtracking path from the root of the tree to any tree node (Koza 1992). The user specifies a minimum allowed tree depth d min ! 0 and a set of allowed (maximum) tree depths D. When sampling a tree with full, grow, or ramped half-and-half, we first randomly sample a maximum tree depth d max from D with uniform probability (Koza 1992;Poli et al. 2008;Fortin et al. 2012). Note that d max defines a maximum tree depth for one particular tree that is sampled.
After d max has been determined, a GP parse tree is sampled. In full, grow, and ramped half-and-half, sampling starts with the root node of the GP parse tree at depth 0. We first decide whether the root node represents a function f 2 F or a terminal t 2 T. If the root node represents a function, the respective number of child nodes are created, depending on the arity of the selected function. Sampling continues by deciding for each child node if it represents a function f 2 F or a terminal t 2 T. Afterwards, the appropriate number of child nodes are created. This process is repeated until no more decisions have to be made (all leaf nodes of the GP parse tree represent terminals).
The probability for a node to represent either a function or a terminal depends on the initialization method and the depth d of the node in the GP parse tree. Similar to the tree depth, the depth of a node is the length of the longest nonbacktracking path from the root to the respective node.
The full method creates GP parse trees where all nodes with a depth d\d max are only allowed to represent functions f, randomly chosen with uniform probability from F. Thus, all leaf nodes are sampled at depth d max and represent terminals (Koza 1992). 4 Grow creates GP parse trees where the leaf nodes can be sampled at different depths in the GP parse tree (Koza 1992). For each node with a depth d\d min , a function f is randomly chosen from F. After that, the nodes at depth d min d\d max are sampled and each of these nodes can represent either a function or a terminal. If a terminal is chosen, the sampling process stops for the respective branch. For each node at depth d ¼ d max , a terminal t is randomly chosen from T. Therefore, a parse tree created with grow can have a depth that is less than or equal to d max and greater than or equal to d min .
Ramped half-and-half is a combination of full and grow where half of the population is initialized with trees created using full and half is initialized with trees constructed using grow (Koza 1992).

Expected number of functions and terminals
We develop a model of the expected number of nodes representing a specific type of function or terminal in a GP parse tree. We begin by determining the probability that, during initialization, a node will be selected to represent a function. We assume the condition that either the parent node represents a function or that we sample the root node.
Let d min and d max 2 D be the minimum and maximum tree depth for a tree to be sampled. Furthermore, let d be the depth of a node in the parse tree (0 d d max ). Then, a function is always sampled for all nodes with a depth d\d max in full and for all nodes with a depth d\d min in grow. Thus, the probability to sample a function is 1. In grow, for all nodes where d min d\d max , we sample a function or a terminal from F [ T with uniform probability. Therefore, the probability to sample a function in these cases is the number of functions |F| divided by the overall number of functions and terminals jF [ Tj. In full and grow, we always sample a terminal when d ¼ d max with uniform probability from T. Thus, at depth d max , the probability to sample a function is 0. The probability to sample a function in F with the full method is and with the grow method it is In the following equations, we do not differentiate between P full ðFÞ and P grow ðFÞ since the equations are the same for both initialization methods. Therefore, we will use P(F) to represent both possibilities. Since all functions of the function set are sampled with uniform probability, the expected number of child nodes over all functions in F is the average function arity a, which is defined as Luke (2000) showed that E nodes ðdÞ, the expected number of nodes at depth d in a parse tree, is Thus, the number of nodes at a given depth d [ 0 in the GP parse tree depends on • E nodes ðd À 1Þ, the number of nodes at depth d À 1, • P(F), the conditional probability that these nodes represent functions, and • a, the expected number of child nodes of a node that represents a function in F.
Let E tree ðd max Þ be the expected size of a parse tree, given a maximum tree depth d max . Then, E tree ðd max Þ is the sum of the expected number of nodes E nodes ðdÞ over all depths 0 d d max (Luke 2000): Equation (5) calculates the expected tree size for one particular tree depth d max 2 D and therefore requires that d max has already been determined. d max is uniformly sampled from D and, therefore, the expected tree size over all depths in D is the average expected tree size over the possible (maximum) tree depths d max 2 D For all s 2 F, let E s ðdÞ be the expected number of nodes representing a function in a parse tree at depth d. Then, analogously to the expected tree size, E s ðdÞ can be determined by first multiplying E nodes ðdÞ, the expected number of nodes at depth d, by the probability P(F) that these nodes represent functions. Then, we divide by |F| since all functions in F are sampled with uniform probabilities. Therefore, Since a node represents either a function or a terminal, the probability to sample a terminal from T for a node at a given depth d is Terminals are sampled with uniform probability from T. Thus-analogously to Eq. (7)-given a maximum tree depth d max 2 D, the expected number of nodes at depth d in a parse tree, representing a terminal s 2 T, is We are interested in the expected number of nodes that represent s in a particular depth interval of a parse tree. Let E s ðd max ; l; uÞ be the expected number of nodes representing s in depths d in the parse tree where l d u. Then, E s ðd max ; l; uÞ is the sum of E s ðdÞ, the expected number of nodes representing s, over all depths l d u Then, given a set of possible (maximum) depths D, the expected number of nodes representing s 2 F [ T in a parse tree is E s ðD; l; uÞ ¼ because depths are uniformly sampled from D. Let E node s ðD; l; uÞ be the expected frequency of a node to represent s 2 F [ T. Then E node s ðD; l; uÞ is the expected number of nodes representing s in a parse tree, E s ðD; l; uÞ, divided by the expected tree size E tree ðDÞ: In ramped half-and-half, 50% of the trees are created by the full method and 50% by the grow method (Koza 1992). Thus, the expected number of nodes representing functions and terminals in parse trees sampled with ramped half-andhalf can be calculated by averaging the respective equations for full and grow. For example, let E full s ðD; l; uÞ and E grow s ðD; l; uÞ be the expected number of nodes representing s 2 F [ T, calculated by using P full ðFÞ and P grow ðFÞ, respectively. Then, the expected number of nodes representing s in a parse tree created with the ramped half-andhalf method is 3.3 Expected frequency of n-grams of ancestors n-grams of ancestors can be described as follows. Each node i of the GP parse tree represents a function f from the function set F (f 2 F) or a terminal t from the terminal set T (t 2 T). Let s i be the function or terminal that is represented by an arbitrary node i in a parse tree. Furthermore, let the nodes i À 1; i À 2; . . .; i À ðn À 1Þ be the ancestors of the node i on the same branch in the parse tree and let s iÀ1 ; s iÀ2 ; . . .; s iÀðnÀ1Þ be the function values represented by the respective nodes. Then an n-gram of ancestors in a parse tree is the sequence of the node values s i and the values of its n À 1 ancestor nodes on the same branch (parent, grandparent, great-grandparent, etc.; Hemberg et al. 2012). Therefore, one specific n-gram of ancestors can be expressed using an ordered list such as ½s i ; s iÀ1 ; . . .; s iÀðnÀ1Þ . Root nodes do not have ancestor nodes and for these cases the values of (non-existent) ancestor nodes are defined as s ¼ £. This is done to also represent root nodes as child nodes in some of the n-grams of ancestors since root nodes are usually very important for the semantics of a GP parse tree. The definition of n-grams of ancestors in a GP parse tree implies that s i 2 F [ T and s iÀ1 ; . . .; s iÀðnÀ1Þ 2 F [ f£g. Therefore an n-gram in a GP population can be expressed with an ordered list of the form All n-grams that are observed in a GP population together form a multiset. 5 Figure 1a shows an example of a GP parse tree sampled by using the function set F ¼ fþg and the terminal set T ¼ fxg. The respective n-grams of ancestors are visualized for n ¼ 1 (Fig. 1b, c), for n ¼ 2 ( Fig. 1d-f), and for n ¼ 3 (Fig. 1g-j). The n-grams of ancestors shown in Fig. 1b-j can also be expressed by using ordered lists. For example, the 3-grams of ancestors are ½þ; £; £ (Fig. 1g), ½þ; þ; £ (Fig. 1h), ½x; þ; £ (Fig. 1i), and ½x; þ; þ (Fig. 1j). An n-gram of ancestors can occur several times and at different positions within one individual. For example, the 2-gram ½x; þ has a frequency of 3 in the exemplary parse tree.
The expected frequency over all n-grams of ancestors where the child node value is s 1 and the ancestor node values represent arbitrary functions in F is given by the expected frequency of s 1 . The functions s 2 ; . . .; s n are picked with uniform probabilities, but can have different arities. Nodes representing functions with higher arities have more child nodes and therefore the expected frequencies for n-grams where these functions are ancestor nodes is also higher. Thus, to calculate the expected frequency of an n-gram of ancestors ½s 1 ; s 2 ; . . .; s n in a GP parse tree, we first determine the expected frequency of s 1 and weight this frequency by the arities of s 2 ; . . .; s n .
We will explain this idea on the example of 1-grams, 2grams, and 3-grams. For the n-gram of ancestors ½s 1 ; s 2 ; . . .; s n , let K ½s 1 ;s 2 ;...;s n be the expected frequency of s 1 depending on the ancestor nodes s 2 ; . . .; s n . Furthermore, let W ½s 1 ;s 2 ;...;s n be a weighting factor, depending on the arities of s 2 ; s 3 ; . . .; s n . Then, the expected frequency of an n-gram ½s 1 ; s 2 ; . . .; s n in a parse tree is E ½s 1 ;s 2 ;...;s n ¼ K ½s 1 ;s 2 ;...;s n W ½s 1 ;s 2 ;...;s n : For n ¼ 1, K ½s 1 is independent from any ancestor nodes and therefore can be calculated using Eq. (12): 1-grams do not take ancestor nodes into account and therefore, W ½s 1 ¼ 1. Thus, we can calculate the expected frequency for any 1-gram in a GP tree by A 2-gram of ancestors is the combination of the values represented by a node and its ancestor node. Since the root node of a GP parse tree has no ancestor nodes, we define the value of a (non-existent) ancestor node as s iÀ1 ¼ £.
The value of K ½s 1 ;s 2 strongly depends on s 2 . If s 2 ¼ £, K ½s 1 ;s 2 is the expected frequency of s 1 in the root node (d ¼ 0). Otherwise s 1 has an ancestor node that represents a function f 2 F and so s 1 needs to be at a depth greater than 0 in the tree (d ! 1). Thus, using Eq. (12), we define If s 2 is a function f 2 F, we need to weight K ½s 1 ;s 2 , depending on the arity of s 2 compared to the arities of other functions in F. Functions with higher arities have more child nodes and therefore the expected frequencies for 2grams where these functions are parent nodes is also higher. Let a sum be the sum of all function arities of the functions in the function set Then W ½s 1 ;s 2 is 5 A multiset is a set that allows multiple instances for each of its elements.
W ½s 1 ;s 2 ¼ For the frequencies of 3-grams of ancestors we use Expected frequencies of n-grams of ancestors with n [ 3 can be calculated analogously.

Estimating sampling error of n-grams of ancestors in GP populations
The Cochran formula is a standard method in statistics to estimate a minimum sample size N for a large statistical population. The Cochran formula needs an estimate of the relative frequency p of the property that is evaluated (e.g., the relative frequency of an n-gram of ancestors; Cochran 1977). In general, it is a problem to estimate p. In our case, we already know the expected relative frequency of ngrams of ancestors (see Sect. 3). However, we need to assume that p is normally distributed (Cochran 1977). Furthermore, we need to choose an acceptable confidence level. For this, the Cochran formula uses z-scores of a normal distribution. For example, if a confidence level of 95% is chosen, the corresponding z-score is 1.96.
Last, we define a desired margin r of the relative statistical error e, so that e r, where e is the absolute difference between the expected frequency p and the measured frequency p 0 relative to p e ¼ jp 0 À pj p : The Cochran formula (Cochran 1977) is where p is the expected frequency, r is a margin of the relative error, and the confidence level is determined by a z-score. Thus, if we take a sample of size N, the value of p will be in the interval with a probability equal to the confidence level. For example, we decide to use a confidence level of 95% (z ¼ 1:96) and it is known that p ¼ 7% of a statistical population have the respective property; the desired level of precision is r ¼ 10%. Then, using Eq. (24), we estimate N ¼ 5103:84. As a result, if we take a random sample of size N, with a probability of 0.95 we measure p with 0:063 p 0:077 (Pð0:063 p 0:077Þ ¼ 0:95). The decision for a confidence level and a relative error is, to some extent, arbitrary (Cochran 1977). Values widely used in the literature and also recommended by Cochran (1977) are a confidence level of at least 95% (z ! 1:96) and a relative error of not more than 5%. Estimated sample sizes calculated by using these values have a high precision and a high confidence. Given the expected frequency of an n-gram (Sect. 3.3) as the value for p, we can estimate the size of a GP population.
So far, we are only able to estimate the necessary GP population size for one n-gram. However, in a GP population, we typically expect a large number of different ngrams. Therefore, we have more than one statistical item, for which we need to estimate a proper sample size. For such a case, Cochran recommends to first identify the most important items and afterwards estimate the sample size separately for each of these items. Then, Cochran's pragmatic recommendation is to simply select the largest estimate for a sample size of any of the items (Cochran 1977).  ¼ 1 (b, c), n ¼ 2 (d-f), and n ¼ 3 (g-j) On sampling error in genetic programming 179

Experiments
We empirically validate our model of expected n-gram frequencies and the model of the estimated sampling error in a GP population. Furthermore, we recommend population sizes for some widely used benchmark problem instances.

Frequencies of n-grams of ancestors
First, we validated the model of the expected frequencies of n-grams of ancestors in GP parse trees. We initialized five different large GP populations, each with a size of 100,000,000 individuals, measured the resulting frequencies for all n-grams of ancestors, and compared them with the expected frequencies calculated with our model. The GP populations were initialized with ramped-halfand-half because it includes both, trees initialized with full and grow. The minimum tree depth was set to different values (d min 2 f0; 1; 2; 3; 4g) for each of the five populations to take into account different scenarios. The set of allowed maximum tree depths used is D ¼ fd max 2 N 0 jd min d max 4g. We used three different terminals and four different functions to be able to create a large variety of different trees. The function set included two functions with an arity of one, a binary function, and a ternary function. Since we do not evolve the initial population, it is not necessary to define a fitness function, variation operators, or a selection method.
We measured the error for n-gram frequencies with n 2 f1; 2; 3g in each of the five GP populations. The results are presented in Table 1. The table shows the mean and maximum relative error by d min and n. The mean relative error is the mean over the relative errors for each n-gram frequency, measured separately for each value of n in the respective populations. Analogously, the maximum relative error was measured.
The values shown in Table 1 indicate that both, the mean and maximum relative error, are very small in all settings as expected. The error is larger for larger values of n. This is expected, since the population size is constant in all experiments, but there are much more different 3-grams than there are 1-grams. We used Pearson's chi-squared test to investigate the null hypothesis that the expected and measured frequencies of n-grams of ancestors are statistically different in their distributions. The p-values are very high (p ) 0:05), strongly indicating that the expected and measured n-gram frequencies are not statistically different in their distributions. Therefore, our model is able to reliably estimate the expected frequencies for n-grams of ancestors in large GP populations.

Sampling error
We estimate the GP population size for ramped half-andhalf with d min ¼ 2, D ¼ f2; 3; 4; 5; 6g, desired margins of sampling error r 2 f0:01; 0:05; 0:1g, and a confidence level of 95% (z ¼ 1:96) based on the expected frequencies of 1-grams, 2-grams, and 3-grams, respectively. We used the same function and terminal sets as in the first experiment (three different terminals as well as two unary, one binary, and a ternary function).
We calculated the expected n-gram frequencies of all ngrams and used these to estimate the respective expected population sizes. Note that we estimate the GP population size on the expected frequency of an n-gram per node (not per parse tree) using Eq. (12). The result is an estimate of the number of nodes. Since we want to obtain an estimate of the population size, we divide the estimate number of nodes by the expected tree size (which is the expected number of nodes in one tree). From the resulting list of estimated population sizes, we get three different values that take the most important n-grams into account with different degrees: • The minimum estimated population size which corresponds to the estimate based on the highest expected frequency of all n-grams (max), • the mean of the five lowest estimated population sizes which is based on the five most frequent n-grams (top 5), • and the median over the population size estimates (median).
Overall, we estimated 27 different population sizes using the above settings. The used settings and the resulting population size estimates are presented in Table 2. We can see that the estimated population sizes for larger values of n are much higher compared to smaller values of n (with otherwise unchanged variables). This is expected since the number of n-grams grows exponentially and thus, the expected frequencies of these n-grams is lower, which then leads to higher estimated population sizes. The desired margin of error also has a high influence on the population size estimates: high values of r lead to lower population size estimates and vice versa.
If we take only the most important n-gram frequencies into account (settings max and top 5), the estimated population sizes are lower compared to the setting median. This is because in median many n-grams with low frequencies are taken into account, resulting in large population size estimates. The difference between the settings can be large, e.g., in the case of 3-grams where the population size estimates with median are about 5 times as high as with max.
Next, we empirically analyzed the resulting error using the estimated population sizes from Table 2. For each estimate, we initialized 100 GP populations with the respective population size and measured the resulting relative sampling error by comparing the measured with the expected n-gram frequencies. In total, we initialized 2700 GP populations. The results are presented in Fig. 2.
Each of the 27 box plots visualizes the relative sampling errors that were measured in 100 GP populations. In Fig. 2, we differentiate between n-grams (columns), margin of error r (rows), and the number of n-gram frequencies taken into account when estimating the population size (each horizontal axis). The vertical axes show the corresponding values of sampling error. The desired margin of error is also depicted by a red line in each plot. As expected, we can see that the majority of values is below the margin of error. Interestingly, this is also the case if we only take into account the most important ngrams (settings max and top 5). For both of these cases, the estimated and measured sampling error are close to each other. When we estimate the GP population size using the median, we take more n-grams with a lower frequency into account. Therefore, we overestimate the GP population size. This leads to a sampling error that is well below the margin of error. In other words, it is only necessary to take into account the most important n-gram frequencies, which is in line with Cochran's general recommendation (Cochran 1977).
Our results show that the Cochran formula together with the results of the model of expected n-gram frequencies reliably estimate the GP population size for a desired margin of error.

Variance of fitness values
We analyze the variance of fitness values in generation 0 for different population sizes of N 2 f10; 20; . . .; 100; 200; . . .; 1000; 2000; . . .; 10;000; 15;000; . . .; 30;000g for four benchmark problem instances )-6-Multiplexer, 11-Multiplexer, Koza-1, and Pagie-1. These four problem instances were chosen because their primitive sets are widely known in the community and interesting differences between the four primitive sets exist (e.g., different number of functions, different arities of the functions, different number of terminals). We use ramped half-and-half with d min ¼ 2 and D ¼ f2; 3; 4; 5; 6g. The results are presented in Figs. 3 and 4. For the 6-Multiplexer (Fig. 3a) and the 11-Multiplexer (Fig. 3b) we plot the median, 25-, and 75-quartile of the mean fitness in generation 0 over population sizes. Since there are infinite fitness values in the symbolic regression problem instances, we plot the median, 25-, and 75-quartile of the median fitness for Koza-1 (Fig. 4a) and Pagie-1 (Fig. 4b). The x-axes are logscaled for better visibility of the results of small population sizes. We can see that the variance of mean and median fitness values is very high with small population sizes and asymptotically gets lower with higher population sizes.
To further analyze the variance of fitness values over population sizes, we plot the quartile coefficient of dispersion (QCD) (Figs. 5, 6). The QCD is calculated using the first (Q 1 ) and third (Q 3 ) quartiles of the data set: High values of the QCD indicate that the data has large variance. Similar to the results in Figs. 3 and 4, we calculate the QCD of the mean fitness (Fig.5) and median fitness (Fig. 6). For comparison, we plot the estimated sampling error calculated by using the five highest expected frequencies of 1-, 2-, and 3-grams for each of the population sizes. To estimate the error we use the Cochran formula (Eq. 24) and transform it to For a better comparison, we chose the scales in such a way that the QCD starts at about the same point as the estimated sampling error using expected 2-gram frequencies. In Figs. 5 and 6 we can see that the decrease of the QCD is analogous to the decrease of the estimated sampling error. Using Pearson's correlation coefficient, we measure the correlation between estimated sampling errors and QCD for estimates with 1-, 2-, and 3-grams on all four problem instances. The correlation coefficients are between 0.992 and 0.999, indicating a strong correlation. This means that the estimated sampling error calculated with expected

Estimated GP population sizes for common benchmark problem instances
We use our models to recommend reasonable population sizes for eight widely used benchmark problem instances . In general, we used the function  and terminal sets proposed by the authors of the benchmarks. However, in our analysis we ignored ephemeral random constants, e.g., used in (Pagie and Hogeweg 1997). Furthermore, in the model of expected frequencies, we used d min ¼ 2 and D ¼ f2; 3; 4; 5; 6g. In the sampling error model, we used r ¼ 0:05, z ¼ 1:96 (confidence level of 95%), and the estimates are based on the five highest expected 2-and 3-gram frequencies (top 5).
The results are presented in Table 3. The estimated population sizes using expected 2-gram frequencies are between 337 and 3440. Using expected 3-gram frequencies, the estimated population sizes are between 839 and 12,180.
Practitioners are usually faced with strict CPU time constraints. As a result, there is a trade-off between either choosing a larger population size or running the search for more generations. The population sizes indicated by our model help to make an informed decision of the population size. Increasing the population size beyond the indicated size would not help much. Instead, it would be better to increase the number of generations.

Conclusions
We developed a model of the expected frequencies of ngrams of ancestors in GP. We used the model of expected n-gram frequencies and Cochrans formula to determine a minimum size of an initial GP population, given a desired degree of sampling error and a confidence level. Then, we used our models to estimate initial GP population sizes for common benchmark problems, giving a recommendation to avoid sampling error. Furthermore, we find that the estimated sampling error calculated with expected frequencies of n-grams is a good proxy for the empirical variance of fitness values in our experiments. Last, we find for selected benchmark problems that the initial population sizes should be between 800 and 12,200, depending on the problem instance to reduce the amount of sampling error below 5%.
Our results show that GP and variants like EDA-GP benefit from high population sizes to avoid problems with sampling error. However, our model does not consider that-in addition to BBs being present-these BBs have an effect on the fitness (i.e., by definition introns do not influence fitness; Sastry et al. 2003).
Furthermore, our analysis focuses on subtree frequencies, where subtrees are represented by n-grams of ancestors. Other forms of n-grams could be interesting as well, e.g., n-grams that use sibling nodes (Hemberg et al. 2012). Also, it could be relevant to analyze other population statistics to evaluate whether our initial population is sufficiently representative (i.e., has a low sampling error). Examples are the distribution of tree depths or the distribution of tree shapes.
Of course, GP search is not only influenced by the initial population but also by other factors. Therefore, exploring a combination of our initialization model and an adaptive population size approach, e.g., the one presented by Hu and Banzhaf (2009), is promising.
We cannot guarantee a certain solution quality with our model as competing BBs or expressions are not considered. Thus, future studies need to extend our models, taking variation and selection into account (temporal models). Table 3 Estimated GP population sizes for common benchmark problem instances (r ¼ 0:05, z ¼ 1:96, ramped half-and-half, estimates are based on the five highest expected 2-and 3-gram frequencies) Acknowledgements Parts of this research were conducted using the supercomputer Mogon offered by Johannes Gutenberg University Mainz (https://hpc.uni-mainz.de). The authors gratefully acknowledge the computing time granted on Mogon. Also, we thank the anonymous reviewers whose suggestions helped us to improve and clarify this manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.