Accessibility percolation on Cartesian power graphs

A fitness landscape is a mapping from a space of discrete genotypes to the real numbers. A path in a fitness landscape is a sequence of genotypes connected by single mutational steps. Such a path is said to be accessible if the fitness values of the genotypes encountered along the path increase monotonically. We study accessible paths on random fitness landscapes of the House-of-Cards type, on which fitness values are independent, identically and continuously distributed random variables. The genotype space is taken to be a Cartesian power graph \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {A}^L}$$\end{document}AL, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L$$\end{document}L is the number of genetic loci and the allele graph \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A}$$\end{document}A encodes the possible allelic states and mutational transitions on one locus. The probability of existence of accessible paths between two genotypes at a distance linear in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L$$\end{document}L displays a transition from 0 to a positive value at a threshold \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _\text {c}$$\end{document}βc for the fitness difference between the initial and final genotype. We derive a lower bound on \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _\text {c}$$\end{document}βc for general \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {A}$$\end{document}A and show that this bound is tight for a large class of allele graphs. Our results generalize previous results for accessibility percolation on the biallelic hypercube, and compare favorably to published numerical results for multiallelic Hamming graphs.


Introduction
In the strong-selection weak-mutation (SSWM) regime evolutionary dynamics reduces to an adaptive walk on what is known as a fitness landscape, the map from genotypes to fitness values [7,25].For low mutation rates the nearly monomorphic population can be represented by a single majority genotype moving through the space of genotypes by individual mutations that fix with The two allele graphs are shown on top with the genotype space below.While the second factor graph represents a locus with three possible alleles, all of which may mutate freely from one to another, the first factor graph represents a locus on which not all mutations between the alleles are considered possible.Specifically mutations between 0 and 2 must take an intermediate mutation through 1 and additionally the mutation from 0 to 1 is considered irreversible and does not allow backstepping to 0. Although in this work we define the genotype graph as the direct Cartesian power of a single allele graph, different allele graphs as shown here can still be modeled without loss of generality by assuming that A is the disjoint sum of the individual graphs.Since the individual constituent graphs are not connected in this sum, this does not increase the accessibility (see Remark 1) a probability depending on the fitness of the mutant relative to the parental genotype [9,22].Under strong selection, the movement of such a walker is additionally constrained towards increasing fitness values, making it an adaptive walk [12].This limits the number of selectively accessible paths a population can take through the genotype space [5,8,26].
Here we investigate the impact that the mutational structure of the genotype space has on the number of evolutionary paths available to SSWM dynamics.We use a simple stochastic model for fitness landscapes known as the House-of-Cards (HoC) model, in which each genotype g is assigned an i.i.d.continuous random fitness value F g [12,13].Continuity of the distribution assures that ties in fitness values almost surely do not happen.Then a path is accessible if these fitness values are in increasing order.Accessibility therefore is a property purely of the ordering of the i.i.d.random variables.As a consequence accessibility is independent of the actual distribution chosen and we are free to choose any representative distribution.We will use the standard Fig. 2 Example of a fitness graph generated from the genotype space in Figure 1 according to the HoC model.The opacity of nodes indicates the randomly chosen fitness value.Only arrows representing mutations that were originally allowed in the genotype graph and also point towards increasing fitness remain, resulting in an acyclic directed graph on genotypes.The global minimum and maximum in this realization are (10) and (12) respectively and the latter is accessible from the former by multiple accessible paths, for example the direct one (10) → (12), but also (10) → (20) → ( 22) → (12).As a counter-example to accessibility consider (01) and (12).Although (12) has higher fitness than (01), it is not accessible from (01) uniform distribution which has certain properties that make it easier to work with.
A genotype is made up of many individual sites or loci, which can be found in some given number of states called alleles and can be mutated individually.For simplicity we will assume that all loci have the same set of possible states.Therefore genotypes are sequences g = (g 1 , . . ., g L ), with L determining the number of loci.Individual (point) mutations, which are the only ones to be considered here, mutate only one of the loci.The mutational structure of the system determines whether every state of one locus is able to mutate to any other or whether some restrictions apply.For example, whereas point mutations in the DNA sequence can mutate any nucleotide base into any other, the genetic code constrains the possible one-step transitions between amino acids.To accommodate general mutational structures we describe the loci by a simple directed graph.
Definition 1 An allele graph is a finite simple directed graph.
We will denote the allele graph under consideration A, its vertex set A, the size of its vertex set A and its adjacency matrix A. The vertex set A is the set of all alleles and the graph's arrows indicate possible one-step mutations between alleles.The assumption of finiteness is not strictly necessary but allows for more focused proofs.Extension to infinite graphs is straight-forward with sufficient regularity properties (e.g.bounded degrees).Again for simplicity, we assume the allele graph to be the same on all loci with the vertices identified by natural numbers.We will give a justification for this restriction after introduction of the genotype space in the following paragraph.
Definition 2 A genotype space (over L loci with allele graph A) is the Cartesian power graph A L = A L [20], where the Cartesian product U V of two directed graphs U and V is a graph over the Cartesian product of their vertex sets with an arrow from (u, v) to (u ′ , v ′ ) iff either there is an arrow from u to u ′ in U or there is an arrow from v to v ′ in V, but not both.
With this definition the genotype space's vertex set forms the genotypes, with arrows between genotypes that can be reached via one-step mutations (Figure 1).The fitness landscape constrains which of these arrows may be taken by an adaptive walker and we call the directed sub-graph of the genotype graph obtained by removing arrows which do not point towards increasing fitness the fitness graph [6] (Figure 2).For conciseness we usually do not specify the number of loci and the allele graph for the genotype space under consideration, assuming these parameters to be L and A instead.
Remark 1 Although we assume the allele graph to be the same on all loci, this still implicitly covers situations in which different loci use different numbers of alleles, as long as the number of alleles is bounded by a constant.This can be seen by considering an allele graph constructed as the disjoint union of all allele graphs with size at most the constant bound.The resulting genotype space then also separates into many disjoint components, one of which is the original genotype space with the varying sequence of allele graphs.Because mutations between the components are impossible, we can consider, without loss of generality, this larger genotype space without affecting the accessibility property.
Definition 3 A fitness landscape on a genotype space A L is an assignment of real numbers F g , called fitness values, to each genotype g.The fitness graph of the fitness landscape is the directed graph over all genotypes with an arrow from g to h iff there is an arrow from g to h in A L and F h > F g .
As seen in this definition we will use bold face for genotypes, i.e. vertices of the genotype space, while using normal face for alleles, i.e. vertices on the allele graph.Throughout, quantities defined over the allele graph will be written in normal face and analogous quantities defined over the genotype space will be written in bold face.Definition 4 A House-of-Cards (HoC) model over a genotype space A L is a random distribution over the set of all fitness landscapes on A L , such that each fitness value is chosen i.i.d.from a standard uniform distribution.Definition 5 Given genotypes g and h on a fitness landscape over a genotype space A L , 1. a walk on A L from g to h is called accessible if it is also a walk on the fitness graph of the fitness landscape and 2. h is said to be accessible from g if at least one such walk exists.
We write Z gh for the number of accessible walks from g to h, considering it as a random variable over the HoC distribution.

Definition 6
The (HoC) accessibility of a genotype h from a genotype g on a genotype space is the probability that h is accessible from g in a HoC model over the genotype space.
In other words accessibility of h from g is the quantity P [Z gh ≥ 1].
Our goal is to determine the accessibility between pairs of genotypes a L and b L defined on genotype spaces with L loci as L becomes large (with fixed allele graph).This question is in particular non-trivial if the directed distance d aLbL from a L to b L on the genotype space is of linear order in L.Here we understand directed distance d gh between two genotypes g and h as the length of the shortest (not necessarily accessible) path from g to h on the genotype space.Of special importance for this question is the value β, which we define as the fitness difference F bL − F aL .

Definition 7
For values 0 ≤ β ≤ 1, the β-(HoC-)accessibility of a genotype h from a genotype g on a genotype space is the probability that h is accessible from g in the HoC model, given that the HoC distribution is conditioned on We typically still write P [Z hg ≥ 1] to refer to β-accessibility.If there is ambiguity between the two in a given context, we add notation to indicate the conditioning.
It is known from previous work on the case of two alleles, A = {0, 1}, and linear distance d aLbL ∼ δL, that there is a critical value β c , depending on δ, such that for constant choices of β above or below β c , asymptotically the probability of b L being β-accessible from a L converges to 1 or 0, respectively, as L → ∞ [3,4,10,16,18].The transition occurring at β = β c has been referred to as accessibility percolation [15,21].Apart from a computational study [28], so far accessibility percolation has been studied only for the biallelic case for which the genotype space A L is the L-dimensional (binary) hypercube.
Remark 2 Results related to those presented here have been obtained in the context of first-passage percolation [14,19,20].The accessibility percolation problem (with any continuous distribution) can be mapped to an equivalent first-passage percolation problem with uniformly distributed weights as described in [18].This mapping does however map accessibility percolation with fitness values on vertices to first-passage percolation with weights on vertices as well, while traditionally weights are put on edges in that context.In [20] Martinsson considers a first-passage percolation model that would map to the HoC accessibility percolation problem if fitness values were assigned to edges rather than vertices.We will adapt and extend his methods to directly resolve the specific accessibility problem introduced here without requiring the mapping to first-passage percolation.
Remark 3 Another way of looking at the HoC β-accessibility problem is to consider it as a Bernoulli percolation on a certain ensemble of orientations of A L .Alternatively to the definition of β-accessibility above, the β-conditioning in the HoC model can also be applied by conditioning on F aL = 0 and F bL = β.This follows immediately from the fitness values being i.i.d.uniform random variables.If all fitness values are increased by a constant value, taking their remainder modulo 1, the resulting distribution is unchanged, so that the βaccessibility must be unaffected by further conditioning on the initial fitness value.With this then only genotypes with fitness values below β are relevant in determining whether an accessible path exists.Furthermore, after removal of the ineligible genotypes, accessibility will depend only on the order of the fitness values on the remaining vertices.Because the fitness values are chosen i.i.d.this implies that β-accessibility in the original problem is equivalent to 1accessibility after additional removal of each vertex (that is not a L or b L ) with probability 1 − β, i.e. a Bernoulli site percolation with rate β.This perspective may be more suitable if one is to consider the effect of decreasing β.

Results
In this section we present our general results for (mostly) arbitrary allele graphs A of which specific applications will be demonstrated in Section 3. First we define some attributes of the problem more carefully and state the limitations imposed by our subsequent proofs.

Prerequisites
Our intent is to describe the limiting behavior of accessibility as L → ∞ for fixed allele graphs between genotypes in distances which are large, i.e. linear in L. To make this setup precise we need to introduce some more notation.We also introduce some restrictions on the choices of the allele graph A and the choices for the sequence of endpoints a L and b L in order to avoid pathological and non-converging behavior.These restrictions and assumptions are summarized in Definition 10.
First we note that since allele graphs were defined to be finite, there is a maximal degree ∆ < ∞ among all vertices.When extending the results to infinite allele graphs it should be required that such an upper bound on the degree still exists in order for most results to remain valid.If the degrees are not sufficiently bounded in an infinite allele graph, then the number of walks from a L to b L may become so large that the problem results in trivial accessibility.
As L, and with it the graph A L , changes we need to define a sequence of endpoint pairs (a L , b L ) in L. Correspondingly we also intend β to vary with L as we consider β-accessibility.For convenience the dependence of β on L is taken to be implicit and usually not reflected in notation.For general sequences of endpoints calculations may become tediously complex and so we impose a few restrictions described in the following on the sequences for which our results will apply.Because the position of a locus in the sequence of the genotypes does not matter, all relevant properties of the pair (a L , b L ) for a given L can be expressed as an integer-valued matrix: Definition 8 The allele counting matrix of a pair of genotypes (a L , b L ) over a genotype space is the (A, A)-matrix with entries where v, w ∈ A are alleles.
This matrix counts for each pair of alleles the number of loci on which the path is required to move from v to w, thereby dividing out the permutationsymmetry of loci.A sequence of such matrices M in L is equivalent to a sequence of pairs (a L , b L ) up to the irrelevant symmetry of A L .
For each value of L we have then a matrix M corresponding to the pair (a L , b L ) of endpoints.Because we are interested in the behavior as L diverges, we want to focus on cases where this sequence of matrices is sufficiently wellbehaved.Borrowing the terminology from phylogeny studies [17], we introduce the following object:

Definition 9
The divergence matrix for a sequence of pairs of genotypes ((a L , b L )) L with allele counting matrices M (L) on a corresponding sequence of genotype spaces with L alleles and fixed allele graph is the (A, A)-matrix with elements for all v, w ∈ A, assuming all limits exist.We also define a sequence of divergence remainder matrices with elements We want this divergence matrix to exist and the remainder terms to be sufficiently well-behaved that we can make convergence statements about the accessibility question.Specifically we require the following: Definition 10 A (well-behaved) accessibility setup is a sequence (A L ) L of genotype spaces with L loci on the same allele graph A together with a sequence of pairs of genotypes ((a L , b L )) L on these genotype spaces such that 1. for all L: M (L) vw = 0 whenever there is no non-zero length walk from v to w in A, 2. for all L: a L = b L , 3. the divergence matrix p of ((a L , b L )) L exists, 4. at least one off-diagonal element of p is non-zero.
The first requirement assures that there is always at least one candidate path from a L to b L , so that the trivial case does not need to be considered and that it is actually possible to move on all loci.If there are only zero-length walks on a locus (i.e.start and end point are the same without any possibility of mutation), then effectively the locus cannot affect accessibility in any way, resulting in a pathological effective reduction of L.
The second requirement assures that we do not need to consider the pathological case in which the β-conditioning cannot be satisfied.
The third requirement assures that the "direction" between the pair of endpoints is asymptotically well-behaved without oscillations which would need to be reflected in the critical β.
The last requirement assures that the distance between the endpoints grows linearly in L, which is the only limit we consider here.
In the following we are always working with one such implied accessibility setup.In order to succinctly state our results, we introduce the following quantity for pairs of alleles v, w ∈ A and t > 0: with A the adjacency matrix of the allele graph.The exponential is a matrix exponential from which the element representing alleles (v, w) is extracted, rather than the exponential of the element of the matrix.In addition to the quantity Γ vw (t), also its first two derivatives Γ ′ vw (t) and Γ ′′ vw (t) with respect to t will be important.We indicate the derivative with respect to the t argument by backticks as shown.
Proposition 1 Γ vw (t) on t ≥ 0 has the following properties: 1.If v = w and there exists no walk from v to w on the allele graph, it is nowhere defined (or −∞ everywhere), 2. if v = w and there exists no walk of non-zero length from v to w on the allele graph, it is 0 everywhere, 3. otherwise it is strictly monotonic increasing, converging to positive infinity as t → ∞ and to negative infinity as t → 0.
In case 1. we formally interpret e Γvw (t) as e tA vw , which would be zero everywhere.
Proof These properties are direct consequences of the behavior of the matrix exponential.⊓ ⊔ We then define the same quantity for pairs of genotypes v, w ∈ A L as averages over the per-locus quantity: where we use the general notation to mean an average over l of the term X l containing l as a variable.As we already did for genotypes, quantities acting on A L will be written in boldface, while equivalent quantities acting on a single A copy will be denoted in normal font-face.The canonical connection between the former and latter is averaging over loci.In particular, Γ vw (t) can be interpreted as the exponential rate with which the expected number of accessible paths grows in L, or more precisely the number of quasi-accessible walks, a concept which will be introduced in section 4.

Proposition 2
In a (well-behaved) accessibility setup Γ vw (t) satisfies the following properties on t > 0: Proof The conditions on the allele counting matrix in an accessibility setup guarantee that all contributions to the mean in eq. ( 5) fall under the last point of Proposition 1. Furthermore because the allele graph is finite and fixed, there is only a finite number of pairs (v, w) of loci.The properties in this proposition then follow immediately from Γ vw (t) being a point-wise average of a finite number of continuous strictly monotic increasing funtions with the same properties and only the relative weighting dependent on L. ⊓ ⊔ As a consequence of the first property there exists exactly one value for each L at which Γ vw (t) becomes 0, which we denote β.This is going to be our candidate for the threshold of β-accessibility.
We also require the following function with domain 0 ≤ r, s ≤ 1, which is a slight generalization of a function introduced by Martinsson in [20]: Definition 11 Martinsson's function of an accessibility setup is the function on [0, 1] 3 defined by where r = 1 − r and s = 1 − s and • s,r x l ,y l is the mean over x l , y l ∈ A weighted by e Γa l x l (β sr)+Γx l y l (βs)+Γ y l b l (β sr) . ( Explicitly written out it reads M(s, r, β) = x l ,y l ∈A Γ x l y l (βs)e Γa l x l (β sr)+Γx l y l (βs)+Γ y l b l (β sr) x l ,y l ∈A e Γa l x l (β sr)+Γx l y l (βs)+Γ y l b l (β sr) l .(9) For the case that y l is not reachable from x l or that s = 0, the formula yields negative infinities for Γ x l y l (βs).We assume that in this case formally the natural choice Γ x l y l (βs)e Γx l y l (βs) = 0 (10) holds.Martinsson's function can be interpreted as a refined version of Γ vw (β) in which walks are segmented into three parts, each of which traverses fitness spans of βsr, βs and βsr respectively, weighting the expected number of accessible walks on the middle segment by the number of accessible walks by which it can be reached from a L and b L .In this way a positive value of Martinsson's function relative to Γ aLbL (β) shows that the number of accessible walks is clustered around few initial and end segments with many alternative accessible middle segments, which in turn implies that the overall expected value is not a good indicator for the existence of at least one accessible walk.How this function arises from consideration of a variation of the second moment method is described in Section 6.
Proposition 3 M(s, r, β) satisfies the following properties: Proof The first two statements can be derived immediately by application of matrix multiplication.The third statement is an immediate consequence of the second given the definition of β. ⊓ ⊔ The objects introduced so far are dependent on L implicitly through the averaging process over loci.In order to be able to make statements about the limiting behavior, it is useful to consider the limits of these quantities as L → ∞.We use the non-L-dependent mean (11) as a replacement for X vw l .As the number of alleles is finite, X vw l will converge to X vw p a l ,b l as L → ∞.In particular we write β * for the limit of β as L → ∞.Similarly we write for the limit of Martinsson's function Proof Per the requirements on the divergence matrix for an accessibility setup the weights in the averaging over pairs of loci in Γ aLbL (β) converge.Because the average is only over a finite number of terms, it consequently also converges.Continuity of the matrix exponential and the properties from Proposition 2 then immediately prove this proposition.⊓ ⊔ With the necessary quantities defined we can classify accessibility setups as follows: Definition 12 If not only at s = 0 and s = 1, but everywhere in its domain M * (s, r, β * ) ≤ 0, then we say that the accessibility setup is of semi-regular type.Otherwise we say that it is of irregular type.If M * (s, r, β * ) < 0 holds strictly everywhere except at s = 0 and s = 1 and if additionally the derivative ∂ s M * (s, r, β * ) is not zero at s = 1, then we say that the accessibility setup is of regular type.
For our statements and in the following proofs we make use of Landau notation with the usual meanings of O(•), o(•), ω(•) and Θ(•).In our notation of arithmetic terms and equations these symbols are stand-ins for some function in the respective class.The limit variable to which these symbols apply should be evident from context, but is usually L → ∞.Functions in these classes are not required to be non-negative.In particular e.g.|Θ(1)| is used to enforce positiveness of a term that is of constant (non-zero) asymptotic order in L. If not stated otherwise, these symbols are assumed to be uniform in the sense that the functions represented depend only on the limit variable and model parameters, but not on other local variables.

Statements
We state our results in terms of (weak) threshold functions defined as follows.
Definition 13 Given a function f (L) we say that a sequence c L in L is a f (L)-threshold function for β-accessibility if for all g(L) = ω(f (L)) 1. (c L + |g(L)|)-accessibility has a non-zero limit inferior as L → ∞ and 2. (c L − |g(L)|)-accessibility has a zero limit superior as L → ∞.
We refer to the first condition as the upper side and the second condition as the lower side of the threshold.In other words, c L determines the asymptotic transition from zero accessibility to non-zero accessibility if we condition the fitness difference between initial and final genotype, with a window of uncertainty of the same order as f (L).In particular if c L is a f (L)-threshold for some f (L) = o(1), then the limit of c L is the critical value β c .

Remark 4
The notion of threshold chosen here is weak in the sense that it doesn't imply a transition from zero to one, but only from zero to some nonzero probability.We do not think that our results are actually restricted to this weak bound and we expect that arguments analogous to those made in [20] may be used to extend our weak threshold result to a strong threshold with lim inf L→∞ P [Z aLbL > 0] = 1, but we did not pursue this improvement here.
We can now state our main theorems: Theorem 1 In an accessibility setup with the common notation used in previous definitions and the sequence 1. c L satisfies the lower side condition for a 1 L -threshold function and 2. c L also satisfies the upper side condition and therefore is a 1 L -threshold function if the accessibility setup is of regular type.
The second part of this theorem in particular implies that β c = β * for the regular type, meaning that β * is indeed the critical value.

Theorem 2 In an accessibility setup of irregular type
In other words, the irregular type does not have c L as given in theorem 1 as a threshold function.
Lastly we consider the length of accessible paths at the critical point: In other words, if there are accessible paths at the candidate threshold function, then they have up to fluctuations of order √ L length Γ ′ aLbL β βL and since Theorem 1 guarantees that the candidate is actually the threshold function for regular setups, this then implies that the critical paths in the regular setup are of that length as well.
The lower side of the threshold functions in Theorem 1 can be derived directly from a consideration of the expected number of (quasi-)accessible walks and an application of Markov's inequality.This approach will be explained in Sect.4, where we also introduce the notion of quasi-accessibility as a tool to simplify the counting of accessible paths.In addition, the first-moment approach allows us to prove Theorem 3 by consideration of the expected values separated by walk length (Sect.4.3).
To prove the upper side of the threshold function, it is necessary to bound a higher moment of the expected number of (quasi-)accessible walks in relation to the mean.In particular, using a generalized version of the second moment method, it is sufficient to bound moments of the form to show asymptotic boundedness of the accessibility away from zero.The evaluation of this expected value will follow the general ideas used by Martinsson in [20] to bound for every given (quasi-)accessible focal walk the number of other (quasi-)accessible walks, through the deviating arcs on the focal walk that generate all such other walks.In the mean taken over x l and y l in Martinsson's function (9), the focal walk is represented by the walk sequence and the corresponding three Γ -terms in the weights, while the deviating arcs are represented by the additional term corresponding to x l → y l over which the average is performed.Because Martinsson considers a model that corresponds to putting weights on edges rather than nodes, our calculations need to be adjusted accordingly (see Sect. 6).
The lower bound on β c for the irregular case in Theorem 2 is again obtained following an approach used by Martinsson, by considering walks through pairs of edges (x, x ′ ) and (y, y ′ ), applying Markov's inequality separately, and union bounding the resulting probability to improve on Markov's inequality from the total expected number of (quasi-)accessible walks (see Sect. 5).

Asymptotic form
The theorems as stated in the previous section are dependent on β and • l averages, which are L-dependent quantities.From the definition of an accessibility setup we do however know that β converges to β * and that averages of the form • l are asymptotically of the form • p a l ,b l , both of which are Lindependent quantities.Depending on the specific choice of sequence of pairs ((a L , b L )) L , the rate of convergence for these quantities may however differ and add additional significant terms in the threshold function, which we detail in this section.

Proposition 5 In Theorem 1 the sequence c L can be replaced by some sequence
for a suitable choice of o( 1), and under this change in Theorem 3 there is a suitable choice of o( 1) so that the interval becomes with Proof The leading order of the distance between the endpoints as L → ∞ is given by the sum of off-diagonal terms of p vw : where δ = v =w p vw .The definition of an accessibility setup enforces that δ > 0, so the value of β * will be positive, i.e. not zero, and then we can expand β around β * in L, using that all Γ -terms are bounded by constants from below and above per Propositions 1 and 2: The term ) is bounded from below by a positive constant, so solving for β results in Inserting this into the candidate threshold function (13) we obtain the alternative threshold function in the proposition.Similarly, expanding Γ ′ aLbL c L + η L in Theorem 3 gives, up to terms of order ln L L or smaller The remaining terms can contribute at most ln L to the walk length, which would be subsumed by the interval length of order √ L. ⊓ ⊔ In general, the candidate critical value does not depend on the non-linear corrections in the behavior of (a L , b L ), but the leading correction to the critical value changes if the non-linear corrections are of order ln L or higher.
This condition describes the situation in which a L and b L are, up to discretization error, separated in a well-defined linear "direction" of a fixed allele counting matrix as L increases.In this case the O(1) contribution is irrelevant since c L represents a 1 L -threshold function, so that additional contributions of order 1 L lie within the threshold window.In this linear separation case the threshold function is described fully by the two quantities β * and Γ ′ * , both of which were derived from the averages over the matrix exponential of the allele graph, weighted by the divergence matrix.

Complete graph
The simplest application is to the complete graph on A = |A| alleles, as seen in the top-left of Fig. 3, which leads to genotype spaces known as Hamming graphs.By symmetry, in this case there are only two choices for the initial and final allele on a locus, either a l = b l or a l = b l .Therefore the accessibility setup can be fully determined by just the relative distance δ, which is then also the relative Hamming distance.As shown in [20], this accessibility setup is (for converging δ > 0) always of regular type for the complete graph.One obtains where δ = 1 − δ.In the biallelic case A = 2 the condition Γ aLbL β = 0 reduces to the relation sinh( β) δ cosh( β) δ = 1 which was first conjectured in [3] and proved in [16,18].At full distance δ = 1 without any variation of δ with L, β = β * and The values of β * and Γ ′ * for small A are shown in Table 1 and numerically obtained values for β * as a function of distance are presented in Fig. 4.
In general e β * is the unique positive solution of the polynomial equation  1 Results for the complete allele graph with 2-4 loci, as well as 21 loci, at full distance δ = 1.The last column shows the prefactor of the asymptotic walk length at the critical point.In the biallelic case A = 2 the result for the walk length was also obtained in [14] For A ≥ 5 the solution of this equation cannot be expressed in closed form, however it can be expanded around A → ∞ as As the number of alleles increases, accessibility increases and the required fitness difference between the start and end point decreases.In fact this quantity vanishes to zero for A → ∞.At the same time the length of accessible walks close to the critical fitness difference increases, but slowly.The minimal length of a path covering the full distance d aLbL is L, and hence β * Γ ′ * − 1 is the fraction of mutational reversions (where a mutated locus reverts to the allele it carried in the initial genotype a l ) and sideways steps (where a mutation occurs to an allele that is part of neither the initial nor the target genotype) [27].The fraction of all alleles on a given locus that appear along an accessible path close to the critical point is given by A −1 β * Γ ′ * which decreases with increasing A.
Zargorski, Burda and Waclaw carried out simulations of this model, giving β * with two digit precision for different values of A [28].Their results match the values derived here up to ±0.01.

Complete graph without return to the wild type allele
We can modify the complete graph slightly to disallow mutations back to the allele that was present in the initial genotype (the wild type allele), while still allowing mutations between all other allele, see top-right of Fig. 3.In this case the expressions simplify significantly to The asymptotic behavior for large A is the same as for the complete graph.For A = 2, the expressions are ill-defined, but the correct expressions coincide with the limits: In the biallelic case A = 2 this describes accessibility percolation on the directed hypercube, which was considered by Hegarty and Martinsson in [10].
In this case β * = 1, which implies that the directed hypercube is marginally accessible under the HoC model [8].For the biallelic case not only the critical value, but also the leading order corrections in the threshold function are known [10] and coincide with the order ln L L contribution in our candidate threshold function and the value of Γ ′ * given above.

Path graph
The complete graph is in some sense the best-case scenario for accessibility.
On the opposite side of the spectrum of possible (undirected) allele graphs one can choose the path graph on A vertices, shown at the bottom of Fig. 3.In this case the distance between the two end points increases linearly with the number of alleles and there is a unique order in which mutations on a locus must be applied.This causes accessibility to become very low.For A = 2 the path graph is identical to the complete graph.However, already for A = 3 we find Since β * represents a fitness quantile which must lie between 0 and 1, this value implies that the path graph on three vertices can never be accessible for any fitness difference if (almost) all loci need to mutate from one end of the graph to the other.For higher A this effect becomes more pronounced.As a possible biological application of the path graph the description of copynumber variants of genes can be mentioned [2].Since the complete graph on two vertices without reversions has β * = 1 as shown before in (33) and adding edges can only decrease β * , it is actually a l b l Fig. 5 Example of an allele graph that leads to a problem setup of irregular type with β * < 1 required that the distance between a l and b l on the allele graph is at least 2 in order for β * > 1 to be possible.

Example of non-trivial irregular type
Many graphs seem to be of completely regular type in the sense that no matter which sequence of pairs (a L , b L ) are chosen, the problem is always of regular type.Martinsson [20] considered different sufficient conditions on graphs to have this property.But he also lists the smallest graph, of order 4, which does not have it.While this example demonstrates that it is possible to have problems of irregular type, it can also be used to generate semi-regular, but not regular, problem types by carefully interpolating the divergence matrix (2) between a regular and irregular type pair of alleles.
However, the example shown by Martinsson turns out to have β * > 1, which automatically implies asymptotic inaccessibility in the accessibility percolation context due to the defined range of β = F bL − F aL as a difference of uniform random variables.We therefore searched for the smallest graph without the regularity property and β * ≤ 1 numerically and found the example in Fig. 5 which has β * ≈ 0.983.

Genetic code
While the complete graph with A = 4 may serve as a model for the allele graph of single-nucleotide mutations on DNA or RNA, the expected effect of such a substitution depends significantly on whether or not it changes the amino acid that is encoded by the corresponding three-nucleotide codon.Mutations not affecting the encoded amino acids are known as synonymous.To specifically model the fitness effects of non-synonymous point mutations we therefore consider the allele graph of all amino acids with edges representing the mutual reachability by single-nucleotide substitutions (Figure 6).This graph is considerably less symmetric than the complete graph and in particular the resulting quantities β * and Γ ′ * will depend on the particular choices of the path endpoints a L and b L rather than simply on their distance.We consider here all pairings of amino acids a l and b l , assuming them to be equal for all loci.Other cases may be interpolated from these.The results are shown in Table 2. Whether the given values determine the asymptotic behavior of accessibility exactly depends on whether the regularity criteria relating to Martinsson's function (9) are satisfied.Due to the degree of the graph we limited ourselves to numerical tests, which did not indicate any violation of the criteria, although such violations may be more subtle than our tests could verify.
The critical point β * and in particular the expected walk length β * Γ ′ * L are, as one would expect, strongly correlated with the distance between alleles.The only distance-3 pair of amino acids is Tyr/Met which also corresponds to the largest walk length with a value of β * Γ ′ * ≈ 4.7567.All other amino acids lie at mutual distance 1 or 2. Nonetheless, the critical point β * ≈ 0.4527 for the distance-3 pair lies slightly below that of the distance-2 pair Asp/Met with β * ≈ 0.4570, demonstrating that the overall structure of the allele graph can have a significant impact on accessibility beyond distance.
For comparison, the accessibility of paths between any pair of codons can be obtained from the values for the complete graph with 4 alleles (Table 1).This gives β * ≈ 0.51, while accounting for the multiplication of three bases per codon yields the expected mean critical walk length per codon as β * Γ ′ * ≈ 5.5.Relative to the complete graph on four alleles the codon graph is allowing arbitrary synonymous mutations without cost, causing the reduction in the critical fitness difference as well as the length of walks at the critical point.However, when compared to the complete graph on 21 alleles with β * ≈ 0.154 and critical walk length factor 3.22 (Table 1), which would permit direct mutations between arbitrary amino acids, the fitness cost is still significantly higher on the codon graph, whereas the critical walk length on the codon graph is scattered around that of the 21-allele complete graph.

First moment bound and walk length
We start with a proof for the first part of Theorem 1 from an upper bound for accessibility based on the mean number of accessible paths, or rather the mean number of quasi-accessible walks.We define the term quasi-accessible as a generalization of the notion of accessibility used up to now as explained in the following.

Quasi-accessibility
In the original definition of accessibility, a non-self-avoiding walk is never accessible, because it would have to visit the same fitness value twice, which makes it impossible for the walk to have strictly increasing fitness.Handling self-avoidance is non-trivial.To remedy this in a simpler manner, instead of considering self-avoiding paths on A L , we consider an extension of A L to A L ′ as follows: Definition 14 The extended genotype space of a genotype space A L is the simple directed graph A L ′ with vertex set A L ′ = A L × N and an arrow from (v, n) to (w, m) iff there is an arrow from v to w in A L .
In other words we duplicate every genotype a countable infinite number of times in such a way that traversal of one of its copies can always be replaced by traversal of another copy.The 1-section containing all vertices of the form (v, 1) can be identified with the vertices on A L .We then assign each of the vertices in A L ′ i.i.d.fitness values: Definition 15 An extended HoC model over a genotype space A L is the HoC model over the extended genotype space of the genotype space A L .The mentioned 1-section then corresponds to the original HoC model and we can identify realizations of the extended HoC model with the corresponding realizations of the original HoC model with equal fitness values on the 1section.All other fitness values do not affect this underlying model.However, it is convenient to introduce these additional fitness values for the following reasons.
We define the following map of walks on A L to A L ′ .Each self-avoiding walk is mapped to the corresponding walk on the 1-section of A L ′ .But instead of mapping non-selfavoiding walks from A L to the 1-section of A L ′ , we can make use of the additional vertex copies to replace all vertices that are visited multiple times in A L with distinct copies in A L ′ .To make this unique, we assume that the n-th visit of vertex v in A L is mapped to the vertex (v, n) in A L ′ , except if v is the final vertex of the walk, in which case we map the n-th visit in reverse order to (v, n) in A L ′ .The resulting walk is always selfavoiding in A L ′ and the special case assures that every walk in A L is mapped to a walk with endpoints on the 1-section in A L ′ .Definition 16 A walk on A L is quasi-accessible if the corresponding mapped walk on A L ′ per the rules above is accessible on A L ′ .A genotype is said to be quasi-accessible from another if there exists a quasi-accessible walk from the latter to the former.Definition 17 A walk on A L ′ is valid if there is a walk in A L which is mapped to it according to the rules above.An (extended) genotype h ′ on A L ′ is said to be valid-accessible from a genotype g ′ on A L ′ if there exists a valid-accessible walk from the latter to the former on A L ′ .With these definitions the probability of any walk on A L , whether self-avoiding or not, to be quasi-accessible is the same, only depending on the length of the walk.Furthermore the notion of valid-accessibility on A L ′ coincides with quasi-accessibility on A L .We denote the number of valid-accessible walks from (v, 1) ∈ A L ′ to (w, 1) ∈ A L ′ , or equivalently quasi-accessible walks from v to w with v, w ∈ A L , by Zvw .Additionally, while quasi-accessibility is different from accessibility for individual walks on A L , accessibility and quasiaccessibility of one genotype from another on A L coincide: Proof If w is accessible from v, then there exists a walk from v to w which is accessible and therefore also quasi-accessible, implying If w is quasi-accessible from v, then there exists a valid walk on A L ′ from (v, 1) to (w, 1) which is accessible (with respect to A L ′ ).Removing all vertices (u, n) with n = 1 from this walk results in another walk on A L ′ which is completely located on the 1-section due to the validity requirement.Because this walk is obtained by only removal of vertices, it is also valid-accessible and because it must be self-avoiding on the 1-section, it is also an accessible walk from v to w implying the other side of the equality ⊓ ⊔ This implies that we can restrict our investigation to quasi-accessibility.
In order to give an upper bound on β-quasi-accessibility and with that a proof of Theorem 1.1, we will consider the mean number of quasi-accessible walks from a L to b L .Each walk of length N from a L to b L on A L is β-quasiaccessible with probability β N −1 (N − 1)! where the numerator accounts for the probability that all inner vertices of the walk are found inside the range of fitness values F aL to F bL and the denominator accounts for the increasing order required on these values.The number of walks taking n steps from a l to b l on one locus l is given by (A n ) a l b l .A walk of length N could take each step on any of the loci, so that the total number of walks of length N can be written as where N n 1 , . . ., n L is the multinomial coefficient accounting for the different orderings of steps on individual loci.Multiplication of this expression with the probability of quasi-accessibility of each such walk gives the mean number of quasi-accessible paths The term (N − 1)! can be reduced to N !by introduction of a derivative and redistributing all the factorials and β N into the product yields Finally the sums and the product can be interchanged and We now choose β based on Theorem 1 as where g(L) = ω (1).Because β converges to a positive value per Proposition 4, around which Γ aLbL (β) and its derivatives are bounded, a Taylor expansion of Γ aLbL (β) gives By definition Γ aLbL β = 0 and by Proposition 4 Γ ′ aLbL β converges to a positive value Γ ′ * , so that inserting the difference β − β then gives Then, by Markov's inequality we have P ZaLbL ≥ 1 ≤ E ZaLbL , implying that quasi-accessibility converges to zero, which then by Lemma 1 also implies that accessibility of b L from a L converges to zero, proving part 1 of Theorem 1. ⊓ ⊔ Corollary 1 By replacing −|g(L)| by a constant value η in the above, the expected number of quasi-accessible walks will converge to a non-zero value and increasing η allows to arbitrarily increase the limit.

Proof of Theorem 3
A more refined version of the previous argument can be used to prove Theorem 3. Specifically the expected number of β-quasi-accessible walks can be separated into intervals of walk lengths.Let h N be the expected number of β-quasi-accessible walks of length N with as in Theorem 3 for some fixed value η.This number is an expectation value over realizations of fitness values, but in the following we will consider it as just a number indexed by some number N representing the walk length.Summation of all of these numbers then yields the total expected number of β-quasiaccessible walks which we calculated already above: We can interpret this as the value φ(1) of the function This function can be viewed as an (ordinary) generating function for the sequence h N shifted by one.The generating function here is not related to the probability distribution of fitness values, but is rather to be understood as simply a counting tool that separates the total expectation value into slots for different walks lengths using the additivity of the expectation value.The effect of the derivative ∂ zβ in the generating function can be reversed by integration of each of the monomials, so that is the generating function of the unshifted h N multiplied by β N , which is another sequence that we define as hN .Normalizing φ(z) through division by φ(1) = e LΓ a L b L (β) turns the generating function into a probability generating function over the parameter N as random variable and this allows us to apply theorems from probability theory.Again, this probability is not related to the distribution of fitness values, but is introduced here artificially as a counting tool.The integrated (probability) generating function factorizes over loci as As a consequence the random variable N under the generating function's distribution can be written as a sum N = L l=1 n l , where n l are independent random variables with probability generating functions This can be seen going in the reverse direction as the generating function of the sum of independent random variables is the product of the individual generating functions of the summands.Because the degree of A is bounded by ∆, (A n l ) a l b l ≤ ∆ n l and the tail of the distribution is dominated by an exponential.This bound is also independent of the chosen loci a l and b l and with the chosen β converging in L, the central limit theorem applies to the sum N .The mean Lµ and variance Lσ 2 of N under this distribution can be obtained from the first and second derivatives of the probability generating function as and the central limit theorem implies that for constants c > 0: where Φ is the standard normal CDF.Since the sum's upper bound is asymptotic to µL, for all hN terms appearing in the sum hN ≥ β µL h N , so that In particular with the choice of the threshold function of Theorem 3 for β: This allows one to reduce the mean number of β-quasi-accessible walks of length outside the interval µL ± cσ √ L to any arbitrarily small value by choosing c large enough.In other words, if there are β-quasi-accessible walks at the suggested threshold function, then they are of length µL with fluctuations of at most order of √ L. Since all β-accessible walks are also β-quasi-accessible walks, the same length constraint then also applies to accessible paths at the threshold function, completing the proof of Theorem 3.

Proof of Theorem 2
The upper bound on accessibility obtained from the expected value does not take into account any dependence between walks.We can improve the bound by including some of the dependencies.This will make it possible to prove our Theorem 2, i.e. show that β c > β * for irregular types.
Let in this section 0 < r < 1 and 0 < s < 1 be constant.The intention is to choose them later such that M * (s, r, β * ) > 0 as application to the irregular type.This choice is always possible in the irregular case since by continuity M * (s, r, β * ) cannot be strictly positive only on the boundaries.
Recall that the parameters r and s determine the fitness spanned by three segments of each walk with s determining the fitness fraction spanned by the middle segment, and r determining the distribution of the remaining fitness span onto the first and last segment.More concretely the intended fitness span of the first segment is βsr, of the second βs and the third βsr, adding up to the full fitness span β that needs to be crossed (see Sect. 2.2, eq. ( 14) ).
For each arrow on the genotype space we can consider the interval formed by the fitness values of the two genotypes incident to it.If a walk from a L to b L is accessible, then it contains exactly one arrow with a fitness interval containing the fitness value βsr.Let this arrow be (x, x ′ ) and let S 12 x,x ′ be an indicator variable for this fitness value falling on to the arrow (x, x ′ ).Similarly there is exactly one arrow containing the fitness value β(1 − sr).Let this arrow be (y, y ′ ) and the corresponding indicator S 23 y,y ′ .These two arrows segment the walk in the closest possible way according to the intended fitness spans mentioned above.A walk is accessible only if each of the three segments a L → x, x ′ → y and y ′ → b L are accessible.In the following we refer to these segments as segment 1, 2 and 3 respectively.To obtain an upper bound on the accessibility of b L from a L it is therefore sufficient to form a union bound of the form: ≥ 1] (63) We can now separate the expectation over the fitness of the intermediate genotypes x, x ′ , y and y ′ , P [Z aLbL ≥ 1] (65) where the dot in the probability indicates conditioning on the fitness values at the end points.The expectation is over all such fitness values satisfying the conditions S 12 x,x ′ and S 23 y,y ′ .As a result of this conditioning, the quasi-accessibilities of the three segments mentioned in the equation are negatively dependent, so that the joint probability of the events can be upper bounded by the product of individual probabilities.In order for a segment to be accessible under the conditioning of the fitness values at the end points, all internal fitness values on the segmnent must fall into the fitness range between the end points and the internal fitness values must be increasingly ordered.Because the fitness values are i.i.d. in the HoC model these two properties are independent.Now suppose we condition on one (or two) of the segments being accessible with any particular choice of accessible walks.This is equivalent to conditioning all the internal fitness values of these walks in an appropriate manner as well.The accessibility requires these internal fitness values to be constrained to the fitness range of the segment's end points, which makes them unavailable as internal vertices of the remaining segment(s).However, all other fitness values are i.i.d. and unaffected by this conditioning.The probability that the remaining segment(s) is (are) then accessible is therefore smaller than if no conditioning of the internal vertices of the other accessible segments had been applied, effectively only removing walks through accessible vertices of the conditioned segments from the set of candidate walks.For example if ZaLx is at least 1, then Zx ′ y ≥ 1 becomes less likely since the existence of a quasi-accessible walk from a L to x implies that some fitness values of other genotypes fall in the range [F aL , F x ], excluding them for consideration in the range [F x ′ , F y ] required for them to be part of a quasi-accessible walk from x ′ to y. Consequently: Since probabilities lie in [0, 1], we can use the upper bound x ≤ x 1−α for 0 < α < 1 on the middle factor and afterwards we can apply Markov's inequality to all three terms to obtain The remaining inner expectation values depend only on the differences of the fitness values that they are conditioned on, not the actual placement of that difference.We introduce the following quantities: These quantities measure how much the fitness difference allocated to one of the three walk segments differs from what it would be assigned if r and s determined it exactly.For example the (x, x ′ ) edge is required to contain the fitness value βsr.Therefore the first walk segment can span a fitness distance of at most βsr, but this happens exactly only if F x − F aL = βsr is chosen.All other valid choices set the fitness value lower than this and ǫ 1 measures the reduction of the segment's length.As it will turn out only the point with all ǫ i equal to zero contributes to the expectation value in leading order.Intuitively any constant offset from the intended segment length corresponds to an effective reduction of β by a constant, resulting in an exponentially lower likelihood of walks being quasi-accessible.Nonetheless we will carry the ǫ i through the calculation.
The remaining inner expectation values are of the same form as the simple expectation of walks from a L to b L calculated in the previous section: The expectation values are dominated by the exponential terms e LΓvw(t) with an additional linear factor L resulting from the derivative.However, the derivative also adds the term Γ ′ vw (t).As an average over loci it can be seen that pointwise in t and uniformly over v and w, this quantity is bounded by a constant from above.However the bound is not uniform in t.At t = 0 it diverges, as can be seen from the expansion To avoid this issue we rewrite the expectation value including the sum resulting from application of the product rule of differentiation: In each summand the value is a product over terms, each of which depends only on quantities on a single locus and the bulk of the contributions of loci contribute simply the exponential e Γv l w l (t) = e tA v l w l .Only the locus l ′ gives a different contribution, namely the derivative of the exponential term, Ae tA v l ′ w l ′ .Our goal is to bring eq. ( 68) into the form of a sum over products, such that the product factorizes in the same sense as it does for a single expectation value.In particular the current form is a sum of a product of three expectation values.If we expand each expectation as shown in eq. ( 74), we obtain three sums, each accounting for one special locus on which the corresponding derivative is taken.We name these special loci l 1 , l 2 and l 3 , corresponding to the means in eq. ( 68) in the order they appear there.The sum in the middle term can be taken out of the (•) 1−α form to give an upper bound, because 1 − α ∈ [0, 1] and therefore the form is subadditive.Having done so, the sum over the pair of edges on the genotype space may similarly be factorized over loci.Each edge on the genotype graph corresponds to a step on one locus.Therefore it is sufficient to sum over individual genotypes together with another special locus, and one edge on the allele graph corresponding to that locus.We denote the sum over loci for these two edges l 12 and l 23 respectively.The initial sum then factorizes over loci: Here F l is the resulting factor collecting all sums over quantities on locus l and all factors of the product of the three expectations that are functions of quantities on locus l, as well as potentially e.g. a form is implicitly dependent on l 1 , l 2 , l 3 , since these three variables decide whether the contribution resulting from any of the three expectation values has the usual exponential form or that of its derivative.If l ∈ {l 1 , l 2 , l 3 , l 12 , l 23 }, then the contribution of all three expectation values and the edge sum is of the usual form, i.e. the exponential term of the expectation value and no sum over edges and we give it the name G l : If l is equal to any of the set of special loci, then some of these exponential terms will be modified and there might be additional sums.For example if l is equal to l 3 and l 12 , but not equal to any of the other special loci, then By assumption s = 0 and also 0 = r = 1.Then, with the fixed s and r, G l is bounded away from zero everywhere except at the boundary, because as long as one of the matrix exponentials has non-zero argument, it contributes a finite term to the sum by adequate choice of x l and y l so that the indices of the matrix exponential become (a l , b l ).More generally G l is also uniformly bounded over s and r, since by definition of s and r at least one of the matrix exponential arguments must be at least β 3 , epsilon shifts notwithstanding.This then allows us to write each F l as a product G l H l , with H l bounded away from infinity except at the mentioned boundary.In the next section we will use the same approach with a more detailed handling of H l , but here it is sufficient to apply such a simple uniform bound with a constant.
However first we consider the behavior at the boundary where all epsilon shifts force the matrix exponential arguments to become zero.Due to the bounded degree of the graph, as t → 0, the diagonal terms of the matrix exponential with argument t drop to 1 and the off-diagonal ones to 0 uniformly.If a l = b l , F l therefore falls to zero as all the ǫ i reach their maximum boundary and similarly it falls to 1 for a l = b l .Since there is by assumption at least a finite fraction of loci with a l = b l , this then implies that eventually, at a finite distance to the boundary for some C < 1.The special loci on which F l = G l are not relevant to this, since there are only finitely many of them and each one is bounded.The contribution to the probability from the boundary is therefore asymptotically zero, since the exponential decay in the integrand cannot be compensated by the additional L 5 factor from the special loci sum.
Returning to the general case away from the boundary, we can bound all H l with l ∈ {l 1 , l 2 , l 3 , l 12 , l 13 } by some constant C uniformly, yielding a factor of at most C 5 , while all other H l are 1.This removes the dependence of the product on the particular choice of the special loci: with The ǫ i are always non-negative in the valid domain and T is decreasing in all of them.Therefore we can give an upper bound by setting all of them to 0 and obtain the upper bound on accessibility: with x l ,y l e Γa l x l (β sr)+(1−α)Γx l y l (βs)+Γ y l b l (β sr) where it is assumed that β is constant in L. This value is then independent of L and if it is negative, the probability that b L is accessible from a L is asymptotically exponentially falling to zero.We may choose α ∈ (0, 1) as well as s and r freely except for their boundary values.But specifically for α close to zero, we obtain the following expansion by matrix multiplication: At β * the zeroth order term is simply zero.The coefficient of the linear order term is exactly −M * (s, r, β) and in the irregular type problem r and s can be chosen such that it is negative at β * .With this choice there is then some suitable small α > 0, so that T 0 is negative at β * .T 0 is continuous as a function of β and therefore we can then also find some β > β * such that T 0 is still negative at the same choice of α, r and s.This shows that the critical point β c is strictly larger than β * in the irregular case, if it exists at all.
In this section we derive a lower bound on accessibility, allowing us to show that the candidate threshold function in Theorem 1 does indeed satisfy the second side of the threshold requirement for accessibility setups of regular type.

Moment bounds
To prove the lower bound on (quasi-)accessibility, we use a generalization of the second moment method.The idea of the second moment method is to bound the second moment of ZaLbL from above in order to apply the inequality [1, 10] ZaLbL is bounded from above through the maximum length of walks and the bounded degree limiting the possible choices in each step and therefore the second moment always exists.In our proof method we do however find that, at least with our non-tight bounds on it, the second moment grows too quickly for some allele graphs to give a non-trivial bound.On the other hand, for some class of allele graphs this bound may be used to obtain a sufficient bound.
To generalize the applicability of the result, we will use a modification of the second moment method which relies on a lower order moment.
Lemma 2 Let X be a random variable over the natural numbers (including zero) with finite moment E X 1+ξ ′ for some ξ ′ > 0, then where an evaluation of 0 ln 0 is to be taken as 0.
Proof We know using Hölder's inequality that for all 0 < ξ ≤ ξ ′ : (89) and therefore Taking the limit of ξ → 0 then completes the proof because where the value of X ln X is taken to be 0 by analytic continuation.⊓ ⊔ Because the number of walks of length N is at most exponential due to the bounded degree of A, while the probability of a walk to be quasi-accessible falls as fast as 1 N !, the tail of ZaLbL is dominated by an exponential decay.In particular all moments of ZaLbL exist.This allows us to apply the lemma: where Relative to π there is one non-trivial arc from x 1 to y 1 , one non-trivial arc from x 2 to y 2 and two non-trivial arcs from x 2 to y 3 .Between any two adjacent vertices on π there is furthermore one trivial arc.All other non-empty path subgraphs are not arcs, since they intersect π more than twice.Each walk is fully specified by choice of one of the arcs for each pair of loci.In fact it would be sufficient to choose one out-going arc per site to account for all walks.By the construction of quasi-accessibility the graph is guaranteed to be free of cycles as in the example (color figure online) In order to prove Theorem 1.2 we need to show that lim inf P ZaLbL ≥ 1 > 0 with the proposed threshold function.In particular we will choose for this section β = c L + η L with some constant η.
Per Corollary 1 then E ZaLbL converges to a non-zero value.This assures that it is sufficient to show that K does not diverge.The following method of bounding K adapts the idea used in [20] to account for the correlations of accessible walks using the notion of shortcuts or arcs to obtain alternative walks from a focal one.
Let X π be the indicator variable that the walk π is quasi-accessible, then where the sum is over all walks from a L to b L on A L or equivalently all valid walks on A L ′ .Similarly we can expand the right hand ZaLbL over individual walks.A graphical example for the situation described in the following can be seen in Fig. 7.We will however intentionally over-count these in the following way: Each valid walk π ′ from (a L , 1) to (b L , 1) in A L ′ trivially crosses π in at least two vertices, namely (a L , 1) and (b L , 1).Furthermore if we list out for each valid walk π ′ the vertices it shares with π in A L ′ , then the segment of π ′ between two adjacent vertices x ∈ A L ′ and y ∈ A L ′ in that list does not intersect π a third time in A L ′ .We call such a segment on A L ′ an arc through x and y on π.An arc is said to be trivial if it is a segment of π itself.Immediately from the definition a trivial arc can only contain a single edge.We denote the number of non-trivial arcs which are accessible by Z′ xyπ .Each walk π ′ generates at most one arc through x and y on π.Also each walk π ′ is uniquely identified by the set of arcs it generates on π and π ′ is accessible on A L ′ if and only if all of the arcs it generates on π are accessible.Therefore we can bound for any valid walk π: With this we have where I xyπ is an indicator variable which is 1 iff π contains x and y in A L ′ and 0 otherwise.Conditioned on the two fitness values F x and F y , Z′ xyπ becomes independent of X π since x and y are the only vertices whose fitness values influence both the quasi-accessibility of candidate arcs and π: For convenience we also assume that the conditioning of the fitness values F (aL,1) and F (bL,1) to a difference of β is contained in the outer expectation.Currently Z′ xyπ is stochastically independent of X π , but still explicitly dependent on π in the choice of candidate arcs that need to be counted.We can remove this dependence by loosening the restriction that included arcs must not be trivial and must not intersect π except at x and y.Doing so Z′ xyπ is upper bounded by Z′ xy , where the lack of third index indicates the loosened restriction.The resulting bound is not in general good enough for all choices of x and y in the sum.We will later revisit and adjust it for these special cases.
Because the logarithm is concave, the mean over it can be bounded by exchange of the two.Let ↓ x be the projection of x ∈ A L ′ on the first component or equivalently the 1-section of A L ′ .Compared to all walks on A L ′ generated from walks on A L from ↓ x to ↓ y, arcs from x to y in A L ′ are more restricted in the number of times vertices with projection ↓ x or ↓ y may or must be visited.Therefore the expectation over Z′ xy may be bounded by the expectation over Z↓x↓y .
Similarly all walks π through x and y can be separated into three segments from (a L , 1) to x, from x to y and from y to (b L , 1).Each walk is uniquely determined by these three segments and for any choice of these segments forming a valid walk, their accessibility is independent under the conditioning since valid walks are selfavoiding in A L ′ .Taking all triples of walk segments from a L to ↓ x, from ↓ x to ↓ y and from ↓ y to b L , all valid walks from a L to b L through any copy of the genotypes ↓ x and ↓ y in A L ′ are generated.This allows together with the previous arguments for the bound Further we use that the logarithm can be bounded from above by any power law ln(x) ≤ ᾱ(x − 1) α for x ≥ 1, 0 < α < 1 and a constant ᾱ depending on α.In particular ᾱ as a function of α can be chosen so that it is bounded except around α = 0, where it must diverge.Therefore, as long as we choose later any non-zero but constant α, the additional factor ᾱ will not change the asymptotic order of K.All in all we have the following bound, which is also represented graphically in Fig. 8: The remaining expectation values depend only on the differences of the fitness values that they are conditioned on, not the actual placement of that difference.It is therefore convenient to use the variables s and r with s = 1 − s and r = 1 − r introduced previously, such that with which the outer expectation value of K can be expressed as an integral over the unit square (s, r) ∈ [0, 1] 2 with a surface element β 2 s 2 drds, which through the factor 1 2 already conditions on F x and F y being correctly ordered.We need to show that this integral is asymptotically bounded by a constant in order to show that K is asymptotically bounded by a constant from above as we intend.
The expectation value E ZaLbL may be bounded using eq.( 72): The same bound does not in general apply to the other expectation values uniformly over the integration domain due to the divergence of the constant term with vanishing fitness difference.For this reason, we split the integration region.For some sufficiently small constant ǫ > 0 we will consider integration in the regions with s ∈ [0, ǫ] and s ∈ [ǫ, 1] separately and name the corresponding contributions to K accordingly with an index.

Case s ∈ [ǫ, 1]
In the interval [ǫ, 1], s is bounded away from zero and therefore using eq.( 72), the expectation values E Zxy |βs can be bounded uniformly by For the remaining expectation values we follow the procedure used in Sect. 5 and expand with eq. ( 74) to obtain a sum of locus-factorized terms.We name the special loci according to the walk segment's index.As we have already expanded the contributions for the second segment, only the first and third remain: Again, the usual form for loci l ∈ {l 1 , l 3 } can be given through the exponential terms in the expectation values x l ,y l ∈A e Γa l x l (β sr)+(1+α)Γx l y l (βs)+Γ y l b l (β sr)−Γ a l b l (β) (107) and again for loci l ∈ {l 1 , l 3 } one or more of the exponential factors will be replaced by their derivatives.For the same reasons as used previously, in these cases H l is uniformly bounded by a constant and therefore where T = ln x l ,y l ∈A e Γa l x l (β sr)+(1+α)Γx l y l (βs)+Γ y l b l (β sr)−Γ a l b l (β) l . (109) T can be considered a function of β, s, r and α.At α = 0, it is always 0 as can be verified by matrix multiplication.The first derivative towards α at α = 0 is found to be exactly M(s, r, β).It is therefore possible to bound In the regular case, as β converges to β, M(s, r, β) is eventually bounded from above by a negative constant in the region s ∈ [ǫ, 1 − ǫ], so that in this region the integrand falls exponentially quickly to zero for suitable choice of α > 0, resulting in no asymptotic contribution to K. In the interval s ∈ [1 − ǫ, 1] we need to account for the boundary term at s = 1.At s = 1, T is exactly αΓ aLbL (β).At the candidate threshold function Γ aLbL (β) simply evaluates to − ln L L up to irrelevant higher orders in L. By assumptions for the regular case we also have that the derivative ∂ α ∂ s T is positive at (s, α, β) = (1, 0, β), so that for some c > 0. The term −α ln L L exactly compensates a factor L α to the integrand of K and with a factor s in the surface element of the integration, the contribution to K from s ∈ [1 − ǫ, 1] is then for suitably small constant α > 0: For the integration interval [0, ǫ] we will fix α = 1 and since we cannot apply the simple bound to the expectation E Zxy |βs 1+α used before uniformly in this region, we will expand it using the sum form of the expectation value.Since 1 + α = 2 now, there will effectively be two additional sums resulting from this, for which we label the corresponding locus variables l 21 and l 22 .Again, we bring the contribution into the form Here, since all expectation values in the numerator of K were expanded into sums, only a single factor L −1 remains from the expectation value in its denominator.The usual form G l is unchanged from the region [ǫ, 1] except for the choice α = 1.As before H l can be bounded by a constant for all special l, but this will turn out not to be sufficient here.Suppose we used such a bound, then we would obtain where T is unchanged from the previous integration region except for the choice α = 1.At s = 0 only terms with x l = y l can contribute to T and so it becomes 0 by matrix multiplication.The first derivative towards s can be formed directly, using that derivatives of matrix exponentials correspond to multiplication with the matrix exponent.Using that (A) x l y l I x l y l = 0 since the allele graph is simple, the derivative evaluates exactly to −βΓ ′ aLbL (β), so that: As βΓ ′ aLbL (β) is strictly positive and bounded away from zero asymptotically, this shows that ǫ can always be chosen such that T is negative for s ∈ (0, ǫ] with negative first derivative at s = 0. Consequently the integration at the boundary is of the form for some constant c > 0. In contrast to the boundary at s = 1 the surface element does not contribute here and does not yield an additional factor L −1 .The naive bound shown above is not sufficient and two powers of L remain that we have to suppress.
To cancel these factors, we need to bound the terms H l more carefully around small s instead of applying uniform constant bounds.In particular it would be sufficient to show that these terms introduce at least two factors s into the integrand, since the integration over s n e −cLs would result in a value of order L −1−n instead of just L −1 .Depending on the choices of distances between x and y and the choices of the special loci it is possible to provide these two factors.However not all combinations of these choices yield such a factor.The problematic cases will however turn out to be marginal in the sense that they only apply to a fraction 1 L or 1 L 2 of the summands in the sums over special loci.Each such factor L −1 offsets the need for one s factor in the integrand, allowing the total contribution to K to still be constant.In the following we list all of the relevant combinations and show their contributions of s orders.
The method of bounding H l is to consider the small-s behavior of the factors e sβA x l = y l (118) and due to the bounded degree of the graph, all of these bounds are uniform.Using the bounds above, we can obtain the necessary factors of s.First consider the case of all four special loci distinct.We then have for the two loci l 21 and l 22 : x l 2i ,y l 2i ∈A e β srA x l 2i ,y l 2i ∈A (e β srA ) a l 2i x l 2i [(e βsA ) x l 2i y l 2i ] 2 (e β srA ) y l 2i b l 2i . (119) From eq. ( 117) we can see that for all distances e βsA x l y l Ae βsA and therefore each of H l21 and H l22 contribute at least one factor s, resulting in a sufficient contribution of s 2 as explained above.If not all of the four loci are distinct the form of H 2i will be different.However, the only modifications are in the placement of derivatives of matrix exponentials.As long as still l 21 = l 22 , the relevant terms which are small around s = 0, namely the exponentials for the second walk segment, remain unchanged.Therefore the remaining cases are for l 21 = l 22 , for which we will write l 2 .This equality reduces the number of summands to consider by a factor L −1 as discussed before and consequently we need to find only one factor s. In particular if either l 1 or l 3 are equal to l 2 as well, then the weight of these cases is reduced by another factor L −1 , so that no s is required anymore.Therefore, we can focus only on the case where l 1 , l 2 and l 3 are all distinct.For this case the contribution of locus l 2 is H l2 = x l 2 ,y l 2 ∈A e β srA a l 2 x l 2 [ Ae βsA x l 2 y l 2 ] 2 e β srA y l 2 b l 2 x l 2 ,y l 2 ∈A (e β srA ) a l 2 x l 2 [(e βsA ) x l 2 y l 2 ] 2 (e β srA ) y l 2 b l 2 .
(121) Following again eq. ( 117), the numerator is of order O s 2 except if d x l 2 y l 2 = 1, in which case there is a zeroth order contribution.The latter case requires additional considerations to resolve.First, we consider the subcase with d xy ≥ 2. In this case it is possible that d x l 2 y l 2 = 1, but if this is the case we always have another locus l ′ with d x l ′ y l ′ ≥ 1.Following the separation of edges in the previous section, we can handle one such locus as a special locus in exchange for another sum of order L.However, the x l ′ → y l ′ factor contributions in H l ′ 's numerator will then always be e βsA x l ′ y l ′ 2 without any derivatives since l ′ = l 2 .From eq. (117), such a factor results in a factor s 2 compensating the additional L sum as well as the required s factor to the integrand.The only remaining case is then d xy = 1.For this case the contribution to K is indeed not bounded as we require.However, this contribution turns out to be an overcounting issue introduced by our loosening of the arc restrictions on Z′ xyπ .Specifically, if d xy = 1, we will enforce the restriction that Z′ xyπ should not count the direct walk segment x → y if π is taking this direct step.Since the direct step is always accessible given that F x and F y are ordered correctly, this segment contributes exactly 1 to the expectation value E Zxy |βs , which we can therefore substract from it.This is possible even if π does not use this direct step since Z′ xyπ ≤ Z′ xyπ ′ if π does not use the trivial arc, but π ′ does (Figure 9).With this modification the value of H l2 becomes (122) Since (A) x l 2 y l 2 = 1, the leading order in the numerator is now O(s), which is sufficient to obtain a bounded contribution to K.
All in all, the total contributions to K are bounded in L at the candidate threshold function for the regular type, implying that there is a constant C > 0, such that lim inf P ZaLbL ≥ 1 ≥ C. (123) This completes the proof of Theorem 1.2, showing that for regular setups there is indeed a threshold function c L at which accessibility jumps in a 1 L window from 0 to a non-zero value of at least C.
It remains to improve this bound from non-zero C to C = 1, which we expect can be done as mentioned in Section 2.2, Remark 4.

Discussion
Cartesian power graphs provide a natural framework for describing genotype spaces composed of sequences of elements drawn from a finite set of alleles A. The allele graph A encodes the possible mutational transitions on this set.Once the genotype-fitness map is specified according to the HoC model, which assigns fitness values to genotypes as i.i.d.continuous random variables, the rank order properties of the resulting fitness landscape are uniquely determined by A.
Here we have focused on the existence of fitness-monotonic paths as a measure of evolutionary accessibility [8], and proved precise results for the critical fitness difference quantile β c above which accessible paths exist with positive probability.Our results quantify how accessibility increases with an increase of the number of alleles A [27,28], and decreases mutational transitions are blocked or become unidirectional.For certain allele graphs, such as the path graph over three or more alleles, accessible paths do not exist for any fitness difference.Moreover, a criterion based on the behavior of Martinsson's function M(s, r, β) identifies allele graphs for which the behavior of the expected number of accessible paths is not informative about the existence of paths.In the words of Berestycki et al. [4], for such allele graphs the expectation does not "tell the truth".
The HoC accessibility problem considered here is conceptually appealing, because under the assumption of i.i.d.random fitness values, landscape accessibility is determined solely by the structure of the genotype space.However, the HoC model is not biologically realistic, as empirical fitness landscapes display varying degrees of fitness correlations [24,25].The accessibility properties of correlated fitness landscapes differ significantly from those of the HoC model [15].For example, for the much studied class of NK fitness landscapes, accessibility is determined by the structure of the interaction graph, and is low for most common structures [11,23].Previous work on accessibility of NK fitness landscapes has been restricted to the biallelic case, and exploring the interplay between the allele graph and the interaction graph in determining evolutionary accessibility constitutes an interesting problem for future research.

Fig. 1
Fig.1Example of a genotype space as the Cartesian graph product of two allele graphs.The two allele graphs are shown on top with the genotype space below.While the second factor graph represents a locus with three possible alleles, all of which may mutate freely from one to another, the first factor graph represents a locus on which not all mutations between the alleles are considered possible.Specifically mutations between 0 and 2 must take an intermediate mutation through 1 and additionally the mutation from 0 to 1 is considered irreversible and does not allow backstepping to 0. Although in this work we define the genotype graph as the direct Cartesian power of a single allele graph, different allele graphs as shown here can still be modeled without loss of generality by assuming that A is the disjoint sum of the individual graphs.Since the individual constituent graphs are not connected in this sum, this does not increase the accessibility (seeRemark 1)

Theorem 3
In an accessibility setup at the candidate threshold function from Theorem 1, i.e. with β = c L + η L for some constant η, the probability that all β-accessible paths have length in the interval Γ ′ aLbL β βL±g(L) √ L converges to 1 for every function g(L) = ω(1).

Fig. 3
Fig. 3 Allele graph structures described in Sect.3. Top left: Complete graph on four alleles with backmutations to the wild-type allele 0. Top right: Complete graph on four alleles with backmutations to the wild-type allele 0 removed.Bottom: Path graph on four alleles.In each case the intended initial (wild-type) and final alleles as used in this section are indicated by the labels a l and b l

Fig. 6
Fig.6 Allele graph constructed from possible point-mutations on codons.Two amino acids are connected by an arrow iff there is a possible point mutation on a single nucleotide that changes one into the other

Fig. 7
Fig. 7 Illustration of the estimation of Za L b L .The short horizontal arrows (red) indicate the focal walk π from a L to b L .Relative to π there is one non-trivial arc from x 1 to y 1 , one non-trivial arc from x 2 to y 2 and two non-trivial arcs from x 2 to y 3 .Between any two adjacent vertices on π there is furthermore one trivial arc.All other non-empty path subgraphs are not arcs, since they intersect π more than twice.Each walk is fully specified by choice of one of the arcs for each pair of loci.In fact it would be sufficient to choose one out-going arc per site to account for all walks.By the construction of quasi-accessibility the graph is guaranteed to be free of cycles as in the example (color figure online)

Fig. 8
Fig.8 Graphical representation of the bound on K given in Eq. (100).The fitness range β is split into three segments corresponding to paths a L → x, x → y and y → b L , indicated by straight arrows (black).Each segment contributes the given expectation value conditioned on the specified fitness difference between its endpoints into a product.The curved arrow (blue) represents the contribution of all arcs from x to y, which contribute the given αdependent factor (color figure online)

x= 1 +
l y l and Ae sβA x l y l which appear in it.In particular, depending on the distance we have e sβA x l y l O s 2x l = y l O s dx l y lx l = y l (117)Ae sβA x l y l = O(s) x l = y l O s dx l y l −1

Fig. 9
Fig.9 Distance-1 case for Z′ xyπ .The dots represent an arbitrary acyclic subgraph.The distance-1 path indicated by the curved arrow (blue) is always accessible, assuming that the initial and final fitness values for x and y are correctly ordered, which we enforce through the integral boundaries.The focal path is also conditioned on being accessible.Assuming that π is the focal path indicated by the straight arrows (red) and π ′ the distance-1 path (blue), then in Z′ a L b L π ′ all paths except π ′ are counted, while in Z′ a L b L π at least one accessible path (π) is excluded, but more paths that are not arcs may also be excluded.Therefore Z′ a L b L π ≤ Z′ a L b L π ′ under the stated conditioning (color figure online)