An estimator for the recombination rate from a continuously observed diffusion of haplotype frequencies

Recombination is a fundamental evolutionary force, but it is difficult to quantify because the effect of a recombination event on patterns of variation in a sample of genetic data can be hard to discern. Estimators for the recombination rate, which are usually based on the idea of integrating over the unobserved possible evolutionary histories of a sample, can therefore be noisy. Here we consider a related question: how would an estimator behave if the evolutionary history actually was observed? This would offer an upper bound on the performance of estimators used in practice. In this paper we derive an expression for the maximum likelihood estimator for the recombination rate based on a continuously observed, multi-locus, Wright–Fisher diffusion of haplotype frequencies, complementing existing work for an estimator of selection. We show that, contrary to selection, the estimator has unusual properties because the observed information matrix can explode in finite time whereupon the recombination parameter is learned without error. We also show that the recombination estimator is robust to the presence of selection in the sense that incorporating selection into the model leaves the estimator unchanged. We study the properties of the estimator by simulation and show that its distribution can be quite sensitive to the underlying mutation rates.


Introduction
Recombination is a fundamental evolutionary force which shuffles genetic variation along a chromosome and gives rise to new haplotypes not previously seen in a population.It is a major goal of population genetics to infer rates of recombination along the genome and to disentangle its effects from other evolutionary forces such as mutation, selection, migration, and genetic drift.However, the effects of recombination can be difficult to detect; generally the signal of recombination is weak and a single recombination event may leave no discernible trace in a sample of genetic data (Hayman et al., 2022).Typically one observes a sample from the state of the population only at the present day, while the evolutionary history of the population, which can be much more informative for recombination, is a latent, unobserved variable.A wide range of inferential methods tackle this problem by positing a generative reproductive model for the population and integrating over all possible evolutionary histories, or by approximating this idea.A popular model is the diffusion limit of a Cannings-type model for recombination, genetic drift, and mutation.Under this limit the evolution of haplotype frequencies follows the Wright-Fisher diffusion with recombination (Ohta and Kimura, 1969a,b) while the genealogical history of a sample is known as the ancestral recombination graph (ARG) (Griffiths and Marjoram, 1997).Reconstruction of ARGs is a major current endeavour (see Peñalba and Wolf, 2020, for recent review), and with the very large samples available in recent datasets it becomes ever more necessary to introduce computational and/or model heuristics.
In this paper we address a related question: in the idealised situation in which one observes the entire evolutionary history of a population, as defined via the trajectory of haplotype frequencies in the diffusion limit, can we define an estimator for the recombination rate based on this observation and derive its properties?Although observing the entire sample path of a diffusion is unrealistic in practice, we may regard the corresponding estimator as setting an upper bound on the information about recombination available to us.We note that statistical inference from a continuously observed diffusion is by now a standard problem; see Kutoyants (2004) for textbook treatment for scalar diffusions (though regularity conditions imposed throughout that work preclude most of it applying to the Wright-Fisher diffusion even in one dimension).
Further, advances in sequencing technologies are leading to growing availability of genetic data sampled from a population across different times, sometimes over very long timescales, and providing great potential for improved statistical inference (Dehasque et al., 2020); such datasets can be considered as discrete, noisy versions of the idealised setting studied in this paper.
A motivation for this work is Watterson (1979) who derived the maximum likelihood estimator ŝ for natural selection from an observation of the trajectory of a Wright-Fisher diffusion (here a diallelic, one-locus model comprising only selection and genetic drift).He found the complete distribution of the estimator.It is worth noting that in this model ŝ does not enjoy the usual desirable asymptotic properties such as consistency, since one of the alleles will almost surely go extinct in finite time and thus the total information available about the parameter up to time T remains finite as T → ∞.If we introduce bidirectional recurrent mutation to the model then it becomes ergodic, and Sant et al. (2022) have recently shown that in this situation the estimator enjoys the properties of consistency (uniformly over compact subsets of the parameter space) as well as asymptotic normality and asymptotic efficiency.We will see that, with or without mutation, the estimator for recombination behaves very differently to that of selection because the 'information' (defined formally below) can become infinite in finite time.
Essentially, in a model of selection the signal-to-noise ratio for the selection parameter remains finite on hitting a boundary of the simplex of possible frequencies, while for the recombination parameter it may not.We will see that if the information becomes infinite then the maximum likelihood estimator (MLE) for recombination becomes exact, ρMLE = ρ.
The paper is structured as follows.In Section 2 we summarise likelihood theory for a continuously observed diffusion and specialise it to the infinitesimal variance of a Wright-Fisher diffusion.In Section 3 we derive the MLE for a general Wright-Fisher diffusion with arbitrary infinitesimal drift subject only to the constraint that the drift is linear in its unknown parameters.
We then specialise this to the model of our primary interest, a multi-locus model with unknown recombination rate.Throughout we focus on the two-locus case which illustrates the main ideas without complicating the notation.Our main results are to derive an expression for the MLE and to show that if the information explodes then it is possible to learn the recombination parameter without error.Section 4 studies the impact of the presence of selection on this estimator, and in Section 5 we conduct a simulation study to investigate the empirical properties of the MLE.
We discuss some potential directions for future work in Section 6.

General case
We first give a summary of general likelihood-based inference for the parameters of a diffusion before specialising to the Wright-Fisher diffusion.Let {X(t) : t ≥ 0} be a d-dimensional diffusion process and suppose its path {X(t) : t ∈ [0, T ]} is observed up to time T .The generator of the diffusion has a form where the model has r parameters ϕ = (ϕ 1 , . . ., ϕ r ) in a parameter space Θ.We assume the drift µ = (µ 1 . . ., µ d ) can be written in the form with a(x; ϕ 0 ) ≡ 0 for a fixed reference parameter ϕ 0 , and c(•) does not contain any parameters to be estimated.(For example, later we will estimate the rate of recombination in the presence of recurrent mutation with the latter having rates fixed and known.Then a(•; ϕ) will correspond to the contribution of recombination while c(•) will correspond to the contribution of mutation, containing known mutation parameters.) We will denote the corresponding path measure on continuous functions from [0, T ] to R d by P ) is non-singular for almost all t ∈ [0, T ] and, for each ϕ ∈ Θ, where I T = (I ij ) is the r × r observed information matrix then the likelihood for ϕ takes the form of a Radon-Nikodym derivative given with respect to a dominating measure which here we have chosen to be the model parametrised by ϕ 0 so that P (T ) ϕ 0 is the distribution over paths with drift c.Under these conditions, the likelihood takes the form where The first integral in ( 5) is with respect to the path { X(t) : t ∈ [0, T ]} and the second is with respect to t.
The distribution of ∆X(t) given X(t) = x is taken as approximately normal with mean µ(x)∆t and covariance matrix V (x)∆t as ∆t → 0. If V (x) is non-singular, then the quadratic form in the exponent of the normal density of ∆X(t) is We are expressing this density with respect to another normal density with mean c(x)∆t and covariance matrix V (x)∆t, and thus we subtract the corresponding quadratic form After some rearrangement, letting ∆t → 0, and integrating from 0 to T , we recover the quadratic form appearing in (5).Note that the likelihood ignores the terms |V (x)| in the diffusion, since we assume that the likelihood is with respect to a parametric form only for µ. (Statistical inference for parameters of V would be trivial in this setting, since V is identifiable from the path via its quadratic variation.)See Basawa and Prakasa Rao (1980, Ch. 9) and Kloeden et al. (2003, Ch. 6) for further details on the general case.

Wright-Fisher diffusion
The family of Wright-Fisher diffusions has generator (1) with diffusion coefficient of the form where δ ij denotes the Kronecker delta (i.e.δ ij = 1 if i = j and δ ij = 0 if i = j).The diffusion takes values in the simplex and the domain of L is D(L) = C 2 (∆ d−1 ), twice continuously differentiable functions with domain ∆ d−1 .For now we continue to leave the drift in the form (2) but otherwise unspecified.
The matrix V (x) is singular since d i=1 x i = 1.Our first task, then, is to modify the results from Section 2.1 to accommodate this issue.We achieve this by studying the first d − 1 coordinates of X, whose infinitesimal covariance matrix V * (x) is non-singular.Fortunately, its inverse V * (x) −1 takes on a particularly simple form, as we now show.
3 Theory for maximum likelihood estimators

General Wright-Fisher diffusion
Our next goal is to derive an MLE for the parameters ϕ of a Wright-Fisher diffusion.This is found by differentiating the log-likelihood with respect to the parameters.In all the examples we encounter, the drift is a linear function of the parameters so for the remainder of this article we assume a(x; ϕ) to be of the form where does not depend on ϕ.To avoid issues of identifiability we suppose that the columns of Z = (Z ij ) are linearly independent functions.Then from Theorem 1 the log-likelihood is a quadratic function with a unique maximum, φ, in R r , which is the solution of the set of equations for k = 1, . . ., r: The equations ( 12) are familiar in regression theory.Now denote the (r × 1) vector Y = Y k with elements and let Σ(X(t)) = diag(X i (t)).Then the equations ( 12) can be written Continuing to assume (3), the matrix on the left-hand side of the previous equation is nonsingular (see Basawa and Prakasa Rao (1980, p223-224), Kloeden et al. (2003, p231)) and hence we arrive at the form Of course if Θ ⊂ R k then it is not guaranteed that φ ∈ Θ, and φ must be adjusted appropriately to ensure it is the MLE.An example of this adjustment is given later.
The observed information matrix ( 4) is a key quantity in telling us about how informative the data is for ϕ.For this model, the observed information matrix I T has elements using linearity of a i (x; ϕ); the information matrix does not depend on φ.The expression ( 13) for φ can be written

Deterministic model
As a check on the expression for φ, we can ask for the estimator we would obtain if the observed trajectory is that of the deterministic model Now from ( 2) and ( 6) we find and so we can substitute this expression for d x into the likelihood equation ( 12) to obtain that φ is a solution to Owing to the factor a i (x(t); ϕ) − a i (x(t); φ), it is clear that a solution to the likelihood equation is given by φ = ϕ.It is reassuring that the estimator is well behaved even in this crude level of approximation; the trajectory defined by ( 15) is not a realisation from the assumed model since it is a path of bounded variation.

Neutral two-locus model
We now turn to our main result, an expression for the MLE for the recombination parameter ρ ∈ [0, ∞) =: Θ.Consider a neutral two-locus model in which there are K possible alleles at the first locus, locus A, and L possible alleles at the second, locus B. The haplotype of an individual is denoted (i, j) ∈ {1, . . ., K} × {1, . . ., L}, and its frequency in the population is x ij .Note that to reconcile this double-index notation with previous sections we must implicitly stack the KL possible haplotypes in some agreed order into a vector of length d = KL.We will switch between the two notations as required.To emphasise when haplotypes have been stacked we will use a bold index, so x i denotes the frequency of haplotype i, i = 1, . . ., d.
The model is completed by specifying the drift.Here it is of the form where x i• := L l=1 x il and x •j := K k=1 x kj .Recombination occurs between the two loci at rate ρ; specifically this is a model of the homologous crossing-over that takes place during meiosis.To simplify later results, we omit the conventional factor of 1/2 in the recombination rate parameter.
In much of what follows the choice for c(x) is immaterial, but for concreteness we will set x il (P B lj − δ jl ).
Here mutation takes place at locus A and B at respective rates θ A /2 and θ B /2 on the timescale of the diffusion.When a mutation occurs, the change in allele is governed by the K × K and L × L mutation transition matrices P A and P B (i.e. if a mutation occurs at locus A on haplotype (k, j) then it mutates to haplotype (i, j) with probability P A ki , i = 1, . . ., K; similarly for P B ).We allow θ A , θ B ≥ 0, so the model may or may not be ergodic.
Note the separate roles for the two components of the drift: here it is only ρ to be estimated, with the other parameters appearing in c(•) considered known.The likelihood is expressed with respect to the parametrisation ρ 0 = 0, a model in which the two loci are completely linked but the mutation parameters are the same.
Using the results of Section 3.1, for this model the log-likelihood is where for the second equality we recall The estimator ρ is therefore and the observed information is The denominator in ρ and the information can be simplified to It is worth remarking on the functional form (17) for log L T (ρ).This is a polynomial in ρ and we can think of a trade-off between the order of the polynomial and the complexity of its coefficients.In this model we have a particularly simple quadratic polynomial in ρ, order only two, with the benefit of knowing that the function is convex with a unique finite maximum (since the coefficient of ρ 2 is negative).The price we pay is that the coefficients of the polynomial are highly cumbersome in the sense that they are given as integrals over the sample path of a diffusion.Contrast this with the dual coalescent model in which the likelihood for an observed sample path of an ARG would be a product of exponential waiting time densities times a product of rational functions for the transitions of the jump chain.With many possible jumps, these rational functions may be constructed from polynomials in ρ of very high order, though their coefficients are much simpler than the stochastic integrals encountered here.In a coalescent model, the shape of the likelihood curve as a function of ρ can be rather complicated, even exhibiting local minima when integrating over ARGs (Jenkins and Song, 2009).

Deterministic model
Is ρ in (18) a reasonable estimate?Again we can check what happens when X solves a deterministic model.Setting θ A = θ B = 0 for the moment, the deterministic model is and substituting dx ij directly into (18) again shows that ρ = ρ.
We can further describe the evolution of I T .Note that in this deterministic model (summing (20) over j): Therefore x i• (t) = x i• (0) for all t ≥ 0, and similarly for x •j (t).The solution to (20) is then We have that so (provided ρ > 0): The limit information is therefore lim As far as the deterministic model goes, the information is in the transient phase until the frequencies come to equilibrium.The accumulated information I T remains finite as T → ∞.In a stochastic model on the other hand, we will see that the injection of noise allows I T → ∞ as T → ∞.We note that one should regard this contrasting behaviour with caution: it does not mean that the estimator is consistent only in the stochastic setting.We have just seen that ρ = ρ in the deterministic setting, which is trivially consistent, and creates a paradox when we try to reconcile this fact with the asymptotic finiteness of I T .The paradox is resolved by noting that the data-generating mechanism differs from the one assumed in designing the estimator.
Had we assumed a deterministic model throughout our analysis then, since the parameter is simply a rate appearing in an observed ODE, the 'likelihood' would be a point mass on the true rate and the MLE would be equal to that true rate.The 'information' in this setting, being the curvature of the log-likelihood, is immediately infinite.The fact that ρ = ρ demonstrates that the estimator adapts automatically to a change in data-generating mechanism.The quantity I T could be regarded not as the information under the true model but as a way of quantifying 'the informativeness of the deterministic trajectory under stochastic assumptions'.It is this quantity that remains finite as T → ∞.
It is possible to repeat these calculations for a model with θ A , θ B > 0; that is, to solve the deterministic mutation-recombination equation.Again I T converges to a finite limit; see Appendix A.

Stochastic differential equation interpretation
We can find an expression for the error associated with ρ by regarding X(t) as the solution to a stochastic differential equation (SDE): where W is a (d − 1)-dimensional Brownian motion and σ There are various ways to define σ subject to this constraint.It is common to ask for σ to be lower triangular by applying a Cholesky decomposition to V .In the case of the covariance matrix of the Wright-Fisher diffusion, the Cholesky decomposition is given analytically by Sato (1976), though it is one that explodes at the boundaries.
Proposition 1.The error associated with ρ is where Proof.Rearranging (18) slightly and using d i=1 d X i (t) = 0, we have which leads to (24).
Thus the bias and mean squared error of ρ are given respectively by the expectation of the term on the right-hand side of (24) and the expectation of its square.Estimators of this form are not unbiased in general (Basawa and Prakasa Rao, 1980, p218).

Corrected MLE
There are two problems with the estimator ρ defined in (18).First, the parameter space is Θ = [0, ∞) but we cannot ensure ρ ≥ 0. (Although ρ < 0 is biologically unrealistic, mathematically it is nonetheless a valid model and a sample path may point to this region of the parameter space if d X ij (t) is sufficiently negative.)This is easily corrected by applying a rectified linear unit, max{0, ρ}.The second issue is more serious: it is not guaranteed that (3) holds.In other words, we have not ruled out the possibility that I T explodes in finite time.For observations for which I T < ∞, we can still interpret (17) as a quasi-log-likelihood function (Kloeden et al., 2003, p231), but otherwise we must treat L T (ρ) as a generalized density valid only until the stopping time See Liptser and Shiryaev (2001, Ch. 6) and Mijatović et al. (2012) for further discussion on this subtle point.Writing ρ = ρT for the estimator in (18), we define the following corrected estimator: A similar issue arises in the estimation of the immigration rate of the continuous branching with immigration (CBI) diffusion, where a related correction is proposed (Overbeck, 1998, in particular Theorem 2(iv)).The subscript in (26) rather suggestively posits this quantity as the MLE; this is proven shortly, in Corollary 1.Although taking lim t↑S ρt in (26) might seem to be unstable, that this is the appropriate correction to our estimator is justified by the following theorem.
Proof.From the definition (26) of ρMLE it suffices to show that ρ → ρ as T ↑ S. Let This is a continuous, stopped martingale with N 0 = 0 and quadratic variation where the second equality uses that •, • is a bilinear form and dW j , dW l = δ jl dt.Thus by the law of large numbers for local martingales (Revuz and Yor, 1999, Ch.V.1, Exercise 1.16, p186), lim with probability 1 on {I ∞ = ∞}.
The limit as T ↑ S is the same.But N T /I T ∧S is precisely the error ρ − ρ given in Proposition 1, so ρ → ρ as T ↑ S with probability 1.
Proof.This follows since we have separately verified that it is the MLE on {T < S} and on {S ≤ T }.In the latter case ρ is identifiable, so L T (ρ) is zero anywhere other than the true value.
Corollary 2. The error associated with ρMLE is where we recall that ρ is the uncorrected estimator given in (18).
Proof.This follows by combining Theorem 2 and Proposition 1.
The relevance of Theorem 2 is: If the sample path is such that I T = ∞, then we learn ρ without error.Inspecting the form of I T in ( 19), we see that its integrand is locally integrable in the interior of ∆ d−1 .Thus for I T = ∞ it is necessary to have at least one haplotype frequency X ij (t) → 0 before time T .We should expect the same phenomenon when inferring the mutation parameters in a one-locus model, where hitting one of the boundaries is completely informative for one of the mutation parameters.
The next result shows that having I T = ∞ is not a hypothetical concern, and the proof makes it clear that explosion of I T is intimately related with hitting a boundary of ∆ d−1 .
Proof.It is clear from the form of I T in (19) that {I T = ∞} will occur if for some i, j, (i) For some δ > 0 and for all t ∈ [0, T ], X(t) lies in Condition (iii) extracts the explosion of I T from a denominator of its integrand, while condition (i) controls the corresponding numerator.Condition (ii) ensures that such explosion takes place before time T .
To study the finiteness or otherwise of the integral in (iii), choose a decomposition σ(x)σ(x) = V (x) so that the component of the SDE (23) corresponding to X ij (t) has the form for a scalar Brownian motion W , where The idea is to show that this SDE behaves locally like a one-locus model of mutation only.More precisely we will compare X ij to another diffusion which solves the SDE for some ϑ ∈ [0, 1), P ∈ (0, 1).Choose ϑ so that ρ + θ A 2 + θ B 2 < ϑ 2 and choose P so that x i• (0)x •j (0) < P , P A i < P , and P B j < P .Then on the set and thus by a standard comparison theorem (see Theorem 1.1 and Remark 1.1 in Ikeda and Watanabe, 1977) we can construct a probability space on which Z where (For the comparison theorem to hold there is a required growth condition on the diffusion coefficient.That this holds follows from the fact that x(1 − x) is 1/2-Hölder continuous; see also Remark 3.9 on p298 of Ethier and Kurtz (1986).) Thus condition (iii) is implied by the a.s.divergence of dt as ε → 0, which in turn follows from Lemma 4.4 of Barton et al. (2004), noting that ϑ < 1 guarantees the 0-boundary for Z is accessible.[Some errors in the proof of Lemma 4.4 are corrected by Taylor (2007).]Tracing our steps backwards, we have shown that condition (iii) The conditions given in Theorem 3 simplify our proof, but it seems feasible to substantially weaken them.

Testing for the presence of recombination
It is possible to use ρMLE to design a likelihood ratio test for the null hypothesis that ρ 0 = 0.
Using (17), the appropriate likelihood ratio statistic is, for I T < ∞, Under standard assumptions, noting that ρ 0 = 0 lies on the boundary of Θ, this has an asymptotic distribution which is an equal mixture between a χ 2 1 distribution and a χ 2 0 distribution under the null hypothesis (Self and Liang, 1987).Denote the CDF of this distribution by F m .
In particular, to construct a level 5% test one should reject ρ 0 = 0 if Λ exceeds the 95th percentile of F m ; equivalently if it exceeds the 90th percentile of a χ 2 1 distribution.To account for the possibility that I T = ∞ we set The asymptotic null distribution for Λ is now less clear, though we note that continuing to assume F m would be conservative.We study the power of this test empirically in Section 5.

Multiple loci
It is possible to extend the above results to a general multi-locus model.The extension is straightforward and we omit many of the lengthy but straightforward calculations.
In a multi-locus model of loci with K j possible alleles at locus j, haplotypes are of the form Stacking the haplotypes, the diffusion coefficient has entries V ik (x) = x i (δ ik − x k ) as usual, and the unknown component of the drift is where ρ j is the recombination rate between locus j and j + 1, with each ρ j to be estimated; x i is the frequency of haplotype i; and we marginalize over a contiguous subset of loci by writing x (i 1 ,...,i ) .
An alternative model is to set ρ j = ρ for each j = 1, . . ., and to construct a single scalar estimator.Then the estimator is These estimators should be corrected as in ( 26).

The effects of natural selection
Methods for inference of recombination can be confounded by natural selection (Reed and Tishkoff, 2006;O'Reilly et al., 2008;Peñalba and Wolf, 2020).In this section we investigate the effect of selection on ρMLE , for simplicity returning to a two-locus model, though it should be straightforward to extend these results to general multi-locus models.

Confounding by selection
First consider the following: Suppose that, unknown to the investigator, the two loci are under selection-possibly a complicated type with epistatic interaction.What effect does this have on our estimator for ρ?More precisely, consider a model in which the component of the drift with parameters to be estimated is still s kl,mn x kl x mn , i = 1, . . ., K; j = 1, . . ., L.
This is a very general diploid, epistatic model of selection in which the selective advantage of an individual carrying haplotypes (i, j) and (k, l), relative to other individuals, is parametrised by s ij,kl .We are interested in the role of selection as a confounder, whereby inference is carried out using the incorrect selection parameters in the dominating measure.
From equation ( 18), c(•) has an effect on ρ only through the term d Therefore, in a model with selection we should adjust (18) by defining a new The last term, which is linear in the selection parameters, quantifies the error introduced by ignoring selection.However, it demonstrates a remarkable property in the absence of epistasis.
In that case we can write s ij,kl = s A ik + s B jl , where s A ik is the selection parameter associated with genotype ik at locus A, and similarly for s B jl .Then equation ( 27) simplifies to ρsel = ρ.
That is, we have an attractive robustness property: if an investigator uses the incorrect model for selection then the estimator ρ is unaffected provided selection is not epistatic.The observed information is also the same.Noting that Theorem 2 continues to hold when c(•) is altered, we conclude that ρMLE defined in Section 3.2.3 is still the MLE for ρ in the presence of (nonepistatic) selection.

General confounding
Returning to the general inference problem of Section 3.1, we can generalise the previous observations by asking: when does a contribution c(x) to the drift leave the estimator φ unchanged?From (5), its contribution to the estimator via d X(t) will be zero if and only if a(X(t); ϕ) V (X(t)) −1 c(X(t)) does not depend explicitly on ϕ.When the drift is linear in the parameters as in (11), this requirement becomes ( * ) ϕ Z V −1 c does not depend on ϕ.
If ( * ) holds we will say that the estimation problem for φ is robust to the contribution of c(x).
In the context of the Wright-Fisher diffusion we have the following result.
Proof.We determine ( * ) for the first d − 1 coordinates of the Wright-Fisher diffusion, with Since ∂a i ∂ϕ k (x) and c i (x) do not depend on ϕ, the above quantity is a linear combination of the ϕ k .It does not depend on any ϕ k if and only if each of its coefficients is zero, i.e. ( 28) holds.
To give another example of the applicability of Proposition 2, consider reversing the roles of selection and recombination, so that we are interested in designing an estimator for (nonepistatic) selection at locus A in the confounding presence of recombination.For simplicity we focus on a genic selection model without mutation: In this model we find and (28) holds; by Proposition 2 the estimator ŝA for (s A 1 , . . ., s A K ) is robust to recombination, as we might hope.
Using Proposition 2 it is also possible to show that the following problems are robust: (i) Estimation of mutation at locus A when there is selection at locus B, (ii) Estimation of genic selection at locus A when there is mutation at locus B; while the following problems are not robust: (iii) Estimation of recombination when there is mutation at either locus, (iv) Estimation of mutation at locus A when there is mutation at locus B, (v) Estimation of mutation at locus A when there is recombination, (vi) Estimation of mutation at locus A when there is selection at locus A, (vii) Estimation of genic selection at locus A when there is mutation at locus A; similarly for problems interchanging the two loci.We omit the straightforward calculations.We caution that in the estimation problems above, the likelihood, and thus the estimator φ, will be valid only up to time S as in (25).It is possible to have I T = ∞ even in models without recombination (in particular, we expect two path measures with different mutation parameters to be mutually singular if certain allele frequencies reach 0).

Joint estimation of recombination and selection
In contrast to Section 4.1, one might recognise the possible existence of selection and be interested in constructing a joint estimator for recombination and selection.How does the marginal estimator for ρ from this compare to those already developed?To illustrate the idea, we consider a simple genic selection model at locus A in which only allele k is under selection; that is, x il (P B lj − δ jl ), From ( 13) and ( 14) we find , and so φ = where ρ is the same estimator as we found in (18) and .
Setting θ A = θ B = 0 recovers the estimator for selection found by Watterson (1979), up to a choice of timescale.The key point is that the presence of selection as a 'known-unknown' leaves the estimator for ρ unaffected, as is clear from the diagonal nature of I T .In fact this might have been predicted even earlier: requiring I kl = 0 for the off-diagonal entry of an observed information matrix, corresponding to two parameters ϕ k , ϕ l , is essentially equivalent to the robustness condition ( * ).(To see this we identify the c i (x) term in ( * ) with Z il ϕ l , so ϕ l parametrises what would have been a confounder in ( * ).)

Simulation study
In this section we conduct an empirical study of the properties of ρMLE by simulation.Although there has been recent progress in the development of algorithms for exact simulation of certain classes of Wright-Fisher diffusion (Jenkins and Spanò, 2017;Griffiths et al., 2018;García-Pareja et al., 2021), these algorithms do not cover the non-reversible diffusions considered in this paper.
Instead we resort to simple Euler-Maruyama simulation; that is, to simulate small increments of the diffusion over a fixed, small timestep ∆t using the approximation where W (t) is the (d − 1)-dimensional Brownian motion in (23).Integrals involving the sample path of X can be approximated using Riemann sums constructed from the same set of gridpoints.
Because of its singularities at the boundaries of ∆ d−1 , the Cholesky decomposition of V (x) of Sato (1976) is perhaps not the best choice of σ(x) for the purposes of simulation, a point also noted in He et al. (2020).Instead we use a decomposition introduced by Pal (2011Pal ( , 2013)): This formulation has the advantage of being simple, symmetric, bounded in x, and vectorising easily via where I d is the identity matrix and 1 d is the d × d matrix of ones.A disadvantage is that it uses a d-dimensional Brownian motion, one dimension more than is necessary for simulation.
Using Euler-Maruyama simulation it is possible to obtain a realisation with X ij (t + ∆t) ≤ 0 for some (i, j).If this occurs we set X ij (t + ∆t) = 0, renormalize X(t + ∆t) so that X(t + ∆t) ∈ ∆ d−1 , and set I t+∆t = ∞.
As an illustration and a check that our implementation is accurate, examples of individual sample paths for ρ = 5 are shown in Figure 1 and Figure 2, together with the accumulated information, I t , and the evolving error, ρMLE − ρ, as functions of time.As is clear from the Figures, the error is stochastically converging towards 0, with erratic jumps towards 0 in regions where the information accumulates most rapidly.In the second example, in which θ A = θ B = 1, the trajectory for X 22 (t) wanders sufficiently closely to 0 that I t = ∞ for some t < T , whereupon ρMLE = ρ.The distribution of ρMLE across 100 experiments using these parameters are shown   Table 1: Distributional summaries of ρMLE for θ A = θ B = 5 (top) and θ A = θ B = 1 (bottom): mean, variance, 5th percentile, median, 95th percentile, frequency of zero error, and power to reject ρ 0 = 0 at level 5%.Each estimate is based on 100 independent replicates.
Because of the unusual behaviour of the estimator for ρ, we have refrained from providing a detailed description of its asymptotic properties such as local asymptotic normality (LAN).
Using the estimator for the immigration rate of the CBI diffusion as a guide, it should be possible to show that inference for ρ exhibits LAN along the sequence of random times T n := inf{t ∈ [0, T ] : I t = n}; see Overbeck (1998, §3.4).However, in the case I T < ∞ for each T , finding the asymptotic behaviour of ρMLE is more involved since we do not know the stationary distribution of X.
Finally, we observe that fundamental quantities appearing throughout this work are , and The role of X ij in the denominators has been to upweight the importance of those parts of the trajectories where the frequency of haplotype (i, j) is small, since these regions are more informative for ρ.In one sense this is unsatisfactory since the diffusion model is often regarded as an approximation of a discrete population of size N , and behaviour near the boundaries is inappropriate when true frequencies can only be a multiple of 1/N .One might prefer to replace the estimator ρ, which integrates each X ij (t) over [0, T ], with one that integrates X ij (t) only over some sub-region {t ∈ [0, T ] : ε ≤ X ij (t) ≤ 1 − ε}.
Even under this restriction, the quantities in (29) tell us to focus our attention on those regions where the haplotype frequency is far from 1/2.The normalizations inherent in (29) seem to offer 'natural' new normalizations for the coefficient of linkage disequilibrium, which do not correspond to the usual normalizations found in r 2 and D , for example (Sved and Hill, 2018).
Exploring the properties of these new summaries of LD will be the subject of future work.
Therefore, (30) can be written with F (t) known and θ := θ A + θ B .This can be solved via an integrating factor; the solution is where D ij (0) = x ij (0) − x i• (0)x •j (0).Since the integrand of I T decays exponentially in t, clearly I ∞ < ∞ as before.

Figure 1 :Figure 2 :Figure 3 :
Figure 1: Example trajectories in a two-locus model with two alleles at each locus, ρ = 5, and θ A = θ B = 5.Also shown in the lower plots are the trajectories of ρMLE − ρ and I t for this sample path.