Numerical evaluation of the transition probability of the simple birth-and-death process

The simple (linear) birth-and-death process is a widely used stochastic model for describing the dynamics of a population. When the process is observed discretely over time, despite the large amount of literature on the subject, little is known about formal estimator properties. Here we will show that its application to observed data is further complicated by the fact that numerical evaluation of the well-known transition probability is an ill-conditioned problem. To overcome this difficulty we will rewrite the transition probability in terms of a Gaussian hypergeometric function and subsequently obtain a three-term recurrence relation for its accurate evaluation. We will also study the properties of the hypergeometric function as a solution to the three-term recurrence relation. We will then provide formulas for the gradient and Hessian of the log-likelihood function and conclude the article by applying our methods for numerically computing maximum likelihood estimates in both simulated and real dataset.


Introduction
A birth-and-death process (BDP) is a stochastic model that is commonly employed for describing changes over time of the size of a population. Its first mathematical formulation is due to Feller (1939), followed by important mathematical contributions of Arley and Borchsenius (1944) and Kendall (1948Kendall ( , 1949. According to the basic assumptions of the model, when the population size at time t is j, the probability of a single birth occurring during an infinitesimal time interval (t, t+dt) is equal to λ j dt + o(dt) while the probability of a single death is µ j dt + o(dt), where λ j ≥ 0 and µ j ≥ 0 for j ≥ 0. If p j (t) is the probability of observing j individuals at time t then p j (t + dt) = λ j−1 dtp j−1 (t) + µ j+1 dtp j+1 (t) + (1 − (λ j + µ j )dt)p j (t) + o(dt) If we subtract p j (t) from both sides of the equation, divide by dt, and then take the limit of dt to zero, we obtain the well known BDP differential equation p j (t) = λ j−1 p j−1 (t) + µ j+1 p j+1 (t) − (λ j + µ j )p j (t) (1) By assuming that at time zero the size of the population was i ≥ 0, that is p i (0) = 1, we have the initial condition required to solve the differential equation (1). In this article we will focus on the simple (linear) BDP without migration (Kendall, 1949) defined by λ j = jλ and µ j = jµ. With this particular choice of parameters a starting size of zero implies λ 0 = µ 0 = 0, i.e. it remains zero for all t ≥ 0. The rate of growth does not increase faster than the population size and therefore ∞ j=0 p j (t) = 1 (Feller, 1968, Chapter 17, Section 4). When i > 0 the population becomes extinct if it reaches the size j = 0 at time t > 0. Obviously i, j, and t are not allowed to be negative, nor the basic birth and death rates. What makes this model particularly attractive for real applications is the fact that its transition probability is available in closed form (Bailey, 1964, Chapter 8) and we could, in principle, easily evaluate it. By defining φ(t, λ, µ) = e (λ−µ)t − 1 λe (λ−µ)t − µ , α(t, λ, µ) = µφ(t, λ, µ), β(t, λ, µ) = λφ(t, λ, µ) γ(t, λ, µ) = 1 − α(t, λ, µ) − β(t, λ, µ) = 1 − (λ + µ)φ (t, λ, µ) and assuming that at time 0 the size of the population was i > 0, the probability of observing j units at time t is (3) where t, λ, and µ are to be considered strictly positive if not otherwise specified and m = min(i, j). The probability of the population being extinct at time t is In the majority of applications direct evaluation of equations (2)-(12) is sufficient. However, for particular values of process parameters, equations (2) and (3) are numerically unstable ( Figure  1) and alternative methods are needed. A possible approach could be the algorithm introduced by Crawford and Suchard (2012) based on the continued fraction representation of Murphy and O'Donohoe (1975), but for this particular case where we know the exact closed form we will show that a simpler and more efficient method is available.
The remainder of the paper is organized as follows. In Section 2 we introduce the problem and find the parameter sets for which it is ill-conditioned. In Section 3 we rewrite the transition probability in terms of a Gaussian hypergeometric function and find in Section 4 a three-term recurrence relation (TTRR) for its computation. In Section 5 we extend the results to the loglikelihood function, its gradient, and its Hessian matrix. In Section 6 we apply our method to both simulated and real data. In Section 7 we conclude the article with a brief discussion.  (2) and (3). For this particular example we set i = 25, j = 35, t = 2, λ = 1 and evaluated the log-probability as a function of µ. We computed correct values in arbitrary precision with Maple TM 2018.2 (Monagan et al., 2005). Relative error is defined as |1 − logp j (t)/ log p j (t)| wherê p j (t) is the numerically evaluated transition probability.

Numerical stability
We will always assume that all basic mathematical operations (arithmetic, logarithmic, exponentiation, etc.) are computed with a relative error bounded by a value that is close to zero and small enough for any practical application (Press et al., 2007). Following this assumption and after taking into consideration floating-point arithmetic (Goldberg, 1991), equations (4)-(12) can be considered numerically stable and won't be discussed further. We will instead focus our attention on the series (2) and (3) assuming all variables to be strictly positive, including j.
Suppose to be interested in the value s m = m h=0 u h and use a naïve recursive summation algorithm for its computation, that is The relative condition number of this algorithm is (Stummel, 1980) where ρ A m is associated with perturbations in the the value of the addends while ρ R m is associated with rounding errors in the arithmetic operations. Note that when u h ≥ 0 for all h, ρ A m = 1 and the condition number depends only on rounding errors. With a compensated summation algorithm (Higham, 2002) we might significantly reduce the numerical error and evaluate accurately the sum. However, when the addends are alternating in sign, the condition number can be of large magnitude and the problem is numerically unstable even when compensating for rounding errors. In our case it is likely that the magnitude of the binomial coefficients make ρ A m a ratio between a very large number and a probability that is instead close to zero. We will now find the conditions under which the sums (2) and (3) are alternating in sign.
Proposition 1 For all λ ≥ 0 and µ ≥ 0 the function is zero if and only if t = 0. It is always positive otherwise.

2
Proof If µ = λ and t = 0 the numerator e (λ−µ)t − 1 is equal to zero but the denominator is not. When µ = λ the function becomes For all λ ≥ 0, it is equal to zero if t = 0 and positive otherwise. Assume now t > 0. When λ > µ we have e (λ−µ)t > 1 and µ/λ < 1 implying that the numerator and the denominator are always positive. When λ < µ the numerator e (λ−µ)t − 1 is negative. The denominator is also negative when e (λ−µ)t < µ/λ. Since λ < µ the left hand side is less than one while the right hand side is greater than one, proving the proposition.
2 Proof This is a direct consequence of Proposition (1) and the assumptions λ ≥ 0 and µ ≥ 0.
From Corollary 2 we have simple closed form solutions when γ(t, λ, µ) is zero and therefore we will not consider this case further. We know from Proposition 2 the conditions under which equations (2) and (3) are alternating in sign and might become numerically unstable. Looking at the example shown in Figure 1, where t = 2 and λ = 1, function γ(t, λ, µ) is non-negative when 0 < µ 0.2032. We clearly see from Figure 1 that the error steadily increases starting at the value µ ≈ 0.2032. In the next section we will find an alternative representation to equations (2) and (3) that will lead to an algorithm for their accurate evaluation.

Hypergeometric representation
Define Note that ω(i, j, t, λ, µ) is simply the first term in the summation (2). When µ = λ set Multiply and divide each term in the series (2) by ω(i, j, t, λ, µ) to get where (q) h is the rising Pochhammer symbol and 2 F 1 (a, b; c; y) is the Gaussian hypergeometric function (Slater, 1966, Chapter 1). To evaluate (14) is then sufficient to separately compute the functions ω(i, j, t, λ, µ) and 2 F 1 (−i, −j; −(i + j − 1); −z(t, λ, µ)). Partial derivatives of log p j (t) are required for computing partial derivatives of the log-likelihood function as we will explicitly show in Section 5. The following theorem proves that partial derivatives of 2 F 1 (−i, −j; −(i + j − 1); −z(t, λ, µ)) depend on hypergeometric functions of similar nature.
Theorem 1 Denote with u x (x, y) the first-order partial derivative of u(x, y) with respect to x. Similarly, denote with u xy (x, y) the second-order partial derivative with respect first to x and subsequently y. Then where m = min(i, j). Apply the same procedure to obtain the second-order partial derivatives.

Hypergeometric function
The following theorem can be considered the main result of this article.

Theorem 2 The hypergeometric function
, as a function of b, is a solution of the three-term recurrence relation (TTRR) Proof The recursion can be obtained by the method of creative telescoping (Petkovšek et al., 1996;Zeilberger, 1991). To prove that it holds, define Note that h t h is the left hand side of equation (16) By expanding the polynomial and collecting the terms with respect to h we get Doing the same with the right hand side we get proving the equality.
If we divide both sides of equation (16) by the coefficient of y b+1 , and shift the index by 1, we obtain the forward recursion Starting from we can, in principle, obtain all remaining solutions all the way up to the required term. Note that a ≥ 1 and k ≤ 1, therefore the denominator a + 1 − k is strictly positive and always well defined.
Knowing the values of y b+2 and y b+1 , for large b, we can travel the recursion also in a backward manner: Theorem 2 proves that the hypergeometric function 2 F 1 (−a, −b; −(a + b − k); −z) is a solution of a TTRR. However, equation (16) can also admit a second linearly independent solution.
Definition 1 A solution f b of a TTRR is said to be a minimal solution if there exists a linearly independent solution g b such that The solution g b is called a dominant solution.
2 It is well known that, regardless of the starting values, forward evaluation of a TTRR converges to the dominant solution while backward evaluation converges instead to the minimal solution (Gil et al., 2007, Chapter 4). We now need to find the conditions under which our hypergeometric function is either the dominant or the minimal solution.
Lemma 1 (Poincaré-Perron) Let y b+1 + v b y b + u b y b−1 = 0 and suppose that v b and u b are different from zero for all b > 0. Suppose also that lim b→∞ v b = v and lim b→∞ u b = u. Denote with φ 1 and φ 2 the (not necessarily distinct) roots of the characteristic equation φ 2 + vφ + u = 0. If f b and g b are the linearly independent non-trivial solutions of the difference equation, then and f b is the minimal solution while g b is the dominant solution. If |φ 1 | = |φ 2 | the lemma is inconclusive about the existence of a minimal solution.
Using Lemma 1 we can study the nature of our hypergeometric function as a solution of the TTRR. (16) when |z| < 1. It is a minimal solution when |z| > 1. The nature of the solution is unknown when |z| = 1.
2 Proof Our TTRR is Take the limit of the coefficients The characteristic equation is φ 2 − (1 − z)φ − z = 0 with solutions φ 1 = −z and φ 2 = 1. When |z| < 1 the solution associated with φ 1 is minimal and the one associated with φ 2 is dominant. The opposite is true when |z| > 1. We will now prove that 2 F 1 (−a, −b; −(a + b − k); −z) is associated with the characteristic root φ 2 = 1. The summation index h in equation (15) depends on b since the upper bound of the series is the minimum between a and b. Note, however, that variable a is considered known and fixed to a finite value. When b goes to infinity the summation index h in (15) does not depend on b any more and the series is always finite, so that we can safely exchange the limit of the sum with the sum of the limits: Using Stirling's approximation n! ∼ √ 2πn(n/e) n for large n, we obtain The solution is therefore dominant when |z| < 1 and minimal for |z| > 1. When |z| = 1 Lemma 1 is inconclusive about the nature of the solution.
Since Lemma 1 refers to asymptotic results, Theorem 3 is always valid for large values of b. For small values of b, instead, there is a possibility of anomalous behaviour as described by Deaño and Segura (2007). By Definition 1 we would expect that the sequence of ratios between a minimal and a dominant solution would be monotonically decreasing to zero for all b. There are cases, however, in which this is not necessarily true. A minimal solution might behave as a dominant solution up to a finite value b * and then switch to its asymptotic minimal nature only starting at b * + 1.
Definition 2 Let f b and g b be respectively the minimal and dominant solution of a TTRR as may be minimal). Let g b be any solution (not minimal) such that and let R b = |f b /g b |, then for b ≥ b − the following holds depending on the value ψ: (1) if ψ > 1, then Proof See Deaño and Segura (2007).
According to Lemma 2, if u b is negative and v b changes sign at index b * , then our asymptotic minimal solution behaves as a dominant solution up to b * − 1 or b * . We must then study the sign of the two coefficients It is dotted to represent the fact that we don't know if the solution becomes minimal at b * or b * + 1. The curve has an horizontal asymptote at b = a.
with b ≥ 1, a ≥ 1 and k ≤ 1. Since the denominators are strictly positive, we can simply study the signs of the associated quantities is negative when z > 0, positive when z < 0, and zero when b = k = 1. Define b * = (z − 1) −1 ((z + 1)a + 1 − k). v b is negative when z < 1 and b > b * or when z > 1 and b < b * . It is obviously positive in the complementary set. The point b * is the delimiter at which the coefficient v b switches from positive sign to negative sign or vice versa. When z > 1 we are under the conditions of Lemma 2, therefore the solution is surely minimal for b > b * + 1. It is pseudo-dominant for all b < b * . Not knowing the shape of the linearly independent solution g b , we don't know if the solution becomes minimal at b * or b * + 1. Interestingly, when z < −(a + 2 − k)/(a − 1), we have the opposite behaviour of a positive u b and v b changing sign from positive to negative at the same index b * . The regions are highlighted in Figure 2. Lemma 2 does not consider the case of a positive u b but we conjecture that it might be applied to this case as well. Nevertheless, as shown by the following proposition, we can simply ignore the problem altogether.
2 Proof Rewrite the function z(t, λ, µ) as It is straightforward to show that the function converges to −1 when λ, µ, or t go to infinity. The limit is never attained for finite λ, µ, or t. When any of the parameters approaches zero, instead, the function approaches positive infinity. We know from Corollary 1 that the denominator α(t, λ, µ)β(t, λ, µ) is always positive. The sign of the function z(t, λ, µ) is therefore equal to the sign of γ(t, λ, µ), which is given in Proposition 2. Same results apply when µ = λ.
Note that for z > 1, as clearly shown in Figure 2, we have to use either the forward or backward recursion depending on the value of b that we wish to evaluate. We can simplify our computations by applying the well known symmetric property 2 F 1 (−a, −b; −(a + b − k); −z) = 2 F 1 (−b, −a; −(a + b−k); −z). If b > a, swap the two variables to transform a minimal solution into a pseudo-dominant one. Using this trick we can apply the forward recursion for all z > −1.
All the previous results are summarized in Algorithm A.1 in Appendix A. Assuming a constant time for arithmetic operations the time complexity is simply O(m), where m = min(a, b), that is the total number of iterations required. Note that we only use basic arithmetic operations, saving computational time when compared to the more expensive functions found in equations (2)-(3), such as the Binomial/Gamma. Using the TTRR approach is better, from a computationally point of view, also when the problem is well-behaved.

Likelihood function
Let t = (t 0 , . . . , t S ) T be the vector of observation times with t S ≤ t, n = (n 0 , . . . , n S ) T be the corresponding observed population sizes, and τ = (τ 1 , . . . , τ S ) T = (t 1 − t 0 , . . . , t S − t S−1 ) T be the vector of inter-arrival times. When the process is observed continuously the log-likelihood function is (Darwin, 1956, Equation (25)) log L(λ, µ|t, n) = B t log λ + D t log µ − (λ + µ)X t + S−1 s=0 log n s where B t and D t are respectively the total number of births and deaths recorded during the time interval [0, t] while X t = S s=0 n s τ s+1 is the total time lived in the population during [0, t]. By convention we set τ S+1 = t − t S . From (19) we obtain the maximum likelihood estimators (MLEs) of λ and µ asλ from which follows that the MLE of the growth rate θ = λ − µ isθ =λ −μ = (B t − D t )/X t . A more challenging situation is encountered when the BDP is observed discretely at fixed time points. Rewrite the probability of transitioning from i to j in t time with birth rate λ and death rate µ as p(j|i, t, λ, µ). Since the BDP is a continuous time Markov chain (Kendall, 1949) we can write the likelihood function as L(λ, µ|t, n) = S s=1 p(n s |n s−1 , τ s , λ, µ) Note that the joint likelihood of M observations of stochastically independent processes, having the same birth and death rates, is simply the product of the M likelihoods associated with each process.
To the best of our knowledge, no known closed form solutions forλ andμ are currently available. However, in the case of equidistant sampling where τ s = τ for all s, we know that (Keiding, 1975) θ = 1 τ log n 1 + · · · + n S n 0 + · · · + n S−1 It is easy to show that the first moment ofθ does not exist. Starting with S = 1 we have The first term in the summation is not defined because the probability of extinction is strictly positive, unless the process is a pure birth process (see equations (9)- (12)). For a simple birth-anddeath process without migration the population stays extinct once its size reaches a value of zero, therefore the previous result can be extended to any value S > 1. To estimateλ,μ, andθ we must consider only observations in which the population is not immediately extinct at time point s = 1.
To find the maximum likelihood estimators we will use a numerical approach, that is the Newton-Raphson method (Bonnans et al., 2006, Chapter 4) applied to the log-likelihood function. To proceed we need its gradient and Hessian matrix, that are log p(n s |n s−1 , τ s , λ, µ) ∂ 2 ∂µ∂λ log p(n s |n s−1 , τ s , λ, µ) ∂ 2 ∂µ 2 log p(n s |n s−1 , τ s , λ, µ) with closed form solutions of partial derivatives of log-probabilities appearing in (22) and (23) given in Appendix B. They can be evaluated with our proposed TTRR approach. Note that (22) and (23) are sums of piecewise functions with sub-domains inherited from equations (2)-(8).

Applications
All results presented so far are implemented in a free Julia (Bezanson et al., 2017) package called "SimpleBirthDeathProcess". The package is released under a MIT software license and can be downloaded from https://github.com/albertopessia/SimpleBirthDeathProcess.jl.
Returning to the example shown in Figure 1, we can see from Figure 3 that our method improves significantly the accuracy of the computations. Interestingly, although not entirely unexpected, the algorithm has a higher numerical error in the neighbourhood of the special point µ = λ, that is the removable singularity of equation (2). Note that relative errors for this particular example are always less than 10 −10 and small enough for any practical application. In Figure 4 we can see a more general example where points near the line µ = λ are again associated with higher relative errors. Also in this case they are very small and always less than 10 −13 . Figure 3: Numerical relative error of the log-probability evaluated using the hypergeometric representation and the TTRR approach. Parameters are the same as in Figure 1, that is i = 25, j = 35, t = 2, and λ = 1. Relative error is always less than 10 −10 . Figure 4: Numerical relative error of the log-probability evaluated using the hypergeometric representation and the TTRR approach. Parameters for this example are i = 200, j = 100, and t = 1. Relative error is always less than 10 −13 . Table 1: Monte Carlo estimates from 10 5 simulations of a simple BDP where λ > µ. Growth rate θ = λ − µ = 0.0693 for each row, i.e. the expected population size at time t = 10 is set to be two times the initial population size n0. For each number of time points S, the three rows correspond respectively to a standard deviation of 1.25, 1.5, and 2.0 times the initial population size n0.

Simulated data
We will now study some properties of the maximum likelihood estimator of the birth rate λ, death rate µ, and growth rate θ = λ − µ. We will use our software package to perform simulations and apply standard Monte Carlo integration to approximate the bias and root mean square error (RMSE) of MLEs. The total number of simulations is set to 10 5 in each of the following synthetic experiments.

Constant growth rate
The first study mimics a situation in which both rate parameters are strictly positive. For simplicity we fix the total observation time to t = 10 and assume the process to be observed at S equidistant time points, that is every τ = t/S amount of time. To reduce the amount of possible combinations to test we choose birth and death rates so that the expected population size and standard deviation at time t is approximately proportional to the initial population size. In what follows we will always condition our estimators only to populations that are not immediately extinct, as explained in Section 5. Results of the simulations when λ > µ are shown in Table 1 while results of the simulations when λ < µ are shown in Table 2. Table 2: Monte Carlo estimates from 10 5 simulations of a simple BDP where λ < µ. Growth rate θ = λ − µ = −0.0693 for each row, i.e. the expected population size at time t = 10 is set to be half the initial population size n0. For each number of time points S, the three rows correspond respectively to a standard deviation of 0.25, 0.5, and 1.0 times the initial population size n0. Estimators are generally negatively biased but we also observe situations when λ < µ in which the bias is positive. The magnitude of the bias ofλ andμ is very large when only one time point is used, for which we have |Bias(λ)| ≈ RMSE(λ) and |Bias(μ)| ≈ RMSE(μ). Increasing the number of time points S help reducing both the bias and RMSE ofλ andμ. All estimators obviously perform worse when the standard deviation of the stochastic process is high. What is surprising to us is the observation thatθ has approximately the same performance regardless of the initial sample size. Increasing the number of time points has also the counter-intuitive effect of making the estimation worse.

Technical replicates
Following the results from the previous section we want to investigate the performance of the estimators when the stochastic process is observed more than once. As an example, this is a standard setting in dose-response drug screening experiments where cell counts are observed after a period of incubation and (usually) 3 to 5 technical replicates are produced under the same experimental conditions. For a fair comparison we will use the same simulation parameters from the previous simulation experiment with the only difference of now having three technical replicates instead of one. Results of the simulations when λ > µ are shown in Table 3 while results of the simulations  Table 3: Monte Carlo estimates from 10 5 simulations of three simple BDPs where λ > µ. Growth rate θ = λ − µ = 0.0693 for each row, i.e. the expected population size at time t = 10 is set to be two times the initial population size n0. For each number of time points S, the three rows correspond respectively to a standard deviation of 1.25, 1.5, and 2.0 times the initial population size n0. when λ < µ are shown in Table 4. As expected, we see a decrease in both bias magnitude and RMSE forλ andμ. A small improvement is obtained also forθ. Again, increasing the number of time points allow for a better estimation of λ and µ but make the estimation of θ worse. When increasing the number of time points S, the loss (gain) of performance is lower (higher) that in the single observation case of the previous section.

Real data
As an example application we will use real data from a cancer drug combination experiment originally performed and analysed by Liu et al. (2007). Briefly, two monoclonal antibodies were combined together at a concentration ratio of 1:1 to form a mixture. Tested concentrations of the mixture were 0 (no antibody), 0.025, 0.25, 2.5, and 10 µg/ml. Living cell counts were subsequently measured with a fluorescence microscopy at 1, 2, and 3 days. For each time point they performed six technical replicates for concentrations greater than zero and twelve replicates for the control dose of zero. Since the initial number of cells was not available, they estimated it from the data to be on average approximately equal to 23. Following previous studies (Crawford et al., 2014) we will fix for each and every observation an initial cell count of 23 as if it were known in advance. The complete dataset is visually represented in Figure 5. It is important to note that the dataset is Table 4: Monte Carlo estimates from 10 5 simulations of three simple BDPs where λ < µ. Growth rate θ = λ − µ = −0.0693 for each row, i.e. the expected population size at time t = 10 is set to be half the initial population size n0. For each number of time points S, the three rows correspond respectively to a standard deviation of 0.25, 0.5, and 1.0 times the initial population size n0. made of 108 independent observations, i.e. counts referring to the same concentration at different time points are not part of the same time series but are, instead, independent realizations of the same stochastic process observed at different times. In our notation, S = 1 and τ = t/S = t for each of the 108 measurements. The basic datum x i , i = 1, . . . , 108, is a vector (c i , t i , n i (0), n i (t i )) T where c i is the tested antibody concentration, t i is the time in days, n i (0) is the initial population size set to 23 for each and every observation, and n i (t i ) is the final cancer cell counts for observation i. For further details about the study and the experimental design we refer to the original article of Liu et al. (2007). To model the data we use a similar approach to that of Crawford et al. (2014), that is a linear model on the logarithm scale of the basic process rates. Formally we define Maximum likelihood estimates and their corresponding standard errors (SE) are shown in Table  5. We obtained estimates by numerically maximizing the log-likelihood function with the BFGS algorithm (Bonnans et al., 2006, Chapter 4). We applied the delta method to the observed Fisher information matrix in order to compute the standard error of all parameters. According to our model, increasing the antibody concentration has the double effect of reducing the birth rate and raising the death rate while maintaining the overall rate λ + µ approximately the Figure 5: Antibody dataset by Liu et al. (2007). All the observed counts are assumed to be originated from the same number of cells n 0 = 23. Increasing the antibody concentration reduces the growth rate of cancer cells. same. When the dose of the treatment increases the global growth rate θ decreases as a consequence, reaching a negative value at the maximum tested concentration. Interestingly, Crawford et al. obtained values that are slightly different from ours but still very close. In particular, the maximum absolute difference between our estimates of θ and theirs is just 0.054. Since their R package birth.death is not available for download any more we could not replicate the analysis and investigate the discrepancies more. We believe, however, that the observed differences are simply due to numerical errors or to a chosen solution that is a local optimum rather than a global.

Concluding remarks
Maximum likelihood estimators for the basic rates of a simple (linear) birth-and-process are available in closed form only when the process is observed continuously over time. Numerical methods are currently the only option to draw inferences for discretely observed processes. However, we showed that direct application of the well-known transition probability might be subject to large numerical error. We rewrote the probability in terms of a Gaussian hypergeometric function and found a three-term recurrence relation for its evaluation. Not only our approach led to very accurate approximations but also to a computational efficient algorithm when compared to the naïve direct summation method. By means of simulation we observed that MLEsλ andμ are largely negatively biased. We confirmed the intuition that to obtain better estimates it is important to employ a large initial population size, multiple time points, and multiple technical replicates. The actual values, as one would expect, depend on the magnitude of the basic rates, i.e. the process standard deviation. If only the growth parameter θ = λ − µ is of interest then multiple technical replicates with (surprisingly) only one time point provide the best results. Interestingly, the initial population size seems not to affect the bias nor the root mean square error ofθ.
We also released a free Julia package called "SimpleBirthDeathProcess". With the help of our tool it is possible to simulate, fit, or just evaluate the likelihood function of a simple BDP. Accurate evaluation of the log-likelihood function will create opportunities for future research, such as implementation of MCMC algorithms for Bayesian inference. As a final note, it might be worth investigating our conjecture that Lemma 2 can be extended to TTRRs with a positive coefficient.

B Gradient and Hessian of the log-transition probability
Partial derivatives of the log-transition probability are simple but cumbersome. To simplify notation we will drop function arguments (unless required for clarity) and denote the first and second order partial derivatives of a function f (x, y) with f x , f y , f xx , f xy , f yx , and f yy . We will also use the substitutions ; −z(t, λ, µ) ; −z(t, λ, µ) Note that functions u and v must be evaluated at the point z(t, λ, λ) = (λt) −2 − 1.
The same phenomenon can be observed with the second-order partial derivatives. When j = i the first-order partial derivatives converge to −it. Second-order partial derivatives (log p) λλ and (log p) µµ converge to 0 while (log p) λµ and (log p) µλ converge to i 2 t 2 . If we interpret the transition probability as the likelihood of a single time point observation, these results are intuitive. Indeed, if j = i the rates cannot be both equal to zero. If j = i, instead, the hypothesis λ = µ = 0 is plausible because it is compatible with the observation.