Another view of sequential sampling in the birth process with immigration

We explore properties of the family sizes arising in a linear birth process with immigration (BI). In particular, we study the correlation of the number of families observed during consecutive disjoint intervals of time. Letting S(a, b) be the number of families observed in (a, b), we study the expected sample variance and its asymptotics for p consecutive sequential samples \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_p =(S(t_0,t_1),\dots , S(t_{p-1},t_p))$$\end{document}Sp=(S(t0,t1),⋯,S(tp-1,tp)), for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0=t_0<t_1<\dots <t_p$$\end{document}0=t0<t1<⋯<tp. By conditioning on the sizes of the samples, we provide a connection between \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S_p$$\end{document}Sp and p sequential samples of sizes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_1,n_2,\dots ,n_p$$\end{document}n1,n2,⋯,np, drawn from a single run of a Chinese Restaurant Process. Properties of the latter were studied in da Silva et al. (Bernoulli 29:1166–1194, 2023. 10.3150/22-BEJ1494). We show how the continuous-time framework helps to make asymptotic calculations easier than its discrete-time counterpart. As an application, for a specific choice of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t_1,t_2,\dots , t_p$$\end{document}t1,t2,⋯,tp, where the lengths of intervals are logarithmically equal, we revisit Fisher’s 1943 multi-sampling problem and give another explanation of what Fisher’s model could have meant in the world of sequential samples drawn from a BI process.


Introduction
The work we describe here arose from an attempt to model the arrival of covid DNA sequences at the GISAID database [Khare et al., 2021].The sequences arrive sequentially through time, a first sequence, a second, and so on.One model might therefore be a discrete-time process in which sequences are labeled by their order of appearance, and the type of that sequence (for example, the sequence itself) is recorded for each of them.One standard model for such counts-of-counts data is the so-called Chinese Restaurant Process (CRP), which records the accumulation of different sequence types (referred to as families in what follows) as they arrive at the database: a newly arriving sequence is either a copy of an existing sequence, or a novel sequence, with a probability determined by a parameter θ and the arrival number of the sequence.da Silva et al. [2022] describe such a model in some detail, and analyses, inter alia, the behavior of the number of distinct families, S i , seen in sequential samples of size n i , i = 1, 2, . . ., p.
In fact the samples arrive continuously through time, so that the samples are (for example) those obtained in week 1, week 2, and so on.This requires a continuous-time process as its generative model, and here we chose the Yule process with immigration.We note that this is not intended to provide a mechanistic model for the way the covid sequences have evolved, but rather a description of their arrival at the database.Thus we do not need to model deaths in our process.
This simplification leads to a model for which explicit results concerning the behavior of the number of families arising in p sequential time intervals may be derived.The analysis relies on simple Marked Poisson Process arguments, for which various marking probabilities may be calculated explicitly.
The work in da Silva et al. [2022] (see also Kendall [1948Kendall [ , 1949]], Ewens et al. [2007], Barbour and Tavaré [2010]) was motivated by an interpretation of a much older problem of Fisher [1943], who was concerned with finding the expected sample variance of (in our phraseology) S 1 and S 2 , namely EV 2 := E(S 1 − S 2 ) 2 /2, when the two sample sizes are equal, to n say.Fisher surmised that when n is large, EV 2 ∼ θ log 2. The present results provide an elementary analysis of this problem in a more general setting, and provides another view of Fisher's calculations.

The Yule process with immigration
We begin with some well-known results for the Yule process with immigration (cf.Tavaré [1987]).Immigration events occur at the points of a Poisson process of rate θ, each new immigrant initiating a family that grows according to a pure birth (Yule) process, B(•), with time scaled so that the birth rate is λ = 1.(This results in no loss of generality, since for arbitrary λ one replaces t by λt, and immigration rate θ = θ/λ.)The distribution of the number of members of a typical family that has grown for time t from a single individual is geometric: We denote the transition probability P(B(t) = k|B(0) = j) by p jk (t).
The population size at time t is denoted by Z(t), and the number of families of size i at time t is denoted by C i (t), i = 1, 2, . . . .It is well known (cf.Karlin and McGregor [1967], Tavaré [1987]) that for each t, the C i (t) are independent Poisson random variables with means given by What is less clear is the joint behavior of the number of families observed in different time intervals.This note addresses aspects of this problem.
We make use of the probability generating function (pgf) of B(t) which is given by 3 Families observable in a given period of time If we were to watch the growth of a given family through time, there will be some intervals in which no new births are observed.In such an interval, all the members of the family were old: that is, were born before the start of the interval; otherwise, the family is composed of old members and new ones.It is the interplay between these two types that we uncover.We say a family is observable in Letting I 1 = (a, b), I 2 = (c, d), we notice that where K(I 1 , I 2 ) counts families observable in both I 1 and I 2 , and T (I i \ I j ) counts families observable in I i but not in I j , for {i = j} = {1, 2}.In fact, K(I 1 , I 2 ) counts those points of the Poisson process X on (0, d), which are marked if observed in both I 1 and I 2 .Similar marking applies to T (I 1 \T 2 ) and T (I 2 \T 1 ).From the Marking Theorem (cf.Kingman [1993]), K(I 1 , I 2 ), T (I 1 \ T 2 ) and T (I 2 \ T 1 ) are independent Poisson random variables.To obtain their expected values, for J = (a 0 , b 0 ) and I i = (a i , b i ) such that b i−1 ≤ a i for i = 1, . . ., k, we define random variables V J (I 1 , . . ., I k ) = V a 0 ,b 0 (I 1 , . . ., I k ) to count the number of families initiated in J and not observable in I 1 , . . ., I k .We need the following lemma.Proof.For r ∈ N, let X J,r (I) count all those families initiated in J, with exactly r members at c, which are not observable in I.The probability that a family initiated at x ∈ J, is not observable in I, while it has exactly r members at c is given by Hence, from the Marking Theorem, for r ∈ N, V J,r (I) are independent Poisson r.v.s with expected value As a consequence, V J (I) is a Poisson r.v. with parameter We can establish the following lemma in a similar way.
Proof.The probability that a family arrives in J and is not observable in I 1 and I 2 , while it has r and s members at a and c, respectively, 1 ≤ r ≤ s, is given by Hence, from the Marking Theorem, for b < c, V J (I 1 , I 2 ) is a Poisson r.v. with the second to last line coming from the fact that families evolve independently.Using (4) once more, and simplifying, we get Proof.The Marking Theorem shows that U J (I) and V J (I) are independent Poisson r.v.s.The result follows since X(J) = U J (I) + V J (I).Proof.
, where From Lemma 1 and Lemma 2, we have EK(I 1 , I 2 ) follows after using Lemma 3 and simplifying.
On the other hand, which, applying Lemma 1 and Lemma 2 and simplifying, gives the result.

Sampling in multiple intervals
We consider sampling from p intervals determined by the points For i = 1, . . ., p, let S i := S(t i−1 , t i ) be the number of families observable in the time interval (t i−1 , t i ).The sample variance of the S i is given by We can exploit the previous results to compute EV p .Since we see from Corollaries 1 and 2 that If the time intervals are of equal length, say t i = iτ /p, i = 0, 1, . . ., p then (5) reduces to where γ = e τ /p .

Logarithmically equal interval lengths
There is one special case that results in the S i being identically distributed, namely the setting in which e t i = iγ + 1 for i = 0, 1, • • • , p and for some γ > 0; from Corollary 1, the S i then have Poisson distributions with and, from Corollary 2, covariances given by As a consequence, the correlation between S i and S j is given by ρ := corr(S i , S j ) = 2 − log(2γ + 1)/ log(γ + 1), i = j, and, from (5), Given the counts S i , i = 1, . . ., p, we let S be their mean, so that E S = θ log(1 + γ).This suggests a Watterson-type estimator of θ given by θ S = S/ log(1 + γ) (7) [Watterson, 1975].The estimator is unbiased, but as p → ∞, , so that θ S is not a consistent estimator of θ.

Fisher's problem revisited
Here we revisit Fisher's sampling problem Fisher [1943].Several approaches have appeared in the literature, and we point the reader to da Silva et al. [2022] for an overview.
Here we provide an alternative setting, via the Yule process with immigration, which leads to rather transparent connections between the different views.
The setting described in da Silva et al. [2022] occurs in discrete time, where successive samples of individuals of sizes n 1 , n 2 , . . ., n p are taken, and the observations are S * i , i = 1, 2, . . ., p, S * i denoting the number of distinct types observed in the ith sample.The generative model is that of the CRP, a sequential model in which the distribution of the counts of family sizes follows the Ewens Sampling Formula Ewens [1972].It is of interest to compare the two settings.
To do this, we recall first that and We could choose the time points 0 = t 0 , t 1 , . . ., t p in such a way that the cumulative number of individuals observed, . ., p and l 0 = 0, matches the expectation under the Yule model.To this end, we solve and for i = 1, . . ., p.
We remark that if θ were known, and we wish to estimate the time points t i given the counts n i , then (10) gives the moment estimators of the t i , and this in turn provides the maximum likelihood estimators of the t i .
Henceforth we assume that t i = log(θ + l i ) − log θ, and define S i = S(t i−1 , t i ), i = 1, . . ., p, and The results of Section 3 translate into Theorem 2. For any i ∈ N, S i is a Poisson random variable with mean Furthermore, for any p, i, j ∈ N, i = j, Proof.The results follow from substituting t i = log(θ + l i ) − log θ in Theorem 1, Theorem 2 and (5), respectively, and simplifying.
The term on the right appears in [Fisher, 1943, p. 451, after (5)] for the case p = 2, so our model provides a unifying approach to Fisher's question.

Asymptotic behavior
To see the asymptotic behavior, let n i = q i n, i = 1, . . ., p, where q i ∈ [0, 1] satisfy The righthand formula was derived originally in Barbour and Tavaré [2010] by a different Poisson argument, and as a limit in the discrete case in da Silva et al. [2022]. For as found by Fisher [1943].
5 How long are gaps in arrivals?
In the covid setting it is of some interest to describe gaps in the appearance of particular variants.One setting for this is the following: Conditional on at least one family arriving in (0, t), choose one of those families at random.What is the distribution of the length of time for which that family is unobservable after time t.Hence, we seek the distribution of the waiting time, W t , for that family to have its first birth after time t?
We begin by showing that N t , the number of members at time t of a family randomly chosen in (0, t), conditional on having at least one family, is log-series distributed, with parameter q t = 1 − e −t .Since the arrival time in (0, t) of a typical family is uniform on (0, t), we see that as required.
Since each of the N t individuals in the family at time t behave independently, it follows that, given N t = n, W t is the minimum of n independent unit exponential lifetimes, which is exponential with parameter n.Hence the density f t (s) of W t is This allows the moments of W t to be written down immediately: where Li n (x) denotes the polylogarithm function.
6 Conditioning in the Yule process with immigration da Silva et al.
[2022] provide a discrete approach to understand Fisher's multi-sampling problem.In this paper we tackled Fisher's problem with a continuous approach.This section discusses the connections between two approaches, focusing primarily on various versions of embedding.
To set the scene, recall that the Ewens Sampling Formula (ESF) may be realised by conditioning independent Poisson random variables on a finite [Watterson, 1974] or an infinite [Shepp and Lloyd, 1966] weighted sum.Hence we discuss a unifying approach that connects both types of conditioning relations through embedding of the CRP into the Yule process with immigration.In particular, for θ > 0, let for n ∈ N, where θ (n) = θ(θ + 1) • • • (θ + n − 1) and θ (0) = 1.E n is the ESF originally derived as the distribution of the allelic partition of a sample of n genes sampled from a large population.To establish the conditioning relations, for x ∈ (0, 1], let π 1 (x), π 2 (x), . . .be independent Poisson random variables with Eπ i (x) = θx i /i, and let Extending the result of Shepp and Lloyd [1966], as T ∞ (x) is almost surely finite for x ∈ (0, 1), we have By conditioning on T n (x) instead of the infinite sum, Watterson introduced a conditioning relation that holds for x ∈ (0, 1], more precisely Notice from (2) that if we define π i (x) = C i (t) for t ∈ R + and x = 1 − e −t ∈ (0, 1), and let Z n (t) = n i=1 iC i (t), by definition we have T n (x) = Z n (t) and T ∞ (x) = Z(t).
Hence, for x ∈ (0, 1) and t ∈ R + , the conditional distribution of counts of family sizes, given Z(t) = n or Z n (t) = n, is the ESF once more.Note that as the C i (t) are independent Poisson random variables with EC i (t) = θ(1 − e −t ) i /i, where C i (∞) = π i (1) has a Poisson distribution with mean θ/i, for i ∈ N; hence one can derive (15) for x = 1, by letting t → ∞.
The conditioning relation provides one way to simulate observations from the ESF: simulate the C i (t), and accept the realisation if Z(t) = n.To make the simulation as efficient as possible, we should choose t to make the probability of the conditioning event as large as possible.To do this, we note that the value of t may be chosen as a function of n, since this choice plays no role in the conditional distribution.Maximizing P(Z(t) = n) given in (8) gives t = log((n + θ)/θ) (and so x = x n = n/(n + θ)).We note that this is the same choice of t as provided in (10).We also note that there are far more efficient ways to simulate the ESF; see Arratia et al. [2018] for example.
The conditioning relation ( 15) is a result of the embedding of the CRP in the Yule process with immigration.Here we discuss embeddings at multiple time points.To this end, let C denote the counts of family sizes generated by the first n arrivals in the CRP with parameter θ, and let C(t) = (C 1 (t), C 2 (t), . . .), t ≥ 0 be the family size counts at time t of the Yule process with immigration with birth rate 1 and immigration rate θ.To discretise the process C = (C(t), t ≥ 0), let be the family size counts of the first n individuals born in the Yule process with immigration, i.e.Ψ j (n) gives the number of families of size j, considering only the first n individuals.As a discrete process, Ψ := (Ψ(n), n ≥ 1) records the outcomes of jumps in the Yule process with immigration.This is a slightly different version of the jump Markov chain J = (J(n), n ≥ 1) used in [Tavaré, 1987, Section 3], in which, for each n, the families in J(n) are also sorted in order of their appearance in the population.
In other words, for any n ∈ N, one can easily obtain Ψ(n) from J(n) by forgetting the order (age) of the families in J(n) and grouping all families of the same size together.From Theorem 2 in Tavaré [1987], J is independent of Z, then as a functional of J, so is Ψ.The independence comes as a result of the fact that at each jump time, one new individual will be added to the existing population, no matter if the new individual arrives as a result of an immigration (new family) or a birth.The connection between C = ( C(n), n ∈ N) and C = (C(t)), t ≥ 0) is given in the next theorem.
To better connect the sequential multi-sampling theory of the Yule process with immigration to its discrete-time counterpart, consider a population of size n 1 +n 2 +• • •+n p , sampled from a single run of a CRP.da Silva et al. [2022] study the pairwise correlation and sample variance of S * 1 , S * 2 , • • • , S * p , the number of types (or species) appearing in the first n 1 arrivals, the second n 2 arrivals, . . ., and the last n p arrivals of the CRP sample.It is now straightforward from ( 16) that (S * 1 , S * 2 , • • • , S * p ) is in distribution the same as (S(t 0 , t 1 ), • • • , S(t p−1 , t p )), conditional on observing, in the latter, exactly n i individuals in (t i−1 , t i ), for i = 1, • • • , p.As mentioned in the discussion after (10), letting t i = log((θ + l i )/θ) for i = 1, • • • , p, maximizes the chance of observing n i new individuals (i.e.n i births) in (t i−1 , t i ), and under this assumption, the asymptotic behavior of the sample variance of S1 , S2 , • • • , Sp and that of the sample variance of S * 1 , S * 2 , • • • , S * p coincide.In this case, in addition to providing a relatively simpler way to calculate things, the Yule process with immigration allows the sequential samples S1 , • • • , Sp to have a random number of individuals, and hence is more appropriate for population models with random sample sizes in which the individuals arrive one by one at random times.
(a, b), for 0 ≤ a < b, if it has at least one birth in (a, b).Let S(a, b) count the families observable in (a, b).Of interest is the distribution of S(a, b) and the joint distribution of S(a, b) and S(c, d) for 0 ≤ a < b ≤ c < d.In particular, we compute Cov(S(a, b), S(c, d)).

Lemma 1 .
For 0 ≤ a < b ≤ c < d, let J = (a, b) and I = (c, d).Then V J (I) is a Poisson r.v. with mean EV J (I) = θ log e d − e c + e b e d − e c + e a − x; e −(d−c) )dx = θΦ(a, b, c; e −(d−c) ) = θ log e d − e c + e b e d − e c + e a , using (4) and simplifying.We highlight one special case of this result for the case J = (0, a), I = (a, b).Lemma 1 gives EV J (I) = θb − θ log(e b − e a + 1) e d − e c + e b e d − e c + e b − e a + 1 , as required.The case b = c reduces to EV J (I) for J = (0, a), I = I 1 ∪ I 2 = (a, d), given in Lemma 1.To compute the expected values of K(I, J), T (I \ J) and T (J \ I), we also need to define the random variables U J (I 1 , . . ., I k ) to count the families initiated in J and observable in I 1 , . . ., I k .Lemma 3.For 0 ≤ a < b ≤ c < d, let J = (a, b) and I = (c, d).Then U J (I) is a Poisson r.v. with parameter EU J (I) = θ log e b (e d − e c + e a ) e a (e d − e c + e b )
e b − e a + 1) −θ log(e d − e c + e a ) + θ log(e d − e c + 1) +θ log(e d − e c + e b ) − θ log(e d − e c + e b − e a + 1)