Mathematical Characterization of Private and Public Immune Receptor Sequences

Diverse T and B cell repertoires play an important role in mounting effective immune responses against a wide range of pathogens and malignant cells. The number of unique T and B cell clones is characterized by T and B cell receptors (TCRs and BCRs), respectively. Although receptor sequences are generated probabilistically by recombination processes, clinical studies found a high degree of sharing of TCRs and BCRs among different individuals. In this work, we use a general probabilistic model for T/B cell receptor clone abundances to define “publicness” or “privateness” and information-theoretic measures for comparing the frequency of sampled sequences observed across different individuals. We derive mathematical formulae to quantify the mean and the variances of clone richness and overlap. Our results can be used to evaluate the effect of different sampling protocols on abundances of clones within an individual as well as the commonality of clones across individuals. Using synthetic and empirical TCR amino acid sequence data, we perform simulations to study expected clonal commonalities across multiple individuals. Based on our formulae, we compare these simulated results with the analytically predicted mean and variances of the repertoire overlap. Complementing the results on simulated repertoires, we derive explicit expressions for the richness and its uncertainty for specific, single-parameter truncated power-law probability distributions. Finally, the information loss associated with grouping together certain receptor sequences, as is done in spectratyping, is also evaluated. Our approach can be, in principle, applied under more general and mechanistically realistic clone generation models.


Introduction
A major component of the adaptive immune system in most jawed vertebrates is the repertoire of B and T lymphocytes.A diverse immune repertoire allows the adaptive immune system to recognize a wide range of pathogens [1].B and T cells develop from common lymphoid progenitors (CLPs) that originate from hematopoietic stem cells (HSCs) in the bone marrow.B cells mature in the bone marrow and spleen while developing T cells migrate to the thymus where they undergo their maturation process.After encountering an antigen, naive B cells may get activated and differentiate into antibody-producing plasma cells, which are essential for humoral (or antibody-mediated) immunity.In recognizing and eliminating infected and malignant cells, T cells contribute to cell-mediated immunity of adaptive immune response.
T-cell receptors bind to antigenic peptides (or epitopes) that are presented by major histocompatibility complex (MHC) molecules on the surface of antigen-presenting cells (APCs).T cells that each carry a type of TCR mature in the thymus and undergo V(D)J recombination, where variable (V), diversity (D), and joining (J) gene segments are randomly recombined [2,3].The receptors are heterodimeric molecules and mainly consist of an α and a β chain while only a minority, about 1-10% [4], of TCRs consists of a δ and a γ chain.The TCR α and γ chains are made up of VJ and constant (C) regions.Additional D regions are present in β and γ chains.During the recombination process, V(D)J segments of each chain are randomly recombined with additional insertions and deletions.After recombination, only about 5% or even less [5] of all generated TCR sequences are selected based on their ability to bind to certain MHC molecules ("positive selection") and to not trigger autoimmune responses ("negative selection").These naive T cells are then exported from the thymus into peripheral tissue where they may interact with foreign peptides that are presented by APCs.The selection process as well as subsequent interactions are specific to an individual.
The most variable parts of TCR sequences are the complementary determining regions (CDRs) 1, 2, and 3, located within the V region, among which the CDR3β is the most diverse [6].Therefore, the number of distinct receptor sequences, the richness R, of TCR repertoires is typically characterized in terms of the richness of CDR3β sequences.Only about 1% of T cells express two different TCRβ chains [7][8][9], whereas the proportion of T cells that express two different TCRα chains may be as high as 30% [9,10].
B cells can also respond to different antigens via different B cell receptors (BCRs) that are comprised of heavy and light chains.As with TCRs, the mechanism underlying the generation of a diverse pool of BCRs is VDJ recombination in heavy chains and VJ recombination in light chains.Positive and negative selection processes sort out about 90% of all BCRs that react too weakly or strongly with certain molecules [11].As a result of the various recombination and joining processes and gene insertions and deletions, the practical theoretical maximum repertoire size Ω 0 of the variable region of BCR and TCR receptors can be ∼ 10 14 − 10 20 [12][13][14][15].This value is comparable to the possible number of amino acid sequences of typical length ∼ 11 − 12.However, many of these sequences are not viable, are removed through thymic selection, or are have such low probability occuring that they are never expected to be produced in an organism's lifetime.Thus, the effective number of TCR variable regions that are produced and that can contribute to the organism's repertoire size, Ω, should be much less than Ω 0 .Estimating the true size of BCR and TCR repertoires realized in an organism is challenging since the majority of such analyses are based on small blood samples, leading to problems similar to the "unseen species" problem in ecology [16].Nonetheless, the number of unique TCRs realized in organisms has been estimated to be about 10 6 for mice [17] and about 10 8 for humans [18].B-cell repertoire size for humans is estimated to be 10 8 − 10 9 [19].These values are significantly smaller than Ω 0 and might be used as an effective Ω.
Each pool of BCR and TCR sequences realized in one organism i can be seen as a subset U i of the set of all possible species-specific sequences S. Sequences that occur in at least two different organisms i and j (i.e., sequences that are elements of U i ∩U j ) are commonly referred to as "public" sequences [16] while "private" sequences occur only in one of the individuals tested.The existence of public TCRβ sequences has been established in several previous works [18,[20][21][22].More recently, a high degree of shared sequences has been also observed in human BCR repertoires [23,24].
The notions of public and private clonotypes have been loosely defined.Some references use the term "public sequence" to refer to those sequences that "are often shared between individuals" [22] or "shared across individuals" [25].Recently, Elhanati et al. [26] and [27] have formulated mathematical and statistical framework to quantify "publicness" and "privateness."Building on these works, we derive a set of measures that enable us to quantify immune repertoire properties, including the expected total richness, the expected numbers of public and private clones, and their variances (confidence levels), all expressed in terms of the general set of clone generation probabilities or clone populations.One of our metrics is the expected "M -overlap" or "Mpublicness," defined as the expected number of clones that appear in samples drawn from M separate individuals.This quantity is a clinically interpretable limit of the expected "sharing number" defined in Elhanati et al. [26].Similarly, we define M -private clones as clones that are not shared by all M individuals, i.e., occurring in at most M − 1 individuals.
In the next section, we first give an overview of the mathematical concepts that are relevant to characterize TCR and BCR distributions.We then formulate a statistical model of receptor distributions in Sec. 3. In Secs.3.1 and 3.2, we derive quantities associated with receptor distributions in single organisms and across individuals, respectively.We will primarily focus on the overlap of repertoires across individuals and on the corresponding confidence intervals that can be used to characterize "public" and "private" sequences of immune repertoires.Formulae we derived are listed in Table 1.In Sec. 4, we use synthetic and empirical TCR amino acid sequence data and perform simulations to compare theoretical predictions of repertoire overlaps between different individuals with corresponding observations.Finally, when analyzing empirical sequence data, one may use continuous approximations [26,27] and averaging (i.e., coarse-graining) methods that change the information content in the underlying dataset.Coarse-graining of TCR and BCR data may also be a result of the employed sequencing techniques [28,29].In Sec. 6, we therefore briefly discuss the information loss associated with analyzing processed cell data.We discuss our results and conclude our paper in Sec. 7. Our source codes are publicly available at [30].

Mathematical concepts
Although receptor sequences and cell counts are discrete quantities, using continuous functions to describe their distribution may facilitate the mathematical analysis of the quantities that we derive in the subsequent sections.For example, a continuous approximation (i.e., a "density-of-states approximation") of receptor sequence distributions has been used in [26,27] to analyze properties of immune repertoires.We therefore briefly review some elementary concepts associated with continuous distributions and their discretization.
Let p(x) be the probability density associated with the distribution of traits, as depicted in Fig. 1(a).The probability that a certain trait occurs in [x, x+dx) is p(x) dx.The corresponding discretized distribution elements are where ∆ is the discretization step size of the support of p(x).If we discretize the values of probabilities, the number of clones with a certain relative frequency p i is given by the clone count where the indicator function 1 = 1 if its argument is satisfied and 0 otherwise.As shown in Fig. 1(b), the parameter δ defines an interval of frequency values and modulates the clone-count binning.Figures 1(b,c) show how p i and c k are constructed from a continuous distribution p(x).
If p(x) is not available from data or a model, an alternative representative starts with the number density n(x), which can be estimated by sampling a process which follows p(x).The probability that a continuous trait x is drawn twice from a continuous distribution p(x) is almost surely zero.Hence, the corresponding number counts n(x) are either 1 if X ∈ [x, x + dx) (i.e., if trait X is sampled) or 0 otherwise, as shown in Figs.1(d,e).We say that X is of clonotype i if X ∈ [i∆, (i + 1)∆) (1 ≤ i ≤ Ω) and we use n i to denote the number of cells of clonotype i.Then, if Ω denotes the effective number of different clonotypes, the total T-cell (or B-cell) population is N ≡ Ω i=1 n i .The relative empirical abundance of clonotype i is thus pi = n i /N (see Fig. 1e), satisfying the normalization condition i pi = 1.The corresponding empirical clone count derived from the number representation n i is defined as and shown in Fig. 1

Whole organism statistical model
Using the mathematical quantities defined above, we develop a simple statistical model for BCR and TCR sequences distributed among individuals.
Although our model is applicable to both BCR and TCR sequences, we will primarily focus on the characterization of TCRs for simplicity.B cells undergo an additional process of somatic hypermutation and class switching leading to a more dynamic evolution of the more diverse B cell repertoire [33].By focusing on naive T cells, we can assume their populations are generated by the thymus via a single, simple effective process.Assume a common universal recombination process (see Fig. 2) in T-cell development that generates a cell carrying TCR of type 1 ≤ i ≤ Ω 0 with probability p (0) i .Here, Ω 0 ≫ Ω is the the theoretical number of ways the full TCR sequence can be constructed which is itself much larger than the effective number Ω that appears in an individual after thymic selection.Although each new T cell produced carries TCR i with probability p (0) i , many sequences i are not realized given thymic selection (that eliminates ∼ 98% of them), the finite number of T cells produced over a lifetime [3,5,15], or the extremely low generation probability of some clones.These effects are invoked to truncate Ω 0 to Ω ≪ Ω 0 .However, we will see in Section 5, explicit scaling relastionships for the limit Ω ≫ 1 can be found for general power-law ordered probabilities p i .
Besides VDJ recombination [34], thymic selection and subsequent death, activation, and proliferation occur differently across individuals 1 ≤ m ≤ M and may be described by model parameters θ  , N (m) might be described by dynamical models, deterministic or stochastic, such as those presented in [35].
At any specific time, individual m will have a cell population configuration n (m) ≡ (n Ω ) with probability P(n (m) |θ (m) , N (m) ).Each individual can be thought of as a biased sample from all cells produced via the universal probabilities p (0) i .We can approximate individual probabilities p where and Thus, each individual can be thought of as a "sample" of the "universal" thymus.Neglecting genetic relationships amongst individuals, we can assume them to be independent with individual probabilities p (m) i .These probabilities describe the likelihood that a randomly drawn cell from individual m is a cell of clone i.
Repeated draws (with replacement) would provide the samples for the estimator p(m are counted and N (m) is also known or estimated.This representation allows us to easily express the probabilities of any configuration n (m) analytically.A dynamical model for n (m) i cannot be directly described by our simple probabilities p (m) i . A mechanistically more direct model could incorporate the production rate of clone i T cells from the thymus, the proliferation and apoptosis rates of clone i cells, and interactions manifested as, e.g., carrying capacity as model parameters.Probability distributions for n (m) , as a function of birth, death, and immigration rates, have been found in [36] and can also be used, instead of Eq. ( 4), to construct probabilities.

Single individual quantities
First, we focus on quantities intrinsic to a single individual organism; thus, we can suppress the "m = 1" label.Within an individual, we can use clone counts to define measures such as the richness where ĉk 3) (the number of clones that are of size k).Other diversity/entropy measures such as Simpson's indices, Gini indices, etc. [1,37] can also be straightforwardly defined.Given the clone populations n, the individual richness can be found by direct enumeration of Eq. (5).
We can also express the richness in terms of the underlying probabilities p associated with the individual by first finding the probability ρ i that a type-i cell appears at all among the N cells within said individual.This probability is and corresponds to that of a binary outcome, either appearing or not appearing.Higher order probabilities like ρ ij (both i-and j-type cells appearing in a specific individual) can be computed using the marginalized probability to construct Higher moments can be straightforwardly computed using quantities such as These expressions arise when we compute the moments of R [defined by Eq. ( 5)] in terms of the probabilities p.Using the single-individual multinomial probability P(n|p, N ) (Eq. ( 4)) allows us to express moments of the richness in a single individual in terms of the underlying system probabilities p.The first two are given by

Multi-individual quantities
Here, we consider the distribution n (m) across different individuals and construct quantities describing group properties.For example, the combined richness of all TCR clones of M individuals is defined as To express the expected multi-individual richness in terms of the underlying individual systems probabilities p (m) , we weight Eq. ( 11) over the M -individual probability and sum over all allowable n (m) .For computing the first two moments of the total-population richness in terms of p (m) , we will make use of the marginalized probability ρi that clone i appears in at least one of the M individuals Private and Public TCR statistics Note that ρi describes the probability that a type i cell occurs at all in the total population, while describes the probability that a type i cell appears in each of the M individuals.
We will also need the joint probability that clones i and j both appear in at least one of the } , which we can decompose as Upon using Eqs.( 4) and ( 7), we find allowing us to rewrite Eq. ( 14) as Using the above definitions, we can express the mean total-population richness as where the last approximation holds for p The second moment of the total M -population richness can also be found in terms of E[R (M) ] and Eq. ( 16), Given n (m) of all individuals, we can also easily define the number of distinct TCR clones that appear in all of M randomly selected individuals, the "M -overlap" or "M -publicness"  , m = 1, 2, 3.The richness R (3) is the total number of distinct number of different TCRs found across all individuals, and is defined in Eq. (11).The overlap K (3) is the number of TCR clones found in all three individuals, as defined in Eq. (19).For visual simplicity, the set of clones i present in each individual are drawn to be contiguous.When considering subsampling of cells from each individual, K (M ) will be reduced since some cell types i will be lost.The corresponding values, s , and R (M ) s can be constructed from Eqs. ( 23) and ( 24) reflecting the losses from subsampling.
Figure 3 provides a simple example of three individuals each with a contiguous distribution of cell numbers n (m) i that overlap.As with Eqs. ( 5) and (10), we can express the overlap in terms of the underlying individual probabilities p (m) by weighting Eq. ( 19) by the M -population probability As with M -publicness, we can identify the expected M -privateness as Ω − E K (M) , the expected number of clones that are not shared by all M individuals, i.e., that occur in at most M − 1 individuals.This "privateness" is related to a multi-distribution generalization of the "dissimilarity probability" of samples from two different discrete distributions [38].Variations in M -overlap associated with a certain cell-type distribution are captured by the variance Var If the total number of sequences Ω is very large, parallelization techniques (see Sec. 4) should be employed to evaluate the term A more specific definition of overlap or privateness may be that a clone must appear in at least some specified fraction of M tested individuals.To find the probability that M i ≤ M individuals share at least one cell of a single type i, we use the Poisson binomial distribution describing independent Bernoulli trials on individuals with different success probabilities ρ where F Mi is the set of all subsets of M i integers that can be selected from the set (1, 2, 3, . . ., M ) and A c is the complement of A. Equation ( 22) gives a probabilistic measure of the prevalence of TCR i across M individuals.For example, one can use it to define a mean frequency E[M i ]/M .One can evaluate Eq. ( 22) recursively or using Fourier transforms, particularly for M < 20 [39,40].

Subsampling
The results above are described in terms of the entire cell populations n (m) or their intrinsic generation probabilities p (m) .In practice, one cannot measure n (m) i or even N (m) in any individual m.Rather, we can only sample a much smaller number of cells S (m) ≪ N (m) from individual m, as shown in Fig. 2. Within this subsample from individual m, we can count the number s (m) i of type-i cells.Since only subsamples are available, we wish to define quantities such as probability of occurrence, richness, and overlap in terms of the cell counts s (m) ≡ {s (m) i } in the sample extracted from an individual.Quantities such as sampled richness and overlap can be defined in the same way except with s (m) as the underlying population configuration.To start, first assume that the cell count n in a specific individual is given.If that individual has N cells of which S are sampled, the probability of observing the population s = {s 1 , s 2 , . . ., s Ω } in the sample is given by (assuming all cells are uniformly distributed and randomly subsampled at once, without replacement) [41] P(s|n, S, N ) = 1 The probability that cell type j appears in the sample from an individual with population n can be found by marginalizing over all s j =i , giving This result can be generalized to more than one TCR clone present.For example, the probability that both clones i and j are found in a sample is Using Eq. ( 23) as the probability distribution, we can also find the probability that clone i appears in any of the M S (m) -sized samples and the joint probabilities that clones i and j appear in any sample Quantities such as richness and publicness measured within samples from the group can be analogously defined in terms of clonal populations s (m) : and For a given n (m) , these quantities can be first averaged over the sampling distribution Eq. ( 23) to express them in terms of n (m) and to explicitly reveal the effects of random sampling.The first two moments of R s , R (M) s , and expressed in terms of n (m) can be easily found by weighting Eqs. ( 28), (29), and (30) by P(s|n, S, N ) and All of the above quantities can also be expressed in terms of the underlying probabilities p (m) rather than the population configurations n (m) .To do so, we can further weight Eqs. ( 31), (32), and (33) over the probability Eq. ( 4) to render these quantities in terms of the underlying probabilities p (m) .However, we can also first convolve Eq. ( 23) with the multinomial distribution in Eq. ( 4) (suppressing the individual index m) P(s|p, S, N ) = n P(s|n, S, N )P(n|p, N ), (34) along with the implicit constraints which is a multinomial distribution identical in form to P(n|p, N ) in Eq. ( 4), except with n replaced by s and N replaced by S. Uniform random sampling from a multinomial results in another multinomial.Thus, if we use the full multi-individual probability to compute moments of the sampled richness and publicness, they take on the same forms as the expressions associated with the whole-organism quantities.For example, in the p representation, the probability that clone i appears in the sample from individual m is in analogy with Eq. ( 6), while the two-clone joint probability in the sampled from individual m becomes in analogy with Eq. ( 8).Similarly, for the overlap quantities, in analogy with Eqs. ( 13) and ( 16), we have ρi (S) .
In addition to simple expressions for the moments of K (M) s , we can also find expressions for the probability distribution over the values of K (M) s . In terms of n (m) , since the probability that s , the probability that exactly k clones are shared by all M samples is where F k is the set of all subsets of k integers that can be selected from the set {1, 2, 3, . . ., K (M) } and A c is the complement of A. Equation ( 41) is the Poisson binomial distribution, but this time the underlying success probabilities across all M individuals vary with TCR clone identity i.Finally, inference of individual measures from subsamples can be formulated.One can use the sampling likelihood function P(s|n, S, N ), Bayes' rule, and the multinomial (conjugate) prior P(n|p, N ) to construct the posterior probability of n given a sampled configuration s: P(n|s, S, N, p) = P(s|n, S, N )P(n|p, N ) n P(s|n, S, N )P(n|p, N ) The normalization in Eq. ( 42) has already been found in Eqs. ( 34) and (35).Thus, we find the posterior in terms of the hyperparameters p.Using this posterior, we can calculate the expectation of the whole organism richness R = k≥1 which depends on the sampled configuration only through the sample-absent clones j. Bayesian methods for estimating overlap between two populations from samples have also been recently explored [42].

Simulations
The sampling theory derived in the previous sections is useful for understanding the effect of different sampling distributions on measurable quantities such as the proportion of shared TCRs and BCRs among different individuals.Figures 4 and 5 show two examples of receptor distributions, along with the respective relative overlaps and Fano factors, for three individuals.To illustrate our methodology clearly and concisely, we utilize three shifted uniform distributions as models of synthetic sequence distributions in Fig. 4. In this example, the number of TCR or BCR sequences per individual is N (m) = 10 5 , (m = 1, 2, 3), and the sampled group richness R (3) s = 1500.Based on the abundance curves shown in Fig. 4(a), we can readily obtain the overlaps between individuals 1-3 (solid black, dashed blue, and dash-dotted red lines), as well as between all pairs of individuals.The maximum possible overlap, normalized by R (3) s , between all three individuals and between individuals 1 and 3 is 500/1500 ≈ 0.33.For the two remaining pairs, the corresponding maximum relative overlap, normalized by the richness associated with all three sampled individuals, is 750/1500 = 0.5.
Using S (m) < N (m) sampled cells from each individual, we observe in Fig. 4(b) that the increase of E[K with S (m) is well-described by Eq. ( 20).In Fig. 4(c), we plot the Fano factor Var K as a function of the number of sampled cells S (m) from each individual.For sample sizes of about 1000 (i.e., about 1% of the total sequence population), the Fano factor is between 0.4 (for overlaps between individuals 1 and 2 and between individuals 2 and 3) and 0.6 (for the overlap between individuals 1-3).As sample sizes reach about 5% of the total number of sequences (10 5 ), the variance Var K (M) s becomes negligible with respect to the expected overlap E K (M) s .
As an example of an application to empirical TRB CDR3 data, we used the SONIA package [43] to generate amino acid sequence data for three individuals, each with N (m) = 10 5 cells.The combined richness across all individuals is R (3) s = 284, 598.We show the abundances of all sequences in Fig. 5(a).The majority of sequences has an abundance of 1 while only very few sequences have abundances that exceed 5. Figure 5(b) shows the expected number of shared sequences as a function of the sampled number of cells S (m) .To evaluate Eq. ( 33), we compute the binomial terms in σi and σij by expanding them according to, e.g., where S (m) /N (m) is the sample fraction drawn from the m th individual.For large n i , other approximations, including variants of Stirling's approximations can be employed for fast and accurate evaluation of binomial terms.
We compare these number-representation results with the p-representation results by using the estimates p(m (S) and ρ (m) ij (S) to compute the quantities in Eqs.(20) and (21).If the number of sampled cells S (m) is not too large, the analytic approximation of using p(m) is fairly accurate, as shown by the dashed back curves in Fig. 5(b).Since the abundances of the majority of sequences are very small, finite-size effects lead to deviations from the naive approximation (37) as the numbers of sampled cells S (m) grows large.Of course, we can also extract associated with relative overlaps between individuals 1 and 2 (solid blue line), 2 and 3 (dash-dotted green line), 1 and 3 (dashed orange line), and 1-3 (solid red line) as a function of the number of sampled cells S (m) .generation probabilities from SONIA and directly use Eq. ( 20) and ρ (m) i (S) from Eq. ( 37) to find the p-representation M -overlap E K To examine the variance associated with a given expected number of shared empirical TRB CDR3 sequences, we show the Fano factor Var K as a function of the sample size S (m) in Fig. 5(c).For the shown sample sizes up to S (m) = 3 × 10 4 , the Fano factor is larger than about 0.85, indicating a relatively large variance Var K (M) s .In addition to reporting mean values of measures of sequence sharing (i.e., "overlap" or "publicness") when analyzing empirical receptor sequence data [26,27], we thus recommend to compute Var K (M) s to determine corresponding confidence intervals.Calculations were performed on an AMD ® Ryzen Threadripper 3970 using Numba to parallelize the calculation of Eqs.(33) and 5 Explicit forms for power-law probabilities All of our results thus far assume a model or estimate of p i or n i , as well as knowledge of at least Ω.For our formulae to be useful, the theoretical maximum richness Ω also needs to be estimated or modeled.Numerous parametric and nonparametric approaches have been developed in the statistical ecology literature [41,[44][45][46][47][48][49][50], as well as expectation maximization methods to self-consistently estimate richness and most likely clone population n [51].
To explore how our results depend on parameters such as Ω we derive approximate analytic expressions when the identical individual probabilities p (m) i = p i obey truncated power-law distributions: where If Ω is sufficiently large, we would like to show under what conditions the expectations of our diversity measures converge quickly to Ω-independent values.By approximating /(Hν (Ω)z ν ) dz in Eq. ( 6) we find in the large Ω limit where the exponential integral is defined by E(x, y) ≡ ∞ 1 t −x e −yt dt and ζ(ν) is the Riemann zeta function.Consistent with known biology and previous estimates [14,15], we take the large-Ω limit where x > 1.From Eqs. (47), we see that the expected richnesses converge to fixed values for large enough Ω and all values of ν ≈ 1.The rescaled expected richnesses are plotted as functions of x = Ω/N or x = Ω/N 1/ν in Fig. 6(a).
Analogous cutoff-insensitive results can be found for the variance Var[R 2 (ν)] as well as other quantities.A good approximation for the variance is Fig. 6 Expected richness and uncertainty under power law-distributed probabilities p i following Eq.( 46).(a) Expected richness for different values of ν that lead to simple scaling and dependence only on x = Ω/N, Ω/N 1/ν .For large x, the expected rescaled richnesses E[R(ν)]/N and E[R(ν > 0)]/N 1/ν converge.Since the normalization of the expected richness (by N 1/ν ) for ν > 1 is different than for ν = 0, 1  2 , (normalized by N ), we construct the squared coefficient of variation and plot where the power-law assumption in Eq. ( 46) is used in the last approximation.The normalized squared coefficients of variation CV Plots of the CV of the richness under power-law system probabilities are shown in Fig. 6(b).We see that the squared CVs converge in the large x limit to N −1 for ν = 0, 1/2 and vanish for ν > 1.
The behavior of the sampled M -overlap, E[K (M) s ({p (m) })], can also be quantified under the power-law probability distribution.By using Eq.(20) and ρ i (S) (Eq.(37), assuming equal probabilities p (m) i = p i and sample sizes S (m) = S across individuals), we find (ν = 1/2) as a function of the number M of individuals sampled with S = 10 4 , 10 5 , 10 6 .In all panels, the dashed curves plot the analytic approximation for ν 0.7 given in the second line of (51).In (c), the ν = 0 limit matches the expression given by the first line in (51).The approximations given in (51) are especially accurate for large S and larger M and ν.
This expression can be further simplified in the large Ω limit for specific ν, (51) The last approximation is most accurate for ν > 1 where H ν (Ω → ∞) converges and the prefactor in brackets is ≈ 1.For sufficiently large S, it still provides a rough estimate of M -overlap for smaller values of ν.Asymptotic expressions for even smaller values of ν can be found in the S/H ν (Ω) ≪ 1 limit, but this limit yields very low expected M -overlap and is typically less informative.(ν) as a function of sample size, power-law ν, and M .For comparison, the analytic approximation for ν 0.7 ( 51) is also plotted by the dashed curves.Eq. ( 51) and plots such as those in Fig. 7(a,b) could be useful for estimating the sample size S required in order to observe a specific overlap between the immune repertoires of M selected individuals.For instance, with M = 4 individuals, a repertoire size of Ω = 10 7 , and a sequence distribution exponent ν = 0.5, an expected M -overlap of approximately 1 can be achieved with a sample size of S = 10 4 .
Since the 1/H ν (Ω) (ν) , as shown in Fig. 7(c).Using Eqs.(33) and (38), we can also straightforwardly evaluate the variance of the M -overlap.For ν = 0 and uniform We find the variances of the M -overlap, as with our other metrics, are well approximated by that of a binomial process in the S → ∞ limit and when values of ν are modest: var . Modest deviations from this approximation arise for finite S and large values of ν.

Sampling resolution and information loss
We end with a brief discussion of information loss upon coarse-graining which arises when analyzing lower-dimensional experimental/biochemical classifications of clones that are commonly used.Such lower-dimensional representations can be obtained through spectratyping [28,29].For TCRs, spectratyping groups sequences together and produces compressed receptor representations describing CDR3 length, frequency, and associated beta variable (TRBV) genes [52].In addition to coarse-grained representations of sequencing data, some studies [26,27] use continuous approximations to describe the distribution of receptor sequences.Estimators of entropy and their errors have been developed for subsampling from discrete distributions [53,54].Therefore, in this section, we focus on quantifying differences in information content that are associated with (i) using continuous approximations of discrete sequencing data, and (ii) coarse-graining already-discretized (i.e., spectratyping) distributions.
Given a discrete random variable X describing Ω "traits" and taking on possible values {x 1 , x 2 , . . ., x Ω }, let p i = P(X = x i ).The entropy of this probability distribution is given by H p = − Ω i=1 p i log(p i ).Similarly, one It is well-known that the differential entropy is not a suitable generalization of the entropy concept to continuous variables [55] since it is not invariant under change of variables and can be negative.These issues can be circumvented by introducing the limiting density of discrete points.Here, we present a more direct approach that will be sufficient for our application.For a probability density function p : [a, b] → R + 0 we introduce a discretizing morphism D ∆ so that describes a random variable taking on values in each of the (b−a)/∆ = B bins.
To quantify the amount of information lost in this discretization step, consider the entropy H q = − B i=1 q i log q i ∼ log ∆ in the ∆ → 0 limit.If we want to evaluate any information loss as a difference between the (finite) differential entropy S p and the (diverging) entropy H q we need to account for this logarithmic contribution by defining the corresponding information loss as By absorbing the logarithmic contribution into the differential entropy, we find the correct continuous entropy according to Jaynes [55] using the limiting density of discrete points.
As an example, we compute the information loss associated with discretizing the truncated power law where γ(s, x) = x 0 t s−1 e −t dt is the lower incomplete gamma function.The distribution Eq. ( 54) gives rise to few high-abundance clones and many low-abundance clones, as typical for TCR receptor sequences [1].Analytic expressions for the discretized probabilities q i are lengthy, so we numerically compute q i to evaluate the information loss L(∆).Eq. ( 53) is plotted as a function of the number of bins B = 1/∆ in Fig. 8(a).The information loss decreases with the number of bins, as this results in the discrete distribution gathering more information about its continuous counterpart.
While the connection between continuous probability distributions and their discretized counterparts has important consequences for sampling, information loss also occurs in spectratyping when an already discrete random variable is coarse grained.In this scenario, the information loss can be quantified uniquely (up to a global multiplicative constant) by the entropy difference of the distributions [56].The difference between the full H p = − Ω i=1 p i log(p i ) and the coarse-grained H q = − B i=1 q i log q i can be explicitly evaluated for uniformly distributed probabilities.
For any number B < Ω we can define a coarse graining procedure that yields only B traits by defining the bin size ∆ = ceil(Ω/B) and grouping together ∆ traits into each bin.The last bin might be smaller than the other bins or even empty.The information loss of this procedure is shown in Fig. 8(b) for an initially uniform distribution of Ω = 1000 traits.Across certain ranges of B, plateaus can build since our coarse graining might add zero probabilities.However, we can instead start from a continuous distribution and compare the discretization with Ω = 1000 bins to any other binning with B ≤ Ω.
Comparing a coarse-grained uniform distribution with B bins to the discretized distribution with Ω bins yields the information loss with respect to the initial discrete distribution L = − log(B/Ω) ≥ 0. We plot this analytical prediction against the information loss L associated with coarse graining an already discrete distribution in Fig. 8(b), showing them to be well-aligned.

Discussion and conclusions
Quantifying properties of cell-type or sequence distributions is an important aspect of analyzing the immune repertoire in humans and animals.Different methods have been developed to estimate TCR and BCR diversity indices such as the total number of distinct sequences in an organism (i.e., species richness) [1,37,51].Another quantity of interest is the number of clones that are considered "public" or "private," indicating how often certain TCR or BCR sequences occur across different individuals.
Public TCRβ and BCR sequences have been reported in a number of clinical studies [18,[20][21][22][23][24].However, the terms "public" and "private" clonotypes are often based on different and ambiguous definitions.According to [22], a "public sequence" is a sequence that is "often shared between individuals" [22], while [25] refers to a sequence as public if it is "shared across individuals".In addition to ambiguities in the definition of what constitutes a private/public sequence, overlaps between the immune repertoires of different individuals are often reported without specifying confidence intervals, even though variations may be large given small sample sizes and heavy tailed sequence distributions.
In this work, we provided mathematical definitions for "public" and "private" clones in terms of the probabilities of observing a number of clones across M selected individuals, complementing related work that introduced the notion of "sharing number M " (i.e., the expected number of sequences which will be found in exactly M individuals) to quantify the expected overlap between cell-sequence samples [26,27].Besides defining individual repertoire probability distributions, our results include analytic expressions for individual and multi-individual expected richness and expected overlap as given by Eqs.(5), ( 10), ( 17), (19), and (20).Additionally, using Eqs.( 28) to (33), we derived expressions for the expected richness and expected overlap in subsamples.The variability of quantities (second moments) such as the M -overlap and subsampled overlap were also derived.Studies analyzing the similarities and differences associated with immune repertoires of different individuals (see, e.g., [18,26,27]) may utilize our results on second moments of overlap measures to quantify the statistical significance of their findings.Our results are summarized in Table 1 where we provide expectations and second moments of all quantities as a function the cell population configurations n (m)  or as a function of the underlying clone generation probabilities p (m) , as is generated by models such as SONIA [43].
Further inference of richness and overlap given sample configurations can be developed using our results.For example, the parametric inference of expected richness in an individual given a sampled configuration s can be found using the multinomial model and Bayes' rule, as presented in Eq. (44).
While our results depend on knowledge of Ω and N , we show using powerlaw probability distributions and explicit expressions in Eqs.(47) and ( 49) that the richness is insensitive to Ω in the large N and Ω/N limits.Therefore, even though Ω may be impossible to accurately estimate, power-law probability distributions generally render our results robust to uncertainty in Ω. Analytic or semi-analytic expressions for the overlap quantities can also be derived.We leave this exercise to the reader.
Finally, in the context of coarse-graining, or spectratyping [57], we have discussed methods that are useful to quantify the information loss associated with different levels of coarse graining TCR and BCR sequences.The results presented here are based on an assumption of simple multinomial distributions as the underlying population model.As we mentioned, a number of mechanistically more realistic probability distributions have been derived for neutral, noninteracting clone populations in steady state [36].These include log series and negative binomial distributions each requiring tailored calculations for the corresponding richness and overlap.6), ( 8), ( 13), ( 16), ( 24), ( 25), (26), and (27), respectively.The component probabilities ρ

Fig. 1
Fig. 1 Sampling from a continuous distribution, described in terms of an underlying probability density p(x) and number density n(x).The probability density p(x) (solid red line) and the Riemann sum approximation to the probability (red bars of width ∆) are shown in panel (a).The probability that a trait in the interval [x, x + dx) arises is p(x)dx.As shown in (b), this distribution can be discretized directly by the intervals [i∆, (i + 1)∆) (red bars) defining discrete traits and their associated probabilities p i (see Eq. (1)).The probabilities p i can be transformed into clone counts c k (the number of identities i that are represented by k individuals) using Eq.(2), which are shown in (c).A finite sample of a population described by p(x) yields the binary outcome shown in (d).In this example, the total number of samples is N = 41 and and since the trait space x is continuous, the probability that the exact same trait arises in more than one sample is almost surely zero.Light blue bars in panel (e) represent number counts n i binned according to ∆.The probabilities pi = n i /N provide an approximation of p i .Clone counts for the empirical pi are calculated according to Eq. (3) and shown in (f).
(m) i .Such a model translates the fundamental underlying recombination process into a

Fig. 2
Fig. 2 Schematic of sampling of multiple species from multiple individuals.A central process produces (through V(D)J recombination) TCRs.Individuals select for certain TCRs resulting in population n (m) i of T cells with receptor i in individual m, for a total T-cell count where the number of effective TCRs Ω for individual m might have as upper bound Ω ∼ 10 14 , if, for example, we are considering just the CDR3 region of the β chain.Assuming a time-independent model (e.g, a model in steady-state), we can describe the probability of a T-cell population n (m) in individual m by a multinomial distribution over individual probabilities p (m) ≡ {p (m) i }: 0

Fig. 3
Fig. 3 Three individuals with overlapping cell number distributions n (m) i

Fig. 4
Fig. 4 Sampling from shifted uniform distributions.(a) Synthetic TCR or BCR distributions of M = 3 individuals.The distributions in individuals 1, 2, and 3 are indicated by solid black, dashed blue, and dash-dotted red lines, respectively.Each individual has 10 5 cells uniformly distributed across 1000 clones (100 cells per clone).The sampled group richness R (3) s is 1500.(b) Samples of size S (m) have been generated to compute the relative overlaps between individuals 1 and 2 (blue disks), 2 and 3 (green inverted triangles), 1 and 3 (orange squares), and 1-3 (red triangles).The solid black lines show the corresponding analytical solutions E K (M ) s /R (3) s (see Eq. (20)).Dashed grey lines show the maximum possible relative overlaps 500/1500 ≈ 0.33 and 750/1500 = 0.5.(c) The Fano factor Var K (M ) s /E K (M ) s associated with relative overlaps between individuals 1 and 2 (solid blue line), 2 and 3 (dashdotted green line), 1 and 3 (dashed orange line), and 1-3 (solid red line) as a function of the number of sampled cells S (m) .

Fig. 5
Fig. 5 Sampling from empirical TRB CDR3 distributions and overlap measures in the number representation.(a) Distributions of TRB CDR3 cells in M = 3 individuals.We used the SONIA package [43] to generate 10 5 TRB CDR3 sequences for each individual.The sampled group richness R (3) s was found to be 284, 598.Equal sample sizes S (m) were then drawn.(b) Relative overlaps between individuals 1 and 2 (blue disks), 2 and 3 (green inverted triangles), 1 and 3 (orange squares), and 1-3 (red triangles).The solid black lines plot the corresponding analytical solutions E K (M ) s ({n (m) }) /R (3) s found in Eqs.(33).The

Fig. 7
Fig. 7 plots the M -overlap E K (M) s

Fig. 8
Fig. 8 The information loss L as a function of the number of discretization bins B. The loss is least as the number of integration bins B → ∞.(a) The solid black line shows the information loss associated with discretizing a truncated power law [see Eq. (54)], and the dashed grey line is a guide-to-the-eye (power law) with slope −0.6.(b) Grey dots show the information loss associated with coarse graining a discrete and uniform random variable with initially Ω = 1000 traits.The solid curve shows the corresponding analytical result for the difference in information L = − log(B/Ω) between discretizing a continuous uniform distribution of Ω = 1000 traits using B bins.

2
ij (S), ρi (S), ρij (S) associated with sampled quantities in the p-representation are given by Eqs.(37),(38),(39), and (40), respectively.nd moment, sampled richnessΩ i=1 σ i + Ω j =i σ ij Ω i=1 ρ i (S) + Ω j =i ρ ij (S) (51)erm in Eq.(51)increases with ν, we expect that an effectively smaller repertoire size (recall p i ∼ i −ν and larger ν leades to fewer larger-population clones), that the expected Moverlap increases with ν.However, the S/ log M 1/ν factor decreases with ν since larger S give rise to a larger number of ways clones sampled from different individuals can "avoid" each other.These features give rise to a maximum in E K

Table 1 Table
of mathematical results.We list our main mathematical derivations and expressions for richness and overlap, unsampled and sampled, in both the n-representation and the p-representation.The component probabilities ρ