Background

Dating species divergence has been studied extensively for the last few decades using approaches based on genetics, archaeological findings, and radiocarbon dating [1, 2]. Finding accurate timing is crucial in analyzing morphological and molecular changes in the DNA, in demographic research, and in dating key fossils. One approach for estimating the divergence times is based on the molecular clock hypothesis [3, 4] which states that the rate of evolutionary change of any specified protein is approximately constant over time and different lineages. Subsequently, statistical inference can be applied to a given phylogenetic tree to infer the dating of each node up to calibration.

Our work focuses on this estimation problem and proposes new statistical methods to date the TMRCA in a coalescent tree of two species given a detailed phylogenetic tree for one of the species with the same transition rates per site. Our work does not detect introgression events, and in cases of introgression [5] should be used alongside methods for introgression detection (e.g. [6]). We note that dating the TMRCA in a coalescent tree is different than finding the population tree divergence time. A discussion regarding the differences between the two is available in [7, 8]. Specifically, the discordance of nuclear and mtDNA histories [9] suggests the coalescent tree and population tree of humans may have a different topology.

We formulate the problem of dating the TMRCA by modeling the number of transitions (\(A\leftrightarrow G,C\leftrightarrow T\)) in each site using a Poisson process with a different rate per site; sites containing transversions are neglected due to their sparsity (indeed, we include sparse transversions in the simulations and show that our methods are robust to their occurrences). The phylogenetic tree is used for estimating the transition rates per site. Hence, when considering two representative sequences, one from each population, our problem reduces to a binary sequence where the parity of the number of transitions of each site is the relevant statistic from which we can infer the time difference between them.

We can roughly divide the approaches to solving this problem into two. The frequentist approach seeks to maximize the likelihood of the observed data. Most notable is the PAML [10] package of programs for phylogenetic analyses of DNA and the MEGA software [11]. Alternatively, the Bayesian approach considers a prior of all the problem’s parameters and maximizes the posterior distribution of the observations. Leading representatives of the Bayesian approach are BEAST2 [12] and MrBayes [13], which are publicly available programs for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models.

In this work, we developed several distinct estimators from frequentist and Bayesian approaches to find the TMRCA directly. The proposed estimators differ in their assumptions on the generated data, the approximations they make, and their numerical stability. We explain each estimator in detail and discuss its properties.

A critical difference between our proposed solutions and existing methods is that we seek to estimate only one specific problem parameter. At the same time, software packages such as BEAST2 and PAML optimize over a broad set of unknown parameters averaging the error on all of them (the tree structure, the timing of every node, the per-site substitution rates, etc.). Subsequently, the resources they require for finding a locally optimal instantiation of the tree and dating all its nodes can be very high in terms of memory and computational complexity. Consequently, the number of sequences they can consider simultaneously is highly limited. Thus, unlike previous solutions, we utilize transition statistics from all available sequences, in the form of a previously built phylogenetic tree.

We develop a novel approach to simulate realistic data to test our proposed solutions. To do so, we employ Phylotree [14, 15] – a complete, highly detailed, constantly updated reconstruction of the human mitochondrial DNA phylogenetic tree. In our work we assume all substitutions are specified by Phylotree (for an elaborate discussion regarding the correction of this claim see [16]). We sample transitions of similar statistics to Phylotree and use it to simulate a sequence at a predefined trajectory from Phylotree’s root.

We then empirically test the different estimators on simulated data and compare our results to the BEAST2 software. Our proposed estimators are calculated substantially faster while utilizing the transitions statistics from all available sequences (Phylotree considers 24,275 sequences), unlike BEAST2 which can consider only dozens of sequences due to its complexity. Comparing with the ground truth, we show that BEAST2 slightly overestimates the TMRCA, while our estimates provide more accurate results. For larger TMRCAs such as the human-Chimpanzee, BEAST2 also has a larger variance compared to our methods. Finally, we use our estimators to date the TMRCA (given in kya – kilo-years ago) of modern humans with Neanderthals, Denisovan and Chimpanzee based on their mtDNA. Surprisingly, the TMRCAs we find (human-Neanderthals \(\sim\)408 kya, human-Denisovans \(\sim\)841 kya, human-Chimpanzee \(\sim\)5,010 kya) – are considerably later than those accepted today.

Methods

Estimation methods

First, we describe an idealized reduced mathematical formulation for estimating TMRCAs and our proposed solutions. In Estimating ancient TMRCAs using a large modern phylogeny section, we describe the reduction process in greater detail.

Consider the following scenario: we have a set of n Poisson rates, denoted as \(\{{\lambda _i}\}_{i=1}^n\) where \(n \in \mathbb {N}\). Let \(\vec {X}\) be a vector of length n such that each element \(X_i\) is independently distributed as \(\text {Pois}(\lambda _i)\). Similarly, let \(\vec {Y}\) be a vector of length n such that each element \(Y_i\) is independently distributed as \(\text {Pois}(\lambda _i \cdot p)\) for a fixed unknown p. We denote \(\vec {Z}\) as the coordinate-wise parity of \(\vec {Y}\), meaning that \(Z_i = 1\) if \(Y_i\) is odd and \(Z_i = 0\) otherwise. Our goal is to estimate p given \(\vec {X}\) and \(\vec {Z}\).

Remark 1

Note that the number of unknown Poisson rate parameters n in the problem \(\{\lambda _i\}_{i=1}^n\) grows with the number of observations \(\{(X_i, Z_i)\}_{i=1}^n\). However, our focus is solely on estimating p, so additional observations do provide more information.

Remark 2

The larger the value of \(p \cdot \lambda _i\), the less information on p is provided in \(Z_i\) as it approaches a Bernoulli distribution with a probability of 0.5. On the other hand, the smaller \(\lambda _i\) is, the harder it will be to infer \(\lambda _i\) from \(X_i\). As a result, the problem of estimating p should be easier in settings where \(\lambda _i\) is high and p is low.

Preliminaries

First, we derive the distribution of \(Z_i\); All proofs are provided in the Supplementary material (Section 1).

Lemma 1

Let \(Y \sim \text {Pois}(\Lambda )\) and Z be the parity of Y. Then \(Z \sim Ber(\frac{1}{2}(1-e^{-2\Lambda }))\).

We use this result to calculate the likelihood and log-likelihood of p and \(\vec \lambda\) given \(\vec {X}\) and \(\vec {Z}\). The likelihood is given by:

$$\begin{aligned} L\left( {\vec X,\vec Z;p,\vec \lambda } \right) = \prod \limits _{i = 1}^n {{e^{ - {\lambda _i}}}\frac{{{\lambda _i}^{{X_i}}}}{{{X_i}!}}\frac{1}{2}\left( {1 + {{\left( { - 1} \right) }^{{Z_i}}}{e^{ - 2{\lambda _i}p}}} \right) }, \end{aligned}$$
(1)

and the log-likelihood is:

$$\begin{aligned} l\left( {\vec X,\vec Z;p,\vec \lambda } \right) = \sum \limits _{i = 1}^n {\left[ { - {\lambda _i} + {X_i}\log {\lambda _i} + \log \left( {1 + {{\left( { - 1} \right) }^{{Z_i}}}{e^{ - 2{\lambda _i}p}}} \right) } \right] + Const.} \end{aligned}$$
(2)

This result follows immediately from the independence of each coordinate.

Cramer-Rao bound

We begin our analysis by computing the Cramer-Rao bound (CRB; [17, 18]). In Comparative study on raw simulations section, we compare the CRB to the error of the estimators.

Theorem 1

Denote the Fisher information matrix for the estimation problem above by \(I \in \mathbb {R}^{(n+1, n+1)}\), where the first n indexes correspond to \(\{\lambda _i\}_{i=1}^n\) and the last index \((n+1)\) corresponds to p. For clarity denote \(I_{p, p} \doteq I_{n+1,n+1}, I_{i, p} \doteq I_{i,n+1}, I_{p, i} \doteq I_{n+1,i}\). Then:

$$\begin{aligned} \forall i\ne j,\; 1\le i,j\le n: \quad&{I_{i,j}} = 0 ,\; \quad {I_{i,i}} = \frac{1}{{{\lambda _i}}} + \frac{4{p^2}}{e^{ 4{\lambda _i}p} - 1}, \quad {I_{i,p}} = {I_{p,i}} = \frac{4p{\lambda _i}}{e^{ 4{\lambda _i}p} - 1}, \nonumber \\&{I_{p,p}} = 4\sum \limits _{i = 1}^n {\frac{\lambda _i^2}{e^{ 4{\lambda _i}p} - 1}}. \end{aligned}$$
(3)

Consequently, an unbiased estimator \(\hat{p}\) holds:

$$\begin{aligned} \mathbb {E}\left[ (p-\hat{p})^2\right] \ge \left[ 4\sum \limits _{i = 1}^n {\frac{ \lambda _i^2}{ e^{4\lambda _i p} - 1 + 4p^2 \lambda _i }} \right] ^{-1}. \end{aligned}$$
(4)

If \(\forall i=1..n: \lambda _i=\lambda\), we can further simplify the expression:

$$\begin{aligned} \mathbb {E}\left[ (p-\hat{p})^2\right] \ge \frac{e^{4\lambda p} - 1 + 4p^2 \lambda }{4n\lambda ^2 }. \end{aligned}$$
(5)

The CRB, despite its known looseness in many problems, provides insights into the sensitivity of the error to each parameter. This expression supports our previous observation that the error of an unbiased estimator increases exponentially with \(\min _i \{ \lambda _i \cdot p \}\). However, for constant \(\lambda _i \cdot p\), the error improves for higher values of \(\lambda _i\). We now proceed to describe and analyze several estimators for p.

Method 1 - maximum likelihood estimator

Proposition 1

Following equation 1, the maximum likelihood estimators \(\hat{p}, \hat{\lambda }_i\) hold:

$$\begin{aligned} \sum \limits _{i = 1}^n {{{\hat{\lambda }}_i} = } \sum \limits _{i = 1}^n {{X_i}} , \quad {X_i} = {{\hat{\lambda }}_i} + \frac{2\hat{p}{{\hat{\lambda }}_i}}{ {{{\left( { - 1} \right) }^{{Z_i}}}{e^{ 2{{\hat{\lambda }}_i}\hat{p}}}} + 1}, \quad \sum \limits _{i=1}^n {\frac{\hat{\lambda }_i}{{ { {{\left( { - 1} \right) }^{{Z_i}}}{e^{ 2{{\hat{\lambda }}_i}\hat{p}}} + 1} }}} = 0. \end{aligned}$$
(6)

Proposition 1 provides n separable equations for maximum likelihood estimation (MLE). Our first estimator sweeps over values of \(\hat{p}\) (grid searching in a relevant area) and then for each \(i=1..n\) finds the optimal \(\hat{\lambda }_i\) numerically. The solution is then selected by choosing the pair \((\hat{p}, \{\hat{\lambda }_i\}_{i=1}^n)\) that maximizes the log-likelihood calculated using equation 2.

The obtained MLE equations are solvable, yet, finding the MLE still requires solving n numerical equations, which might be time-consuming. More importantly, MLE estimation is statistically problematic when the number of parameters is of the same order as the number of observations [19]. Subsequently, we propose alternative methods that may yield better practical results.

Method 2 - \(\lambda _i\)-conditional estimation

We propose a simple estimate of \(\vec {\lambda }\) based solely on \(X_i\), followed by an estimate of p as if \(\vec {\lambda }\) is known, considering only \(\vec {Z}\). This method is expected to perform well when \(\lambda _i\) values are large, as in these cases, \(X_i\) conveys more information about \(\lambda _i\) than \(Z_i\). This approach enables us to avoid estimating both \(\vec {\lambda }\) and p simultaneously, leading to a simpler numerical solution.

When \(p \le 1\), we can mimic \(Y_i\)’s distribution as a sub-sample from \(X_i\), i.e. we assume that \(Y_i|X_i \sim Bin(n=X_i,p)\). Then, we find the maximum likelihood estimate of p:

Proposition 2

If \(Y_i|X_i \sim Bin(X_i,p)\), then:

  1. 1.

    \(Y_i \sim \text {Pois} (\lambda _i \cdot p)\), which justifies this approach.

  2. 2.

    \(Z_i|X_i \sim Ber\left( {\frac{1}{2}\left( {1 - {{\left( {1 - 2p} \right) }^{{X_i}}}} \right) } \right)\), so we can compute the likelihood of p without considering \(\lambda _i\).

  3. 3.

    The maximum likelihood estimate of p given \(\sum \limits _{i=1}^n Z_i\) holds:

    $$\begin{aligned} \sum \limits _{i = 1}^n {{\left( {1 - 2\hat{p}} \right) }^{X_i}} = n - 2\sum \limits _{i=1}^n {Z_i} \end{aligned}$$
    (7)

Remark 3

We use the maximum likelihood estimation of p given \(\sum _{i=1}^n Z_i\) by applying Le-Cam’s theorem [20]. This eliminates the need for a heuristic solution of the pathological case \(X_i=0, Z_i=1\).

Method 3 - Gamma distributed Poisson rates

The Bayesian statistics approach incorporates prior assumptions about the parameters. A common prior for the rate parameters \(\vec {\lambda }\) is the Gamma distribution, which is used in popular Bayesian divergence time estimation programs such as MCMCtree [10], BEAST2 [12], and MrBayes [13]. Specifically, we have \(\lambda _i \sim \Gamma (\alpha ,\beta )\), and for p, we use a uniform prior over the positive real line.

Proposition 3

Let \(\lambda _i \sim \Gamma (\alpha ,\beta )\), then the maximum a posteriori estimator of p holds:

$$\begin{aligned} \frac{{\partial l}}{{\partial p}} = \sum \limits _{i=1}^n{\frac{X_i + \alpha }{{{\left( { - 1} \right) }^{{Z_i}}}{{\left( 1 + {\frac{2p}{{\beta + 1}}} \right) }^{X_i + \alpha }}+1}}=0 \end{aligned}$$
(8)

Subsequently, given estimated values for \(\alpha\) and \(\beta\), we can find an estimator for p numerically to hold Equation 8. Unfortunately, the derivative with respect to \(\alpha\) does not have a closed-form expression, nor is it possible to waive the dependence on \(\vec {Z}, p\). Hence, we suggest using Negative-Binomial regression [21] to estimate \(\alpha\) and \(\beta\) given \(\vec {X}\) .

Estimating ancient TMRCAs using a large modern phylogeny

In this section, we apply the methods described in Estimation methods section to estimate the non-calibrated TMRCAs of humans and their closest relatives by comparing mitochondrial DNA (mtDNA) sequences. Our approach assumes the following assumptions:

  1. 1.

    Molecular clock assumption - the rate of accumulation of transitions (base changes) over time and across different lineages is constant, as first proposed by Zuckerkandl and Pauling [3] and widely used since.

  2. 2.

    Poisson distribution - The number of transitions along the human and human’s closest relatives mtDNA lineages follows a Poisson distribution with site-dependent rate parameter \(\lambda _i\) per time unit (implying that the rate of accumulation of transitions is independent of the time since the last transition).

  3. 3.

    No transversions - We only consider sites with no transversions and assume a constant transition rate per site (\(\lambda _{i,A \rightarrow G}=\lambda _{i,G \rightarrow A}\), or \(\lambda _{i,T \rightarrow C}=\lambda _{i,C \rightarrow T}\)).

  4. 4.

    Independence of sites - The number of transitions at each site is independent of those at other sites.

  5. 5.

    Phylogenetic tree - The phylogenetic tree presented in the Phylotree database includes all transitions and transversions that occurred along the described lineages.

As the Phylotree database is based on tens of thousands of sequences, the branches in the tree correspond to relatively short time intervals, making multiple mutations per site unlikely in each branch [22]. However, when considering the mtDNA sequence of other species, the branches in the tree correspond to much longer time intervals, meaning that many underlying transitions are unobserved. For instance, when comparing two human sequences that differ in a specific site, Phylotree can determine whether the trajectory between the sequences was \(A \rightarrow G\), \(A \rightarrow G \rightarrow A \rightarrow G\), or \(A \rightarrow T \rightarrow G\). However, when comparing sequences of ancient species, an elaborate phylogenetic tree like Phylotree is not available, making it impossible to discriminate between these different trajectories.

We use the following notation:

  1. 1.

    Let \(\vec {X}_{\text {mtDNA}}\) denote the number of transitions observed at each site along the human mtDNA phylogenetic tree as described by Phylotree. Each coordinate corresponds to a different site out of the 16,569 sites. The number of transitions at site i, \(X_{\text {mtDNA,i}}\), follows a Poisson distribution with parameter \(\lambda _i\).

  2. 2.

    Let \(\vec {Y}\) denote the number of transitions between two examined sequences (e.g. a modern human and a Neanderthal). We normalize the length of the tree edges so that the sum of all Phylotree’s edges is one. The estimated parameter p relates to the edge distance between the two examined sequences. Subsequently, \(Y_i\) follows a Poisson distribution with parameter \(\lambda _i \cdot p\).

  3. 3.

    Let \(\vec {Z}\) denote the parity of \(\vec {Y}\).

Using \(\vec {X}\) and \(\vec {Z}\), we can estimate p using the methods in Estimation methods section. The TMRCA is given by: \(\frac{1}{2}(T_{\text {sequence 1}} + T_{\text {sequence 2}} + p)\) when \(T_{\text {sequence 1,2}}\) are the estimated times of the examined sequences measured in (uncalibrated) units of phylotree’s total tree length. For clarity, we summarize the process described in this section in Fig. 1.

Fig. 1
figure 1

We use a large, comprehensive phylogeny, such as Phylotree, and assume its tree topology and branch lengths are known. We also assume that the phylogeny is detailed enough so that it describes all the substitutions that occurred between its sequences. From this detailed phylogeny, we extract a list of the number of transitions and transversions that occurred in each site along the phylogeny to get \(\vec {X}_{mtDNA}\) (composed of the number of transitions that occurred at each site) and a list of phylogeny transversion sites - sites in which at least one transversion occurred along the phylogeny. We aim to estimate the distance between two sequences that are not necessarily part of the tree and may be much more distant than branch tree lengths. To do so, we extract a binary vector \(\vec {Z}\) that states for each site whether the sequences are identical (\(z_i=0\), marked black) or different (\(z_i=1\), marked orange). We check in which sites a transversion must have occurred between the two sequences (marked blue) and remove these sites from \(\vec {X}_{mtDNA}\) and \(\vec {Z}\), thus shortening these vectors. We do the same for the phylogeny transversion sites

Calibration

Our methods output p, which is the ratio of two values:

  1. 1.

    The sum of the edges between the two examined sequences and their most recent common ancestor (MRCA).

  2. 2.

    The total sum of Phylotree’s edges.

Similarly to BEAST2, to calibrate p to years, we use the per-site per-year substitution rate for the coding region given in [23] \(\mu =\) 1.57 x 10E-8. We then calculate the total sum of Phylotree’s edges in years by dividing the average number of substitutions in the coding region per site (1.4) by \(\mu\).

Results

Comparative study on raw simulations

To compare the performance of the three estimation methods described in Estimation methods section, we conducted experiments using simulated data. The Poisson rates \(\lambda\) were generated to reflect the substitution rates observed in mtDNA data using either a Categorical or a Gamma distribution. The parameters for the Gamma distribution (\(\alpha =0.23, \beta =0.164\)) were estimated directly from the data, while the parameters for the Categorical distribution were chosen such that both distributions have the same mean and variance. One of the Categorical values (\(\epsilon =0.1\)) corresponds to the rate of low activity sites in the mtDNA data. The other value (\(a=11.87\)) and the probabilities (0.11, 0.89) were chosen accordingly. To test the robustness of our methods, we have also conducted simulations using the celebrated K2P [24] and TN93 [25] substitution models, with rate matrix parameters and site scaling extracted from Phylotree’s data. Further simulation details appear in the Supplementary material, Section 2.1. The comparison results are shown in Fig. 2 with the Cramer-Rao bound for reference. To provide a qualitative comparison, we performed a one-sided paired Wilcoxon signed rank test on every pair of models, correcting for multiple comparisons using the Bonferroni correction. Our results show that Method 2 has the lowest squared error while Method 1 has the highest squared error, for all distributions and substitution models. It is noteworthy that although Method 3 assumes a Gamma distribution, it still performs well even when a model mismatch exists.

Fig. 2
figure 2

Box-plot of the log squared estimation errors of the three proposed methods for selected values of p, expressed as a percentage of the total length of Phylotree’s edges (outliers are marked with \(*\)). The simulations were run 10, 000 times for each value of p. The CRB is shown in black for reference and the circles represent the log of the mean values which are comparable to the CRB. The experiments were conducted for two different distributions of \(\lambda\) and two different substitution models: (Top left) Categorical distribution with two values: \(\epsilon =0.1\) with probability \(\eta =0.11\) and \(a=11.87\) with probability \(1-\eta\).(Top right) Gamma distribution with parameters \(\alpha\) and \(\beta\). (Bottom left) Site-scaled K2P substitution model. (Bottom right) Site-scaled TN93 substitution model

Phylogenetic tree simulations

We validated our methods by testing their performance in a more realistic scenario of simulating a phylogenetic tree. Our methods take as input the observed transitions along Phylotree (\(\vec {X}_{\text {mtDNA}}\)) using all of its 24,275 sequences and a binary vector \(\vec {Z}\) denoting the differences between two sequences, which we aim to estimate the distance between. We compared our methods to the well-known BEAST2 software [12], which, similarly to other well-established methods (such as MCMCtree [10], MrBayes [13], etc.) considers sequences along with their phylogenetic tree to produce time estimations. The software BEAST2 performs Bayesian analysis using MCMC to average over the space of possible trees. However, it is limited in its computational capacity, so it cannot handle a large number of sequences like those in Phylotree. For this reason, we used a limited set of diverse sequences, including mtDNA genomes of 53 humans [26], the revised Cambridge Reference Sequence (rCRS) [27], the root of the human phylogenetic mtDNA tree, termed Reconstructed Sapiens Reference Sequence (RSRS) [28], and 10 ancient modern humans [23]. More details about the parameters used by BEAST2 are available in the Supplementary material, Section 2.3. To evaluate our methods, we added a simulated sequence with a predefined distance from the RSRS.

Our aim is to generate a vector \(\vec {\lambda }\) that produces a vector \(\vec {X}\) that has a similar distribution to \(\vec {X}{\text {mtDNA}}\). The human mtDNA tree has 16,569 sites, of which 15,629 have no transversions. The MLE of \(\lambda _i\) at each site is the observed number of transitions, \(X_{\text {mtDNA,i}}\). However, simulating \(\vec {\lambda }\) as \(\vec {X}_{\text {mtDNA}}\) leads to an undercount of transitions because 10,411 sites (\(67\%\) of the total number of sites considered) had no transitions along the tree and their Poisson rate is taken to be zero. To mitigate this issue, the rates for these sites were chosen to be \(\epsilon\), the value that minimizes the Kolmogorov-Smirnov statistic [29, 30] (details are provided in the Supplementary material, Section 2.2).

The results are presented in Fig. 3. Similarly to Fig. 2, Method 1 has a larger error than Methods 2 and 3 for values of p within the simulated region, and the gap widens with increasing p. Methods 2 and 3 provide the best results for the entire range of p. Compared to Methods 2 and 3, BEAST2 is less accurate and has a larger variance for higher values of p. Additionally, BEAST2 has a much longer running time (roughly 1 hour) compared to our methods (less than a second). BEAST2 simulation presented here was conducted using a fixed tree topology. The results for a simulation without a fixed tree topology (running time \(\sim 1.5\) hours) are presented in Supplementary Fig. 2. Furthermore, to test the effect of additional sequences, we conducted simulations with an additional 50 human sequences for selected values of p and a fixed tree topology (running time \(\sim 3\) hours). Supplementary Fig. 3 presents a comparison of estimation errors for various BEAST2 estimators compared to Method 2. Finally, to examine the effect of removing transversion sites, we conducted simulations with different ti/tv ratios showing that decreasing the ti/tv ratio does not result in bias (Supplementary Fig. 4).

Fig. 3
figure 3

Comparison of our methods with BEAST2 estimator using simulated data. The right plot shows a zoom-in view of the left plot, focusing on values of p between 0 and \(20\%\). Each point in the plot represents the average of 5 runs, while the shaded regions indicate the range of estimations obtained. We note that in order to compare to our methods we only present here point estimates for BEAST2 (posterior means) and not the HPD of the full posterior distribution

Real data results

As the final step of our experiments, we apply our methods to real-world data to determine the TMRCA of the modern human and Neanderthal, Denisovan, and chimpanzee mtDNA genomes. A schema of our estimation process is provided in Fig. 4. Table 1 displays the uncalibrated distances between modern humans and each sequence, compared to the estimates from BEAST2. The presented TMRCA represents an average of the TMRCA obtained from 55 modern human mtDNA sequences of diverse origins [26]. Table 2 presents the TMRCA in kya (kilo-years ago) of the modern human and each sequence. We note that if some of the modern human sequences were very close to one another, a weighted average would be more appropriate, considering the proximity of sequences and giving sequences closer to one another a smaller weight. As the sequences in our case are from diverse origins, a uniform average is a good approximation. An alternative strategy to calculating the average TMRCA with all human mtDNA sequences is to use the RSRS instead. Moreover, we can use the MRCA of Neanderthals instead of using all Neanderthal sequences and the same for any other population with a confidently known MRCA, resulting in one estimate instead of multiple pairwise estimates. This strategy involves combining different possibly noisy TMRCAs (one for each population and one for the distance between the populations).

Fig. 4
figure 4

We used the following schema to obtain estimates for the time distance between real-world sequences: We first extracted \(\vec {X}_{mtDNA}\) from Phylotree. Then, we determined the binary vector \(\vec {Z}\) that denotes the differences between the RSRS and the sequence in consideration. We removed from both \(\vec {X}_{mtDNA}\) and \(\vec {Z}\) the phylogeny transversion sites and sequences transversion sites as explained in Fig. 1. We applied our estimation methods using \(\vec {X}_{mtDNA}\) and \(\vec {Z}\) as input to get uncalibrated p and calibrated p as explained in Calibration section. Finally, the TMRCA of the examined sequence and humans is given by \(TMRCA = \frac{1}{2}(T_{RSRS}+T_{Sequence}+p_{calibrated})\)

Table 1 Uncalibrated distances between modern humans and selected hominins
Table 2 Estimated TMRCAs of modern human and selected hominins

The estimates from real-world sequences presented in Table 1 are consistent with those obtained for the simulated dataset in Phylogenetic tree simulations section. For low values of p, our three methods all produce similar estimates while BEAST2’s has a slightly higher estimate. For the human-Chimpanzee uncalibrated distance, which is relatively high, Method 1 provides a higher estimate than that obtained by Methods 2 and 3, and BEAST2 provides a substantially higher estimate. The results in Table 2 show the TMRCA estimates, which are significantly smaller for our methods than those obtained from BEAST2 for human-Neanderthals and human-Denisovans. For example, BEAST2 estimated the human – Sima de los Huesos – Denisovans TMRCA as \(\sim 1,018\) kya, while our best-performing method (2) estimated it as \(\sim 841\) kya. This TMRCA is estimated as (540-1,410 kya) in [31]. Similarly, BEAST2 estimated the human – Neanderthal TMRCA as \(\sim 507\) kya, while our methods estimated it as \(\sim 408\) kya. Preceding literature estimates this time closer to ours (\(\sim\)400 kya [32,33,34]) while recent literature provides a much earlier estimate (\(\sim\)800 kya [35]). Finally, BEAST2 estimates the human-Chimpanzee TMRCA as \(\sim 7,293\) kya whereas our estimate is \(\sim\)5,010 kya, both are close to the literature value of 5 – 8 million years ago [36,37,38,39].

Conclusions

We investigated an estimation problem arising in statistical genetics when estimating the TMRCA of species. The problem’s formulation, estimating Poisson rates from parity samples, leads to multiple estimators with varying assumptions. We calculated the CRB for this estimation problem and compared our methods against commonly used BEAST2 in different empirical settings, including a simple sampling scheme (Comparative study on raw simulations section), a more elaborate generative scheme based on real-world mtDNA data (Phylogenetic tree simulations section), and the calculation of the TMRCA of modern humans and other hominins using their mtDNA genomes (Real data results section).

Our results indicate that our proposed methods are significantly faster and more accurate than BEAST2, especially for earlier TMRCAs such as the human-Chimpanzee. Our methods utilize the transition statistics from the entire known human mtDNA phylogenetic tree (Phylotree) without the need for reconstructing a tree containing the sequences of interest. Our results show that the human – Neanderthal TMRCA is \(\sim 408,000\) years ago, considerably later than the values obtained by BEAST2 (\(\sim 507,000\) years ago) and other values cited in the literature.