Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages

Casanellas, Marta; Fernández-Sánchez, Jesús; Garrote-López, Marina; Sabaté-Vidales, Marc

doi:10.1007/s11538-023-01167-y

Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages

Methods
Open access
Published: 13 June 2023

Volume 85, article number 68, (2023)
Cite this article

Download PDF

You have full access to this open access article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages

Download PDF

Marta Casanellas ORCID: orcid.org/0000-0002-1724-8358¹,
Jesús Fernández-Sánchez¹,
Marina Garrote-López² &
…
Marc Sabaté-Vidales³

977 Accesses
1 Citation
3 Altmetric
Explore all metrics

Abstract

Homogeneity across lineages is a general assumption in phylogenetics according to which nucleotide substitution rates are common to all lineages. Many phylogenetic methods relax this hypothesis but keep a simple enough model to make the process of sequence evolution more tractable. On the other hand, dealing successfully with the general case (heterogeneity of rates across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools. The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus especially indicated to deal with data evolving under heterogeneous rates. This method combines the weights of two previous methods by means of a test based on the positivity of the branch lengths estimated with the paralinear distance. ASAQ is statistically consistent when applied to data generated under the general Markov model, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely QFM, wQFM, quartet puzzling, weight optimization and Willson’s method) in combination with several systems of weights, including ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support weight optimization with ASAQ weights as a reliable and successful reconstruction method that improves upon the accuracy of global methods (such as neighbor-joining or maximum likelihood) in the presence of long branches or on mixtures of distributions on trees.

Utility of characters evolving at diverse rates of evolution to resolve quartet trees with unequal branch lengths: analytical predictions of long-branch effects

Article Open access 14 May 2015

Developing a statistically powerful measure for quartet tree inference using phylogenetic identities and Markov invariants

Article 22 April 2017

The Performance of Two Supertree Schemes Compared Using Synthetic and Real Data Quartet Input

Article 19 February 2018

Molecular phylogenetic reconstruction faces several problems, still nowadays. Even if one restricts to gene tree reconstruction, one has to take into account the amount of available data (which might be low with respect to the number of taxa), and depending on the method applied, the selection of a suitable evolutionary model, the inherent difficulty of estimating parameters for the most complex models, and the incorporation of heterogeneity across sites and/or across lineages, among others (cf. Jermiin et al. 2020; Zou et al. 2019, and the references therein).

Most of the available methods strongly depend on the evolutionary model assumed (this is the case of maximum likelihood, Bayesian or distance-based approaches, to a more or less extent), and some have to estimate the substitution parameters for each possible tree topology. Recent work minimizes the relevance of the selection of the substitution model if the tree topology is correctly inferred (see Abadi et al. 2019) and in this case, a general Markov model (GM for short) could be used (see also Kaehler et al. 2015). Regarding topology reconstruction, some methods that avoid parameter estimation and allow complex substitution models such as GM are those based on the paralinear distance or on algebraic tools (phylogenetic invariants and related tools, see Allman and Rhodes 2007).

The paralinear distance (Lake 1994) is a measure that attempts to estimate the evolutionary distance (in terms of expected number of substitutions per site) when sequences evolve under a GM model. It may overestimate the expected number of substitutions when the process is far from stationary, or branch lengths are long (see Kaehler et al. 2015; Zou et al. 2012), so more elaborate methods are provided in the quoted papers. However, it is a widely used distance due to its simple formula and its generality, and it has recently been proven to be consistent for the multispecies coalescent model in the ultrametric case as well, see Allman et al. (2021, 2019).

Several algebraic methods for phylogenetic reconstruction have been proposed in the last years, see, for example, SVDQuartets (Chifman and Kubatko 2014), Erik+2 (Fernández-Sánchez and Casanellas 2016)—both already implemented in PAUP* (Swofford 2003), Splitscores^{Footnote 1} (Allman et al. 2016), or SAQ (Casanellas et al. 2021b). The methods Erik+2, Splitscores, and SAQ are based on the GM model and, in particular, they are not subject to stationarity or time-reversibility and account for different rates of substitution at different lineages (the so-called heterogeneity across lineages, see Appendix A). The three of them consider algebraic conditions in the form of rank constraints of a flattening matrix obtained from the observed distribution of characters on a sequence alignment. Only SAQ considers also the stochastic description of the evolutionary model (see Allman et al. 2014), which translates into semi-algebraic conditions. On the other hand, Erik+2 also allows across-sites heterogeneity as it is able to deal with data from a mixture of distributions on the same tree. Despite the potential of algebraic-based methods for topology reconstruction (already pointed out in the book by Felsenstein (2004)), some of these methods may not work well for short alignments especially in presence of the long branch attraction phenomenon. For instance, Erik+2 is highly successful on different types of quartet data (also on the Felsenstein zone) but requires at least one thousand sites to outperform maximum likelihood or neighbor-joining (briefly NJ) (Fernández-Sánchez and Casanellas 2016). By taking into account the stochasticity of the substitution parameters, SAQ overcomes this problem at the expense of performing slightly worse than Erik+2 for very large amounts of data (10,000 sites or more) (Casanellas et al. 2021b).

Algebraic methods are mainly aimed at recovering quartet topologies (or splits in some cases) and some account for the possibility of being implemented into quartet-based methods. Quartet-based methods (Q-methods for short) have been questioned in the literature. For instance, Ranwez and Gascuel (2001) evaluated two quartet-based methods, their weight optimization method (briefly WO) and the quartet puzzling method (QP) by Strimmer and von Haeseler (1996), and they weighted quartets using a maximum likelihood approach for the Kimura two-parameter model. Their main conclusion was that both QP and WO give worse results than neighbor-joining or than a maximum likelihood approach applied directly to the whole set of taxa. More recent methods such as QMC from Snir and Rao (2010), QFM Reaz et al. (2014), or wQFM from Mahbub et al. (2021) seem to be competitive with maximum likelihood and are scalable to large data sets. As pointed out by Ranwez and Gascuel (2001), the weaknesses of Q-methods are very likely due to the method of weighting the quartets rather than to the method of combining them.

As far as we are aware, the only works that evaluate the use of algebraic methods as input of quartet-based methods are Rusinko and Hipp (2012) (which is restricted to QP with the Jukes-Cantor model), Holland et al. (2012) (where squangles are applied to infer the quartets) and Chifman and Kubatko (2014) (for the coalescent model). On the other hand, the correct management of long-branch attraction is crucial for obtaining a successful quartet-based method (John et al. 2003). These claims and remarks motivate part of the present work. We expect that, as SAQ, Erik+2 and the paralinear method handle successfully the long-branch attraction problem, Q-methods with these weighting systems improve their performance. Moreover, Zou et al. (2019) proved that if a machine learning approach is applied to weighting the quartets, QP can have a similar performance than NJ especially under substitution processes that are heterogeneous across lineages. Precisely, handling heterogeneity of substitution rates across lineages is one of the key features of algebraic methods based on the GM model of nucleotide substitution.

The goal of this work is to test the performance of several quartet-based methods for phylogenetic tree topology reconstruction when applied with input weights from different methods consistent with the general Markov model. We test QFM, wQFM, QP, WO and the method WIL (proposed in Willson 1999) with different systems of input weights: two systems, PL and 4P (see Appendix A), derived from paralinear distances and the four-point condition (Lake 1994; Gascuel 1994; Mihaescu et al. 2009), and three systems based on algebraic and semi-algebraic methods, namely SAQ, Erik+2 and the new proposed method ASAQ (see below) that combines the paralinear distance and algebraic methods. We also provide a new implementation of QP, WO, WIL and the code for ASAQ. We test exhaustively all these combinations on simulated data evolving either under the GM model or the homogeneous general (continuous) time-reversible model (GTR) on twelve taxon trees, and also provide a comparison when input weights from a maximum-likelihood approach are used. We also compare the performance to a traditional NJ (from now on, global NJ) and a ML estimation of the tree (global ML) and test some of the methods on real data: the eight species of yeast studied in Rokas et al. (2003) and the Ratite mitochondrial DNA data studied in Phillips et al. (2009).

The new method ASAQ (standing for algebraic and semi-algebraic quartet reconstruction) is a topology reconstruction method for four taxa which combines Erik+2 and SAQ by means of the paralinear method. This is a quartet reconstruction method based on a statistic (see Eq. (1) in the Materials and Methods section) that assesses the positivity of the estimated length (using the paralinear distance) of the interior branch of the quartet. When data are unmistakably generated by a quartet, the topology output by any reconstruction method should be consistent with the positivity of this statistic. ASAQ uses the paralinear method to either ratify the results of Erik+2, or to rely on SAQ when there is an inconsistency between the outputs of Erik+2 and the paralinear method. By proceeding like this, ASAQ ensures an overall better performance on quartets than Erik+2, SAQ (and the paralinear method itself in most occasions).

As ASAQ is statistically consistent with the GM model, it can consider data evolving on a tree where base composition and instantaneous rate substitution patterns can differ among lineages (heterogeneity across lineages), and does not assume stationarity nor time-reversibility, see Appendix A. We test the performance of ASAQ on a wide scenario of simulated data: on the tree space proposed by Huelsenbeck (1995), on quartets with random branch lengths and on mixture data on the same topology with two categories. All simulations used data either generated from the GM model or from a GTR model, with different sequence lengths.

One can use ASAQ with mixtures of distributions on the same tree topology with two or three categories (or partitions) as this was already implemented by Erik+2. Although ASAQ is based on SAQ and the paralinear method, which are not guaranteed to be consistent on mixtures, it is also highly successful on this kind of data, see the Results section.

1 Materials and Methods

In this section, we describe the new method ASAQ for quartet reconstruction, the quartet-based methods applied and the simulation studies performed.

1.1 Description of the New Method ASAQ

ASAQ is a quartet reconstruction method based on a pair of previous methods by the authors (Fernández-Sánchez and Casanellas 2016; Casanellas et al. 2021b). For a phylogenetic tree T with an interior node r as root, we consider a Markov process on the alphabet $\{\texttt {A},\texttt {C},\texttt {G},\texttt {T}\}$ on T by assigning transition matrices $M_e$ at the edges of T and a distribution $\pi $ at r. As no restrictions are imposed on the transition matrices or $\pi $, this is usually called a general Markov model (GM) on T. One can compute the theoretical joint distribution p of patterns at the leaves of T in terms of the entries of $M_e$ and $\pi $ and we say that p has arisen on T with certain substitution parameters.

We consider fully resolved (unrooted) trees on a set of four taxa $[4]=\{1,2,3,4\}$: the three quartet trees shall be denoted as 12|34, 13|24, 14|23, according to the bipartition induced by the interior edge. Fernández-Sánchez and Casanellas (2016) introduced Erik+2, a reconstruction method essentially based on the rank of flattening matrices obtained from a distribution of nucleotides at the leaves of a tree. The method allows the possibility of dealing with data from mixtures of distributions with up to 3 categories (this upper bound is not computational, but rather a consequence of the theoretical basis of the method: distributions from mixtures with c categories satisfy that the rank of a certain $16 \times 16$-matrix is upper bounded by 4c, so in order to discriminate between such distributions and generic distributions we need $4c < 16$). In the more recent method SAQ, Casanellas et al. (2021b) create a more sophisticated method combining the rank of the flattening matrices with the stochastic information available in the data via a result of Allman et al. (2014). Both methods Erik+2 and SAQ, as well as their associated weighting system, are briefly described in Appendix A.1. They have been widely studied and the reader is referred to the cited publications for their performance on different scenarios.

As already explained in the introduction, while SAQ usually outperforms Erik+2 for short DNA sequence alignments (length $\le $ 1000), Erik+2 obtains better results as the length of the DNA alignment increases. This consideration leads us to introduce ASAQ, a new combined method of Erik+2 and SAQ that tries to apply one or the other according to whether the input pattern distribution is consistent with the positivity of the estimated length of the interior branch. To this end we use the paralinear distance and the paralinear method (see Appendix A.1, Lemma A.1 and Lake (1994)).

For a distribution of patterns at the set of leaves, $p\in {\mathbb {R}}^{256}$, we compute all paralinear distances $d_{x,y}$ between pairs $x,y \in [4]$. Then, given a bipartition A|B of the set [4], $A=\{i,j\}$, $B=\{k,l\}$, we define the following quantity

$$\begin{aligned} {p\ell }_{A|B}(p)=\min \{d_{i,k}+d_{j,l} \,, \, d_{i,l}+d_{j,k}\}-d_{i,j}-d_{k,l}. \end{aligned}$$

(1)

The quantity above is the “neighborliness” measure used in Gascuel (1994), the “paralinear method” used in Lake (1994), and was presented by Buneman (1971) as a measure of twice the length at the interior edge (when d is a tree metric). It is worth pointing out that at most one of the three values ${p\ell }_{12|34}(p)$, ${p\ell }_{13|24}(p)$ and ${p\ell }_{14|23}(p)$ is strictly positive if all paralinear distances above are non-negative, see Lemma A.2. We denote by $\texttt {PL} (p)$ the collection of these values:

$$\begin{aligned} \texttt {PL} (p)=({p\ell }_{12|34}(p), {p\ell }_{13|24}(p), {p\ell }_{14|23}(p)). \end{aligned}$$

If p has arisen on a quartet A|B and the entries of the Markov matrix at the interior edge were strictly positive, both quantities inside the minimum in Eq. (1) are equal (as can be deduced from the 4-point condition). Moreover, ${p\ell }_{A|B}(p)$ is the unique positive quantity in the triplet $\texttt {PL} (p)$ and its value coincides with twice the paralinear distance of the interior edge (see Theorem A.3). The paralinear method is the quartet reconstruction method that outputs the tree T with highest ${p\ell }_{T}$ value. It is a statistically consistent quartet-inference method (Theorem A.3).

Therefore, we propose the following quartet reconstruction method ASAQ: it checks whether both methods Erik+2 and the paralinear method PL output the same quartet, and

(1)
if Erik+2 and PL agree, then ASAQ outputs the topology and weights of Erik+2;
(2)
if they do not agree or some paralinear distances $d_{x,y}$ are negative, then ASAQ outputs the topology and weights of SAQ.

An inconsistency between Erik+2 and the paralinear method implies that data are far from having arisen on a tree with stochastic parameters (note the role of the positivity of the entries of the Markov matrix in Lemmas A.1 and A.2). The positivity of transition matrices implies semi-algebraic conditions on the joint distributions at the leaves (Allman et al. 2014) and in this case we rely on SAQ, as it is the unique method that takes into account both these semi-algebraic conditions and the algebraic constraints considered by Erik+2. In Theorem A.3 we prove that ASAQ is as well a statistically consistent quartet reconstruction method for the general Markov model and in Table 5 we show the percentage of discordance between PL and Erik+2, and the success of SAQ in these cases.

We cannot claim that ASAQ is statistically consistent for mixtures because consistency is not known to hold for the paralinear method or SAQ in this scenario. However, the simulation studies in Casanellas et al. (2021a) show a good performance of SAQ in mixture data from the same tree and this will lead to a good performance of ASAQ as well (see the Results section). We denote by ASAQ ($m=k$) the use of ASAQ with Erik+2 estimating mixtures on the same tree with k categories. The limit on the number of categories $m=3$ for quartets comes from the theoretical foundations of Erik+2, as a larger amount of categories would make unfeasible the identifiability of the tree topology by this method.

As a topology reconstruction method, the paralinear method is highly successful (see Results section, Fig. 2). Nevertheless, we found that using the paralinear method in order to ratify or not the results of Erik+2 gives a better performance on topology reconstruction.

1.2 Quartet-Based Methods (Q-Methods)

We have implemented different quartet-based methods (Q-methods) with different input weights. All of them seek for a tree that maximizes weighted quartet consistency. Quartet puzzling (QP), weight optimization (WO) and Willson’s (WIL) methods have been programmed in C!++! and wQFM has been used with the implementation provided in Mahbub et al. (2021). QP amalgamates quartets in a randomized order and seeks to maximize the total sum of weights; we provide a new implementation of QP different from the one in Schmidt et al. (2002) since we wish to apply the method with systems of weights not based on likelihoods. Weight optimization uses quartet weights to dynamically define the taxon addition order, seeking to maximize the total weight at each step. WO is known to reconstruct the correct tree if the input quartets are correctly weighted (i.e. if all quartets of the original tree have the highest weight among the tree possible weights of the corresponding 4-tuple), see Ranwez and Gascuel (2001). Instead of constructing a tree that maximizes the total weight at each step, the essential idea of Willson’s method is attaching new taxa in such a way that the new tree at each step is highly consistent with the input quartets. QP, WO and WIL are initialized at a random 4-tuple. Since the output of these Q-methods strongly depends on the choice of the initial quartet, each one of them has been applied 100 times to each alignment and then the majority rule consensus tree (briefly MRCT) of these 100 replicates has been computed.

On the contrary, wQFM amalgamates quartets following a divide and conquer approach and implicitly gives the same importance to all quartets (and does not depend on an initial quartet choice). Although wQFM was specially designed to build species trees from gene trees, it can also consider other input weights. We use wQFM with quartets weighted by different methods, but also with unweighted quartets. In this last case we refer to the method as QFM Reaz et al. (2014) and the weights are transformed to 1 for the quartet output by the method and to 0 for the other two quartets.

In order to evaluate the difference between two trees, we use the Robinson–Foulds distance (RF for short), (Robinson and Foulds 1981). For the computation of the majority rule consensus tree and the RF distance, the available functions in the Python Library DendroPy have been used, see Sukumaran and Holder (2010).

Input Weights

We require the weights of the quartet reconstruction methods to be positive and normalized. Details about the input weights obtained from ASAQ are provided in the previous section. Further details about the weighting system for all the considered methods are moved to Appendix A.1. For the paralinear method the weights are denoted as PL and are obtained after normalizing the exponentials of the scores given by Eq. (1); 4P is a slight modification of this method, see Appendix A.1. We consider already published weighting systems for SAQ, Erik+2 and maximum likelihood (ML): see Fernández-Sánchez and Casanellas (2016) for Erik+2, Casanellas et al. (2021b) for SAQ, and the posterior probabilities used in Strimmer et al. (1997) for ML.

Maximum likelihood weights (which will be denoted as ML) have been computed assuming the most general continuous-time homogeneous model (same instantaneous rate matrix throughout the tree but no constraints on the entries of the rate matrix or assumption of stationarity); this is the Lie Markov model listed as 12.12 in Fernández-Sánchez et al. (2015) and was denoted as ML(homGMc) in Fernández-Sánchez and Casanellas (2016). To this aim we used the baseml program from the PAML library (Yang 1997) with the UNREST model and let it infer the instantaneous rate matrix (common to all lineages) and the distribution at the root. Note that in order to use weights obtained from ML, we need to obtain the estimates of the maximum likelihood for the three possible quartets. This is unfeasible when the method does not converge (which often happens for the quartets which did not generate the data).

1.3 Description of the Simulated Data

We consider two different scenarios of simulated data: one for testing ASAQ as a quartet reconstruction method (described in Sect. 1.3.1) and another for testing different Q-methods with several weighting systems (see Sect. 1.3.2). For the first we use the simulated data introduced by Fernández-Sánchez and Casanellas (2016), whereas for the second we follow the approach of Ranwez and Gascuel (2001) and consider 12-leaf trees.

Evolutionary models For the trees described below, we generate data evolving either on a general Markov model (GM, see Sect. 2.1) or on a homogeneous general time-reversible model (GTR). By a homogeneous GTR model we mean a continuous-time GTR model that shares the same instantaneous mutation rate matrix Q across all branches of the tree. GTR data have been generated using Seq-gen (Rambaut and Grass 1997), while GM data have been generated using the software GenNon-h (Kedzierska and Casanellas 2012) available at https://github.com/Algebraicphylogenetics/GenNon-H. For GTR data we specified in each setting a stationary distribution and a rate matrix common to all generated trees (see below for details on the chosen rates in each case), whereas for GM data the transition matrices are randomly generated for each alignment (according to the specified branch lengths).

1.3.1 Simulated Data for Quartet Reconstruction

Tree space The first data set we use to test ASAQ corresponds to the tree space suggested by Huelsenbeck (1995). We consider quartets as in Fig. 9a in Appendix B, with branch lengths given by a pair of parameters a and b which vary between 0 and 1.5 in steps of 0.02. Branch lengths are always measured as the expected number of elapsed substitutions per site. The resulting tree space is shown in Fig. 9b in Appendix B. The upper left region of this tree space corresponds to the “Felsenstein zone”, which contains trees subject to the long branch attraction phenomenon. For each of the two nucleotide substitution models considered in the paper, GM and GTR, and for each pair (a, b) of branch lengths, we have simulated one hundred alignments. The rates for GTR data were chosen as in Fernández-Sánchez and Casanellas (2016), subsection Description of the Data, p.284. The considered alignment lengths are of 500, 1 000 and 10 000 sites.

Random branch lengths Following Casanellas et al. (2021b), we test ASAQ on 10 000 alignments generated from quartets whose branch lengths are randomly generated according to a uniform distribution in the intervals (0, 1) or (0, 3). These alignments are obtained according to both substitution models, GM and GTR, and are either 1 000 or 10 000 sites long. We represent the weights output by ASAQ in a ternary plot (also called a simplex plot) as in Strimmer and von Haeseler (1997).

Mixture data The performance of ASAQ is also tested in the scenario of data sampled from a mixture of distributions. According to the approach of Kolaczkowski and Thornton (2004), we consider the mixture of distributions as follows. We partition the alignment into two categories of the same sample size both evolving under the GM model on the same quartet topology as Fig. 9a but the first category corresponds to branch lengths $a = 0.05$, $b = 0.75$, while the second corresponds to $a = 0.75$ and $b = 0.05$ (see Fig. 10 in the Appendix). The internal branch length takes the same value in both categories and varies from 0.01 to 0.4 in steps of 0.05. The total length of the alignments considered is 1 000 or 10 000 sites.

1.3.2 Simulated Data for Larger Trees

We followed Ranwez and Gascuel (2001) to test the performance of different quartet-based methods (QP, WO and WIL) with different weighting systems. To this end, we considered the three 12-taxon topologies depicted in Fig. 1 denoted as CC, CD and DD, and fixed the ratio among branch lengths as in Ranwez and Gascuel (2001), depending on a parameter b denoting the internal branch length, which is varied in the set $\{0.005,0.015,0.05,0.1,0.25,0.5\}$.

For each tree topology and for each b, we have considered 100 alignments with lengths 600 (in order to match the alignment length considered in Ranwez and Gascuel (2001)), 5 000, and 10 000 generated under the GM model. For DD topology, we have also generated data under the GTR model with $b=0.005,\ldots , 0.1$, and instantaneous rates 2 (A$\rightarrow $C), 5 (A$\rightarrow $G), 3 (A$\rightarrow $T), 4 (C$\rightarrow $G), 1 (C$\rightarrow $T), 2 (G$\rightarrow $T) and equal base frequency (we chose DD because it is arguably the hardest to reconstruct, see the Results section).

Mixture data We have considered a 2-category mixture model, that is, we generated alignments evolving on a the topology CD but whose sites evolve following two systems of substitution parameters: the first system corresponds to the branch lengths described in the tree CD, while the second system corresponds to CD after exchanging the branch lengths of seq3 and seq4, and the lengths of seq7 and seq8 (see Fig. 11 for details). The parameters that were varied in this framework were the proportion p of sites in the first category (which was varied in 0.25, 0.50, and 0.75) and the internal edge length b which was varied as above in 0.005, 0.015, 0.05 and 0.1. The lengths of the alignments considered were 600, 1 000 and 10 000 bp.

1.4 Real Data

Yeast data We analyze the performance of ASAQ on real data on the eight species of yeast studied in Rokas et al. (2003) with the alignment consisted of the concatenation of 42 337 s codon positions of 106 genes as provided by Jayaswal et al. (2014). We investigate whether the quartets output by ASAQ support the tree T of Rokas et al. (2003), the alternative tree $T'$ of Phillips et al. (2004) (see Fig. 12), or the mixture model proposed by Jayaswal et al. (2014). Although the tree T is widely accepted by the community of biologists, its correct inference is known to depend on the correct management of heterogeneity across lineages, as an inaccurate underlying model usually reconstructs $T'$ (Rokas et al. 2003; Phillips et al. 2004; Jayaswal et al. 2014). According to Jayaswal et al. (2014), these data are best modeled by considering, apart from heterogeneity across lineages, two different rate categories (discrete $\Gamma $ distribution) plus invariable sites. In our setting, this is translated into a mixture distribution with 3 categories ($m=3$ in ASAQ).

Ratites and tinamous mitochondrial data The phylogeny of ratites and tinamous has been debated for some time (see Phillips et al. 2009, and the references therein) and is still controversial (Benito et al. 2022). There is evidence of a higher rate of evolution among the tinamous relative to the ratites (see, e.g., Paton et al. 2002), so these data are appropriate for analysis by the methods proposed here. Moreover, the recoding used in Phillips et al. (2009) to sort out this problem is questioned in Vera-Ruiz et al. (2021). We do our analyses using a 3506 sites alignment consisting of the third codon position of the mitochondrial DNA coding alignment for 24 DNA sequences provided in this last paper. We run WO and QP with weights obtained by ASAQ, SAQ, Erik+2, 4P, PL. For each combination of methods we used all quartets but also a random selection of a subset of input quartets of size either one hundred, 2125 (approximately 20% of all possible quartets), or 5313 (equal to 50%), and then we performed the MRCTs. This was aimed at testing the sensitivity of the method to the amount of input data (comparing the performance when only a portion of the quartets is used) in terms of scalability of the methods.

We analyze the results obtained in comparison to the following trees: (A) the tree that groups Tinamous and Moas as proposed by Phillips et al. (2009) and displayed in Vera-Ruiz et al. (2021), Fig. 4B the consensus tree of Fig. 1a of Phillips et al. (2009) and (C) the Ratite paraphyly tree of Fig. 1b of Phillips et al. (2009). If we call CEK to the largely established clade CEK=((cassowary, emu), kiwis), then these trees can be summarized as:

A:
(outgroup, neognathus, (ostrich, (rheas, ((moas, tinamous), CEK))));
B:
(outgroup, neognathus, (tinamous, (moas, rheas, (ostrich, CEK))));
C:
(outgroup, neognathus, ((tinamous, moas), (rheas, (ostrich, CEK)))) ,

where the outgroup comprises sequences of alligator and caiman (see Fig. 7 for a description of the species involved). As unrooted trees, A and C have 21 interior edges and B has 20.

2 Results

In this section, we describe the results obtained and benchmark them with published results for the sake of completeness. The interested reader is referred to the corresponding papers for details of the methods therein.

2.1 Results on Quartets

2.1.1 Tree Space

The performance of ASAQ and PL on data generated on the tree space of the previous section (Materials and Methods) is represented in Fig. 2 (for GM data) and in Fig. 13 in the Appendix (for GTR data). In black we represent 100% success, in white 0% success, and gray tones correspond to regions of intermediate success accordingly. The 95 % and 33 % isoclines are represented with a white and black line, respectively. These simulation studies show a consistent performance according to the results by Huelsenbeck (1995) and Fernández-Sánchez and Casanellas (2016), with the usual decreasing performance at the Felsenstein zone and an improvement of both methods with sample size. Figures 2 and 13 show that the performance of ASAQ is better than that of PL, for both GM and GTR data.

For completeness, the average performance of ASAQ and PL on this tree space for different alignment lengths and underlying models is compared to other methods in Table 1: we include the average results of SAQ (Casanellas et al. 2021b, as shown in) and Erik+2 and ML as published in Fernández-Sánchez and Casanellas (2016). This comparison shows that for GM data the best results are achieved by ASAQ, while for GTR data ML obtains the best performance for alignments of length 500 and 1 000 bp and Erik+2 does so for long alignments (10 000 bp).

Table 1 Average success of several methods applied to data simulated on the tree space of Fig. 9b). ASAQ and PL are compared to the results for SAQ obtained in (Casanellas et al. 2021b), and for Erik+2 and maximum likelihood ML in (Fernández-Sánchez and Casanellas 2016). MLestimates the most general continuous-time homogeneous process 12.12 when data are generated under a GM model, while ML estimates a homogeneous GTR model when data are generated under GTR, see (Fernández-Sánchez and Casanellas 2016, Table 1). In each row of the table, the highest success is indicated in bold font

Full size table

2.1.2 Random Branch Lengths

To visualize the overall distribution of the weights of ASAQ applied to trees with random branch lengths, in Fig. 3 we show ternary plots corresponding to the GM alignments described in Sect. 1.3.1. The ternary plots of the performance of ASAQ when applied to the same setting with data generated under the GTR model are shown in Fig. 14. In Table 2 we display the summary of average success of ASAQ on these data.

Note that Table 2 shows a high performance of the method and the ternary plots show a clear distribution of points towards the bottom left corner (which represents the correct quartet) and weights symmetrically distributed on the other corners. In particular, the method is not biased towards any of the incorrect topologies. We note that the level of success exhibited is quite sensitive to the branch length, being much higher for branch lengths in (0,1) than in (0,3). We do not appreciate a remarkable difference between the performance of ASAQ when applied to GTR data.

Table 2 Average success of ASAQ on alignments of lengths 1 000 and 10 000 bp generated on the tree 12|34 under the GM and GTR models with random branch lengths uniformly distributed in (0,1) (first row) and (0,3) (second row). The plots corresponding to these data are shown in Figs. 3 and 14

Full size table

2.1.3 Mixture Data

In Fig. 4 we show the performance of the method ASAQ with $m=2$ categories when applied to data from mixtures as described in Sect. 1.3.1 and in Fig. 10. Based on the results of Fig. 5 in Fernández-Sánchez and Casanellas (2016), we also provide a comparison to the success of Erik+2 $(m=2)$ and two versions of maximum likelihood on the same data. We did not include the performance of SAQ, as it is very similar to that of ASAQ.

Figure 4 shows an increasing accuracy of all methods when the value of the parameter r (the branch length of the interior edge of the two trees involved) is increased. This was expected as larger values of r represent larger divergence between sequences at the left and the right of the trees. We note that ASAQ ($m=2$) outperforms the other methods, with a high level of success in all cases, even when the length of the alignments is 1 000 bp. The average success of ASAQ $(m=2)$ applied to the simulated data is $96.64\%$ for 1 000 bp, $99\%$ for 10 000 bp and $99.92\%$ for 100 000 bp (results not shown).

2.2 Results of Q-Methods

We proceed to describe the performance of wQFM, QFM, WO, WIL and QP (see the subsection on Quartet-based methods ) applied to the input weights from ASAQ, SAQ, Erik+2, PL, 4P and ML on the trees CC, CD and DD presented in Sect. 1.3.2. The weights from PL and 4P are both defined in terms of the paralinear distance (see Appendix A.1), and they produce similar results. Because of this, we omit the performance of 4P in some cases. We also add the results obtained from a global NJ applied to the same data (using the NJ algorithm implemented in the R package APE of Paradis et al. (2004) with paralinear distance) and also global ML using IQ-TREE 2 (Minh et al. 2020) with the most continuous-time homogeneous general model 12.12.

2.2.1 Unmixed Data

The results on simulated GM data obtained by the combination of methods and weights presented in this paper are summarized in Fig. 5 for CD, in Fig. 15 for CC and in Fig. 15 for DD. The results for GTR data (for the tree DD) are shown in Fig. 17. The height of the bars shows the average of the RF distance from the original tree to the consensus tree of 100 replicates for each of the 100 generated alignments. The values of these results are detailed in Tables 6 (for CC), C.3 (for CD) and C.4 (for DD) for GM data and in C.7 for GTR data on DD. For comparison, the results of a global NJ and global ML (which has a similar performance as a global NJ) applied to these data is displayed in Table 10 (see also the results of the global NJ in Figs. 5, 15 and 16). In Table 9, we show the results obtained when applying ML weights. Since ML did not converge for some quartets (especially in the presence of long branches and when trying to maximize the likelihood for GM data generated on another quartet), we write between parentheses the number of alignments considered for ML (in the computation of the average RF distance we neglected the alignments where ML did not converge).

By comparing the performance of the four Q-methods, we observe a slightly better performance of wQFM, QFM WIL and WO compared to QP. The accuracy drops for long branches in all methods, but wQFM and WO seem to perform slightly better for short and medium branches ($b\le 0.1$) while WIL does better for longer branches $b\ge 0.25$. Among all Q-methods, QP is the most sensitive to the choice of the system of weights.

Another general remark is that all Q-methods with their best system of weights outperform a global NJ and a global ML for $b>0.1$ on GM data, while none of them (with any system of weights) beats a global NJ or ML for b smaller than or equal to 0.1. Figure 17 shows that a global NJ is the best option for GTR data on DD trees.

All reconstruction algorithms have had considerably more success when reconstructing the tree CC in contrast to CD or DD for GM data, in concordance with the results obtained by Ranwez and Gascuel (2001). In the tree topologies CD and DD, the distance between seq9 and seq10 is 4b, while the distance between seq8 and seq11 is 20b. The same happens between species seq1 and seq2, and seq0 and seq3 of the DD topology. Thus, for an alignment generated from these trees, there is a high probability that two separate lineages evolve in a convergent manner to the same nucleotide at the same site, creating a long branch attraction situation and making the reconstruction methods to infer the wrong topology. wQFM, QFM, WO and WIL are especially successful when dealing with CC and CD trees (Fig. 15 and Fig. 5) and short branches, probably because these Q-methods succeed in reconstructing the C subtree.

Weights from ASAQ, SAQ, Erik+2, PL and 4P produce overall comparable results, although ASAQ and SAQ do better when dealing with long branches. wQFM with seems to work slightly better with quartets weighted by SAQ, PL and 4P over Erik+2 or ASAQ (although it does not perform as well as WO when dealing with long branches). On the contrary, in the unweighted version, ASAQ seems to provide the best input quartets for QFM (note that PL and 4P provide the same input for QFM, so they are displayed together in the tables). We also note an improvement in the results when the length of the alignment increases, especially for the CC case or for GTR data (Fig. 17) and hardly noticeable for the other two trees on GM data.

The simulation study shows a big difference between the general performance of the Q-methods with input weights from ASAQ, Erik+2 and PL in contrast to weights obtained by ML, especially when reconstructing the CC and CD trees (compare Tables 6, 7, 8 and 9). This could probably be due to the inconsistency between the ML weights computed and the general Markov model used to generate the data (see subsections Quartet-based methods and Description of the simulated data). Even when the model used for ML estimation matches the model that generated the data (as for GTR, see Table 11), ML seems to be the worse weighting system. When data are generated under the GTR model on DD trees (Fig. 17 and Table 11), all Q-methods (except for ML weights) have a remarkable high performance when alignments are long enough ($\ge $ 5 000 bp), especially WO. For short alignments, PL seems a good choice for the input weights with these data. All in all, we observe that both ASAQ and PL weights (combined with one or another Q-method) give rise to good results (much better than ML weights) comparable with the results obtained by a global NJ or ML, and even better when dealing with long branches.

2.2.2 Mixture Data

The results on mixture data obtained by a global NJ and by QP, WO, and WIL with input weights from PL, ASAQ and Erik+2 adapted to 2-category data (as described in Sect. 1.3.2) are summarized in Fig. 6 for WO and in Tables 12, 13 and 14 for all methods. Tables 12, 13 and 14 have a similar structure as Tables 6, 7 and 8 and correspond to the results obtained for different proportions between the two categories; $p=0.25$, $p=0.5$ and $p=0.75$, respectively. (We recall that the proportion p of sites of the first category of the alignment were generated assuming the branch lengths of the CD tree in Fig. 1.) It is worth pointing out that the reconstruction of mixture data from CD trees is more accurate on average than when applied to unmixed data (Fig. 5). This is probably due to the fact that when the branch lengths of species seq3, seq4 and seq7, seq8 are exchanged (Fig. 1), reconstruction methods have an easier job to make the appropriate splits, since the long branch attraction situation that was provoked by the quartet of species $\{seq8,seq9,seq10,seq11\}$ does no longer exist with the new branch lengths. This is consistent with the observation that the reconstruction results are more accurate for low values of b. For short alignments (600 bp), we note that ASAQ and PL weights provide better results than Erik+2. For 5 000 bp ASAQ and Erik+2 outperform PL. Moreover, if $b=0.1$, the combination WO+ASAQ or WO+Erik+2 beats a global NJ. For 10 000 bp, the difference between performance is even larger (see Tables Tables 12, 13 and 14). As expected, the length of the alignment improves the performance of these methods, reducing the impact of the long branch attraction effect for high values of b.

2.3 Results on Real Data

2.3.1 Yeast Data

As shown in Table 3, the claim by Jayaswal et al. (2014) is corroborated by our analysis: by performing a MRCT on 100 random initial quartets, WO+ASAQ reconstructs T when 3 categories are considered, but it reconstructs $T'$ otherwise. Similarly, the RF distance of the MRCT obtained by WIL+ASAQ to the tree T is smaller than the RF distance to $T'$ only when 3 categories are considered. When applied to these data, NJ with paralinear distances reconstructs the tree T.

It should also be mentioned that several studies have analyzed this data set seeking to build a species network from individual gene trees. These works mostly support a hybridization event involving S. kudriavzevii and S. bayanus, resulting in a species network with at least one 4-cycle that could give rise to the inconsistencies between trees T and the $T'$; see for instance Allman et al. (2019); Holland et al. (2004) and Yu et al. (2011).

Table 3 Robinson–Foulds distance of the consensus tree obtained by WIL and WO with ASAQ weights (with different number of categories) applied to the trees T and $T'$ suggested in Rokas et al. (2003) and Phillips et al. (2004), respectively

Full size table

2.3.2 Ratites and Tinamous

First, we analyze the results obtained by 50 majority rule consensus trees obtained from WO and QP on 100 input quartets weighted with different methods. In Table 4 we show the average results in comparison to the trees A, B and C detailed in Sect. 1.4. The number of interior edges obtained by each method shows that WO produces more resolved trees than QP (independently of the weighting method) and that ASAQ, SAQ and Erik+2 (m=1) give rise to more resolved trees than PL and 4P with both WO and QP.

Table 4 For each weighting system (indicated in the left column) combined with either QP or WO, the first column depicts the average number of interior edges of 50 MRCTs on Ratites real data. The average RF distance and normalized RF distance (that is, dividing by the number of interior edges) from the MRCTs to the trees A, B and C (see Sect. 1.4) is shown in the remaining columns. The last row shows the RF distance and the normalized RF distance from the global NJ tree (obtained using paralinear distances) to trees A, B and C

Full size table

Then we study the MRCTs obtained for these real data with WO (trees displayed in Fig. 7). When we use ASAQ and we compute the MRCT with 2125, 5313, or all initial quartets (with both $m=1$ and $m=2$), we obtain the phylogeny

D:
(outgroup, neognathus, tinamous, (rheas, (ostrich, (moas, CEK)))).

This phylogeny is also supported by a global NJ, although NJ splits tinamous and outgroup against the others. The same tree as NJ is obtained with SAQ (both for 2125 or 5313 initial quartets). Although the deepest interior branch lengths within ratites obtained by NJ are very small: (rheas, (ostrich, (moas, CEK):0.0025):0.003), SAQ and ASAQ give full support to these splits. Indeed, even when we use SAQ or ASAQ (with both $m=1$ or 2) with only 100 initial quartets, all 50 MRCTs share these splits. Unexpected trees are obtained by Erik+2, PL or 4P (either not solving correctly the neognathus clade, or not giving a CEK clade nor putting all tinamous in a clade).

2.4 Execution Time and Implementation

An implementation of ASAQ in C++ can be found in

https://github.com/marinagarrote/ASAQ-method

The Q-methods WO, WIL and QP applied along the paper are implemented in

https://github.com/msabvid/weights_quartet_methods/

The computations on this paper have been performed on a computer with 6 Dual Core Intel(R) Xeon(R) E5-2430 Processor (2.20 GHz) equipped with 25 GB RAM running Debian GNU/Linux 8. We have used the g++ (Debian 4.9.2-10+deb8u2) 4.9.2 compiler and the C++ library for linear algebra & scientific computing Armadillo version 3.2.3 (Creamfields).

The average time required to compute ASAQ and PL weights for 100 alignments of 4 taxa and length 10 000 bp is 8.7 and 7.8 seconds, respectively. For each alignment, ML was stopped if it did not converge for a 4-tuple after 10 s; the frequency of convergence can be seen on Tables 9 and 11, 12, 13 and 14. The average time to reconstruct a CD tree given the weights for its 495 quartet subtrees is 4 seconds for WO, 89.5 seconds for WIL and 1.8 seconds for QP.

3 Discussion

Via experiments on the simulation framework proposed by Ranwez and Gascuel (2001) but considering more general models of nucleotide substitution (including the GM model and mixtures of distributions), we observe a huge improvement on the performance of Q-based methods when weights from ASAQ, SAQ, Erik+2 and PL methods are considered. In general, the highest success is obtained by WO with the weighting system of ASAQ or PL, or QFM with ASAQ input. The accuracy of these methods is compatible with a global NJ or ML and improves upon them in the presence of mixtures or long branches when using ASAQ as weighting system. Moreover, these weights also outperform weights obtained by ML, even when data are generated under a GTR model and ML is estimating the same model.

It is worth noting that in the previous studies of Q-methods by Ranwez and Gascuel (2001); John et al. (2003), only weights from ML and NJ were considered. The results in this paper validate QFM, wQFM, WO and WIL as successful phylogenetic reconstruction methods if their input is a system of reliable weights. Moreover, as ASAQ assumes the most general Markov model and can deal with mixtures, this opens the door to use Q-methods to data generated by complex models.

We need to mention that the comparison performed against the ML weights (ML as input of Q-methods) or a global ML may not be totally fair because the model used in parameter estimation (continuous-time unrestricted model) does not fit the GM model used to simulate the data. It would certainly be interesting to develop maximum likelihood estimation based on the GM model and perform new tests. In any case, the results of ML weights are much worse than those of ASAQ, Erik+2 or PL also for GTR data (see Fig. 17 and Table 8 in the Appendix) and the results of a global ML are similar to a global NJ.

We have observed a good performance of input weights from ASAQ, Erik+2 and PL on simulated mixture data on 12-taxon trees, especially for 5 000 bp or more. ASAQ and PL are not likely to be statistically consistent for general mixtures on the same tree. Nevertheless, as Erik+2 is consistent on mixture data and PL is known to be consistent on some type of mixtures (see Allman et al. 2019), this suggests that ASAQ (being based on the accordance of Erik+2 and SAQ) might also be consistent on some types of mixture data, which would explain the good results obtained. We would like to point out that the mixture model allowed in Erik+2 (and hence in ASAQ) is actually more flexible than we mentioned. Indeed, the rank conditions considered in Erik+2 for quartets still hold if we let one cherry or one leaf evolve under a mixture with any number of categories (while the other cherry evolves under a single system of parameters), see Casanellas and Fernandez-Sanchez (2020). This makes the mixture model underlying Erik+2 and ASAQ more general, with implications in the mixture model that can be considered for Q-methods with input weights from ASAQ.

Our results on real data validate WO + ASAQ as a reliable method (both for yeast or ratites/tinamous data). Moreover, it is relevant to note that for the ratites/tinamou data, the topology obtained by WO + ASAQ or SAQ, or by NJ with paralinear distances does not agree with any of the topologies proposed in Phillips et al. (2009).

Note that the Q-methods considered here have higher order of computational complexity than NJ, but we have not yet explored the possibility of considering a subset of the possible 4-tuples as suggested in Snir and Rao (2010) or Davidson et al. (2018). It would be interesting to further explore these other versions of Q-methods with weights from ASAQ and PL, or even to restrict to quartets with highest weights as starting point for Q-methods.

On another direction, our results show that ASAQ is a powerful reconstruction method for quartet topology reconstruction. It assumes the most general model of nucleotide substitution (a general Markov model) of independently and identically distributed sites but it can also account for mixtures of distributions (with up to three categories). As it is based on the algebraic and semi-algebraic description of the model, it does not need to estimate the substitution parameters. In this sense, ASAQ could be easily adapted as a suitable method for dealing with amino acid substitutions as well. The incorporation of invariable sites seems also plausible via the results in Jayaswal et al. (2007); Steel et al. (2000). We plan to incorporate these features in a forthcoming version of the software.

As mentioned in the introduction, ASAQ is part of the set of phylogenetic reconstruction tools that are based on algebra. Most of these methods only reconstruct quartets because no statistically consistent (and computationally affordable) algebraic method for larger trees has been designed yet.

A study of ASAQ from a statistical point of view would certainly be relevant, as in this study the efficiency of the method has been solely based on the results obtained on large sets of simulated data.

Notes

under the name “SplitSup” in http://github.com/eallman/SplitSup/.

References

Abadi S, Azouri D, Pupko T, Mayrose I (2019) Model selection may not be a mandatory step for phylogeny reconstruction. Nat Commun 10:934
Allman ES, Banos H, Rhodes JA (2022) Identifiability of species network topologies from genomic sequences using the logdet distance. J Math Bio 84:35
Article MathSciNet MATH Google Scholar
Allman ES, Rhodes JA (2007) Phylogenetic invariants. In: Gascuel O, Steel MA (eds) Reconstructing evolution. Oxford University Press, Oxford
Google Scholar
Allman ES, Rhodes JA, Taylor A (2014) A semialgebraic description of the general Markov model on phylogenetic trees. SIAM J Discret Math 28(2):736–755
Article MathSciNet MATH Google Scholar
Allman ES, Kubatko LS, Rhodes JA (2016) Split scores: a tool to quantify phylogenetic signal in genome-scale data. Syst Biol 66(4):syw103
Article Google Scholar
Allman ES, Baños H, Rhodes JA (2019) NANUQ: a method for inferring species networks from gene trees under the coalescent model. Algorithms Mol Biol 14(1):24
Article Google Scholar
Allman ES, Long C, Rhodes JA (2019) Species tree inference from genomic sequences using the logdet distance. SIAM J Appl Algebr Geom 3(1):107–127
Article MATH Google Scholar
Benito J, Kuo P-C, Widrig KE, Jagt JWM, Field DJ (2022) Cretaceous ornithurine supports a neognathous crown bird ancestor. Nature 612:100–105
Article Google Scholar
Buneman P (1971) The recovery of trees from measures of dissimilarity. In: Mathematics in the archaeological and historical sciences, pp 387–395
Casanellas M, Fernández-Sánchez J, Garrote-López M (2021) Distance to the stochastic part of phylogenetic varieties. J Symb Comput 104:653–682
Article MathSciNet MATH Google Scholar
Casanellas M, Fernández-Sánchez J, Garrote-López M (2021) SAQ: semi-algebraic quartet reconstruction method. IEEE/ACM Trans Comput Biol Bioinf 18(6):2855–2861
Article Google Scholar
Casanellas M, Fernández-Sánchez J, Roca-Lacostena J (2023) The embedding problem for Markov matrices. Publicacions Matemàtiques 67(1):411–445
Article MathSciNet MATH Google Scholar
Casanellas M, Fernández-Sánchez J (2021) Rank conditions on phylogenetic networks. In: Extended abstracts GEOMVAP 2019. Trends in mathematics, vol. 15. Springer-Birkhäuser, pp 65–69
Chifman J, Kubatko LS (2014) Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324
Article Google Scholar
Davidson R, Lawhorn M, Rusinko J, Weber N (2018) Efficient quartet representations of trees and applications to supertree and summary methods. IEEE/ACM Trans Comput Biol Bioinf 15(3):1010–1015
Article Google Scholar
Felsenstein J (2004) Inferring phylogenies. Sinauer Associates, Sunderland
Google Scholar
Fernández-Sánchez J, Casanellas M (2016) Invariant versus classical approach when evolution is heterogeneous across sites and lineages. Sys Bio 65:280–291
Article Google Scholar
Fernández-Sánchez J, Sumner JG, Jarvis PD, Woodhams MD (2015) Lie Markov models with purine/pyrimidine symmetry. J Math Biol 70(4):855–891
Article MathSciNet MATH Google Scholar
Garrote-López M (2021) Algebraic and semi-algebraic phylogenetic reconstruction. Phd. thesis, Universitat Politècnica de Catalunya. https://upcommons.upc.edu/handle/2117/351096
Gascuel O (1994) A note on Sattath and Tversky’s, Saitou and Nei’s, and Studier and Keppler’s algorithms for inferring phylogenies from evolutionary distances. Mol Biol Evolut 11:961–963
Google Scholar
Holland BR, Huber KT, Moulton V, Lockhart PJ (2004) Using consensus networks to visualize contradictory evidence for species phylogeny. Mol Biol Evolut 21(7):1459–1461
Article Google Scholar
Holland BR, Jarvis PD, Sumner JG (2012) Low-parameter phylogenetic inference under the general Markov model. Syst Biol 62(1):78–92
Article Google Scholar
Huelsenbeck JP (1995) Performance of phylogenetic methods in simulation. Syst Biol 44(1):17–48
Article Google Scholar
Jayaswal V, Robinson J, Jermiin LS (2007) Estimation of phylogeny and invariant sites under the general Markov model of nucleotide sequence evolution. Syst Biol 56(2):155–162
Article Google Scholar
Jayaswal V, Wong TKF, Robinson J, Poladian L, Jermiin LS (2014) Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages. Syst Biol 63(5):726–742
Article Google Scholar
Jermiin LS, Catullo RA, Holland BR (2020) A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genom Bioinform 2(2):lqaa041
Article Google Scholar
John SK, Warnow T, Moret BM, Vawter L (2003) Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining. J Algorithms 48(1):173–193
Article MathSciNet MATH Google Scholar
Kaehler BD, Yap VB, Zhang R, Huttley GA (2015) Genetic distance for a general non-stationary Markov substitution process. Syst Biol 64(2):281–293
Article Google Scholar
Kedzierska AM, Casanellas M (2012) GenNon-h: generating multiple sequence alignments on nonhomogeneous phylogenetic trees. BMC Bioinform 13(1):216
Article Google Scholar
Kolaczkowski B, Thornton JW (2004) Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431(7011):980–984
Article Google Scholar
Lake JA (1994) Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc Natl Acad Sci 91(4):1455–1459
Article Google Scholar
Mahbub M, Wahab Z, Reaz R, Rahman MS, Bayzid MS (2021) wQFM: highly accurate genome-scale species tree estimation from weighted quartets. Bioinformatics 37(21):3734–3743
Article Google Scholar
Mihaescu R, Levy D, Pachter L (2009) Why neighbor-joining works. Algorithmica 54(1):1–24
Article MathSciNet MATH Google Scholar
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R (2020) IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evolut 37(5):1530–1534
Article Google Scholar
Paradis E, Claude J, Strimmer K (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20(2):289–290
Article Google Scholar
Paton T, Haddrath O, Baker AJ (2002) Complete mitochondrial DNA genome sequences show that modern birds are not descended from transitional shorebirds. Proc Biol Sci 269(1493):839–846
Article Google Scholar
Phillips MJ, Delsuc F, Penny D (2004) Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evolut 21(7):1455–1458
Article Google Scholar
Phillips MJ, Gibb GC, Crimp EA, Penny D (2009) Tinamous and Moa Flock together: mitochondrial genome sequence analysis reveals independent losses of flight among ratites. Syst Biol 59(1):90–107
Article Google Scholar
Rambaut A, Grass NC (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics 13(3):235–238
Article Google Scholar
Ranwez V, Gascuel O (2001) Quartet-based phylogenetic inference: improvements and limits. Mol Biol Evolut 18(6):1103–1116
Article MATH Google Scholar
Reaz R, Bayzid MS, Rahman MS (2014) Accurate phylogenetic tree reconstruction from quartets: a heuristic approach. PLoS ONE 9(8):1–13
Article Google Scholar
Robinson D, Foulds L (1981) Comparison of phylogenetic trees. Math Biosci 53(1–2):131–147
Article MathSciNet MATH Google Scholar
Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425(6960):798–804
Article Google Scholar
Rusinko J, Hipp B (2012) Invariant based quartet puzzling. Algorithms Mol Biol 7(1):35
Article Google Scholar
Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) Tree-puzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18(3):502–4
Article Google Scholar
Snir S, Rao S (2010) Quartets MaxCut: a divide and conquer quartets algorithm. IEEE/ACM Trans Comput Biol Bioinf 7(4):704–718
Article Google Scholar
Steel MA, Huson D, Lockhart PJ (2000) Invariable sites models and their use in phylogeny reconstruction. Syst Biol 49(2):225–232
Article Google Scholar
Strimmer K, von Haeseler A (1996) Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evolut 13(7):964–969
Article Google Scholar
Strimmer K, von Haeseler A (1997) Likelihood-mapping: A simple method to visualize phylogenetic content of a sequence alignment. Proc Natl Acad Sci 94(13):6815–6819
Article MATH Google Scholar
Strimmer K, Goldman N, von Haeseler A (1997) Bayesian probabilities and quartet puzzling. Mol Biol Evolut 14(2):210–210
Article Google Scholar
Sukumaran J, Holder MT (2010) DendroPy: a Python library for phylogenetic computing. Bioinformatics 26(12):1569–1571
Article Google Scholar
Sumner JG, Charleston MA, Jermiin LS, Jarvis PD (2008) Markov invariants, plethysms, and phylogenetics. J Theor Biol 253(3):601–615
Article MathSciNet MATH Google Scholar
Sumner JG, Taylor A, Holland BR, Jarvis PD (2017) Developing a statistically powerful measure for quartet tree inference using phylogenetic identities and Markov invariants. J Math Biol 75(6–7):1619–1654
Article MathSciNet MATH Google Scholar
Swofford DL (2003) PAUP${}^\ast $: Phylogenetic analysis using parsimony (${}^\ast $and Other Methods), Version 4.0b10. Sinauer Associates, Sunderland, Massachusetts
Vera-Ruiz VA, Robinson J, Jermiin LS (2021) A likelihood-ratio test for lumpability of phylogenetic data: is the Markovian property of an evolutionary process retained in recoded DNA? Syst Biol 71:660–675
Article Google Scholar
Willson SJ (1999) Building phylogenetic trees from quartets by using local inconsistency measures. Mol Biol Evol 16(5):685–693
Article Google Scholar
Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Bioinformatics 13:555–556
Article Google Scholar
Yu Y, Than C, Degnan JH, Nakhleh L (2011) Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Syst Biol 60(2):138–149
Article Google Scholar
Zou L, Susko E, Field C, Roger AJ (2012) Fitting nonstationary general-time-reversible models to obtain edge-lengths and frequencies for the Barry-Hartigan model. Syst Biol 61(6):927–940
Article Google Scholar
Zou Z, Zhang H, Guan Y, Zhang J (2019) Deep residual neural networks resolve quartet molecular phylogenies. Mol Biol Evol 37(5):1495–1507
Article Google Scholar

Download references

Acknowledgements

We would like to thank the reviewers of the paper for important contributions that improved the final version of the manuscript. MC, JFS and MGL were partially supported by Spanish State Research Agency grant PID2019-103849GB-I00. MC and JFS were also supported by AEI through the Severo Ochoa and María de Maeztu Program for Centers and Units of Excellence in R &D (project CEX2020-001084-M) and by the AGAUR project 2021 SGR 00603 Geometry of Manifolds and Applications, GEOMVAP.

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Authors and Affiliations

Institut de Matematiques de la UPC-BarcelonaTech (IMTech), Universitat Politècnica de Catalunya and Centre de Recerca Matemàtica, Av. Diagonal 647, 08028, Barcelona, Spain
Marta Casanellas & Jesús Fernández-Sánchez
University of British Columbia, Vancouver, Canada
Marina Garrote-López
The University of Edinburgh, Edinburgh, UK
Marc Sabaté-Vidales

Authors

Marta Casanellas
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Fernández-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Marina Garrote-López
View author publications
You can also search for this author in PubMed Google Scholar
Marc Sabaté-Vidales
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MC, JFS and MGL conceived the idea, developed the theoretical framework, designed the method and planned the simulations; MGL developed the idea, implemented ASAQ and carried out the simulations; MSV implemented all Q-methods and carried out part of the simulations; all authors discussed the results and wrote the paper. Order of the authors corresponds solely to alphabetical criteria.

Corresponding author

Correspondence to Marta Casanellas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Weighting Systems and Technical Results About ASAQ

The underlying nucleotide substitution model we consider is a general Markov (GM) model (without treatment of indels or ambiguous sites). That is, we assume that substitution of nucleotides along a phylogenetic tree T follows a Markov process on T specified by a distribution $\pi $ of nucleotides at an interior node (which pays the role of the root) and $4\times 4$ Markov matrices $M^e$ for each edge e of T. We do not assume any extra condition on the transition matrices (neither stationarity nor time-reversibility).

As Markov matrices $M^e$ are not assumed of type $exp(t_eQ_e)$ (which would imply assuming a time-homogeneous continuous-time Markov process), the Markov process along each edge is not necessarily homogeneous through time (that is, the patterns of instantaneous mutation rates are allowed to change through time). Models that assume $M^e=exp(t_eQ_e)$ are locally time-homogeneous; this condition is very restrictive: for example less than 4% of diagonal largest-in-column Markov matrices are of this type (Casanellas et al. 2023). Moreover, if a Markov process on T assumes $M^e=exp(t_eQ)$ with the same instantaneous rate matrix Q for the whole tree (as is often assumed in phylogenetic reconstruction), the process is globally time-homogeneous in addition (i.e. instantaneous mutation rate patterns are constant throughout the tree). In this sense, the GM model underlying the methods Erik+2, SAQ, ASAQ, PL and 4P is heterogeneous across lineages.

1.1 A.1 Weighting Systems

The input weights we consider in this paper are obtained from different quartet reconstruction methods: from SAQ, Erik+2 and ASAQ, from methods that use the paralinear distance (PL and 4P) and from maximum likelihood (ML). For any of these methods we describe here the weighting score used. To this end, let f be the vector of relative frequencies obtained from a DNA alignment of four taxa.

Weights for methods based on the paralinear distance, PL and 4P . Consider two (ordered) nucleotide sequences $S_x$ and $S_y$ of the same length corresponding to two taxa x and y, respectively. Let J be the underlying joint probability matrix of $S_x$ and $S_y$, this is, the entry (i, j) of J is the probability (either theoretical or estimated by relative frequencies) of observing nucleotides i and j at the same position of $S_x$ and $S_y$ (so that the sum of entries in J is one). If $\det J\ne 0$ and all nucleotides are observed in $S_x$ and $S_y$, then the paralinear distance (Lake 1994) between x and y is

$$\begin{aligned} d_{x,y}=-\log \frac{|\det J|}{\sqrt{\det D_x} \sqrt{\det D_y}}, \end{aligned}$$

(2)

where $D_x$ and $D_y$ are the diagonal matrices whose diagonal entries are given by $J {\textbf{1}}$ and ${\textbf{1}}^t J$, respectively; if $\det J=0$, we take $d_{x,y}$ as infinity. Based on the work by Lake (1994) and Ranwez and Gascuel (2001), given the vector f of relative frequencies from a quartet alignment, in Section Materials and Methods of the main document, we defined

$$\begin{aligned} {p\ell }_{12|34}(f)=\min \{d_{1,3}+d_{2,4},d_{1,4}+d_{2,3}\}-d_{1,2}-d_{3,4} \end{aligned}$$

which represents twice the length of the interior edge of the quartet (see Theorem A.3(b) below). We define ${p\ell }_{13|24}(f)$ and ${p\ell }_{14|23}(f)$ similarly.

In Mihaescu et al. (2009), the authors propose a slightly different way of weighting the quartets using the paralinear distance:

$$\begin{aligned} \ell _{12|34}(f)=\frac{1}{4}(d_{1,3}+d_{1,4}+d_{2,3}+d_{2,4})-\frac{1}{2}(d_{1,2}+d_{3,4}), \end{aligned}$$

($\ell _{13|24}(f)$ and $\ell _{14|23}(f)$ are defined analogously). These are the measures that we use for what we call the 4P method (standing for “four-point condition”). In the same paper quoted above, the authors establish the (theoretical) equivalence between NJ and a consistent quartet-based method that uses these weights. Observe also that these weights are equivalent to ${p\ell }_{12|34}(f)$ for theoretical data.

In order to assign a non-negative score to each quartet, we define the normalized weights for PL and 4P as

$$\begin{aligned} w_{p\ell }(f)= & {} \frac{1}{\sum _i \exp ({p\ell }_{T_i}(f))} \left( \exp ({p\ell }_{T_1}(f)), \exp ({p\ell }_{T_2}(f)), \exp ({p\ell }_{T_3}(f)) \right) . \\ w_{4p}(f)= & {} \frac{1}{\sum _i \exp (\ell _{T_i}(f))} \left( \exp (\ell _{T_1}(f)), \exp (\ell _{T_2}(f)), \exp (\ell _{T_3}(f)) \right) . \end{aligned}$$

Weights for maximum likelihood.

We weighted quartet trees for the ML method using the likelihoods associated to the three trees $T_1, T_2, T_3$ as done by Ranwez and Gascuel (2001): if $l_{T_i}(f)$ is the likelihood for the tree $T_i$ (with the MLE parameters), then the normalized weights for the three quartets are defined as

$$\begin{aligned} w_{ml} (f)= \frac{1}{\sum _i l_{T_i}(p)} \left( l_{T_1}(f), l_{T_1}(f), l_{T_1}(f) \right) . \end{aligned}$$

Weights for Erik+2 .

Given a bipartition ij|kl and a distribution $f\in {\mathbb {R}}^{256}$, denote by $Flat_{ij|kl}(f)$ the flattening of f according to that bipartition, that is, $Flat_{ij|kl}(f)$ is the $16\times 16$ matrix whose $(x_ix_j,x_kx_l)$-entry is the coordinate of f that matches $x_ix_jx_kx_l$ in the convenient order. Then, $Flat_{ij|kl}(f)$ has rank $\le 4$, while the flattenings corresponding to the other two bipartitions have rank 16 if the parameters are generic (Allman and Rhodes 2007)

If $\delta _4(\cdot )$ stands for the distance to the space of matrices with rank $\le 4$, the method Erik+2 computes for each quartet tree T the value

$$\begin{aligned} e_T(f)=\frac{\delta _4(Flat_T^r(f))+\delta _4(Flat_T^c(f))}{2}, \end{aligned}$$

(3)

where $Flat_T^r$ and $Flat_T^c$ stand for the (normalized) matrices obtained by dividing the rows and columns of the flattening matrix $Flat_T$ by the sum of its entries, respectively (as a variation we can consider the distance $\delta _{4m}(\cdot )$ to rank 4m matrices when we deal with mixed data with $m\in \{1,2,3\}$ categories). The right quartet tree should be the one that gives the minimal value; this is the output of the method.

In order to have a normalized scoring system that allows us to compare and represent the output of $\texttt {Erik+2} $ for each quartet tree T, we consider the inverse of the value $e_T(f)$ and then divide but the sum of these scores for all three quartet trees:

$$\begin{aligned} {\texttt {E}rik+2} (f)= \frac{1}{e}\left( e_{T_1}(f)^{-1},e_{T_2}(f)^{-1},e_{T_3}(f)^{-1}\right) \end{aligned}$$

where $e:= e_{T_1}(f)^{-1} + e_{T_2}(f)^{-1} + e_{T_3}(f)^{-1}$.

Weights for SAQ .

The method SAQ suggested by Casanellas et al. (2021b) associates the following score to the tree $T=12|34$:

$$\begin{aligned} s_T(f):=\sum _i \frac{\min \left\{ \delta _4(psd(Flat_{13|24}({\tilde{f}}_i))), \delta _4(psd(Flat_{1{4}|23}({\tilde{f}}_i)))\right\} }{\delta _4(psd(Flat_{12|34}({\tilde{f}}_i)))} \end{aligned}$$

where the sum runs over the sixteen 12|34 leaf-transformed distributions ${\tilde{f}}_i$ and psd(M) is the closest symmetric and positive semi-definite matrix to the matrix M. If T is any of the other two possible quartet trees, $s_T(f)$ is computed analogously by permuting the roles of the leaves accordingly. SAQ outputs the normalized three scores, that is, if $s:= s_{T_1}(f) + s_{T_2}(f) + s_{T_3}(f)$, then

$$\begin{aligned} {\texttt {S}AQ} (f):=\frac{1}{s} \left( s_{T_1}(f),s_{T_2}(f),s_{T_3}(f)\right) . \end{aligned}$$

1.2 A.2 The Paralinear Distance and Theoretical Foundations of ASAQ

The following lemma generalizes the nonnegativity and the additivity properties of this dissimilarity map to general Markov matrices, extending the results of Lake (1994) to general Markov matrices as far as they are non-singular.

Lemma A.1

Let $\pi $ be the nucleotide distribution at x and consider a substitution process leading from x to y and ruled by a Markov matrix M. Then, the paralinear distance $d_{x,y}$ between x and y defined in (2) coincides with

$$\begin{aligned} d_{x,y}=-\log \frac{|\det M| \, \sqrt{\det D(\pi )}}{\sqrt{\det D(M^t \pi )}}, \end{aligned}$$

(4)

(where D(u) refers to the diagonal matrix whose diagonal entries are the coordinates of the vector u). Moreover, we have that

a)
$d_{x,y}\ge 0$, and the equality holds if and only if $M=Id$ or is a permutation matrix;
b)
the dissimilarity measure $d_{x,y}$ is additive.

Proof

We have that $J=D(\pi ) M$ is the underlying joint probability matrix between x and y so that the sum of its entries is one. Note that $\pi =J{\textbf{1}}$ and $\pi ^t M={\textbf{1}}^t J$, so $D_x=D(\pi )$ and $D_y=D(M^t \pi )$. Therefore,

$$\begin{aligned} d_{x,y}= & {} -\log \frac{|\det J|}{\sqrt{\det D_x}\sqrt{\det D_y}} = -\log \frac{|\det M| \, \det D_x}{\sqrt{\det D_x}\sqrt{\det D_y}}\\ {}= & {} -\log \frac{|\det M| \, \sqrt{\det D(\pi )}}{\sqrt{\det D(M^t \pi )}}. \end{aligned}$$

Now, we proceed to prove (a): $d_{x,y}\ge 0$ or equivalently, that

$$\begin{aligned} (\det M)^2 \; \frac{\det D(\pi )}{\det D(M^t\pi )} \le 1. \end{aligned}$$

(5)

First of all, since the matrix M is positive, we deduce that

$$\begin{aligned} \big |\det M \big |=\left| \sum _{\{i_1,i_2,i_3,i_4\}=[4]}\prod _j \; sgn(i_1,i_2,i_3,i_4) \, m_{i_j,j} \right| \le \sum _{\{i_1,i_2,i_3,i_4\}=[4]}\prod _j \;m_{i_j,j}. \end{aligned}$$

We have that

$$\begin{aligned} |\det M |\det D(\pi )= & {} |\det M | \,\prod _k \pi _k \\\le & {} \prod _k \pi _k \; \left( \sum _{\{i_1,i_2,i_3,i_4\}=[4]}\prod _j \;m_{i_j,j}\right) \\= & {} \sum _{\{i_1,i_2,i_3,i_4\} = [4]}\prod _j \; m_{i_j,j}\; \pi _j \\\le & {} \sum _{i_1,i_2,i_3,i_4 \in [4]}\prod _j \; m_{i_j,j}\, \pi _{i_j} = \prod _{j} \sum _i m_{ji} \pi _j = \prod _i (M^t\, \pi )_i \\= & {} \det D(M^t\pi ). \end{aligned}$$

By multiplying this inequality with $|\det M|\le 1$ (M is a Markov matrix), we obtain (5). Note that a similar argument is used in the proof of the main theorem in Steel (1994).

Note that if $d_{x,y}=0$, then necessarily $\det M=1$. Hadamard’s inequality implies that a Markov matrix M with determinant 1 is necessarily the identity matrix or a permutation matrix (Casanellas et al., 2022). Finally, the statement (b) follows easily from the expression (4). $\square $

Write $\Delta $ for the probability simplex in the space ${\mathbb {R}}^{256}$, that is, $\Delta $ is the set of all possible distribution vectors of patterns of nucleotides $p=(p_{x_1 x_2 x_3 x_4})_{x_1,x_2,x_3,x_4\in \{A,C,G,T\}}$. By Lemma 8 of Buneman (1971) applied to the paralinear distance, we get the following result (which can be also deduced almost directly from the definition (1) of the paper):

Lemma A.2

Assume that the determinant of every double marginalization of $p\in \Delta $ is non-zero (so that all values $d_{i,j}$, $i,j\in [4]$ can be computed). If $d_{i,j}\ge 0$ for all $i,j\in [4]$, then ${p\ell }_{A|B}(p)$ is strictly positive for at most one bipartition A|B.

Before proving the main theorem, we recall some definitions that will be needed. Following Sumner et al. (2008), given a vector $f=(f_{x_1,x_2,x_3,x_4})\in \mathbb {R}^{256}$ and $4\times 4$ Markov matrices $X_i, i=1,\ldots ,4$, the Markov action $(X_1,\ldots ,X_4)*f$ is the new vector $g=(g_{y_1,\ldots ,y_4})$ defined by

$$\begin{aligned} g_{y_1,\ldots ,y_4} = \sum _{x_1,\ldots ,x_4} f_{x_1,\ldots ,x_4} X_1(y_1,x_1) X_2(y_2,x_2)X_3(y_3,x_3)X_4(y_4,x_4). \end{aligned}$$

In case $f=p$ is the distribution vector arising from some tree T with certain transition matrices $\{M_i\}_i$, the new vector g corresponds to the distribution obtained by multiplying the original matrices at the pendant edges with the new matrices: $M_i \rightarrow M_i X_i$, $i=1,\ldots ,4$.

According to Sumner et al. (2017), a quartet inference method that assigns a triplet of weights w(F) to each array $F\in \mathbb {R}^{256}$ satisfies the strong property II if $w(X*F)=w(F)+(\lambda (X),\lambda (X),\lambda (X))$ for some additive function $\lambda $ on Markov matrices $X=(X_1,\ldots ,X_4)$. Essentially, the strong property II guarantees that if the probability parameters of the pattern distribution are affected by a linear operator, the effect on the output is that of adding the same quantity to each topology weight. Moreover, this quantity is consistent with the composition of operators.

Given an array F, for each pair $a,b\in [4], a< b$, we define matrices $N_{ab}$ by taking the (i, j)-entry as the double marginalization of F on the components different from a and b, where rows (respectively columns) are labeled by the states at a (resp. b). If $a>b$, then define $N_{ab} = N_{ba}^t$. For example, $(N_{13})_{i,j} = F_{i+j+}$ and $(N_{43})_{i,j} = F_{++ji}$.

Theorem A.3

Let f be the vector of relative frequencies of site patterns observed in an alignment of length N which has been generated according to a multinomial distribution with measure p (that is, f is the vector of relative frequencies that arises as N independent samples from p). Then we have

a)
$$\begin{aligned} \texttt{PL} (f)=({p\ell }_{12|34}(f),{p\ell }_{13|24}(f),{p\ell }_{14|23}(f)) \end{aligned}$$
is a quartet inference method that satisfies the strong property II with $\lambda =0$. Moreover, if p arises from a Markov process on the tree $T=12|34$ of Fig. 8 with positive distribution $\pi $ at the root and invertible transition matrices, then the limit of the expectation of $\texttt{PL} (f)$ as N goes to infinity is
$$\begin{aligned} \lim _{N\rightarrow \infty } {\mathbb {E}}(\texttt{PL} (f))= \left( {p\ell }_{12|34}(p),{p\ell }_{13|24}(p),{p\ell }_{14|23}(p) \right) ; \end{aligned}$$
b)
if p arises from a Markov process as in a), then ${p\ell }_{12|34}(p)=2 d_{u,v}\ge 0$ and ${p\ell }_{C|D}(p)=-{p\ell }_{12|34}(p)$ for any other bipartition $C|D \ne 12|34$;
c)
the paralinear method (which associates to a frequency vector f the tree $T_{A|B}$ with largest ${p\ell }_{A|B}(f)$) is statistically consistent for the general Markov model;
d)
ASAQ is statistically consistent for the general Markov model.

Proof

(a)
The same proof of (Sumner et al. 2017, Theorem 3.1) applied to the paralinear distance gives that $\texttt {PL} (f)$ satisfies the strong property II with $\lambda =0$ in our case. Indeed, it is straightforward to see that ${p\ell }_{A|B}(f)$ is invariant by the Markov action: ${p\ell }_{A|B}((X_1,\dots ,X_4)*f)={p\ell }_{A|B}(f)$ for any $4\times 4$ Markov matrices $X_i$ and any bipartition A|B. This can be immediately seen by observing that for any array F
$$\begin{aligned} {p\ell }_{A|B}(F)=-\log \, \max \left\{ \left| \frac{\det (N_{ac}) \det (N_{bd})}{\det (N_{ab}) \det (N_{cd})} \right| , \left| \frac{\det (N_{ad}) \det (N_{bc})}{\det (N_{ab}) \det (N_{cd})} \right| \right\} . \end{aligned}$$
The claim about expectation follows by Taylor expansion of ${p\ell }_{A|B}(f)$ around the expectation of f (as ${\mathbb {E}}(f)=p$ componentwise).
(b)
The first claim is a consequence of the additivity of the paralinear distance, see Lemma A.1. Indeed, under the assumption that p arises from the tree $T=12|34$, every quantity $d_{i,j}$ ($i,j\in [4]$) can be written as the sum of the paralinear distances attached to the edges of T between i and j. For example, $d_{1,3}=d_{1,u}+d_{u,v}+d_{v,3}$. Then the value ${p\ell }_{A|B}(p)$ results in $2 d_{u,v}$, which is nonnegative in virtue of (a) of Lemma A.1. Now, if C|D is a bipartition other than 12|34, then from (1) of the main document we immediately get ${p\ell }_{C|D}(p)=-{p\ell }_{12|34}(p)$.
(c)
Denote $\delta = 2 d_{u,v}$. Then, following (a) we have that
$$\begin{aligned} \!\lim _{N \rightarrow \infty } \!\texttt {PL} (f)\!=\!\!\lim _{N \rightarrow \infty }\!\left( {p\ell }_{12|34}(f),{p\ell }_{13|24}(f),{p\ell }_{14|23}(f)\!\right) \!=\! \texttt {PL} (p)\!=\!(\delta , -\delta , -\delta ), \end{aligned}$$
where the last equality follows from (b). Therefore, for any $\epsilon >0$ there exist $N_1$, $N_2$, $N_3$ such that for any $N>N_0=\max \left\{ N_1,N_2,N_3\right\} $ we have
$$\begin{aligned} \left| {p\ell }_{12|34}(f)-\delta \right|<\epsilon , \quad \left| {p\ell }_{13|24}(f)+\delta \right|<\epsilon , \quad \left| {p\ell }_{14|23}(f)+\delta \right| <\epsilon . \end{aligned}$$
Then, for any $\epsilon <\delta $ we have
$$\begin{aligned} {p\ell }_{12|34}(f)> \delta -\epsilon> 0> -\delta +\epsilon >\max \left\{ {p\ell }_{13|24}(f), {p\ell }_{14|23}(f)\right\} . \end{aligned}$$
Then we can conclude that the method that chooses the tree T with largest ${p\ell }_{T}(f)$ chooses $T_{12|34}$ with probability tending to 1 when N tends to infinite.
(d)
The statistical consistency of ASAQ follows from the statistical consistency of Erik+2 (Fernández-Sánchez and Casanellas 2016), SAQ (Casanellas et al. 2021b) and the paralinear method obtained in c). Indeed, if f is obtained from N samples of a multinomial distribution with measure p that has arisen from $T=T_{12|34}$ with non-singular parameters (positive distribution at the root and invertible transition matrices), then the probability that both methods (when applied to f) select $T_{12|34}$ tends to 1 when N tends to infinite.

$\square $

Note that under the assumption that $M_5$ is diagonally largest in column (DLC for short, see Chang (1996)), we deduce that if ${p\ell }_{A|B}(p)=0$, then $M_5$ is necessarily the identity matrix.

Remark A.4

In order to avoid numerical problems, we only compute ${p\ell }_{A|B}(f)$ if the condition number of the double marginal matrices (joint distributions at two leaves) involved in its computation are less than a certain tolerance. This is implemented using a parameter “threshold” set to 5000, but it can be modified by the user and can be adapted to the alignment length. If ${p\ell }_{A|B}(f)$ cannot be computed, then this is a sign of short sample size and ASAQ outputs the topology and the weighting system given by SAQ.

B Tree figures

See Figs. 9, 10, 11 and 12.

C Performance of ASAQ, PL and Q-Methods

1.1 C.1 Performance of ASAQ and PL on the Treespace Under the GTR Model

Figure 13 represents the results of ASAQ and PL in recovering the correct quartet on data generated under the GTR model on the tree space described in Simulated data for quartet reconstruction in the Section on Materials and Methods of the main document.

1.2 C.2 Performance of ASAQ on Quartets with Random Branch Lengths Under the GTR Model

Figure 14 shows the ternary plots corresponding to the ASAQ method applied to GTR data on trees with branches of random length.

1.3 C.3 Performance of Q-Methods with Different Systems of Weights

Figures 15 and 16 represent the average Robinson–Foulds distance for GM data on CC and DD trees of Q-methods with ASAQ, Erik+2, PL and 4P weights. Similarly, Fig. 17 shows the average RF distance of Q-methods applied to GTR data on DD trees.

The results obtained by quartet puzzling, weight optimization and Willson methods applied to the input weights of the different methods on GTR data from the DD trees are shown in Fig. 17 (lengths 600 bp and 5 000 bp) and summarized in Table 6 (also including the results for 10 000 bp).

It is remarkable that in these simulations, it is enough to consider alignments of length 5 000 bp to obtain an almost perfect reconstruction of the original tree using the WO or Willson methods. It is also remarkable that for this particular topology the results described in Unmixed data in the Section Results, where the general Markov model was assumed, were not this good, obtaining distance values around 8 (see Fig. 16 ). In this case, assuming the GTR model these reconstruction methods manage to correctly infer the splits of the tree topology. Compared with these results, the performance of QP is poor, although it also improves the results obtained when applied to general Markov data. For short alignments (600 bp) PL and 4P weights obtained slightly better results overall than the other weights.

Note also the bad performance of WO when combined with ASAQ weights when applied to alignments of 600 bp and parameter $b=0.1$. For short alignments, ASAQ tends to choose the weights of SAQ, so this poor performance is probably caused by the bad results of SAQ when applied to GTR data (see Casanellas et al. (2021b) ).

The performance of every Q-method applied to ML weights for these data was quite poor and the resulting RF distance was never smaller than 8.95 (see Table 11).

1.4 C.4 Tables

Table 5 reports the average number of cases in which $\texttt {Erik+2} $ and $\texttt {PL} $ results differ, and of those the proportion of times that $\texttt {SAQ} $ estimates the right topology. The figures show that among the three methods tested here, SAQ is the best option in all cases, except for alignments of length 10 000 bp generated under the GTR model.

Tables 6, 7 and 8 show the figures of the performance of the Q-methods QP, WIL, WO, wQFM and Qfm when applied to different weighting systems, namely ASAQ, SAQ, Erik+2, PL and 4P. Data have been simulated according to a general Markov model on the trees CC, CD and DD. The values in each table show the average of the Robinson Foulds distance from the original tree to the MRCT of 100 replicates for each of the 100 generated alignments. Similarly, Table 9 shows the results when the weighting system comes from ML. In case the reconstruction method failed to provide a weight configuration for some of the $\left( {\begin{array}{c}12\\ 4\end{array}}\right) $ quartets, a resulting tree cannot be constructed and the corresponding alignment has been neglected. The number in parentheses represents the number of consensus trees that we have been able to reconstruct. Table 10 shows the RF distances from the original tree to the tree constructed by applying global NJ with paralinear distances and global ML consistent with the 12.12 model.

In a similar way, Table 11 shows the RF distances for data generated under the GTR model. Finally, Tables 12, 13 and 14 show the RF distances on mixture data for different proportions between the two categories $p=0.25$, $p=0.5$ and $p=0.75$, respectively (see Sect. 2.2.2).

1.5 C.5 Average Discordance Between $\texttt {Erik+2} $ and $\texttt {PL} $

Table 5 Average discordance between PL and Erik+2 for data generated on the tree 12|34 under the GM and GTR models, for different length alignments: 500, 1 000 and 10 000 bp (3rd column). The next three columns show for the cases when PL and Erik+2 do not agree the average success of each method (PL, Erik+2 or SAQ) to obtain the right topology

Full size table

1.5.1 C.5.1 Q-Methods on GM Data

Table 6 Average Robinson–Foulds distances for tree CC. Different combinations of Q-methods and weights (specified in the first two columns) have been applied to alignments with length 600 bp, 5 000 bp and 10 0000 bp evolving with different branch lengths

Full size table

Table 7 Average Robinson–Foulds distances for tree CD. Different combinations of Q-methods and weights (specified in the first two columns) have been applied to alignments with length 600 bp, 5 000 bp and 10 0000 bp evolving with different branch lengths

Full size table

Table 8 Average Robinson–Foulds distances for tree DD. Different combinations of Q-methods and weights (specified in the first two columns) have been applied to alignments with length 600 bp, 5 000 bp and 10 0000 bp evolving with different branch lengths

Full size table

Table 9 Average Robinson–Foulds distance for WO, QP and WIL on trees CC,CD,DD when ML weights are used. Different combinations of Q-methods and weights (specified in the first two columns) have been applied to alignments with length 600 bp, 5 000 bp and 10 0000 bp evolving with different branch lengths. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct all the trees

Full size table

Table 10 Average Robinson–Foulds distances for trees CC,CD,DD when reconstructed by a global NJ with paralinear distance and a global ML. Different combinations of Q-methods and weights (specified in the first two columns) have been applied to alignments with length 600 bp, 5 000 bp and 10 0000 bp evolving with different branch lengths

Full size table

1.5.2 C.5.2 Q-Methods on GTR Data

Table 11 Average Robinson–Foulds distance for tree DD on GTR data. Different combinations of Q-methods and weights (specified in the first two columns) have been applied to alignments with length 600 bp, 5 000 bp and 10 0000 bp evolving with different branch lengths. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct the 100 simulated trees

Full size table

1.5.3 C.5.3 Q-Methods on Mixture Data

Table 12 Average Robinson–Foulds distance on mixture data for $p=0.25$. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct the 100 simulated trees. Figure 6 of the main document represents the bar plots of these values

Full size table

Table 13 Average Robinson–Foulds distance on mixture data for $p=0.50$. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct the 100 simulated trees. Figure 6 of the main document represents the bar plots of these values

Full size table

Table 14 Average Robinson–Foulds distance on mixture data for $p=0.75$. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct the 100 simulated trees. Figure 6 of the main document represents the bar plots of these values

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Casanellas, M., Fernández-Sánchez, J., Garrote-López, M. et al. Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages. Bull Math Biol 85, 68 (2023). https://doi.org/10.1007/s11538-023-01167-y

Download citation

Received: 02 August 2022
Accepted: 15 May 2023
Published: 13 June 2023
DOI: https://doi.org/10.1007/s11538-023-01167-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages

Abstract

Similar content being viewed by others

Utility of characters evolving at diverse rates of evolution to resolve quartet trees with unequal branch lengths: analytical predictions of long-branch effects

Developing a statistically powerful measure for quartet tree inference using phylogenetic identities and Markov invariants

The Performance of Two Supertree Schemes Compared Using Synthetic and Real Data Quartet Input

1 Materials and Methods

1.1 Description of the New Method ASAQ

1.2 Quartet-Based Methods (Q-Methods)

1.3 Description of the Simulated Data

1.3.1 Simulated Data for Quartet Reconstruction

1.3.2 Simulated Data for Larger Trees

1.4 Real Data

2 Results

2.1 Results on Quartets

2.1.1 Tree Space

2.1.2 Random Branch Lengths

2.1.3 Mixture Data

2.2 Results of Q-Methods

2.2.1 Unmixed Data

2.2.2 Mixture Data

2.3 Results on Real Data

2.3.1 Yeast Data

2.3.2 Ratites and Tinamous

2.4 Execution Time and Implementation

3 Discussion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Appendices

A Weighting Systems and Technical Results About ASAQ

1.1 A.1 Weighting Systems

1.2 A.2 The Paralinear Distance and Theoretical Foundations of ASAQ

Lemma A.1

Proof

Lemma A.2

Theorem A.3

Proof

Remark A.4

B Tree figures

C Performance of ASAQ, PL and Q-Methods

1.1 C.1 Performance of ASAQ and PL on the Treespace Under the GTR Model

1.2 C.2 Performance of ASAQ on Quartets with Random Branch Lengths Under the GTR Model

1.3 C.3 Performance of Q-Methods with Different Systems of Weights

1.4 C.4 Tables

1.5 C.5 Average Discordance Between \(\texttt {Erik+2} \) and \(\texttt {PL} \)

1.5.1 C.5.1 Q-Methods on GM Data

1.5.2 C.5.2 Q-Methods on GTR Data

1.5.3 C.5.3 Q-Methods on Mixture Data

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation