Molecular phylogenetic reconstruction faces several problems, still nowadays. Even if one restricts to gene tree reconstruction, one has to take into account the amount of available data (which might be low with respect to the number of taxa), and depending on the method applied, the selection of a suitable evolutionary model, the inherent difficulty of estimating parameters for the most complex models, and the incorporation of heterogeneity across sites and/or across lineages, among others (cf. Jermiin et al. 2020; Zou et al. 2019, and the references therein).

Most of the available methods strongly depend on the evolutionary model assumed (this is the case of maximum likelihood, Bayesian or distance-based approaches, to a more or less extent), and some have to estimate the substitution parameters for each possible tree topology. Recent work minimizes the relevance of the selection of the substitution model if the tree topology is correctly inferred (see Abadi et al. 2019) and in this case, a general Markov model (GM for short) could be used (see also Kaehler et al. 2015). Regarding topology reconstruction, some methods that avoid parameter estimation and allow complex substitution models such as GM are those based on the paralinear distance or on algebraic tools (phylogenetic invariants and related tools, see Allman and Rhodes 2007).

The paralinear distance (Lake 1994) is a measure that attempts to estimate the evolutionary distance (in terms of expected number of substitutions per site) when sequences evolve under a GM model. It may overestimate the expected number of substitutions when the process is far from stationary, or branch lengths are long (see Kaehler et al. 2015; Zou et al. 2012), so more elaborate methods are provided in the quoted papers. However, it is a widely used distance due to its simple formula and its generality, and it has recently been proven to be consistent for the multispecies coalescent model in the ultrametric case as well, see Allman et al. (2021, 2019).

Several algebraic methods for phylogenetic reconstruction have been proposed in the last years, see, for example, SVDQuartets (Chifman and Kubatko 2014), Erik+2 (Fernández-Sánchez and Casanellas 2016)—both already implemented in PAUP* (Swofford 2003), SplitscoresFootnote 1 (Allman et al. 2016), or SAQ (Casanellas et al. 2021b). The methods Erik+2, Splitscores, and SAQ are based on the GM model and, in particular, they are not subject to stationarity or time-reversibility and account for different rates of substitution at different lineages (the so-called heterogeneity across lineages, see Appendix A). The three of them consider algebraic conditions in the form of rank constraints of a flattening matrix obtained from the observed distribution of characters on a sequence alignment. Only SAQ considers also the stochastic description of the evolutionary model (see Allman et al. 2014), which translates into semi-algebraic conditions. On the other hand, Erik+2 also allows across-sites heterogeneity as it is able to deal with data from a mixture of distributions on the same tree. Despite the potential of algebraic-based methods for topology reconstruction (already pointed out in the book by Felsenstein (2004)), some of these methods may not work well for short alignments especially in presence of the long branch attraction phenomenon. For instance, Erik+2 is highly successful on different types of quartet data (also on the Felsenstein zone) but requires at least one thousand sites to outperform maximum likelihood or neighbor-joining (briefly NJ) (Fernández-Sánchez and Casanellas 2016). By taking into account the stochasticity of the substitution parameters, SAQ overcomes this problem at the expense of performing slightly worse than Erik+2 for very large amounts of data (10,000 sites or more) (Casanellas et al. 2021b).

Algebraic methods are mainly aimed at recovering quartet topologies (or splits in some cases) and some account for the possibility of being implemented into quartet-based methods. Quartet-based methods (Q-methods for short) have been questioned in the literature. For instance, Ranwez and Gascuel (2001) evaluated two quartet-based methods, their weight optimization method (briefly WO) and the quartet puzzling method (QP) by Strimmer and von Haeseler (1996), and they weighted quartets using a maximum likelihood approach for the Kimura two-parameter model. Their main conclusion was that both QP and WO give worse results than neighbor-joining or than a maximum likelihood approach applied directly to the whole set of taxa. More recent methods such as QMC from Snir and Rao (2010), QFM Reaz et al. (2014), or wQFM from Mahbub et al. (2021) seem to be competitive with maximum likelihood and are scalable to large data sets. As pointed out by Ranwez and Gascuel (2001), the weaknesses of Q-methods are very likely due to the method of weighting the quartets rather than to the method of combining them.

As far as we are aware, the only works that evaluate the use of algebraic methods as input of quartet-based methods are Rusinko and Hipp (2012) (which is restricted to QP with the Jukes-Cantor model), Holland et al. (2012) (where squangles are applied to infer the quartets) and Chifman and Kubatko (2014) (for the coalescent model). On the other hand, the correct management of long-branch attraction is crucial for obtaining a successful quartet-based method (John et al. 2003). These claims and remarks motivate part of the present work. We expect that, as SAQ, Erik+2 and the paralinear method handle successfully the long-branch attraction problem, Q-methods with these weighting systems improve their performance. Moreover, Zou et al. (2019) proved that if a machine learning approach is applied to weighting the quartets, QP can have a similar performance than NJ especially under substitution processes that are heterogeneous across lineages. Precisely, handling heterogeneity of substitution rates across lineages is one of the key features of algebraic methods based on the GM model of nucleotide substitution.

The goal of this work is to test the performance of several quartet-based methods for phylogenetic tree topology reconstruction when applied with input weights from different methods consistent with the general Markov model. We test QFM, wQFM, QP, WO and the method WIL (proposed in Willson 1999) with different systems of input weights: two systems, PL and 4P (see Appendix A), derived from paralinear distances and the four-point condition (Lake 1994; Gascuel 1994; Mihaescu et al. 2009), and three systems based on algebraic and semi-algebraic methods, namely SAQ, Erik+2 and the new proposed method ASAQ (see below) that combines the paralinear distance and algebraic methods. We also provide a new implementation of QP, WO, WIL and the code for ASAQ. We test exhaustively all these combinations on simulated data evolving either under the GM model or the homogeneous general (continuous) time-reversible model (GTR) on twelve taxon trees, and also provide a comparison when input weights from a maximum-likelihood approach are used. We also compare the performance to a traditional NJ (from now on, global NJ) and a ML estimation of the tree (global ML) and test some of the methods on real data: the eight species of yeast studied in Rokas et al. (2003) and the Ratite mitochondrial DNA data studied in Phillips et al. (2009).

The new method ASAQ (standing for algebraic and semi-algebraic quartet reconstruction) is a topology reconstruction method for four taxa which combines Erik+2 and SAQ by means of the paralinear method. This is a quartet reconstruction method based on a statistic (see Eq. (1) in the Materials and Methods section) that assesses the positivity of the estimated length (using the paralinear distance) of the interior branch of the quartet. When data are unmistakably generated by a quartet, the topology output by any reconstruction method should be consistent with the positivity of this statistic. ASAQ uses the paralinear method to either ratify the results of Erik+2, or to rely on SAQ when there is an inconsistency between the outputs of Erik+2 and the paralinear method. By proceeding like this, ASAQ ensures an overall better performance on quartets than Erik+2, SAQ (and the paralinear method itself in most occasions).

As ASAQ is statistically consistent with the GM model, it can consider data evolving on a tree where base composition and instantaneous rate substitution patterns can differ among lineages (heterogeneity across lineages), and does not assume stationarity nor time-reversibility, see Appendix A. We test the performance of ASAQ on a wide scenario of simulated data: on the tree space proposed by Huelsenbeck (1995), on quartets with random branch lengths and on mixture data on the same topology with two categories. All simulations used data either generated from the GM model or from a GTR model, with different sequence lengths.

One can use ASAQ with mixtures of distributions on the same tree topology with two or three categories (or partitions) as this was already implemented by Erik+2. Although ASAQ is based on SAQ and the paralinear method, which are not guaranteed to be consistent on mixtures, it is also highly successful on this kind of data, see the Results section.

1 Materials and Methods

In this section, we describe the new method ASAQ for quartet reconstruction, the quartet-based methods applied and the simulation studies performed.

1.1 Description of the New Method ASAQ

ASAQ is a quartet reconstruction method based on a pair of previous methods by the authors (Fernández-Sánchez and Casanellas 2016; Casanellas et al. 2021b). For a phylogenetic tree T with an interior node r as root, we consider a Markov process on the alphabet \(\{\texttt {A},\texttt {C},\texttt {G},\texttt {T}\}\) on T by assigning transition matrices \(M_e\) at the edges of T and a distribution \(\pi \) at r. As no restrictions are imposed on the transition matrices or \(\pi \), this is usually called a general Markov model (GM) on T. One can compute the theoretical joint distribution p of patterns at the leaves of T in terms of the entries of \(M_e\) and \(\pi \) and we say that p has arisen on T with certain substitution parameters.

We consider fully resolved (unrooted) trees on a set of four taxa \([4]=\{1,2,3,4\}\): the three quartet trees shall be denoted as 12|34, 13|24, 14|23, according to the bipartition induced by the interior edge. Fernández-Sánchez and Casanellas (2016) introduced Erik+2, a reconstruction method essentially based on the rank of flattening matrices obtained from a distribution of nucleotides at the leaves of a tree. The method allows the possibility of dealing with data from mixtures of distributions with up to 3 categories (this upper bound is not computational, but rather a consequence of the theoretical basis of the method: distributions from mixtures with c categories satisfy that the rank of a certain \(16 \times 16\)-matrix is upper bounded by 4c, so in order to discriminate between such distributions and generic distributions we need \(4c < 16\)). In the more recent method SAQ, Casanellas et al. (2021b) create a more sophisticated method combining the rank of the flattening matrices with the stochastic information available in the data via a result of Allman et al. (2014). Both methods Erik+2 and SAQ, as well as their associated weighting system, are briefly described in Appendix A.1. They have been widely studied and the reader is referred to the cited publications for their performance on different scenarios.

As already explained in the introduction, while SAQ usually outperforms Erik+2 for short DNA sequence alignments (length \(\le \) 1000), Erik+2 obtains better results as the length of the DNA alignment increases. This consideration leads us to introduce ASAQ, a new combined method of Erik+2 and SAQ that tries to apply one or the other according to whether the input pattern distribution is consistent with the positivity of the estimated length of the interior branch. To this end we use the paralinear distance and the paralinear method (see Appendix A.1, Lemma A.1 and Lake (1994)).

For a distribution of patterns at the set of leaves, \(p\in {\mathbb {R}}^{256}\), we compute all paralinear distances \(d_{x,y}\) between pairs \(x,y \in [4]\). Then, given a bipartition A|B of the set [4], \(A=\{i,j\}\), \(B=\{k,l\}\), we define the following quantity

$$\begin{aligned} {p\ell }_{A|B}(p)=\min \{d_{i,k}+d_{j,l} \,, \, d_{i,l}+d_{j,k}\}-d_{i,j}-d_{k,l}. \end{aligned}$$
(1)

The quantity above is the “neighborliness” measure used in Gascuel (1994), the “paralinear method” used in Lake (1994), and was presented by Buneman (1971) as a measure of twice the length at the interior edge (when d is a tree metric). It is worth pointing out that at most one of the three values \({p\ell }_{12|34}(p)\), \({p\ell }_{13|24}(p)\) and \({p\ell }_{14|23}(p)\) is strictly positive if all paralinear distances above are non-negative, see Lemma A.2. We denote by \(\texttt {PL} (p)\) the collection of these values:

$$\begin{aligned} \texttt {PL} (p)=({p\ell }_{12|34}(p), {p\ell }_{13|24}(p), {p\ell }_{14|23}(p)). \end{aligned}$$

If p has arisen on a quartet A|B and the entries of the Markov matrix at the interior edge were strictly positive, both quantities inside the minimum in Eq. (1) are equal (as can be deduced from the 4-point condition). Moreover, \({p\ell }_{A|B}(p)\) is the unique positive quantity in the triplet \(\texttt {PL} (p)\) and its value coincides with twice the paralinear distance of the interior edge (see Theorem A.3). The paralinear method is the quartet reconstruction method that outputs the tree T with highest \({p\ell }_{T}\) value. It is a statistically consistent quartet-inference method (Theorem A.3).

Therefore, we propose the following quartet reconstruction method ASAQ: it checks whether both methods Erik+2 and the paralinear method PL output the same quartet, and

  1. (1)

    if Erik+2 and PL agree, then ASAQ outputs the topology and weights of Erik+2;

  2. (2)

    if they do not agree or some paralinear distances \(d_{x,y}\) are negative, then ASAQ outputs the topology and weights of SAQ.

An inconsistency between Erik+2 and the paralinear method implies that data are far from having arisen on a tree with stochastic parameters (note the role of the positivity of the entries of the Markov matrix in Lemmas A.1 and A.2). The positivity of transition matrices implies semi-algebraic conditions on the joint distributions at the leaves (Allman et al. 2014) and in this case we rely on SAQ, as it is the unique method that takes into account both these semi-algebraic conditions and the algebraic constraints considered by Erik+2. In Theorem A.3 we prove that ASAQ is as well a statistically consistent quartet reconstruction method for the general Markov model and in Table 5 we show the percentage of discordance between PL and Erik+2, and the success of SAQ in these cases.

We cannot claim that ASAQ is statistically consistent for mixtures because consistency is not known to hold for the paralinear method or SAQ in this scenario. However, the simulation studies in Casanellas et al. (2021a) show a good performance of SAQ in mixture data from the same tree and this will lead to a good performance of ASAQ as well (see the Results section). We denote by ASAQ (\(m=k\)) the use of ASAQ with Erik+2 estimating mixtures on the same tree with k categories. The limit on the number of categories \(m=3\) for quartets comes from the theoretical foundations of Erik+2, as a larger amount of categories would make unfeasible the identifiability of the tree topology by this method.

As a topology reconstruction method, the paralinear method is highly successful (see Results section, Fig. 2). Nevertheless, we found that using the paralinear method in order to ratify or not the results of Erik+2 gives a better performance on topology reconstruction.

1.2 Quartet-Based Methods (Q-Methods)

We have implemented different quartet-based methods (Q-methods) with different input weights. All of them seek for a tree that maximizes weighted quartet consistency. Quartet puzzling (QP), weight optimization (WO) and Willson’s (WIL) methods have been programmed in C!++! and wQFM has been used with the implementation provided in Mahbub et al. (2021). QP amalgamates quartets in a randomized order and seeks to maximize the total sum of weights; we provide a new implementation of QP different from the one in Schmidt et al. (2002) since we wish to apply the method with systems of weights not based on likelihoods. Weight optimization uses quartet weights to dynamically define the taxon addition order, seeking to maximize the total weight at each step. WO is known to reconstruct the correct tree if the input quartets are correctly weighted (i.e. if all quartets of the original tree have the highest weight among the tree possible weights of the corresponding 4-tuple), see Ranwez and Gascuel (2001). Instead of constructing a tree that maximizes the total weight at each step, the essential idea of Willson’s method is attaching new taxa in such a way that the new tree at each step is highly consistent with the input quartets. QP, WO and WIL are initialized at a random 4-tuple. Since the output of these Q-methods strongly depends on the choice of the initial quartet, each one of them has been applied 100 times to each alignment and then the majority rule consensus tree (briefly MRCT) of these 100 replicates has been computed.

On the contrary, wQFM amalgamates quartets following a divide and conquer approach and implicitly gives the same importance to all quartets (and does not depend on an initial quartet choice). Although wQFM was specially designed to build species trees from gene trees, it can also consider other input weights. We use wQFM with quartets weighted by different methods, but also with unweighted quartets. In this last case we refer to the method as QFM Reaz et al. (2014) and the weights are transformed to 1 for the quartet output by the method and to 0 for the other two quartets.

In order to evaluate the difference between two trees, we use the Robinson–Foulds distance (RF for short), (Robinson and Foulds 1981). For the computation of the majority rule consensus tree and the RF distance, the available functions in the Python Library DendroPy have been used, see Sukumaran and Holder (2010).

Input Weights

We require the weights of the quartet reconstruction methods to be positive and normalized. Details about the input weights obtained from ASAQ are provided in the previous section. Further details about the weighting system for all the considered methods are moved to Appendix A.1. For the paralinear method the weights are denoted as PL and are obtained after normalizing the exponentials of the scores given by Eq. (1); 4P is a slight modification of this method, see Appendix A.1. We consider already published weighting systems for SAQ, Erik+2 and maximum likelihood (ML): see Fernández-Sánchez and Casanellas (2016) for Erik+2, Casanellas et al. (2021b) for SAQ, and the posterior probabilities used in Strimmer et al. (1997) for ML.

Maximum likelihood weights (which will be denoted as ML) have been computed assuming the most general continuous-time homogeneous model (same instantaneous rate matrix throughout the tree but no constraints on the entries of the rate matrix or assumption of stationarity); this is the Lie Markov model listed as 12.12 in Fernández-Sánchez et al. (2015) and was denoted as ML(homGMc) in Fernández-Sánchez and Casanellas (2016). To this aim we used the baseml program from the PAML library (Yang 1997) with the UNREST model and let it infer the instantaneous rate matrix (common to all lineages) and the distribution at the root. Note that in order to use weights obtained from ML, we need to obtain the estimates of the maximum likelihood for the three possible quartets. This is unfeasible when the method does not converge (which often happens for the quartets which did not generate the data).

1.3 Description of the Simulated Data

We consider two different scenarios of simulated data: one for testing ASAQ as a quartet reconstruction method (described in Sect. 1.3.1) and another for testing different Q-methods with several weighting systems (see Sect. 1.3.2). For the first we use the simulated data introduced by Fernández-Sánchez and Casanellas (2016), whereas for the second we follow the approach of Ranwez and Gascuel (2001) and consider 12-leaf trees.

Evolutionary models For the trees described below, we generate data evolving either on a general Markov model (GM, see Sect. 2.1) or on a homogeneous general time-reversible model (GTR). By a homogeneous GTR model we mean a continuous-time GTR model that shares the same instantaneous mutation rate matrix Q across all branches of the tree. GTR data have been generated using Seq-gen (Rambaut and Grass 1997), while GM data have been generated using the software GenNon-h (Kedzierska and Casanellas 2012) available at https://github.com/Algebraicphylogenetics/GenNon-H. For GTR data we specified in each setting a stationary distribution and a rate matrix common to all generated trees (see below for details on the chosen rates in each case), whereas for GM data the transition matrices are randomly generated for each alignment (according to the specified branch lengths).

1.3.1 Simulated Data for Quartet Reconstruction

Tree space The first data set we use to test ASAQ corresponds to the tree space suggested by Huelsenbeck (1995). We consider quartets as in Fig. 9a in Appendix B, with branch lengths given by a pair of parameters a and b which vary between 0 and 1.5 in steps of 0.02. Branch lengths are always measured as the expected number of elapsed substitutions per site. The resulting tree space is shown in Fig. 9b in Appendix B. The upper left region of this tree space corresponds to the “Felsenstein zone”, which contains trees subject to the long branch attraction phenomenon. For each of the two nucleotide substitution models considered in the paper, GM and GTR, and for each pair (ab) of branch lengths, we have simulated one hundred alignments. The rates for GTR data were chosen as in Fernández-Sánchez and Casanellas (2016), subsection Description of the Data, p.284. The considered alignment lengths are of 500, 1 000 and 10 000 sites.

Random branch lengths Following Casanellas et al. (2021b), we test ASAQ on 10 000 alignments generated from quartets whose branch lengths are randomly generated according to a uniform distribution in the intervals (0, 1) or (0, 3). These alignments are obtained according to both substitution models, GM and GTR, and are either 1 000 or 10 000 sites long. We represent the weights output by ASAQ in a ternary plot (also called a simplex plot) as in Strimmer and von Haeseler (1997).

Mixture data The performance of ASAQ is also tested in the scenario of data sampled from a mixture of distributions. According to the approach of Kolaczkowski and Thornton (2004), we consider the mixture of distributions as follows. We partition the alignment into two categories of the same sample size both evolving under the GM model on the same quartet topology as Fig. 9a but the first category corresponds to branch lengths \(a = 0.05\), \(b = 0.75\), while the second corresponds to \(a = 0.75\) and \(b = 0.05\) (see Fig. 10 in the Appendix). The internal branch length takes the same value in both categories and varies from 0.01 to 0.4 in steps of 0.05. The total length of the alignments considered is 1 000 or 10 000 sites.

Fig. 1
figure 1

The three different tree topologies CC, CD and DD on 12 taxa considered to test the Q-methods with different weighting systems. They are obtained by glueing a combination of two trees (C and D) by the root. Here, the parameter b represents the length of internal branches. These tree topologies have been taken from (Ranwez and Gascuel 2001)

1.3.2 Simulated Data for Larger Trees

We followed Ranwez and Gascuel (2001) to test the performance of different quartet-based methods (QP, WO and WIL) with different weighting systems. To this end, we considered the three 12-taxon topologies depicted in Fig. 1 denoted as CC, CD and DD, and fixed the ratio among branch lengths as in Ranwez and Gascuel (2001), depending on a parameter b denoting the internal branch length, which is varied in the set \(\{0.005,0.015,0.05,0.1,0.25,0.5\}\).

For each tree topology and for each b, we have considered 100 alignments with lengths 600 (in order to match the alignment length considered in Ranwez and Gascuel (2001)), 5 000, and 10 000 generated under the GM model. For DD topology, we have also generated data under the GTR model with \(b=0.005,\ldots , 0.1\), and instantaneous rates 2 (A\(\rightarrow \)C), 5 (A\(\rightarrow \)G), 3 (A\(\rightarrow \)T), 4 (C\(\rightarrow \)G), 1 (C\(\rightarrow \)T), 2 (G\(\rightarrow \)T) and equal base frequency (we chose DD because it is arguably the hardest to reconstruct, see the Results section).

Mixture data We have considered a 2-category mixture model, that is, we generated alignments evolving on a the topology CD but whose sites evolve following two systems of substitution parameters: the first system corresponds to the branch lengths described in the tree CD, while the second system corresponds to CD after exchanging the branch lengths of seq3 and seq4, and the lengths of seq7 and seq8 (see Fig. 11 for details). The parameters that were varied in this framework were the proportion p of sites in the first category (which was varied in 0.25, 0.50,  and 0.75) and the internal edge length b which was varied as above in 0.005, 0.015, 0.05 and 0.1. The lengths of the alignments considered were 600, 1 000 and 10 000 bp.

1.4 Real Data

Yeast data We analyze the performance of ASAQ on real data on the eight species of yeast studied in Rokas et al. (2003) with the alignment consisted of the concatenation of 42 337 s codon positions of 106 genes as provided by Jayaswal et al. (2014). We investigate whether the quartets output by ASAQ support the tree T of Rokas et al. (2003), the alternative tree \(T'\) of Phillips et al. (2004) (see Fig. 12), or the mixture model proposed by Jayaswal et al. (2014). Although the tree T is widely accepted by the community of biologists, its correct inference is known to depend on the correct management of heterogeneity across lineages, as an inaccurate underlying model usually reconstructs \(T'\) (Rokas et al. 2003; Phillips et al. 2004; Jayaswal et al. 2014). According to Jayaswal et al. (2014), these data are best modeled by considering, apart from heterogeneity across lineages, two different rate categories (discrete \(\Gamma \) distribution) plus invariable sites. In our setting, this is translated into a mixture distribution with 3 categories (\(m=3\) in ASAQ).

Ratites and tinamous mitochondrial data The phylogeny of ratites and tinamous has been debated for some time (see Phillips et al. 2009, and the references therein) and is still controversial (Benito et al. 2022). There is evidence of a higher rate of evolution among the tinamous relative to the ratites (see, e.g., Paton et al. 2002), so these data are appropriate for analysis by the methods proposed here. Moreover, the recoding used in Phillips et al. (2009) to sort out this problem is questioned in Vera-Ruiz et al. (2021). We do our analyses using a 3506 sites alignment consisting of the third codon position of the mitochondrial DNA coding alignment for 24 DNA sequences provided in this last paper. We run WO and QP with weights obtained by ASAQ, SAQ, Erik+2, 4P, PL. For each combination of methods we used all quartets but also a random selection of a subset of input quartets of size either one hundred, 2125 (approximately 20% of all possible quartets), or 5313 (equal to 50%), and then we performed the MRCTs. This was aimed at testing the sensitivity of the method to the amount of input data (comparing the performance when only a portion of the quartets is used) in terms of scalability of the methods.

We analyze the results obtained in comparison to the following trees: (A) the tree that groups Tinamous and Moas as proposed by Phillips et al. (2009) and displayed in Vera-Ruiz et al. (2021), Fig. 4B the consensus tree of Fig. 1a of Phillips et al. (2009) and (C) the Ratite paraphyly tree of Fig. 1b of Phillips et al. (2009). If we call CEK to the largely established clade CEK=((cassowary, emu), kiwis), then these trees can be summarized as:

  1. A:

    (outgroup, neognathus, (ostrich, (rheas, ((moas, tinamous), CEK))));

  2. B:

    (outgroup, neognathus, (tinamous, (moas, rheas, (ostrich, CEK))));

  3. C:

    (outgroup, neognathus, ((tinamous, moas), (rheas, (ostrich, CEK))))  ,

where the outgroup comprises sequences of alligator and caiman (see Fig. 7 for a description of the species involved). As unrooted trees, A and C have 21 interior edges and B has 20.

2 Results

In this section, we describe the results obtained and benchmark them with published results for the sake of completeness. The interested reader is referred to the corresponding papers for details of the methods therein.

2.1 Results on Quartets

2.1.1 Tree Space

The performance of ASAQ and PL on data generated on the tree space of the previous section (Materials and Methods) is represented in Fig. 2 (for GM data) and in Fig. 13 in the Appendix (for GTR data). In black we represent 100% success, in white 0% success, and gray tones correspond to regions of intermediate success accordingly. The 95 % and 33 % isoclines are represented with a white and black line, respectively. These simulation studies show a consistent performance according to the results by Huelsenbeck (1995) and Fernández-Sánchez and Casanellas (2016), with the usual decreasing performance at the Felsenstein zone and an improvement of both methods with sample size. Figures 2 and 13 show that the performance of ASAQ is better than that of PL, for both GM and GTR data.

For completeness, the average performance of ASAQ and PL on this tree space for different alignment lengths and underlying models is compared to other methods in Table 1: we include the average results of SAQ (Casanellas et al. 2021b, as shown in) and Erik+2 and ML as published in Fernández-Sánchez and Casanellas (2016). This comparison shows that for GM data the best results are achieved by ASAQ, while for GTR data ML obtains the best performance for alignments of length 500 and 1 000 bp and Erik+2 does so for long alignments (10 000 bp).

Table 1 Average success of several methods applied to data simulated on the tree space of Fig. 9b). ASAQ and PL are compared to the results for SAQ obtained in (Casanellas et al. 2021b), and for Erik+2 and maximum likelihood ML in (Fernández-Sánchez and Casanellas 2016). MLestimates the most general continuous-time homogeneous process 12.12 when data are generated under a GM model, while ML estimates a homogeneous GTR model when data are generated under GTR, see (Fernández-Sánchez and Casanellas 2016, Table 1). In each row of the table, the highest success is indicated in bold font
Fig. 2
figure 2

Performance of ASAQ (left) and PL (right) on the tree space of Fig. 9b on alignments of length 500 bp (top), 1 000 bp (middle) and 10 000 bp (bottom) generated under the GM model. Black is used to represent 100% of successful quartet reconstruction, white to represent 0%, and different tones of gray the intermediate frequencies. The 95% contour line is drawn in white, whereas the 33% contour line is drawn in black

2.1.2 Random Branch Lengths

Fig. 3
figure 3

Ternary plots corresponding to the weights of ASAQ applied to 10000 alignments generated under the GM model on the 12|34 tree. On each triangle the bottom-left vertex represents the underlying tree 12|34, the bottom-right vertex is the tree 13|24 and the top vertex is 14|23. The small gray triangle depicted represents the average point of all the dots in the figure. Top: correspond to trees with random branch lengths uniformly distributed between 0 and 1; bottom: random branch lengths uniformly distributed between 0 and 3. Left: 1 000 bp; Right: 10 000 bp. The analogous ternary plots for SAQ can be found in Garrote-López (2021)

To visualize the overall distribution of the weights of ASAQ applied to trees with random branch lengths, in Fig. 3 we show ternary plots corresponding to the GM alignments described in Sect. 1.3.1. The ternary plots of the performance of ASAQ when applied to the same setting with data generated under the GTR model are shown in Fig. 14. In Table 2 we display the summary of average success of ASAQ on these data.

Note that Table 2 shows a high performance of the method and the ternary plots show a clear distribution of points towards the bottom left corner (which represents the correct quartet) and weights symmetrically distributed on the other corners. In particular, the method is not biased towards any of the incorrect topologies. We note that the level of success exhibited is quite sensitive to the branch length, being much higher for branch lengths in (0,1) than in (0,3). We do not appreciate a remarkable difference between the performance of ASAQ when applied to GTR data.

Table 2 Average success of ASAQ on alignments of lengths 1 000 and 10 000 bp generated on the tree 12|34 under the GM and GTR models with random branch lengths uniformly distributed in (0,1) (first row) and (0,3) (second row). The plots corresponding to these data are shown in Figs. 3 and 14

2.1.3 Mixture Data

In Fig. 4 we show the performance of the method ASAQ with \(m=2\) categories when applied to data from mixtures as described in Sect. 1.3.1 and in Fig. 10. Based on the results of Fig. 5 in Fernández-Sánchez and Casanellas (2016), we also provide a comparison to the success of Erik+2 \((m=2)\) and two versions of maximum likelihood on the same data. We did not include the performance of SAQ, as it is very similar to that of ASAQ.

Figure 4 shows an increasing accuracy of all methods when the value of the parameter r (the branch length of the interior edge of the two trees involved) is increased. This was expected as larger values of r represent larger divergence between sequences at the left and the right of the trees. We note that ASAQ (\(m=2\)) outperforms the other methods, with a high level of success in all cases, even when the length of the alignments is 1 000 bp. The average success of ASAQ \((m=2)\) applied to the simulated data is \(96.64\%\) for 1 000 bp, \(99\%\) for 10 000 bp and \(99.92\%\) for 100 000 bp (results not shown).

Fig. 4
figure 4

These plots represent the percentage of correctly reconstructed trees by several methods applied to the mixture data described in the Method section; the value r refers to the branch length of the interior edge of the trees (Fig. 10). We compare the results of ASAQ (\(m=2\)) and the results presented in (Fernández-Sánchez and Casanellas 2016, Figure 5) for Erik+2 (\(m=2\)), ML  (as usual on the 12.12 model), and a ML estimating a heterogeneous across lineages GTR with two categories of discrete \(\gamma \) rates across sites denoted as ML(GTR+2\(\Gamma \)) (only for length 1000 bp)

2.2 Results of Q-Methods

We proceed to describe the performance of wQFM, QFM, WO, WIL and QP (see the subsection on Quartet-based methods ) applied to the input weights from ASAQ, SAQ, Erik+2, PL, 4P and ML on the trees CC, CD and DD presented in Sect. 1.3.2. The weights from PL and 4P are both defined in terms of the paralinear distance (see Appendix A.1), and they produce similar results. Because of this, we omit the performance of 4P in some cases. We also add the results obtained from a global NJ applied to the same data (using the NJ algorithm implemented in the R package APE of Paradis et al. (2004) with paralinear distance) and also global ML using IQ-TREE 2 (Minh et al. 2020) with the most continuous-time homogeneous general model 12.12.

2.2.1 Unmixed Data

The results on simulated GM data obtained by the combination of methods and weights presented in this paper are summarized in Fig. 5 for CD, in Fig. 15 for CC and in Fig. 15 for DD. The results for GTR data (for the tree DD) are shown in Fig. 17. The height of the bars shows the average of the RF distance from the original tree to the consensus tree of 100 replicates for each of the 100 generated alignments. The values of these results are detailed in Tables 6 (for CC), C.3 (for CD) and C.4 (for DD) for GM data and in C.7 for GTR data on DD. For comparison, the results of a global NJ and global ML (which has a similar performance as a global NJ) applied to these data is displayed in Table 10 (see also the results of the global NJ in Figs. 5, 15 and 16). In Table 9, we show the results obtained when applying ML weights. Since ML did not converge for some quartets (especially in the presence of long branches and when trying to maximize the likelihood for GM data generated on another quartet), we write between parentheses the number of alignments considered for ML (in the computation of the average RF distance we neglected the alignments where ML did not converge).

By comparing the performance of the four Q-methods, we observe a slightly better performance of wQFM, QFM WIL and WO compared to QP. The accuracy drops for long branches in all methods, but wQFM and WO seem to perform slightly better for short and medium branches (\(b\le 0.1\)) while WIL does better for longer branches \(b\ge 0.25\). Among all Q-methods, QP is the most sensitive to the choice of the system of weights.

Another general remark is that all Q-methods with their best system of weights outperform a global NJ and a global ML for \(b>0.1\) on GM data, while none of them (with any system of weights) beats a global NJ or ML for b smaller than or equal to 0.1. Figure 17 shows that a global NJ is the best option for GTR data on DD trees.

All reconstruction algorithms have had considerably more success when reconstructing the tree CC in contrast to CD or DD for GM data, in concordance with the results obtained by Ranwez and Gascuel (2001). In the tree topologies CD and DD, the distance between seq9 and seq10 is 4b, while the distance between seq8 and seq11 is 20b. The same happens between species seq1 and seq2, and seq0 and seq3 of the DD topology. Thus, for an alignment generated from these trees, there is a high probability that two separate lineages evolve in a convergent manner to the same nucleotide at the same site, creating a long branch attraction situation and making the reconstruction methods to infer the wrong topology. wQFM, QFM, WO and WIL are especially successful when dealing with CC and CD trees (Fig. 15 and Fig. 5) and short branches, probably because these Q-methods succeed in reconstructing the C subtree.

Weights from ASAQ, SAQ, Erik+2, PL and 4P produce overall comparable results, although ASAQ and SAQ do better when dealing with long branches. wQFM with seems to work slightly better with quartets weighted by SAQ, PL and 4P over Erik+2 or ASAQ (although it does not perform as well as WO when dealing with long branches). On the contrary, in the unweighted version, ASAQ seems to provide the best input quartets for QFM (note that PL and 4P provide the same input for QFM, so they are displayed together in the tables). We also note an improvement in the results when the length of the alignment increases, especially for the CC case or for GTR data (Fig. 17) and hardly noticeable for the other two trees on GM data.

The simulation study shows a big difference between the general performance of the Q-methods with input weights from ASAQ, Erik+2 and PL in contrast to weights obtained by ML, especially when reconstructing the CC and CD trees (compare Tables 6, 7, 8 and 9). This could probably be due to the inconsistency between the ML weights computed and the general Markov model used to generate the data (see subsections Quartet-based methods and Description of the simulated data). Even when the model used for ML estimation matches the model that generated the data (as for GTR, see Table 11), ML seems to be the worse weighting system. When data are generated under the GTR model on DD trees (Fig. 17 and Table 11), all Q-methods (except for ML weights) have a remarkable high performance when alignments are long enough (\(\ge \) 5 000 bp), especially WO. For short alignments, PL seems a good choice for the input weights with these data. All in all, we observe that both ASAQ and PL weights (combined with one or another Q-method) give rise to good results (much better than ML weights) comparable with the results obtained by a global NJ or ML, and even better when dealing with long branches.

Fig. 5
figure 5

Average Robinson–Foulds distance for GM data simulated on the tree CD with alignment length 600 bp (left), 5 000 bp (center) and 10 000 bp (right). The Q-methods WO (first row), WIL (second row), QP (third row), and QFM (last row) are applied with different systems of weights, namely ASAQ, SAQ, Erik+2, PL and 4P. The white diamonds represent the average RF distance of the tree reconstructed using a global NJ with paralinear distances. Specific values of these results are detailed in Tables 6 and 10, and results with ML weights are included in Table 9

2.2.2 Mixture Data

The results on mixture data obtained by a global NJ and by QP, WO, and WIL with input weights from PL, ASAQ and Erik+2 adapted to 2-category data (as described in Sect. 1.3.2) are summarized in Fig. 6 for WO and in Tables 12, 13 and 14 for all methods. Tables 12, 13 and 14 have a similar structure as Tables 6, 7 and 8 and correspond to the results obtained for different proportions between the two categories; \(p=0.25\), \(p=0.5\) and \(p=0.75\), respectively. (We recall that the proportion p of sites of the first category of the alignment were generated assuming the branch lengths of the CD tree in Fig. 1.) It is worth pointing out that the reconstruction of mixture data from CD trees is more accurate on average than when applied to unmixed data (Fig. 5). This is probably due to the fact that when the branch lengths of species seq3, seq4 and seq7, seq8 are exchanged (Fig. 1), reconstruction methods have an easier job to make the appropriate splits, since the long branch attraction situation that was provoked by the quartet of species \(\{seq8,seq9,seq10,seq11\}\) does no longer exist with the new branch lengths. This is consistent with the observation that the reconstruction results are more accurate for low values of b. For short alignments (600 bp), we note that ASAQ and PL weights provide better results than Erik+2. For 5 000 bp ASAQ and Erik+2 outperform PL. Moreover, if \(b=0.1\), the combination WO+ASAQ or WO+Erik+2 beats a global NJ. For 10 000 bp, the difference between performance is even larger (see Tables Tables 12, 13 and 14). As expected, the length of the alignment improves the performance of these methods, reducing the impact of the long branch attraction effect for high values of b.

Fig. 6
figure 6

Average Robinson–Foulds distance on mixture data simulated on the tree CD for \(p=0.25\) (left), \(p=0.5\) (middle) and \(p=0.75\) (right) with alignment length 600 bp (above) and 5 000 bp (below). We omit the results for 10 000 bp as they are very similar to those obtained for 5 000 bp. The Q-method WO is applied with different systems of weights, namely ASAQ with 2 categories, Erik+2 with 2 categories, and PL. The white diamond represents the result of a global NJ with the paralinear distance. Results obtained by WO, QP, WIL for the different proportions between the two categories, \(p=0.25\), \(p=0.5\) and \(p=0.75\), can be found in Tables 12, 13 and 14, respectively. These tables also include results with ML weights

2.3 Results on Real Data

2.3.1 Yeast Data

As shown in Table 3, the claim by Jayaswal et al. (2014) is corroborated by our analysis: by performing a MRCT on 100 random initial quartets, WO+ASAQ reconstructs T when 3 categories are considered, but it reconstructs \(T'\) otherwise. Similarly, the RF distance of the MRCT obtained by WIL+ASAQ to the tree T is smaller than the RF distance to \(T'\) only when 3 categories are considered. When applied to these data, NJ with paralinear distances reconstructs the tree T.

It should also be mentioned that several studies have analyzed this data set seeking to build a species network from individual gene trees. These works mostly support a hybridization event involving S. kudriavzevii and S. bayanus, resulting in a species network with at least one 4-cycle that could give rise to the inconsistencies between trees T and the \(T'\); see for instance Allman et al. (2019); Holland et al. (2004) and Yu et al. (2011).

Table 3 Robinson–Foulds distance of the consensus tree obtained by WIL and WO with ASAQ weights (with different number of categories) applied to the trees T and \(T'\) suggested in Rokas et al. (2003) and Phillips et al. (2004), respectively

2.3.2 Ratites and Tinamous

First, we analyze the results obtained by 50 majority rule consensus trees obtained from WO and QP on 100 input quartets weighted with different methods. In Table 4 we show the average results in comparison to the trees A, B and C detailed in Sect. 1.4. The number of interior edges obtained by each method shows that WO produces more resolved trees than QP (independently of the weighting method) and that ASAQ, SAQ and Erik+2 (m=1) give rise to more resolved trees than PL and 4P with both WO and QP.

Table 4 For each weighting system (indicated in the left column) combined with either QP or WO, the first column depicts the average number of interior edges of 50 MRCTs on Ratites real data. The average RF distance and normalized RF distance (that is, dividing by the number of interior edges) from the MRCTs to the trees A, B and C (see Sect. 1.4) is shown in the remaining columns. The last row shows the RF distance and the normalized RF distance from the global NJ tree (obtained using paralinear distances) to trees A, B and C
Fig. 7
figure 7

MRCTs obtained for ratites and tinamous data (see Materials and Methods in the main document ) by WO applied to 2125 initial quartets with weights from a Erik+2 (\(m=1\)), b SAQ, c ASAQ (\(m=1\)), d PL and e 4P. The trees obtained with these weights from 5313 initial quartets or with both \(m=1\) and \(m=2\) are the same as shown. All edges are shown with the same length, and they do not represent evolutionary distance. Tree c) (ASAQ) corresponds to the phylogeny D: (outgroup, neognathus, tinamous, (rheas, (ostrich, (moas, CEK)))). Species considered are: Little spotted (LS), Great spotted (GS), and Brown (B) kiwi; Cassowary; Emu; Eastern (E), Little Bush (LB), and Giant (G) moa; Ostrich, Lesser (L) and Greater (G) rhea; Talaupa (T), Giant (G),and Elegant Crested (EC) tinamou; Caiman; American Alligator; Brush (B) turkey; Chicken; Magpie (M) goose; Redhead duck; Little blue (L) penguin; Red-throated (RT) loon; Ruddy (R) Turnstone; Blackish (B) oystercatcher

Then we study the MRCTs obtained for these real data with WO (trees displayed in Fig. 7). When we use ASAQ and we compute the MRCT with 2125, 5313, or all initial quartets (with both \(m=1\) and \(m=2\)), we obtain the phylogeny

  1. D:

    (outgroup, neognathus, tinamous, (rheas, (ostrich, (moas, CEK)))).

This phylogeny is also supported by a global NJ, although NJ splits tinamous and outgroup against the others. The same tree as NJ is obtained with SAQ (both for 2125 or 5313 initial quartets). Although the deepest interior branch lengths within ratites obtained by NJ are very small: (rheas, (ostrich, (moas, CEK):0.0025):0.003), SAQ and ASAQ give full support to these splits. Indeed, even when we use SAQ or ASAQ (with both \(m=1\) or 2) with only 100 initial quartets, all 50 MRCTs share these splits. Unexpected trees are obtained by Erik+2, PL or 4P (either not solving correctly the neognathus clade, or not giving a CEK clade nor putting all tinamous in a clade).

2.4 Execution Time and Implementation

An implementation of ASAQ in C++ can be found in

https://github.com/marinagarrote/ASAQ-method

The Q-methods WO, WIL and QP applied along the paper are implemented in

https://github.com/msabvid/weights_quartet_methods/

The computations on this paper have been performed on a computer with 6 Dual Core Intel(R) Xeon(R) E5-2430 Processor (2.20 GHz) equipped with 25 GB RAM running Debian GNU/Linux 8. We have used the g++ (Debian 4.9.2-10+deb8u2) 4.9.2 compiler and the C++ library for linear algebra & scientific computing Armadillo version 3.2.3 (Creamfields).

The average time required to compute ASAQ and PL weights for 100 alignments of 4 taxa and length 10 000 bp is 8.7 and 7.8 seconds, respectively. For each alignment, ML was stopped if it did not converge for a 4-tuple after 10 s; the frequency of convergence can be seen on Tables 9 and 11, 12, 13 and 14. The average time to reconstruct a CD tree given the weights for its 495 quartet subtrees is 4 seconds for WO, 89.5 seconds for WIL and 1.8 seconds for QP.

3 Discussion

Via experiments on the simulation framework proposed by Ranwez and Gascuel (2001) but considering more general models of nucleotide substitution (including the GM model and mixtures of distributions), we observe a huge improvement on the performance of Q-based methods when weights from ASAQ, SAQ, Erik+2 and PL methods are considered. In general, the highest success is obtained by WO with the weighting system of ASAQ or PL, or QFM with ASAQ input. The accuracy of these methods is compatible with a global NJ or ML and improves upon them in the presence of mixtures or long branches when using ASAQ as weighting system. Moreover, these weights also outperform weights obtained by ML, even when data are generated under a GTR model and ML is estimating the same model.

It is worth noting that in the previous studies of Q-methods by Ranwez and Gascuel (2001); John et al. (2003), only weights from ML and NJ were considered. The results in this paper validate QFM, wQFM, WO and WIL as successful phylogenetic reconstruction methods if their input is a system of reliable weights. Moreover, as ASAQ assumes the most general Markov model and can deal with mixtures, this opens the door to use Q-methods to data generated by complex models.

We need to mention that the comparison performed against the ML weights (ML as input of Q-methods) or a global ML may not be totally fair because the model used in parameter estimation (continuous-time unrestricted model) does not fit the GM model used to simulate the data. It would certainly be interesting to develop maximum likelihood estimation based on the GM model and perform new tests. In any case, the results of ML weights are much worse than those of ASAQ, Erik+2 or PL also for GTR data (see Fig. 17 and Table 8 in the Appendix) and the results of a global ML are similar to a global NJ.

We have observed a good performance of input weights from ASAQ, Erik+2 and PL on simulated mixture data on 12-taxon trees, especially for 5 000 bp or more. ASAQ and PL are not likely to be statistically consistent for general mixtures on the same tree. Nevertheless, as Erik+2 is consistent on mixture data and PL is known to be consistent on some type of mixtures (see Allman et al. 2019), this suggests that ASAQ (being based on the accordance of Erik+2 and SAQ) might also be consistent on some types of mixture data, which would explain the good results obtained. We would like to point out that the mixture model allowed in Erik+2 (and hence in ASAQ) is actually more flexible than we mentioned. Indeed, the rank conditions considered in Erik+2 for quartets still hold if we let one cherry or one leaf evolve under a mixture with any number of categories (while the other cherry evolves under a single system of parameters), see Casanellas and Fernandez-Sanchez (2020). This makes the mixture model underlying Erik+2 and ASAQ more general, with implications in the mixture model that can be considered for Q-methods with input weights from ASAQ.

Our results on real data validate WO + ASAQ as a reliable method (both for yeast or ratites/tinamous data). Moreover, it is relevant to note that for the ratites/tinamou data, the topology obtained by WO + ASAQ or SAQ, or by NJ with paralinear distances does not agree with any of the topologies proposed in Phillips et al. (2009).

Note that the Q-methods considered here have higher order of computational complexity than NJ, but we have not yet explored the possibility of considering a subset of the possible 4-tuples as suggested in Snir and Rao (2010) or Davidson et al. (2018). It would be interesting to further explore these other versions of Q-methods with weights from ASAQ and PL, or even to restrict to quartets with highest weights as starting point for Q-methods.

On another direction, our results show that ASAQ is a powerful reconstruction method for quartet topology reconstruction. It assumes the most general model of nucleotide substitution (a general Markov model) of independently and identically distributed sites but it can also account for mixtures of distributions (with up to three categories). As it is based on the algebraic and semi-algebraic description of the model, it does not need to estimate the substitution parameters. In this sense, ASAQ could be easily adapted as a suitable method for dealing with amino acid substitutions as well. The incorporation of invariable sites seems also plausible via the results in Jayaswal et al. (2007); Steel et al. (2000). We plan to incorporate these features in a forthcoming version of the software.

As mentioned in the introduction, ASAQ is part of the set of phylogenetic reconstruction tools that are based on algebra. Most of these methods only reconstruct quartets because no statistically consistent (and computationally affordable) algebraic method for larger trees has been designed yet.

A study of ASAQ from a statistical point of view would certainly be relevant, as in this study the efficiency of the method has been solely based on the results obtained on large sets of simulated data.