Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages

Homogeneity across lineages is a general assumption in phylogenetics according to which nucleotide substitution rates are common to all lineages. Many phylogenetic methods relax this hypothesis but keep a simple enough model to make the process of sequence evolution more tractable. On the other hand, dealing successfully with the general case (heterogeneity of rates across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools. The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus especially indicated to deal with data evolving under heterogeneous rates. This method combines the weights of two previous methods by means of a test based on the positivity of the branch lengths estimated with the paralinear distance. ASAQ is statistically consistent when applied to data generated under the general Markov model, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely QFM, wQFM, quartet puzzling, weight optimization and Willson’s method) in combination with several systems of weights, including ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support weight optimization with ASAQ weights as a reliable and successful reconstruction method that improves upon the accuracy of global methods (such as neighbor-joining or maximum likelihood) in the presence of long branches or on mixtures of distributions on trees.

Molecular phylogenetic reconstruction faces several problems, still nowadays. Even if one restricts to gene tree reconstruction, one has to take into account the amount of available data (which might be low with respect to the number of taxa), and depending on the method applied, the selection of a suitable evolutionary model, the inherent difficulty of estimating parameters for the most complex models, and the incorporation of heterogeneity across sites and/or across lineages, among others (cf. Jermiin et al. 2020;Zou et al. 2019, and the references therein).
Most of the available methods strongly depend on the evolutionary model assumed (this is the case of maximum likelihood, Bayesian or distance-based approaches, to a more or less extent), and some have to estimate the substitution parameters for each possible tree topology. Recent work minimizes the relevance of the selection of the substitution model if the tree topology is correctly inferred (see Abadi et al. 2019) and in this case, a general Markov model (GM for short) could be used (see also Kaehler et al. 2015). Regarding topology reconstruction, some methods that avoid parameter estimation and allow complex substitution models such as GM are those based on the paralinear distance or on algebraic tools (phylogenetic invariants and related tools, see Allman and Rhodes 2007).
The paralinear distance (Lake 1994) is a measure that attempts to estimate the evolutionary distance (in terms of expected number of substitutions per site) when sequences evolve under a GM model. It may overestimate the expected number of substitutions when the process is far from stationary, or branch lengths are long (see Kaehler et al. 2015;Zou et al. 2012), so more elaborate methods are provided in the quoted papers. However, it is a widely used distance due to its simple formula and its generality, and it has recently been proven to be consistent for the multispecies coalescent model in the ultrametric case as well, see Allman et al. (2021. Several algebraic methods for phylogenetic reconstruction have been proposed in the last years, see, for example, SVDQuartets (Chifman and Kubatko 2014), Erik+2 (Fernández-Sánchez and Casanellas 2016)-both already implemented in PAUP* (Swofford 2003), Splitscores 1 (Allman et al. 2016), or SAQ (Casanellas et al. 2021b). The methods Erik+2, Splitscores, and SAQ are based on the GM model and, in particular, they are not subject to stationarity or time-reversibility and account for different rates of substitution at different lineages (the so-called heterogeneity across lineages, see Appendix A). The three of them consider algebraic conditions in the form of rank constraints of a flattening matrix obtained from the observed distribution of characters on a sequence alignment. Only SAQ considers also the stochastic description of the evolutionary model (see Allman et al. 2014), which translates into semi-algebraic conditions. On the other hand, Erik+2 also allows across-sites heterogeneity as it is able to deal with data from a mixture of distributions on the same tree. Despite the potential of algebraic-based methods for topology reconstruction (already pointed out in the book by Felsenstein (2004)), some of these methods may not work well for short alignments especially in presence of the long branch attraction phenomenon. For instance, Erik+2 is highly successful on different types of quartet data (also on the Felsenstein zone) but requires at least one thousand sites to outperform maximum likelihood or neighbor-joining (briefly NJ) (Fernández-Sánchez and Casanellas 2016). By taking into account the stochasticity of the substitution parameters, SAQ overcomes this problem at the expense of performing slightly worse than Erik+2 for very large amounts of data (10,000 sites or more) (Casanellas et al. 2021b).
Algebraic methods are mainly aimed at recovering quartet topologies (or splits in some cases) and some account for the possibility of being implemented into quartetbased methods. Quartet-based methods (Q-methods for short) have been questioned in the literature. For instance, Ranwez and Gascuel (2001) evaluated two quartet-based methods, their weight optimization method (briefly WO) and the quartet puzzling method (QP) by Strimmer and von Haeseler (1996), and they weighted quartets using a maximum likelihood approach for the Kimura two-parameter model. Their main conclusion was that both QP and WO give worse results than neighbor-joining or than a maximum likelihood approach applied directly to the whole set of taxa. More recent methods such as QMC from Snir and Rao (2010), QFM Reaz et al. (2014), or wQFM from Mahbub et al. (2021) seem to be competitive with maximum likelihood and are scalable to large data sets. As pointed out by Ranwez and Gascuel (2001), the weaknesses of Q-methods are very likely due to the method of weighting the quartets rather than to the method of combining them.
As far as we are aware, the only works that evaluate the use of algebraic methods as input of quartet-based methods are Rusinko and Hipp (2012) (which is restricted to QP with the Jukes-Cantor model), Holland et al. (2012) (where squangles are applied to infer the quartets) and Chifman and Kubatko (2014) (for the coalescent model). On the other hand, the correct management of long-branch attraction is crucial for obtaining a successful quartet-based method (John et al. 2003). These claims and remarks motivate part of the present work. We expect that, as SAQ, Erik+2 and the paralinear method handle successfully the long-branch attraction problem, Q-methods with these weighting systems improve their performance. Moreover, Zou et al. (2019) proved that if a machine learning approach is applied to weighting the quartets, QP can have a similar performance than NJ especially under substitution processes that are heterogeneous across lineages. Precisely, handling heterogeneity of substitution rates across lineages is one of the key features of algebraic methods based on the GM model of nucleotide substitution.
The goal of this work is to test the performance of several quartet-based methods for phylogenetic tree topology reconstruction when applied with input weights from different methods consistent with the general Markov model. We test QFM, wQFM, QP, WO and the method WIL (proposed in Willson 1999) with different systems of input weights: two systems, PL and 4P(see Appendix A), derived from paralinear distances and the four-point condition (Lake 1994;Gascuel 1994;Mihaescu et al. 2009), and three systems based on algebraic and semi-algebraic methods, namely SAQ, Erik+2 and the new proposed method ASAQ (see below) that combines the paralinear distance and algebraic methods. We also provide a new implementation of QP, WO, WIL and the code for ASAQ. We test exhaustively all these combinations on simulated data evolving either under the GM model or the homogeneous general (continuous) time-reversible model (GTR) on twelve taxon trees, and also provide a comparison when input weights from a maximum-likelihood approach are used. We also compare the performance to a traditional NJ(from now on, global NJ) and a ML estimation of the tree (global ML) and test some of the methods on real data: the eight species of yeast studied in Rokas et al. (2003) and the Ratite mitochondrial DNA data studied in Phillips et al. (2009).
The new method ASAQ (standing for algebraic and semi-algebraic quartet reconstruction) is a topology reconstruction method for four taxa which combines Erik+2 and SAQ by means of the paralinear method. This is a quartet reconstruction method based on a statistic (see Eq. (1) in the Materials and Methods section) that assesses the positivity of the estimated length (using the paralinear distance) of the interior branch of the quartet. When data are unmistakably generated by a quartet, the topology output by any reconstruction method should be consistent with the positivity of this statistic. ASAQ uses the paralinear method to either ratify the results of Erik+2, or to rely on SAQ when there is an inconsistency between the outputs of Erik+2 and the paralinear method. By proceeding like this, ASAQ ensures an overall better performance on quartets than Erik+2, SAQ (and the paralinear method itself in most occasions).
As ASAQ is statistically consistent with the GM model, it can consider data evolving on a tree where base composition and instantaneous rate substitution patterns can differ among lineages (heterogeneity across lineages), and does not assume stationarity nor time-reversibility, see Appendix A. We test the performance of ASAQ on a wide scenario of simulated data: on the tree space proposed by Huelsenbeck (1995), on quartets with random branch lengths and on mixture data on the same topology with two categories. All simulations used data either generated from the GM model or from a GTR model, with different sequence lengths.
One can use ASAQ with mixtures of distributions on the same tree topology with two or three categories (or partitions) as this was already implemented by Erik+2. Although ASAQ is based on SAQ and the paralinear method, which are not guaranteed to be consistent on mixtures, it is also highly successful on this kind of data, see the Results section.
π , this is usually called a general Markov model (GM) on T . One can compute the theoretical joint distribution p of patterns at the leaves of T in terms of the entries of M e and π and we say that p has arisen on T with certain substitution parameters.
We consider fully resolved (unrooted) trees on a set of four taxa [4] = {1, 2, 3, 4}: the three quartet trees shall be denoted as 12|34, 13|24, 14|23, according to the bipartition induced by the interior edge. Fernández-Sánchez and Casanellas (2016) introduced Erik+2, a reconstruction method essentially based on the rank of flattening matrices obtained from a distribution of nucleotides at the leaves of a tree. The method allows the possibility of dealing with data from mixtures of distributions with up to 3 categories (this upper bound is not computational, but rather a consequence of the theoretical basis of the method: distributions from mixtures with c categories satisfy that the rank of a certain 16 × 16-matrix is upper bounded by 4c, so in order to discriminate between such distributions and generic distributions we need 4c < 16). In the more recent method SAQ, Casanellas et al. (2021b) create a more sophisticated method combining the rank of the flattening matrices with the stochastic information available in the data via a result of Allman et al. (2014). Both methods Erik+2 and SAQ, as well as their associated weighting system, are briefly described in Appendix A.1. They have been widely studied and the reader is referred to the cited publications for their performance on different scenarios.
As already explained in the introduction, while SAQ usually outperforms Erik+2 for short DNA sequence alignments (length ≤ 1000), Erik+2 obtains better results as the length of the DNA alignment increases. This consideration leads us to introduce ASAQ, a new combined method of Erik+2 and SAQ that tries to apply one or the other according to whether the input pattern distribution is consistent with the positivity of the estimated length of the interior branch. To this end we use the paralinear distance and the paralinear method (see Appendix A.1, Lemma A.1 and Lake (1994)).
For a distribution of patterns at the set of leaves, p ∈ R 256 , we compute all paralinear distances d x,y between pairs x, y ∈ [4]. Then, given a bipartition A|B of the set [4], A = {i, j}, B = {k, l}, we define the following quantity (1) The quantity above is the "neighborliness" measure used in Gascuel (1994), the "paralinear method" used in Lake (1994), and was presented by Buneman (1971) as a measure of twice the length at the interior edge (when d is a tree metric). It is worth pointing out that at most one of the three values p 12|34 ( p), p 13|24 ( p) and p 14|23 ( p) is strictly positive if all paralinear distances above are non-negative, see Lemma A.2. We denote by PL( p) the collection of these values: If p has arisen on a quartet A|B and the entries of the Markov matrix at the interior edge were strictly positive, both quantities inside the minimum in Eq. (1) are equal (as can be deduced from the 4-point condition). Moreover, p A|B ( p) is the unique positive quantity in the triplet PL( p) and its value coincides with twice the paralinear distance of the interior edge (see Theorem A.3). The paralinear method is the quartet reconstruction method that outputs the tree T with highest p T value. It is a statistically consistent quartet-inference method (Theorem A.3). Therefore, we propose the following quartet reconstruction method ASAQ: it checks whether both methods Erik+2 and the paralinear method PL output the same quartet, and (1) if Erik+2 and PL agree, then ASAQ outputs the topology and weights of Erik+2; (2) if they do not agree or some paralinear distances d x,y are negative, then ASAQ outputs the topology and weights of SAQ.
An inconsistency between Erik+2 and the paralinear method implies that data are far from having arisen on a tree with stochastic parameters (note the role of the positivity of the entries of the Markov matrix in Lemmas A.1 and A.2). The positivity of transition matrices implies semi-algebraic conditions on the joint distributions at the leaves (Allman et al. 2014) and in this case we rely on SAQ, as it is the unique method that takes into account both these semi-algebraic conditions and the algebraic constraints considered by Erik+2. In Theorem A.3 we prove that ASAQ is as well a statistically consistent quartet reconstruction method for the general Markov model and in Table 5 we show the percentage of discordance between PL and Erik+2, and the success of SAQ in these cases. We cannot claim that ASAQ is statistically consistent for mixtures because consistency is not known to hold for the paralinear method or SAQ in this scenario. However, the simulation studies in Casanellas et al. (2021a) show a good performance of SAQ in mixture data from the same tree and this will lead to a good performance of ASAQ as well (see the Results section). We denote by ASAQ (m = k) the use of ASAQ with Erik+2 estimating mixtures on the same tree with k categories. The limit on the number of categories m = 3 for quartets comes from the theoretical foundations of Erik+2, as a larger amount of categories would make unfeasible the identifiability of the tree topology by this method.
As a topology reconstruction method, the paralinear method is highly successful (see Results section, Fig. 2). Nevertheless, we found that using the paralinear method in order to ratify or not the results of Erik+2 gives a better performance on topology reconstruction.

Quartet-Based Methods (Q-Methods)
We have implemented different quartet-based methods (Q-methods) with different input weights. All of them seek for a tree that maximizes weighted quartet consistency. Quartet puzzling (QP), weight optimization (WO) and Willson's (WIL) methods have been programmed in C++ and wQFM has been used with the implementation provided in Mahbub et al. (2021). QP amalgamates quartets in a randomized order and seeks to maximize the total sum of weights; we provide a new implementation of QP different from the one in Schmidt et al. (2002) since we wish to apply the method with systems of weights not based on likelihoods. Weight optimization uses quartet weights to dynamically define the taxon addition order, seeking to maximize the total weight at each step. WO is known to reconstruct the correct tree if the input quartets are correctly weighted (i.e. if all quartets of the original tree have the highest weight among the tree possible weights of the corresponding 4-tuple), see Ranwez and Gascuel (2001). Instead of constructing a tree that maximizes the total weight at each step, the essential idea of Willson's method is attaching new taxa in such a way that the new tree at each step is highly consistent with the input quartets. QP, WO and WIL are initialized at a random 4-tuple. Since the output of these Q-methods strongly depends on the choice of the initial quartet, each one of them has been applied 100 times to each alignment and then the majority rule consensus tree (briefly MRCT) of these 100 replicates has been computed.
On the contrary, wQFM amalgamates quartets following a divide and conquer approach and implicitly gives the same importance to all quartets (and does not depend on an initial quartet choice). Although wQFM was specially designed to build species trees from gene trees, it can also consider other input weights. We use wQFM with quartets weighted by different methods, but also with unweighted quartets. In this last case we refer to the method as QFM Reaz et al. (2014) and the weights are transformed to 1 for the quartet output by the method and to 0 for the other two quartets.
In order to evaluate the difference between two trees, we use the Robinson-Foulds distance (RF for short), (Robinson and Foulds 1981). For the computation of the majority rule consensus tree and the RF distance, the available functions in the Python Library DendroPy have been used, see Sukumaran and Holder (2010).

Input Weights
We require the weights of the quartet reconstruction methods to be positive and normalized. Details about the input weights obtained from ASAQ are provided in the previous section. Further details about the weighting system for all the considered methods are moved to Appendix A.1. For the paralinear method the weights are denoted as PL and are obtained after normalizing the exponentials of the scores given by Eq. (1); 4P is a slight modification of this method, see Appendix A.1. We consider already published weighting systems for SAQ, Erik+2 and maximum likelihood (ML): see Fernández-Sánchez and Casanellas (2016) for Erik+2, Casanellas et al. (2021b) for SAQ, and the posterior probabilities used in  for ML.
Maximum likelihood weights (which will be denoted as ML) have been computed assuming the most general continuous-time homogeneous model (same instantaneous rate matrix throughout the tree but no constraints on the entries of the rate matrix or assumption of stationarity); this is the Lie Markov model listed as 12.12 in Fernández-Sánchez et al. (2015) and was denoted as ML(homGMc) in Fernández-Sánchez and Casanellas (2016). To this aim we used the baseml program from the PAML library (Yang 1997) with the UNREST model and let it infer the instantaneous rate matrix (common to all lineages) and the distribution at the root. Note that in order to use weights obtained from ML, we need to obtain the estimates of the maximum likelihood for the three possible quartets. This is unfeasible when the method does not converge (which often happens for the quartets which did not generate the data).

Description of the Simulated Data
We consider two different scenarios of simulated data: one for testing ASAQ as a quartet reconstruction method (described in Sect. 1.3.1) and another for testing different Qmethods with several weighting systems (see Sect. 1.3.2). For the first we use the simulated data introduced by Fernández-Sánchez and Casanellas (2016), whereas for the second we follow the approach of Ranwez and Gascuel (2001) and consider 12-leaf trees.
Evolutionary models For the trees described below, we generate data evolving either on a general Markov model (GM, see Sect. 2.1) or on a homogeneous general time-reversible model (GTR). By a homogeneous GTR model we mean a continuous-time GTR model that shares the same instantaneous mutation rate matrix Q across all branches of the tree. GTR data have been generated using Seq-gen (Rambaut and Grass 1997), while GM data have been generated using the software GenNon-h (Kedzierska and Casanellas 2012) available at https://github.com/ Algebraicphylogenetics/GenNon-H. For GTR data we specified in each setting a stationary distribution and a rate matrix common to all generated trees (see below for details on the chosen rates in each case), whereas for GM data the transition matrices are randomly generated for each alignment (according to the specified branch lengths).

Simulated Data for Quartet Reconstruction
Tree space The first data set we use to test ASAQ corresponds to the tree space suggested by Huelsenbeck (1995). We consider quartets as in Fig. 9a in Appendix B, with branch lengths given by a pair of parameters a and b which vary between 0 and 1.5 in steps of 0.02. Branch lengths are always measured as the expected number of elapsed substitutions per site. The resulting tree space is shown in Fig. 9b in Appendix B. The upper left region of this tree space corresponds to the "Felsenstein zone", which contains trees subject to the long branch attraction phenomenon. For each of the two nucleotide substitution models considered in the paper, GM and GTR, and for each pair (a, b) of branch lengths, we have simulated one hundred alignments. The rates for GTR data were chosen as in Fernández-Sánchez and Casanellas (2016), subsection Description of the Data, p.284. The considered alignment lengths are of 500, 1 000 and 10 000 sites. Following Casanellas et al. (2021b), we test ASAQ on 10 000 alignments generated from quartets whose branch lengths are randomly generated according to a uniform distribution in the intervals (0, 1) or (0, 3). These alignments are obtained according to both substitution models, GM and GTR, and are either 1 000 or 10 000 sites long. We represent the weights output by ASAQ in a ternary plot (also called a simplex plot) as in .

Mixture data
The performance of ASAQ is also tested in the scenario of data sampled from a mixture of distributions. According to the approach of Kolaczkowski and Thornton (2004), we consider the mixture of distributions as follows. We partition the alignment into two categories of the same sample size both evolving under the GM Fig. 1 The three different tree topologies CC, C D and D D on 12 taxa considered to test the Q-methods with different weighting systems. They are obtained by glueing a combination of two trees (C and D) by the root. Here, the parameter b represents the length of internal branches. These tree topologies have been taken from (Ranwez and Gascuel 2001) model on the same quartet topology as Fig. 9a but the first category corresponds to branch lengths a = 0.05, b = 0.75, while the second corresponds to a = 0.75 and b = 0.05 (see Fig. 10 in the Appendix). The internal branch length takes the same value in both categories and varies from 0.01 to 0.4 in steps of 0.05. The total length of the alignments considered is 1 000 or 10 000 sites.

Simulated Data for Larger Trees
We followed Ranwez and Gascuel (2001) to test the performance of different quartetbased methods (QP, WO and WIL) with different weighting systems. To this end, we considered the three 12-taxon topologies depicted in Fig. 1 denoted as CC, CD and DD, and fixed the ratio among branch lengths as in Ranwez and Gascuel (2001), depending on a parameter b denoting the internal branch length, which is varied in the set {0.005, 0.015, 0.05, 0.1, 0.25, 0.5}.
For each tree topology and for each b, we have considered 100 alignments with lengths 600 (in order to match the alignment length considered in Ranwez and Gascuel (2001)), 5 000, and 10 000 generated under the GM model. For DD topology, we have also generated data under the GTR model with b = 0.005, . . . , 0.1, and instantaneous rates 2 (A→C), 5 (A→G), 3 (A→T), 4 (C→G), 1 (C→T), 2 (G→T) and equal base frequency (we chose DD because it is arguably the hardest to reconstruct, see the Results section).

Mixture data
We have considered a 2-category mixture model, that is, we generated alignments evolving on a the topology CD but whose sites evolve following two systems of substitution parameters: the first system corresponds to the branch lengths described in the tree CD, while the second system corresponds to CD after exchanging the branch lengths of seq3 and seq4, and the lengths of seq7 and seq8 (see Fig. 11 for details). The parameters that were varied in this framework were the proportion p of sites in the first category (which was varied in 0.25, 0.50, and 0.75) and the internal edge length b which was varied as above in 0.005, 0.015, 0.05 and 0.1. The lengths of the alignments considered were 600, 1 000 and 10 000 bp.

Real Data
Yeast data We analyze the performance of ASAQ on real data on the eight species of yeast studied in Rokas et al. (2003) with the alignment consisted of the concatenation of 42 337 s codon positions of 106 genes as provided by Jayaswal et al. (2014). We investigate whether the quartets output by ASAQ support the tree T of Rokas et al. (2003), the alternative tree T of Phillips et al. (2004) (see Fig. 12), or the mixture model proposed by Jayaswal et al. (2014). Although the tree T is widely accepted by the community of biologists, its correct inference is known to depend on the correct management of heterogeneity across lineages, as an inaccurate underlying model usually reconstructs T (Rokas et al. 2003;Phillips et al. 2004;Jayaswal et al. 2014). According to Jayaswal et al. (2014), these data are best modeled by considering, apart from heterogeneity across lineages, two different rate categories (discrete distribution) plus invariable sites. In our setting, this is translated into a mixture distribution with 3 categories (m = 3 in ASAQ).

Ratites and tinamous mitochondrial data
The phylogeny of ratites and tinamous has been debated for some time (see Phillips et al. 2009, and the references therein) and is still controversial (Benito et al. 2022). There is evidence of a higher rate of evolution among the tinamous relative to the ratites (see, e.g., Paton et al. 2002), so these data are appropriate for analysis by the methods proposed here. Moreover, the recoding used in Phillips et al. (2009) to sort out this problem is questioned in Vera-Ruiz et al. (2021). We do our analyses using a 3506 sites alignment consisting of the third codon position of the mitochondrial DNA coding alignment for 24 DNA sequences provided in this last paper. We run WO and QP with weights obtained by ASAQ, SAQ, Erik+2, 4P, PL. For each combination of methods we used all quartets but also a random selection of a subset of input quartets of size either one hundred, 2125 (approximately 20% of all possible quartets), or 5313 (equal to 50%), and then we performed the MRCTs. This was aimed at testing the sensitivity of the method to the amount of input data (comparing the performance when only a portion of the quartets is used) in terms of scalability of the methods.

Results
In this section, we describe the results obtained and benchmark them with published results for the sake of completeness. The interested reader is referred to the corresponding papers for details of the methods therein.

Tree Space
The performance of ASAQ and PL on data generated on the tree space of the previous section (Materials and Methods) is represented in Fig. 2 (for GM data) and in Fig. 13 in the Appendix (for GTR data). In black we represent 100% success, in white 0% success, and gray tones correspond to regions of intermediate success accordingly. The 95 % and 33 % isoclines are represented with a white and black line, respectively. These simulation studies show a consistent performance according to the results by Huelsenbeck (1995) and Fernández-Sánchez and Casanellas (2016), with the usual decreasing performance at the Felsenstein zone and an improvement of both methods with sample size. Figures 2 and 13 show that the performance of ASAQ is better than that of PL, for both GM and GTR data.
For completeness, the average performance of ASAQ and PL on this tree space for different alignment lengths and underlying models is compared to other methods in Table 1: we include the average results of SAQ (Casanellas et al. 2021b, as shown in) and Erik+2 and ML as published in Fernández-Sánchez and Casanellas (2016). This comparison shows that for GM data the best results are achieved by ASAQ, while for GTR data ML obtains the best performance for alignments of length 500 and 1 000 bp and Erik+2 does so for long alignments (10 000 bp).

Random Branch Lengths
To visualize the overall distribution of the weights of ASAQ applied to trees with random branch lengths, in Fig. 3 we show ternary plots corresponding to the GM alignments described in Sect. 1.3.1. The ternary plots of the performance of ASAQ when applied to the same setting with data generated under the GTR model are shown in Fig. 14. In Table 2 we display the summary of average success of ASAQ on these data.
Note that Table 2 shows a high performance of the method and the ternary plots show a clear distribution of points towards the bottom left corner (which represents the correct quartet) and weights symmetrically distributed on the other corners. In particular, the method is not biased towards any of the incorrect topologies. We note  Fig. 9b on alignments of length 500 bp (top), 1 000 bp (middle) and 10 000 bp (bottom) generated under the GM model. Black is used to represent 100% of successful quartet reconstruction, white to represent 0%, and different tones of gray the intermediate frequencies.
The 95% contour line is drawn in white, whereas the 33% contour line is drawn in black Table 1 Average success of several methods applied to data simulated on the tree space of Fig. 9b). ASAQ and PL are compared to the results for SAQ obtained in (Casanellas et al. 2021b), and for Erik+2 and maximum likelihood ML in (Fernández-Sánchez and Casanellas 2016). MLestimates the most general continuous-time homogeneous process 12.12 when data are generated under a GM model, while ML estimates a homogeneous GTR model when data are generated under GTR, see (Fernández-Sánchez and Casanellas 2016,  that the level of success exhibited is quite sensitive to the branch length, being much higher for branch lengths in (0,1) than in (0,3). We do not appreciate a remarkable difference between the performance of ASAQ when applied to GTR data.

Mixture Data
In Fig. 4 we show the performance of the method ASAQ with m = 2 categories when applied to data from mixtures as described in Sect. 1.3.1 and in Fig. 10. Based on the results of Fig. 5 in Fernández-Sánchez and Casanellas (2016), we also provide a comparison to the success of Erik+2 (m = 2) and two versions of maximum likelihood on the same data. We did not include the performance of SAQ, as it is very similar to that of ASAQ. Figure 4 shows an increasing accuracy of all methods when the value of the parameter r (the branch length of the interior edge of the two trees involved) is increased. This was expected as larger values of r represent larger divergence between sequences at the left and the right of the trees. We note that ASAQ (m = 2) outperforms the other methods, with a high level of success in all cases, even when the length of the alignments is 1 000 bp. The average success of ASAQ (m = 2) applied to the simulated Ternary plots corresponding to the weights of ASAQ applied to 10000 alignments generated under the GM model on the 12|34 tree. On each triangle the bottom-left vertex represents the underlying tree 12|34, the bottom-right vertex is the tree 13|24 and the top vertex is 14|23. The small gray triangle depicted represents the average point of all the dots in the figure. Top: correspond to trees with random branch lengths uniformly distributed between 0 and 1; bottom: random branch lengths uniformly distributed between 0 and 3. Left: 1 000 bp; Right: 10 000 bp. The analogous ternary plots for SAQ can be found in Garrote-López (2021) data is 96.64% for 1 000 bp, 99% for 10 000 bp and 99.92% for 100 000 bp (results not shown).

Results of Q-Methods
We proceed to describe the performance of wQFM, QFM, WO, WIL and QP (see the subsection on Quartet-based methods ) applied to the input weights from ASAQ, SAQ, Erik+2, PL, 4P and ML on the trees CC, C D and D D presented in Sect. 1.3.2. The weights from PL and 4P are both defined in terms of the paralinear distance (see Appendix A.1), and they produce similar results. Because of this, we omit the Fig. 4 These plots represent the percentage of correctly reconstructed trees by several methods applied to the mixture data described in the Method section; the value r refers to the branch length of the interior edge of the trees (Fig. 10). We compare the results of ASAQ (m = 2) and the results presented in (Fernández-Sánchez and Casanellas 2016, Figure 5) for Erik+2 (m = 2), ML (as usual on the 12.12 model), and a ML estimating a heterogeneous across lineages GTR with two categories of discrete γ rates across sites denoted as ML(GTR+2 ) (only for length 1000 bp) performance of 4P in some cases. We also add the results obtained from a global NJ applied to the same data (using the NJ algorithm implemented in the R package APE of Paradis et al. (2004) with paralinear distance) and also global ML using IQ-TREE 2 (Minh et al. 2020) with the most continuous-time homogeneous general model 12.12.

Unmixed Data
The results on simulated GM data obtained by the combination of methods and weights presented in this paper are summarized in Fig. 5 for CD, in Fig. 15 for CC and in Fig. 15 for DD. The results for GTR data (for the tree DD) are shown in Fig. 17. The height of the bars shows the average of the RF distance from the original tree to the consensus tree of 100 replicates for each of the 100 generated alignments. The values of these results are detailed in Tables 6 (for CC), C.3 (for CD) and C.4 (for DD) for GM data and in C.7 for GTR data on DD. For comparison, the results of a global NJ and global ML (which has a similar performance as a global NJ) applied to these data is displayed in Table 10 (see also the results of the global NJ in Figs. 5, 15 and 16). In Table 9, we show the results obtained when applying ML weights. Since ML did not converge for some quartets (especially in the presence of long branches and when trying to maximize the likelihood for GM data generated on another quartet), we write between parentheses the number of alignments considered for ML (in the computation of the average RF distance we neglected the alignments where ML did not converge).
By comparing the performance of the four Q-methods, we observe a slightly better performance of wQFM, QFM WIL and WO compared to QP. The accuracy drops for long branches in all methods, but wQFM and WO seem to perform slightly better for short and medium branches (b ≤ 0.1) while WIL does better for longer branches b ≥ 0.25. Among all Q-methods, QP is the most sensitive to the choice of the system of weights.
Another general remark is that all Q-methods with their best system of weights outperform a global NJand a global ML for b > 0.1 on GM data, while none of them (with any system of weights) beats a global NJ or ML for b smaller than or equal to 0.1. Figure 17 shows that a global NJ is the best option for GTR data on DD trees.
All reconstruction algorithms have had considerably more success when reconstructing the tree CC in contrast to CD or DD for GM data, in concordance with the results obtained by Ranwez and Gascuel (2001). In the tree topologies CD and DD, the distance between seq9 and seq10 is 4b, while the distance between seq8 and seq11 is 20b. The same happens between species seq1 and seq2, and seq0 and seq3 of the DD topology. Thus, for an alignment generated from these trees, there is a high probability that two separate lineages evolve in a convergent manner to the same nucleotide at the same site, creating a long branch attraction situation and making the reconstruction methods to infer the wrong topology. wQFM, QFM, WO and WIL are especially successful when dealing with CC and CD trees ( Fig. 15 and Fig. 5) and short branches, probably because these Q-methods succeed in reconstructing the C subtree.
Weights from ASAQ, SAQ, Erik+2, PL and 4P produce overall comparable results, although ASAQ and SAQ do better when dealing with long branches. wQFM with seems to work slightly better with quartets weighted by SAQ, PL and 4P over Erik+2 or ASAQ (although it does not perform as well as WO when dealing with long branches). On the contrary, in the unweighted version, ASAQ seems to provide the best input quartets for QFM (note that PL and 4P provide the same input for QFM, so they are displayed together in the tables). We also note an improvement in the results when the length of the alignment increases, especially for the CC case or for GTR data (Fig. 17) and hardly noticeable for the other two trees on GM data.
The simulation study shows a big difference between the general performance of the Q-methods with input weights from ASAQ, Erik+2 and PL in contrast to weights obtained by ML, especially when reconstructing the CC and CD trees (compare Tables  6, 7, 8 and 9). This could probably be due to the inconsistency between the ML weights computed and the general Markov model used to generate the data (see subsections Quartet-based methods and Description of the simulated data). Even when the model used for ML estimation matches the model that generated the data (as for GTR, see Table 11), ML seems to be the worse weighting system. When data are generated under the GTR model on DD trees ( Fig. 17 and Table 11), all Q-methods (except for ML weights) have a remarkable high performance when alignments are long enough (≥ 5 000 bp), especially WO. For short alignments, PL seems a good choice for the input weights with these data. All in all, we observe that both ASAQ and PL weights (combined with one or another Q-method) give rise to good results (much better than ML weights) comparable with the results obtained by a global NJor ML, and even better when dealing with long branches.

Mixture Data
The results on mixture data obtained by a global NJ and by QP, WO, and WIL with input weights from PL, ASAQ and Erik+2 adapted to 2-category data (as described in Sect. 1.3.2) are summarized in Fig. 6 for WO and in Tables 12, 13 and 14 for  Tables 6 and 10, and results with ML weights are included in Table 9 all methods. Tables 12, 13 and 14 have a similar structure as Tables 6, 7 and 8 and correspond to the results obtained for different proportions between the two categories; p = 0.25, p = 0.5 and p = 0.75, respectively. (We recall that the proportion p of sites of the first category of the alignment were generated assuming the branch lengths of the CD tree in Fig. 1.) It is worth pointing out that the reconstruction of mixture data from CD trees is more accurate on average than when applied to unmixed data Fig. 6 Average Robinson-Foulds distance on mixture data simulated on the tree CD for p = 0.25 (left), p = 0.5 (middle) and p = 0.75 (right) with alignment length 600 bp (above) and 5 000 bp (below). We omit the results for 10 000 bp as they are very similar to those obtained for 5 000 bp. The Q-method WO is applied with different systems of weights, namely ASAQ with 2 categories, Erik+2 with 2 categories, and PL. The white diamond represents the result of a global NJ with the paralinear distance. Results obtained by WO, QP, WIL for the different proportions between the two categories, p = 0.25, p = 0.5 and p = 0.75, can be found in Tables 12, 13 and 14, respectively. These tables also include results with ML weights (Fig. 5). This is probably due to the fact that when the branch lengths of species seq3, seq4 and seq7, seq8 are exchanged (Fig. 1), reconstruction methods have an easier job to make the appropriate splits, since the long branch attraction situation that was provoked by the quartet of species {seq8, seq9, seq10, seq11} does no longer exist with the new branch lengths. This is consistent with the observation that the reconstruction results are more accurate for low values of b. For short alignments (600 bp), we note that ASAQ and PL weights provide better results than Erik+2. For 5 000 bp ASAQ and Erik+2 outperform PL. Moreover, if b = 0.1, the combination WO+ASAQ or WO+Erik+2 beats a global NJ. For 10 000 bp, the difference between performance is even larger (see Tables Tables 12, 13 and 14). As expected, the length of the alignment improves the performance of these methods, reducing the impact of the long branch attraction effect for high values of b.

Yeast Data
As shown in Table 3, the claim by Jayaswal et al. (2014) is corroborated by our analysis: by performing a MRCT on 100 random initial quartets, WO+ASAQ reconstructs T when 3 categories are considered, but it reconstructs T otherwise. Similarly, the RF distance of the MRCT obtained by WIL+ASAQ to the tree T is smaller than the RF distance to T only when 3 categories are considered. When applied to these data, NJ with paralinear distances reconstructs the tree T . It should also be mentioned that several studies have analyzed this data set seeking to build a species network from individual gene trees. These works mostly support a hybridization event involving S. kudriavzevii and S. bayanus, resulting in a species network with at least one 4-cycle that could give rise to the inconsistencies between trees T and the T ; see for instance ; Holland et al. (2004) and Yu et al. (2011).

Ratites and Tinamous
First, we analyze the results obtained by 50 majority rule consensus trees obtained from WO and QP on 100 input quartets weighted with different methods. In Table 4 we show the average results in comparison to the trees A, B and C detailed in Sect. 1.4. The number of interior edges obtained by each method shows that WO produces more resolved trees than QP (independently of the weighting method) and that ASAQ, SAQ and Erik+2 (m=1) give rise to more resolved trees than PL and 4P with both WO and QP.
This phylogeny is also supported by a global NJ, although NJ splits tinamous and outgroup against the others. The same tree as NJ is obtained with SAQ (both for 2125 or 5313 initial quartets). Although the deepest interior branch lengths within ratites obtained by NJ are very small: (rheas, (ostrich, (moas, CEK):0.0025):0.003), SAQ and ASAQ give full support to these splits. Indeed, even when we use SAQ or ASAQ (with both m = 1 or 2) with only 100 initial quartets, all 50 MRCTs share these splits. Unexpected trees are obtained by Erik+2, PL or 4P (either not solving correctly the neognathus clade, or not giving a CEK clade nor putting all tinamous in a clade).

Execution Time and Implementation
An implementation of ASAQ in C++ can be found in Table 4 For each weighting system (indicated in the left column) combined with either QP or WO, the first column depicts the average number of interior edges of 50 MRCTs on Ratites real data. The average RF distance and normalized RF distance (that is, dividing by the number of interior edges) from the MRCTs to the trees A, B and C (see Sect. 1.4) is shown in the remaining columns. The last row shows the RF distance and the normalized RF distance from the global NJ tree (obtained using paralinear distances) to trees A, B and C  The average time required to compute ASAQ and PL weights for 100 alignments of 4 taxa and length 10 000 bp is 8.7 and 7.8 seconds, respectively. For each alignment, ML was stopped if it did not converge for a 4-tuple after 10 s; the frequency of convergence can be seen on Tables 9 and 11, 12, 13 and 14. The average time to reconstruct a CD tree given the weights for its 495 quartet subtrees is 4 seconds for WO, 89.5 seconds for WIL and 1.8 seconds for QP.

Discussion
Via experiments on the simulation framework proposed by Ranwez and Gascuel (2001) but considering more general models of nucleotide substitution (including the GM model and mixtures of distributions), we observe a huge improvement on the performance of Q-based methods when weights from ASAQ, SAQ, Erik+2 and PL methods are considered. In general, the highest success is obtained by WO with the weighting system of ASAQ or PL, or QFM with ASAQ input. The accuracy of these methods is compatible with a global NJor ML and improves upon them in the presence of mixtures or long branches when using ASAQ as weighting system. Moreover, these weights also outperform weights obtained by ML, even when data are generated under a GTR model and ML is estimating the same model.
It is worth noting that in the previous studies of Q-methods by Ranwez and Gascuel (2001); John et al. (2003), only weights from ML and NJ were considered. The results in this paper validate QFM, wQFM, WO and WIL as successful phylogenetic reconstruction methods if their input is a system of reliable weights. Moreover, as ASAQ assumes the most general Markov model and can deal with mixtures, this opens the door to use Q-methods to data generated by complex models.
We need to mention that the comparison performed against the ML weights (ML as input of Q-methods) or a global ML may not be totally fair because the model used in parameter estimation (continuous-time unrestricted model) does not fit the GM model used to simulate the data. It would certainly be interesting to develop maximum likelihood estimation based on the GM model and perform new tests. In any case, the results of ML weights are much worse than those of ASAQ, Erik+2 or PL also for GTR data (see Fig. 17 and Table 8 in the Appendix) and the results of a global ML are similar to a global NJ.
We have observed a good performance of input weights from ASAQ, Erik+2 and PL on simulated mixture data on 12-taxon trees, especially for 5 000 bp or more. ASAQ and PL are not likely to be statistically consistent for general mixtures on the same tree. Nevertheless, as Erik+2 is consistent on mixture data and PL is known to be consistent on some type of mixtures (see , this suggests that ASAQ (being based on the accordance of Erik+2 and SAQ) might also be consistent on some types of mixture data, which would explain the good results obtained. We would like to point out that the mixture model allowed in Erik+2 (and hence in ASAQ) is actually more flexible than we mentioned. Indeed, the rank conditions considered in Erik+2 for quartets still hold if we let one cherry or one leaf evolve under a mixture with any number of categories (while the other cherry evolves under a single system of parameters), see Casanellas and Fernandez-Sanchez (2020). This makes the mixture model underlying Erik+2 and ASAQ more general, with implications in the mixture model that can be considered for Q-methods with input weights from ASAQ.
Our results on real data validate WO + ASAQ as a reliable method (both for yeast or ratites/tinamous data). Moreover, it is relevant to note that for the ratites/tinamou data, the topology obtained by WO + ASAQ or SAQ, or by NJ with paralinear distances does not agree with any of the topologies proposed in Phillips et al. (2009).
Note that the Q-methods considered here have higher order of computational complexity than NJ, but we have not yet explored the possibility of considering a subset of the possible 4-tuples as suggested in Snir and Rao (2010) or Davidson et al. (2018). It would be interesting to further explore these other versions of Q-methods with weights from ASAQ and PL, or even to restrict to quartets with highest weights as starting point for Q-methods.
On another direction, our results show that ASAQ is a powerful reconstruction method for quartet topology reconstruction. It assumes the most general model of nucleotide substitution (a general Markov model) of independently and identically distributed sites but it can also account for mixtures of distributions (with up to three categories). As it is based on the algebraic and semi-algebraic description of the model, it does not need to estimate the substitution parameters. In this sense, ASAQ could be easily adapted as a suitable method for dealing with amino acid substitutions as well. The incorporation of invariable sites seems also plausible via the results in Jayaswal et al. (2007); Steel et al. (2000). We plan to incorporate these features in a forthcoming version of the software.
As mentioned in the introduction, ASAQ is part of the set of phylogenetic reconstruction tools that are based on algebra. Most of these methods only reconstruct quartets because no statistically consistent (and computationally affordable) algebraic method for larger trees has been designed yet.
A study of ASAQ from a statistical point of view would certainly be relevant, as in this study the efficiency of the method has been solely based on the results obtained on large sets of simulated data.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Weighting Systems and Technical Results About ASAQ
The underlying nucleotide substitution model we consider is a general Markov (GM) model (without treatment of indels or ambiguous sites). That is, we assume that substitution of nucleotides along a phylogenetic tree T follows a Markov process on T specified by a distribution π of nucleotides at an interior node (which pays the role of the root) and 4 × 4 Markov matrices M e for each edge e of T . We do not assume any extra condition on the transition matrices (neither stationarity nor time-reversibility).
As Markov matrices M e are not assumed of type ex p(t e Q e ) (which would imply assuming a time-homogeneous continuous-time Markov process), the Markov process along each edge is not necessarily homogeneous through time (that is, the patterns of instantaneous mutation rates are allowed to change through time). Models that assume M e = ex p(t e Q e ) are locally time-homogeneous; this condition is very restrictive: for example less than 4% of diagonal largest-in-column Markov matrices are of this type (Casanellas et al. 2023). Moreover, if a Markov process on T assumes M e = ex p(t e Q) with the same instantaneous rate matrix Q for the whole tree (as is often assumed in phylogenetic reconstruction), the process is globally time-homogeneous in addition (i.e. instantaneous mutation rate patterns are constant throughout the tree). In this sense, the GM model underlying the methods Erik+2, SAQ, ASAQ, PL and 4P is heterogeneous across lineages.

A.1 Weighting Systems
The input weights we consider in this paper are obtained from different quartet reconstruction methods: from SAQ, Erik+2 and ASAQ, from methods that use the paralinear distance (PL and 4P) and from maximum likelihood (ML). For any of these methods we describe here the weighting score used. To this end, let f be the vector of relative frequencies obtained from a DNA alignment of four taxa.
Weights for methods based on the paralinear distance, PL and 4P. Consider two (ordered) nucleotide sequences S x and S y of the same length corresponding to two taxa x and y, respectively. Let J be the underlying joint probability matrix of S x and S y , this is, the entry (i, j) of J is the probability (either theoretical or estimated by relative frequencies) of observing nucleotides i and j at the same position of S x and S y (so that the sum of entries in J is one). If det J = 0 and all nucleotides are observed in S x and S y , then the paralinear distance (Lake 1994) between x and y is where D x and D y are the diagonal matrices whose diagonal entries are given by J 1 and 1 t J , respectively; if det J = 0, we take d x,y as infinity. Based on the work by Lake (1994) and Ranwez and Gascuel (2001), given the vector f of relative frequencies from a quartet alignment, in Section Materials and Methods of the main document, we defined which represents twice the length of the interior edge of the quartet (see Theorem A.3(b) below). We define p 13|24 ( f ) and p 14|23 ( f ) similarly.
In Mihaescu et al. (2009), the authors propose a slightly different way of weighting the quartets using the paralinear distance: ( 13|24 ( f ) and 14|23 ( f ) are defined analogously). These are the measures that we use for what we call the 4P method (standing for "four-point condition"). In the same paper quoted above, the authors establish the (theoretical) equivalence between NJ and a consistent quartet-based method that uses these weights. Observe also that these weights are equivalent to p 12|34 ( f ) for theoretical data.
In order to assign a non-negative score to each quartet, we define the normalized weights for PL and 4P as

Weights for maximum likelihood.
We weighted quartet trees for the ML method using the likelihoods associated to the three trees T 1 , T 2 , T 3 as done by Ranwez and Gascuel (2001): if l T i ( f ) is the likelihood for the tree T i (with the MLE parameters), then the normalized weights for the three quartets are defined as

Weights for Erik+2.
Given a bipartition i j|kl and a distribution f ∈ R 256 , denote by Flat i j|kl ( f ) the flattening of f according to that bipartition, that is, Flat i j|kl ( f ) is the 16 × 16 matrix whose (x i x j , x k x l )-entry is the coordinate of f that matches x i x j x k x l in the convenient order. Then, Flat i j|kl ( f ) has rank ≤ 4, while the flattenings corresponding to the other two bipartitions have rank 16 if the parameters are generic (Allman and Rhodes 2007) If δ 4 (·) stands for the distance to the space of matrices with rank ≤ 4, the method Erik+2 computes for each quartet tree T the value where Flat r T and Flat c T stand for the (normalized) matrices obtained by dividing the rows and columns of the flattening matrix Flat T by the sum of its entries, respectively (as a variation we can consider the distance δ 4m (·) to rank 4m matrices when we deal with mixed data with m ∈ {1, 2, 3} categories). The right quartet tree should be the one that gives the minimal value; this is the output of the method.
In order to have a normalized scoring system that allows us to compare and represent the output of Erik+2 for each quartet tree T , we consider the inverse of the value e T ( f ) and then divide but the sum of these scores for all three quartet trees:

Weights for SAQ.
The method SAQ suggested by Casanellas et al. (2021b) associates the following score to the tree T = 12|34: where the sum runs over the sixteen 12|34 leaf-transformed distributionsf i and psd(M) is the closest symmetric and positive semi-definite matrix to the matrix M. If T is any of the other two possible quartet trees, s T ( f ) is computed analogously by permuting the roles of the leaves accordingly. SAQ outputs the normalized three scores, that is, if s :

A.2 The Paralinear Distance and Theoretical Foundations of ASAQ
The following lemma generalizes the nonnegativity and the additivity properties of this dissimilarity map to general Markov matrices, extending the results of Lake (1994) to general Markov matrices as far as they are non-singular. Proof We have that J = D(π )M is the underlying joint probability matrix between x and y so that the sum of its entries is one. Note that π = J 1 and π t M = 1 t J , so D x = D(π ) and D y = D(M t π). Therefore,

Lemma A.1 Let π be the nucleotide distribution at x and consider a substitution process leading from x to y and ruled by a Markov matrix M. Then, the paralinear distance d x,y between x and y defined in (2) coincides with
Now, we proceed to prove (a): d x,y ≥ 0 or equivalently, that First of all, since the matrix M is positive, we deduce that We have that By multiplying this inequality with | det M| ≤ 1 (M is a Markov matrix), we obtain (5). Note that a similar argument is used in the proof of the main theorem in Steel (1994). Note that if d x,y = 0, then necessarily det M = 1. Hadamard's inequality implies that a Markov matrix M with determinant 1 is necessarily the identity matrix or a permutation matrix (Casanellas et al., 2022). Finally, the statement (b) follows easily from the expression (4).
Write for the probability simplex in the space R 256 , that is, is the set of all possible distribution vectors of patterns of nucleotides p = ( p x 1 x 2 x 3 x 4 ) x 1 ,x 2 ,x 3 ,x 4 ∈{A,C,G,T } . By Lemma 8 of Buneman (1971) applied to the paralinear distance, we get the following result (which can be also deduced almost directly from the definition (1) of the paper):

Lemma A.2 Assume that the determinant of every double marginalization of p ∈ is non-zero (so that all values d i, j , i, j ∈ [4] can be computed). If d i, j ≥ 0 for all i, j ∈ [4], then p A|B ( p) is strictly positive for at most one bipartition A|B.
Before proving the main theorem, we recall some definitions that will be needed. Following Sumner et al. (2008), given a vector f = ( f x 1 ,x 2 ,x 3 ,x 4 ) ∈ R 256 and 4 × 4 Markov matrices X i , i = 1, . . . , 4, the Markov action (X 1 , . . . , X 4 ) * f is the new vector g = (g y 1 ,...,y 4 ) defined by y 4 , x 4 ).
In case f = p is the distribution vector arising from some tree T with certain transition matrices {M i } i , the new vector g corresponds to the distribution obtained by multiplying the original matrices at the pendant edges with the new matrices: According to Sumner et al. (2017), a quartet inference method that assigns a triplet of weights w(F) to each array F ∈ R 256 satisfies the strong property II if w(X * F) = w(F) + (λ(X ), λ(X ), λ(X )) for some additive function λ on Markov matrices X = (X 1 , . . . , X 4 ). Essentially, the strong property II guarantees that if the probability parameters of the pattern distribution are affected by a linear operator, the effect on the output is that of adding the same quantity to each topology weight. Moreover, this quantity is consistent with the composition of operators.
Given an array F, for each pair a, b ∈ [4], a < b, we define matrices N ab by taking the (i, j)-entry as the double marginalization of F on the components different from a and b, where rows (respectively columns) are labeled by the states at a (resp. b). If a > b, then define N ab = N t ba . For example, (N 13 ) i, j = F i+ j+ and (N 43 ) i, j = F ++ ji .
Theorem A.3 Let f be the vector of relative frequencies of site patterns observed in an alignment of length N which has been generated according to a multinomial distribution with measure p (that is, f is the vector of relative frequencies that arises as N independent samples from p). Then we have a) is a quartet inference method that satisfies the strong property II with λ = 0. Moreover, if p arises from a Markov process on the tree T = 12|34 of Fig. 8 with positive distribution π at the root and invertible transition matrices, then the limit of the expectation of PL( f ) as N goes to infinity is b) if p arises from a Markov process as in a), then p 12|34 ( p) = 2d u,v ≥ 0 and p C|D ( p) = −p 12|34 ( p) for any other bipartition C|D = 12|34; c) the paralinear method (which associates to a frequency vector f the tree T A|B with largest p A|B ( f )) is statistically consistent for the general Markov model; d) ASAQ is statistically consistent for the general Markov model.
Proof (a) The same proof of (Sumner et al. 2017,Theorem 3.1) applied to the paralinear distance gives that PL( f ) satisfies the strong property II with λ = 0 in our case. Indeed, it is straightforward to see that p A|B ( f ) is invariant by the Markov action: p A|B ((X 1 , . . . , X 4 ) * f ) = p A|B ( f ) for any 4 × 4 Markov matrices X i and any bipartition A|B. This can be immediately seen by observing that for any array F The claim about expectation follows by Taylor expansion of p A|B ( f ) around the expectation of f (as E( f ) = p componentwise). (b) The first claim is a consequence of the additivity of the paralinear distance, see Lemma A.1. Indeed, under the assumption that p arises from the tree T = 12|34, every quantity d i, j (i, j ∈ [4]) can be written as the sum of the paralinear distances attached to the edges of T between i and j. For example, d 1,3 = d 1,u + d u,v + d v,3 . Then the value p A|B ( p) results in 2d u,v , which is nonnegative in virtue of (a) of Lemma A.1. Now, if C|D is a bipartition other than 12|34, then from (1) of the main document we immediately get p C|D ( p) = −p 12|34 ( p). (c) Denote δ = 2d u,v . Then, following (a) we have that where the last equality follows from (b). Therefore, for any > 0 there exist N 1 , N 2 , N 3 such that for any Then, for any < δ we have Then we can conclude that the method that chooses the tree T with largest p T ( f ) chooses T 12|34 with probability tending to 1 when N tends to infinite. (d) The statistical consistency of ASAQ follows from the statistical consistency of Erik+2 (Fernández-Sánchez and Casanellas 2016), SAQ (Casanellas et al. 2021b) and the paralinear method obtained in c). Indeed, if f is obtained from N samples of a multinomial distribution with measure p that has arisen from T = T 12|34 with non-singular parameters (positive distribution at the root and invertible transition matrices), then the probability that both methods (when applied to f ) select T 12|34 tends to 1 when N tends to infinite.
Note that under the assumption that M 5 is diagonally largest in column (DLC for short, see Chang (1996)), we deduce that if p A|B ( p) = 0, then M 5 is necessarily the identity matrix.

Remark A.4
In order to avoid numerical problems, we only compute p A|B ( f ) if the condition number of the double marginal matrices (joint distributions at two leaves) involved in its computation are less than a certain tolerance. This is implemented using a parameter "threshold" set to 5000, but it can be modified by the user and can be adapted to the alignment length. If p A|B ( f ) cannot be computed, then this is a sign of short sample size and ASAQ outputs the topology and the weighting system given by SAQ.

B Tree figures
See Figs. 9, 10, 11 and 12.  Figure 13 represents the results of ASAQ and PL in recovering the correct quartet on data generated under the GTR model on the tree space described in Simulated data for quartet reconstruction in the Section on Materials and Methods of the main document. Figure 14 shows the ternary plots corresponding to the ASAQ method applied to GTR data on trees with branches of random length.

C.3 Performance of Q-Methods with Different Systems of Weights
Figures 15 and 16 represent the average Robinson-Foulds distance for GM data on CC and D D trees of Q-methods with ASAQ, Erik+2, PL and 4P weights. Similarly, Fig. 17 shows the average RF distance of Q-methods applied to GTR data on D D trees.
The results obtained by quartet puzzling, weight optimization and Willson methods applied to the input weights of the different methods on GTR data from the DD trees are shown in Fig. 17 (lengths 600 bp and 5 000 bp) and summarized in Table 6 (also including the results for 10 000 bp).

Fig. 13
Performance of ASAQ (left) and PL (right) in the tree space of Fig. 9b on alignments of length 500 bp (top), 1 000 bp (middle) and 10 000 bp (bottom) generated under the GTR model. Black is used to represent 100% of successful quartet reconstruction, white to represent 0%, and different tones of gray the intermediate frequencies.
The 95% contour line is drawn in white, whereas the 33% contour line is drawn in black It is remarkable that in these simulations, it is enough to consider alignments of length 5 000 bp to obtain an almost perfect reconstruction of the original tree using the WO or Willson methods. It is also remarkable that for this particular topology the results described in Unmixed data in the Section Results, where the general Markov model was assumed, were not this good, obtaining distance values around 8 (see Fig. 16 ). In this case, assuming the GTR model these reconstruction methods manage to correctly infer the splits of the tree topology. Compared with these results, the performance of QP is poor, although it also improves the results obtained when applied to general Markov data. For short alignments (600 bp) PL and 4P weights obtained slightly better results overall than the other weights.
Note also the bad performance of WO when combined with ASAQ weights when applied to alignments of 600 bp and parameter b = 0.1. For short alignments, ASAQ Fig. 15 Average Robinson-Foulds distance for GM data simulated on the tree CC with alignment length 600 bp (left), 5 000 bp (center) and 10 000 bp (right). The Q-methods WO (first row), WIL (second row), QP (third row) and QFM (last row) are applied with different systems of weights, namely ASAQ, Erik+2, PL and 4P. The white diamond represent the average RF distance of the tree reconstructed using a global NJ with paralinear distance. Concrete values of these results are detailed in Table 6 tends to choose the weights of SAQ, so this poor performance is probably caused by the bad results of SAQ when applied to GTR data (see Casanellas et al. (2021b) ).
The performance of every Q-method applied to ML weights for these data was quite poor and the resulting RF distance was never smaller than 8.95 (see Table 11).

Fig. 16
Average Robinson-Foulds distance for GM data simulated on the tree DD with alignment length 600 bp (left), 5 000 bp (center) and 10 000 bp (right). The Q-methods WO (first row), WIL (second row), QP (third row) and QFM (last row) are applied with different systems of weights, namely ASAQ, SAQ, Erik+2, PL and 4P. The white diamonds represent the average RF distance of the tree reconstructed using a global NJ with paralinear distance. Specific values of these results are detailed in Table 8 C.4 Tables   Table 5 reports the average number of cases in which Erik+2 and PL results differ, and of those the proportion of times that SAQ estimates the right topology. The figures show that among the three methods tested here, SAQ is the best option in all cases, except for alignments of length 10 000 bp generated under the GTR model. Fig. 17 Average Robinson-Foulds distance for GTR data simulated on the tree DD with alignment length 600 bp (above) and 5 000 bp (below). The Q-methods WO (left), WIL (center) and QP (right) are applied with different systems of weights, namely ASAQ, SAQ, Erik+2, PL and 4P. The white diamonds represent the average RF distance of the tree reconstructed using a global NJ with paralinear distance. Specific values of these results are detailed in Table 11 Tables 6 Table 9 shows the results when the weighting system comes from ML. In case the reconstruction method failed to provide a weight configuration for some of the 12 4 quartets, a resulting tree cannot be constructed and the corresponding alignment has been neglected. The number in parentheses represents the number of consensus trees that we have been able to reconstruct. Table 10 shows the RF distances from the original tree to the tree constructed by applying global NJ with paralinear distances and global ML consistent with the 12.12 model.

Table 11
Average Robinson-Foulds distance for tree DD on GTR data. Different combinations of Q-methods and weights (specified in the first two columns) have been applied to alignments with length 600 bp, 5 000 bp and 10 0000 bp evolving with different branch lengths. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct the 100 simulated trees 12.45 (99) 11.98 (98) - (0) 11.88 (64) 11.81 (86) 12.04 (97) 11.88 (33) 11.08 (71) 11.51 (85) 11.69 (96) 11.96 (47) WIL ASAQ   (97) 9.82 (33) 9.1 (71) 9.11 (85) 9.32 (96) 9.38 (47) Global NJ  Table 12 Average Robinson-Foulds distance on mixture data for p = 0.25. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct the 100 simulated trees. Figure

Table 13
Average Robinson-Foulds distance on mixture data for p = 0.50. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct the 100 simulated trees. Figure

Table 14
Average Robinson-Foulds distance on mixture data for p = 0.75. The numbers in parentheses represent the number of consensus trees that we have been able to reconstruct (as ML may not converge for some quartets and in this case we cannot reconstruct the tree); if it is missing, then we have been able to reconstruct the 100 simulated trees. Figure