Statistical Inconsistency of Maximum Parsimony for k-Tuple-Site Data

Galla, Michelle; Wicke, Kristina; Fischer, Mareike

doi:10.1007/s11538-018-00552-2

Statistical Inconsistency of Maximum Parsimony for k-Tuple-Site Data

Published: 03 January 2019

Volume 81, pages 1173–1200, (2019)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

172 Accesses
1 Altmetric
Explore all metrics

Abstract

One of the main aims of phylogenetics is to reconstruct the “Tree of Life.” In this respect, different methods and criteria are used to analyze DNA sequences of different species and to compare them in order to derive the evolutionary relationships of these species. Maximum parsimony is one such criterion for tree reconstruction, and it is the one which we will use in this paper. However, it is well known that tree reconstruction methods can lead to wrong relationship estimates. One typical problem of maximum parsimony is long branch attraction, which can lead to statistical inconsistency. In this work, we will consider a blockwise approach to alignment analysis, namely the so-called k-tuple analyses. For four taxa, it has already been shown that k-tuple-based analyses are statistically inconsistent if and only if the standard character-based (site-based) analyses are statistically inconsistent. So, in the four-taxon case, going from individual sites to k-tuples does not lead to any improvement. However, real biological analyses often consider more than only four taxa. Therefore, we analyze the case of five taxa for 2- and 3-tuple-site data and consider alphabets with two and four elements. We show that the equivalence of single-site data and k-tuple-site data then no longer holds. Even so, we can show that maximum parsimony is statistically inconsistent for k-tuple-site data and five taxa.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Defining Binary Phylogenetic Trees Using Parsimony

Article Open access 17 December 2022

On the Accuracy of Ancestral Sequence Reconstruction for Ultrametric Trees with Parsimony

Article 23 February 2018

Efficient FPT Algorithms for (Strict) Compatibility of Unrooted Phylogenetic Trees

References

Anderson FE, Swofford DL (2004) Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol Phylogenet Evol 33(2):440–451
Article Google Scholar
Bandelt HJ, Fischer M (2008) Perfectly misleading distances from ternary characters. Syst Biol 57(4):540–543. https://doi.org/10.1080/10635150802203880
Article Google Scholar
Crick FH, Barnett L, Brenner S, Watts-Tobin RJ (1961) General nature of the genetic code for proteins. Nature 192(4809):1227–1232
Article Google Scholar
Delport W, Scheffler K, Seoighe C (2008) Models of coding sequence evolution. Brief Bioinform 10(1):97–109. https://doi.org/10.1093/bib/bbn049
Article Google Scholar
Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Biol 27(4):401. https://doi.org/10.1093/sysbio/27.4.401
Article Google Scholar
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376. https://doi.org/10.1007/bf01734359
Article Google Scholar
Fischer M, Kelk S (2016) On the maximum parsimony distance between phylogenetic trees. Ann Comb 20(1):87–113. https://doi.org/10.1007/s00026-015-0298-1
Article MathSciNet MATH Google Scholar
Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Biol 20(4):406. https://doi.org/10.1093/sysbio/20.4.406
Article Google Scholar
Hartigan J (1973) Minimum mutation fits to a given tree. Biometrics 29(1):53–65. http://www.jstor.org/stable/2529676
He XL, Wu B, Li Q, Peng WH, Huang ZQ, Gan BC (2016) Phylogenetic relationship of two popular edible Pleurotus in China, Bailinggu (P. eryngii var. tuoliensis) and Xingbaogu (P. eryngii), determined by ITS, RPB2 and EF1$\alpha $ sequences. Mol Biol Rep 43(6):573–582
Article Google Scholar
Jukes TH, Cantor CR (1969) Evolution of protein molecules, chapter 24. In: Munro HN (ed) Mammalian protein metabolism. Academic Press, New York, pp 21–132. https://doi.org/10.1016/B978-1-4832-3211-9.50009-7
Chapter Google Scholar
Knoop V, Müller K (2009) Gene und Stammbäume, 2nd edn. Springer Spektrum, Heidelberg
Book Google Scholar
Neyman J (1971) Molecular studies of evolution: a source of novel statistical problems. In: Gupta SS, Yackel J (eds) Statistical decision theory and related topics. Academic Press, New York, pp 1–27. https://doi.org/10.1016/B978-0-12-307550-5.50005-8
Google Scholar
Qu XJ, Jin JJ, Chaw SM, Li DZ, Yi TS (2017) Multiple measures could alleviate long-branch attraction in phylogenomic reconstruction of Cupressoideae (Cupressaceae). Sci Rep 7:41005
Article Google Scholar
Raskoti BB, Jin WT, Xiang XG, Schuiteman A, Li DZ, Li JW, Huang WC, Jin XH, Huang LQ (2016) A phylogenetic analysis of molecular and morphological characters of Herminium (Orchidaceae, Orchideae): evolutionary relationships, taxonomy, and patterns of character evolution. Cladistics 32(2):198–210. https://doi.org/10.1111/cla.12125
Article Google Scholar
Sanderson M, Wojciechowski M, Hu JM, Khan TS, Brady S (2000) Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. Mol Biol Evol 17(5):782–797
Article Google Scholar
Sankoff D (1975) Minimal mutation trees of sequences. SIAM J Appl Math 28(1):35–42
Article MathSciNet MATH Google Scholar
Semple C, Steel M (2003) Phylogenetics. Oxford lecture series in mathematics and its applications. Oxford University Press, Oxford. https://books.google.de/books?id=uR8i2qetjSAC
Steel M, Penny D (2000) Parsimony, likelihood, and the role of models in molecular phylogenetics. Mol Biol Evol 17(6):839. https://doi.org/10.1093/oxfordjournals.molbev.a026364
Article Google Scholar
Varga J, Frisvad JC, Samson R (2011) Two new aflatoxin producing species, and an overview of Aspergillus section Flavi. Stud Mycol 69:57–80
Article Google Scholar
Wolfram Research, Inc (2017) Mathematica, version 10.3 (2017) Wolfram Research Inc, Champaign

Download references

Acknowledgements

The first and second authors thank the University of Greifswald for the Bogislaw studentship and the Landesgraduiertenförderung studentship, respectively, under which this work was conducted. Moreover, we wish to thank two anonymous reviewers for very helpful suggestions on an earlier version of this manuscript.

Author information

Authors and Affiliations

Institute of Mathematics and Computer Science, University of Greifswald, Greifswald, Germany
Michelle Galla, Kristina Wicke & Mareike Fischer

Authors

Michelle Galla
View author publications
You can also search for this author in PubMed Google Scholar
Kristina Wicke
View author publications
You can also search for this author in PubMed Google Scholar
Mareike Fischer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mareike Fischer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

All calculations in this manuscript were carried out with Mathematica (Wolfram Research 2017). By way of example, we will demonstrate the respective calculations for 2-tuple-site data and two character states (corresponding to the results presented in Sect. 2.1). To begin with, we implemented both the well-known Fitch algorithm (Fitch 1971) for the calculation of the parsimony score of a character or tuple, as well as the well-known Felsenstein algorithm (Felsenstein 1981) to compute the probabilities of characters and tuples on a given phylogenetic tree. Note that we assumed tree $(T_1,\theta _{T_1})$ (cf. Fig. 5) to be the generating tree on which all characters evolved according to the i.i.d. $N_2$-model. Based on these two algorithms, we first calculated the expected parsimony score for 2-tuple-site data and two character states according to Formula (2) for all trees $T' \in {\mathcal {T}}$, where ${\mathcal {T}}$ is the set of all phylogenetic X-trees on five leaves. We summarized the results in a vector $\mathtt {eps2Tuples}$ containing the expected parsimony score for each tree as entries. These entries were sorted according to Table 1, i.e., the first entry of $\mathtt {eps2Tuples}$ contained the expected parsimony score of tree $T_1$ and so on. Recall that in our case the expected parsimony scores depend on two parameters, p and q (representing the edge lengths of the generating tree), where we have $0 \le p,q \le \frac{1}{2}$ (as we are considering two character states). To show that MP is statistically inconsistent on 2-tuple-site data, we had to find values for p and q such that the expected parsimony score of $T_1$ (i.e., the first entry of the vector $\mathtt {eps2Tuples}$) was not the minimum of all values in $\mathtt {eps2Tuples}$. Thus, we had to find values of p and q fulfilling the following constraints:

$$\begin{aligned} \mathtt {eps2Tuples}[1]&> min [\mathtt {eps2Tuples}] \end{aligned}$$

(5)

$$\begin{aligned} 0&\le ~p \le \frac{1}{2} \end{aligned}$$

(6)

$$\begin{aligned} 0&\le ~q \le \frac{1}{2}. \end{aligned}$$

(7)

To find an explicit example for such values of p and q (as for example used in the proof of Theorem 2), we used the predefined Mathematica function $\mathtt {FindInstance[expr,vars]}$, which (if they exist) finds values for the variables $\mathtt {vars}$ where the expression $\mathtt {expr}$ is true. In our example, the expressions are the three Inequalities (5), (6) and (7), and the variables are p and q. So we used this function in the following way:

$$\begin{aligned}&\text {FindInstance}\left[ \left\{ \mathtt {epst2Tuples}[[1]] > \text {Min}[\mathtt {epst2Tuples}], 0 \le p \le \frac{1}{2}, \right. \right. \\&\quad \left. \left. 0 \le q\le \frac{1}{2}\right\} ,\{p,q\}\right] . \end{aligned}$$

The results are explicit values for p and q such that MP is statistically inconsistent (in our example, i.e., for $k=2$ and $r=2$, this yielded the values $p=\frac{91}{256} \approx 0.35547$ and $q=0.1$ as already shown in the proof of Theorem 2).

However, we not only wanted to find one explicit example of p and q, but the set of all values for p and q such that MP is statistically inconsistent on 2-tuple-site data. To plot all such combinations of p and q, we used the Mathematica function $\mathtt {RegionPlot[pred,\{x, x_{min}, x_{max}\},\{y,y_{min}, y_ {max}\}]}$ which shows the region where the predicate $\mathtt {pred}$ is true. In our example, the predicate was Inequality (5) and the parameters $\mathtt {x}$ and $\mathtt {y}$ were our parameters p and q with $p_{min}=q_{min}=0$ and $p_{max}=q_{max}=\frac{1}{2}$ as in Inequalities (6) and (7). Thus, we used this function as follows:

$$\begin{aligned} \text {RegionPlot}\left[ \mathtt {epst2Tuples}[[1]] > \text {Min}[\mathtt {epst2Tuples}],\left\{ q, 0, \frac{1}{2}\right\} , \left\{ p, 0, \frac{1}{2}\right\} \right] . \end{aligned}$$

The results are shown in Fig. 8. Note that in this figure we can see that the areas where MP is statistically inconsistent or consistent on 2-tuple-site data are separated by a curve. With the function $\mathtt {Reduce}$ and the same input as we used for the function $\mathtt {FindInstance}$, we obtained the set of all values which fulfill Inequalities (5), (6) and (7). The result of this function is a very complicated term, which is why we skip the technical details here. Basically, the problem is that the corresponding curve is not as smooth as it appears at first glance in Fig. 8. This is due to the fact that inconsistency is not everywhere caused by the same tree. For instance, when $p=\frac{91}{256} \approx 0.35547$ and $q=0.1$, tree $T_3$ has a lower expected parsimony score than $T_1$, but when $p= \frac{8187}{16384} = 0.49959 \approx $ and $q=\frac{1967}{4096} \approx 0.48022$, this is not the case. Instead, here $T_5$ has a lower parsimony score.

To summarize, by implementing algorithms for the calculation of parsimony scores and probabilities of characters and tuples on a phylogenetic tree, as well as by using the three predefined Mathematica functions $\mathtt {FindInstance}$, $\mathtt {RegionPlot}$ and $\mathtt {Reduce}$, we computed all our results for 2-tuple-site data and two character states. Analogously, all other results presented in this manuscript were obtained.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Galla, M., Wicke, K. & Fischer, M. Statistical Inconsistency of Maximum Parsimony for k-Tuple-Site Data. Bull Math Biol 81, 1173–1200 (2019). https://doi.org/10.1007/s11538-018-00552-2

Download citation

Received: 14 December 2017
Accepted: 05 December 2018
Published: 03 January 2019
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s11538-018-00552-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical Inconsistency of Maximum Parsimony for k-Tuple-Site Data

Abstract

Access this article

Similar content being viewed by others

Defining Binary Phylogenetic Trees Using Parsimony

On the Accuracy of Ancestral Sequence Reconstruction for Ultrametric Trees with Parsimony

Efficient FPT Algorithms for (Strict) Compatibility of Unrooted Phylogenetic Trees

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Statistical Inconsistency of Maximum Parsimony for k-Tuple-Site Data

Abstract

Access this article

Similar content being viewed by others

Defining Binary Phylogenetic Trees Using Parsimony

On the Accuracy of Ancestral Sequence Reconstruction for Ultrametric Trees with Parsimony

Efficient FPT Algorithms for (Strict) Compatibility of Unrooted Phylogenetic Trees

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation