Heuristic Algorithms for the Protein Model Assignment Problem
Assigning an optimal combination of empirical amino acid substitution models (e.g., WAG, LG, MTART) to partitioned multi-gene datasets when branch lengths across partitions are linked, is suspected to be an NP-hard problem. Given p partitions and the approximately 20 empirical protein models that are available, one needs to compute the log likelihood score of 20 p possible model-to-partition assignments for obtaining the optimal assignment.
Initially, we show that protein model assignment (PMA) matters for empirical datasets in the sense that different (optimal versus suboptimal) PMAs can yield distinct final tree topologies when tree searches are conducted using RAxML.
In addition, we introduce and test several heuristics for finding near-optimal PMAs and present generally applicable techniques for reducing the execution times of these heuristics. We show that our heuristics can find PMAs with better log likelihood scores on a fixed, reasonable tree topology than the naïve approach to the PMA, which ignores the fact that branch lengths are linked across partitions. By re-analyzing a large empirical dataset, we show that phylogenies inferred under a PMA calculated by our heuristics have a different topology than trees inferred under a naïvely calculated PMA; these differences also induce distinct biological conclusions. The heuristics have been implemented and are available in a proof-of-concept version of RAxML.
Keywordsphylogenetic inference maximum likelihood model assignment protein data
Unable to display preview. Download preview PDF.
- 1.Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. Some Mathematical Questions in Biology-DNA Sequence Analysis 17, 57–86 (1986)Google Scholar
- 6.Keane, T., Creevey, C., Pentony, M., Naughton, T., Mclnerney, J.: Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evol. Biol. 6(1), 29 (2006)Google Scholar
- 9.Yutin, N., Puigbò, P., Koonin, E., Wolf, Y.: Phylogenomics of Prokaryotic Ribosomal Proteins. PloS ONE 7(5) (2012)Google Scholar
- 11.Kobert, K., Hauser, J., Stamatakis, A.: Is the Protein Model Assignment Problem NP-hard?; Exelixis-RRDR-2012-9; Technical report, Heidelberg Institute for Theoretical Studies (October 2012), http://sco.h-its.org/exelixis/pubs/Exelixis-RRDR-2012-9.pdf
- 12.Posada, D.: In: Selection of Phylogenetic Models of Molecular Evolution. John Wiley & Sons, Ltd. (2001)Google Scholar
- 17.Hauser, J.: Algorithms for Model Assignment in Multi-Gene Phylogenetics. Master’s thesis, Ruprecht-Karls University Heidelberg (2012)Google Scholar
- 18.Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598), 671 (1983)Google Scholar
- 22.Yutin, N., Puigbò, P., Koonin, E., Wolf, Y.: Phylogenomics of Prokaryotic Ribosomal Proteins. PloS ONE 7(5), e36972 (2012)Google Scholar
- 27.von Reumont, B., Jenner, R., Wills, M., Dell’Ampio, E., Pass, G., Ebersberger, I., Meyer, B., Koenemann, S., Iliffe, T., Stamatakis, A., et al.: Pancrustacean phylogeny in the light of new phylogenomic data: support for Remipedia as the possible sister group of Hexapoda. Mol. Biol. Evol. 29(3), 1031–1045 (2012)CrossRefGoogle Scholar