Multilocus phylogenetic analysis with gene tree clustering
Both theoretical and empirical evidence point to the fact that phylogenetic trees of different genes (loci) do not display precisely matched topologies. Nonetheless, most genes do display related phylogenies; this implies they form cohesive subsets (clusters). In this work, we discuss gene tree clustering, focusing on the normalized cut (Ncut) framework as a suitable method for phylogenetics. We proceed to show that this framework is both efficient and statistically accurate when clustering gene trees using the geodesic distance between them over the Billera–Holmes–Vogtmann tree space. We also conduct a computational study on the performance of different clustering methods, with and without preprocessing, under different distance metrics, and using a series of dimensionality reduction techniques. Our results with simulated data reveal that Ncut accurately clusters the set of gene trees, given a species tree under the coalescent process. Other observations from our computational study include the similar performance displayed by Ncut and k-means under most dimensionality reduction schemes, the worse performance of hierarchical clustering, and the significantly better performance of the neighbor-joining method with the p-distance compared to the maximum-likelihood estimation method. Supplementary material, all codes, and the data used in this work are freely available at http://polytopes.net/research/cluster/ online.
KeywordsPhylogenetics Normalized cut Clustering
The authors would like to thank the editor and the anonymous referees for their useful comments for improving the manuscript.
Funding K. F. and R. Y. were supported by JSPS KAKENHI 26540016. C. V. would also like to acknowledge support from ND EPSCoR NSF #1355466.
- Betancur, R., Li, C., Munroe, T., Ballesteros, J., & Ortí, G. (2013). Addressing gene tree discordance and non-stationarity to resolve a multi-locus phylogeny of the flatfishes (teleostei: Pleuronectiformes). Systematic Biology. doi: 10.1093/sysbio/syt039.
- Chatterji, S., Yamazaki, I., Bai, Z., & Eisen, J. A. (2008). Compostbin: A DNA composition-based algorithm for binning environmental shotgun reads. In M. Vingron & L. Wong (Eds.), Research in computational molecular biology (pp. 17–28). Berlin: Springer.Google Scholar
- Cox, I. J., Rao, S. B., & Zhong, Y. (1996). “Ratio regions”: A technique for image segmentation. In 1996, proceedings of the 13th international conference on pattern recognition, vol. 2 (pp. 557–564). IEEE.Google Scholar
- Gori, K., Suchan, T., Alvarez, N., Goldman, N., & Dessimoz, C. (2015). Clustering genes of common evolutionary history. Preprint. arXiv:1510.02356.
- Gretton, A., Smola, A. J., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., et al. (2005). Kernel constrained covariance for dependence measurement. In Proceedings of the 10th international workshop on artificial intelligence and statistics.Google Scholar
- Hartigan, J. (1975). Clustering algorithms. London: Wiley.Google Scholar
- Hedges, S. (2009). Vertebrates (vertebrata). In S. B. Hedges & S. Kumar (Eds.), The timetree of life (pp. 309–314). Berlin: Springer-Verlag.Google Scholar
- Holmes, S. (2005). Statistical approach to tests involving phylogenies. In O. Gascuel (Ed.), Mathematics of phylogeny and evolution, chapter 4 (pp. 91–117). New York: Oxford University Press.Google Scholar
- Huson, D. H., Klopper, T., Lockhart, P. J., & Steel, M. A. (2005). Reconstruction of reticulate networks from gene trees. In S. Miyano, J. Mesirov, S. Kasif, S. Istrail, P. A. Pevzner & M. Waterman (Eds.), Research in computational molecular biology, proceedings (pp. 233–249). Berlin: Springer.Google Scholar
- Maddison, W. P., & Maddison, D. (2009). Mesquite: A modular system for evolutionary analysis. Version 2.72. Available at http://mesquiteproject.org.
- Neyman, J. (1971). Molecular studies of evolution: A source of novel statistical problems. In S. S. Gupta & J. Yackel (Eds.), Statistical decision theory and related topics (pp. 1–27). New York: Academic Press.Google Scholar
- Pamilo, P., & Nei, M. (1988). Relationships between gene trees and species trees. Molecular Biology and Evolution, 5, 568–583.Google Scholar
- Saitou, N., & Nei, M. (1987). The neighbor joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–425.Google Scholar
- Takahata, N. (1989). Gene genealogy in 3 related populations: Consistency probability between gene and population trees. Genetics, 122, 957–966.Google Scholar
- Takahata, N., & Nei, M. (1990). Allelic genealogy under overdominant and frequency-dependent selection and polymorphism of major histocompatibility complex loci. Genetics, 124, 967–978.Google Scholar
- Tavare, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences, 17, 57–86.Google Scholar
- van der Maaten, L., & Hinton, G. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.Google Scholar
- Yang, Z. (1997). PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS, 15, 555–556.Google Scholar