Efficiently Solvable Perfect Phylogeny Problems on Binary and k-State Data with Missing Values
The perfect phylogeny problem is of central importance to both evolutionary biology and population genetics. Missing values are a common occurrence in both sequence and genotype data. In their presence, the problem of finding a perfect phylogeny is NP-hard, even for binary characters . We extend the utility of the perfect phylogeny by introducing new efficient algorithms for broad classes of binary and multi-state data with missing values.
Specifically, we address the rich data hypothesis introduced by Halperin and Karp  for the binary perfect phylogeny problem with missing data. We give an efficient algorithm for enumerating phylogenies compatible with characters satisfying the rich data hypothesis. This algorithm is useful for computing the probability of data with missing values under the coalescent model.
In addition, we use the partition intersection (PI) graph and chordal graph theory to generalize the rich data hypothesis to multi-state characters with missing values. For a bounded number of states, k, we provide a fixed parameter tractable algorithm for the k-state perfect phylogeny problem with missing data. Our approach reduces missing data problems to problems on complete data. Finally, we characterize a commonly observed condition, an m-clique in the PI graph, under which a perfect phylogeny can be found efficiently for binary characters with missing values. We evaluate our results with extensive empirical analysis using two biologically motivated generative models of character data.
Unable to display preview. Download preview PDF.
- 11.Halperin, E., Karp, R.M.: Perfect phylogeny and haplotype assignment. In: RECOMB 2004: Proc.s of the 8th ann. Internat’l. Conf. on Comp. Mol. Bio., pp. 10–19. ACM Press, New York (2004)Google Scholar
- 13.Kannan, S., Warnow, T.: A fast algorithm for the computation and enumeration of perfect phylogenies when the number of character states is fixed. In: Proc. of the 6th Ann. ACM-SIAM Symp. on Disc. Alg., pp. 595–603. Society for Industrial and Applied Mathematics, Philadelphia (1995)Google Scholar
- 16.Li, N., Stephens, M.: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165(4), 2213–2233 (2003)Google Scholar
- 17.McKee, T.A., McMorris, F.R.: Topics in intersection graph theory. SIAM Monographs on Discrete Mathematics (1999)Google Scholar
- 24.Steel, M.: The complexity of reconstructing trees from qualitative characters and subtrees. Journal of Classification, 91–116 (1992)Google Scholar
- 26.Sze, S.H., Lu, S., Chen, J.: Integrating sample-driven and pattern-driven approaches in motif finding. Algorithms in Bioinformatics, 438–449 (2004)Google Scholar
- 28.Warnow, T.J.: Tree compatibility and inferring evolutionary history. In: Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete algorithms, SODA 1993, pp. 382–391. Society for Industrial and Applied Mathematics, Philadelphia (1993)Google Scholar