Haplotype Inference by Pure Parsimony

Gusfield, Dan

doi:10.1007/3-540-44888-8_11

Dan Gusfield⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2676))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

760 Accesses
95 Citations

Abstract

The next high-priority phase of human genomics will involve the development and use of a full Haplotype Map of the human genome [7]. A critical, perhaps dominating, problem in all such efforts is the inference of large-scale SNP-haplotypes from raw genotype SNP data. This is called the Haplotype Inference (HI) problem. Abstractly, input to the HI problem is a set of n strings over a ternary alphabet. A solution is a set of at most 2n strings over the binary alphabet, so that each input string can be “generated” by some pair of the binary strings in the solution. For greatest biological fidelity, a solution should be consistent with, or evaluated by, properties derived from an appropriate genetic model.

A natural model, that has been suggested repeatedly is called here the Pure Parsimony model, where the goal is to find a smallest set of binary strings that can generate the n input strings. The problem of finding such a smallest set is called the Pure Parsimony Problem. Unfortunately, the Pure Parsimony problem is NP-hard, and no paper has previously shown how an optimal Pure-parsimony solution can be computed efficiently for problem instances of the size of current biological interest. In this paper, we show how to formulate the Pure-parsimony problem as an integer linear program; we explain how to improve the practicality of the integer programming formulation; and we present the results of extensive experimentation we have done to show the time and memory practicality of the method, and to compare its accuracy against solutions found by the widely used general haplotyping program PHASE. We also formulate and experiment with variations of the Pure-Parsimony criteria, that allow greater practicality. The results are that the Pure Parsimony problem can be solved efficiently in practice for a wide range of problem instances of current interest in biology. Both the time needed for a solution, and the accuracy of the solution, depend on the level of recombination in the input strings. The speed of the solution improves with increasing recombination, but the accuracy of the solution decreases with increasing recombination.

Research Supported by NSF grants DBI-9723346 and EIA-0220154

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Clark, K. Weiss, and D. Nickerson et. al. Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Human Genetics, 63:595–612, 1998.
Article Google Scholar
A. Clark. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol, 7:111–122, 1990.
Google Scholar
M. Daly, J. Rioux, S. Schaffner, T. Hudson, and E. Lander. High-resolution haplotype structure in the human genome. Nature Genetics, 29:229–232, 2001.
Article Google Scholar
P. Donnelly. Comments made in a lecture given at the DIMACS conference on Computational Methods for SNPs and Haplotype Inference, November 2002.
Google Scholar
M. Fullerton, A. Clark, Charles Sing, and et. al. Apolipoprotein E variation at the sequence haplotype level: implications for the origin and maintenance of a major human polymorphism. Am. J. of Human Genetics, pages 881–900, 2000.
Google Scholar
D. Gusfield. Inference of haplotypes from samples of diploid populations: complexity and algorithms. Journal of computational biology, 8(3), 2001.
Google Scholar
L. Helmuth. Genome research: Map of the human genome 3.0. Science, 293(5530):583–585, 2001.
Article Google Scholar
E. Hubbel. Personal Communication, August 2000.
Google Scholar
R. Hudson. Gene genealogies and the coalescent process. Oxford Survey of Evolutionary Biology, 7:1–44, 1990.
Google Scholar
R. Hudson. Generating samples under the Wright-Fisher neutral model of genetic variation. Bioinformatics, 18(2):337–338, 2002.
Article Google Scholar
G. Lancia, C. Pinotti, and R. Rizzi. Haplotyping populations: Complexity and approximations, technical report DIT-02-082. Technical report, University of Trento, 2002.
Google Scholar
D. Nickerson, S. Taylor, K. Weiss, and A. Clark et. al. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nature Genetics, 19:233–240, 1998.
Article Google Scholar
NIH. Report on variation and the haplotype map: http://www.nhgri.nih.gov/About_NHGRI/Der/variat.htm.
T. Niu, Z. Qin, X. Xu, and J.S. Liu. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet, 70:157–169, 2002.
Article Google Scholar
S. Orzack, D. Gusfield, and V. Stanton. The absolute and relative accuracy of haplotype inferral methods and a consensus approach to haplotype inferral. Abstract Nr 115 in Am. Society of Human Genetics, Supplement 2001.
Google Scholar
N. Patil and D. R. Cox et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294:1719–1723, 2001.
Article Google Scholar
M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from population data. Am. J. Human Genetics, 68:978–989, 2001.
Article Google Scholar
L. Wang et al. http://www.cs.cityu.edu.hk/lwang/hapar/.

Download references

Author information

Authors and Affiliations

Computer Science Department, University of California, Davis, Davis, CA, 95616, USA
Dan Gusfield

Authors

Dan Gusfield
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Depto. de Ciencias de la Computación, Universidad de Chile, Blanco Encalada 2120, Santiago, 6511224, Chile
Ricardo Baeza-Yates
Escuela de Ciencias Físico-Matemáticas, Universidad Michoacana, Edificio “B”, ciudad universitaria, Morelia Michoacán, Mexico
Edgar Chávez
Université de Marne-la-Vallée, 77454, Marne-la-Vallée Cedex 2, France
Maxime Crochemore

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gusfield, D. (2003). Haplotype Inference by Pure Parsimony. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds) Combinatorial Pattern Matching. CPM 2003. Lecture Notes in Computer Science, vol 2676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44888-8_11

Download citation

DOI: https://doi.org/10.1007/3-540-44888-8_11
Published: 27 May 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40311-1
Online ISBN: 978-3-540-44888-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics