Abstract
Haplotype inference is an important issue in computational biology due to its various applications in diagnosing and treating genetic diseases such as diabetes, Alzheimer, and heart defects. There are different criteria to choose the solution from the alternatives. Parsimony is one of the most important criteria according to which the problem is known as Pure Parsimony Haplotyping (PPH) problem. The approaches to solve PPH are classified to two groups: exact and non-exact. The exact approaches often model the problem as a Mixed Integer Linear Programming (MILP) problem. Although in solving the small instances, these models generate the optimal solution in a reasonable time, because of the NP-hardness characteristic of PPH problem, they are ineffective in solving very large instances. This deficiency is compensated by non-exact algorithms. In this paper, we present a non-exact algorithm for large instances of PPH problem based on the divide-and-conquer technique. This algorithm, first, divides the problem into small sub-problems, which are solved by one of the previous exact approaches, and finally the solutions of the sub-problems are combined through solving an MILP. The appeared MILPs for solving the sub-problems and those for combining the solutions are so small that are solved rapidly. The performance of this algorithm has been evaluated by implementing it on real and simulated instances and in comparison with two well-known methods of PHASE and WinHap2.
Similar content being viewed by others
Availability of data and materials
The data are available upon request.
References
Li WH, Sadler LA (1991) Low nucleotide diversity in man. Genetics 129:513–523. https://doi.org/10.1093/genetics/129.2.513
Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N et al (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22:231–238. https://doi.org/10.1038/10290
Wang DG, Fan JB, Siao CJ, Berno A, Young P, Sapolsky R (1998) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280:1077–1082. https://doi.org/10.1126/science.280.5366.1077
Halushka MK, Fan JB, Bentley K, Hsie L, Shen N, Weder A et al (1999) Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat Genet 22:239–247. https://doi.org/10.1038/10297
Catanzaro D, Labbé M (2009) The pure parsimony haplotyping problem: overview and computational advances. Int Trans Oper Res 16:561–584. https://doi.org/10.1111/j.1475-3995.2009.00716.x
Zhang XS, Wang RS, Wu LY, Chen L (2006) Models and algorithms for haplotyping problem. Curr Bioinform 1:105–114. https://doi.org/10.2174/157489306775330570
Faye A, Faye A, Diome T, Sembene M (2023) Genetic diversity and structure of Callosobruchus maculatus populations in the different agro-ecological zones of Senegal. J Asian Sci Res 13(1):16–27. https://doi.org/10.55493/5003.v13i1.4720
Verstegen C (2020) Reconstructing phylogenies from genotype sequence collections: Merging the Pure Parsimony Haplotyping problem with the Haplotype Phylogeny problem. Louvain School of Management,Université catholique de Louvain, 2020. Prom. : Catanzaro, Daniele. http://hdl.handle.net/2078.1/thesis:24495
Sramkó G, Kosztolányi A, Laczkó L, Rácz R, Szatmári L, Varga Z, Barta Z (2022) Range-wide phylogeography of the flightless steppe beetle Lethrus apterus (Geotrupidae) reveals recent arrival to the Pontic Steppes from the west. Sci Rep 12(1):5069. https://doi.org/10.1038/s41598-022-09007-0
Bell GI, Horita S, Karam JH (1984) A polymorphic locus near the human insulin gene is associated with insulin-dependent diabetes mellitus. Diabetes 33:176–183. https://doi.org/10.2337/diab.33.2.176
Dorman JS, LaPorte RE, Stone RA, Trucco M (1990) Worldwide differences in the incidence of type I diabetes are associated with amino acid variation at position 57 of the HLA-DQ beta chain. Proc Natl Acad Sci 87(19):7370–7374. https://doi.org/10.1073/pnas.87.19.7370
Nisticò L, Buzzetti R, Pritchard LE, Van der Auwera B, Giovannini C, Bosi E et al (1996) The CTLA-4 gene region of chromosome 2q33 is linked to, and associated with, type 1 diabetes. Hum Mol Genet 5:1075–1080. https://doi.org/10.1093/hmg/5.7.1075
Altshuler D, Hirschhorn JN, Klannemark M, Lindgren CM, Vohl MC, Nemesh J et al (2000) The common PPARγ Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes. Nat Genet 26:76–80. https://doi.org/10.1038/79216
Deeb SS, Fajas L, Nemoto M, Pihlajamäki J, Mykkänen L, Kuusisto J et al (1998) A Pro12Ala substitution in PPARγ2 associated with decreased receptor activity, lower body mass index and improved insulin sensitivity. Nat Genet 20:284–287. https://doi.org/10.1038/3099
Chapuis J, Hot D, Hansmannel F, Kerdraon O, Ferreira S, Hubans C et al (2009) Transcriptomic and genetic studies identify IL-33 as a candidate gene for Alzheimer’s disease. Mol Psychiatry 14:1004–1016. https://doi.org/10.1038/mp.2009.10
Strittmatter WJ, Roses AD (1996) Apolipoprotein E and Alzheimer’s disease. Annu Rev Neurosci 19:53–77. https://doi.org/10.1146/annurev.ne.19.030196.000413
Gretarsdottir S, Thorleifsson G, Reynisdottir ST, Manolescu A, Jonsdottir S, Jonsdottir T et al (2003) The gene encoding phosphodiesterase 4D confers risk of ischemic stroke. Nat Genet 35:131–138. https://doi.org/10.1038/ng1245
Van Eerdewegh P, Little RD, Dupuis J, Del Mastro RG, Falls K, Simon J et al (2002) Association of the ADAM33 gene with asthma and bronchial hyperresponsiveness. Nature 418:426–430. https://doi.org/10.1038/nature00878
Trégouët DA, König IR, Erdmann J, Munteanu A, Braund PS, Hall AS et al (2009) Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nat Genet 41:283–285. https://doi.org/10.1038/ng.314
Lancia G, Pinotti MC, Rizzi R (2004) Haplotyping populations by pure parsimony: complexity of exact and approximation algorithms. INFORMS J Comput 16(4):348–359. https://doi.org/10.1287/ijoc.1040.0085
Gusfield D (2001) Inference of haplotypes from samples of diploid populations: complexity and algorithms. J Comput Biol 8:305–323. https://doi.org/10.1089/10665270152530863
Gusfield D (2003) Haplotype inference by pure parsimony. In: Annual symposium on combinatorial pattern matching, pp. 144–155. https://doi.org/10.1007/3-540-44888-8_11
Lancia G, Serafini P (2009) A set-covering approach with column generation for parsimony haplotyping. INFORMS J Comput 21:151–166. https://doi.org/10.1287/ijoc.1080.0285
Halldórsson BV, Bafna V, Edwards N, Lippert R, Yooseph S, Istrail S (2003) Combinatorial problems arising in SNP and haplotype analysis. In: Discrete mathematics and theoretical computer science. Springer, Cham, pp. 26–47. https://doi.org/10.1007/3-540-45066-1_3
Brown DG, Harrower IM (2006) Integer programming approaches to haplotype inference by pure parsimony. IEEE/ACM Tran Comput Biol Bioinform (TCBB) 3:141–154. https://doi.org/10.1109/TCBB.2006.24
Bertolazzi P, Godi A, Labbé M, Tininini L (2008) Solving haplotyping inference parsimony problem using a new basic polynomial formulation. Comput Math Appl 55:900–911. https://doi.org/10.1016/j.camwa.2006.12.095
Jäger G, Climer S, Zhang W (2016) The complete parsimony haplotype inference problem and algorithms based on integer programming, branch-and-bound and Boolean satisfiability. J Discrete Algorithms 37:68–83. https://doi.org/10.1016/j.jda.2016.06.001
Dal Sasso V, De Giovanni L, Labbé M (2016) A column generation approach for pure Parsimony haplotyping. In: OASIcs-OpenAccess Series in Informatics. https://doi.org/10.4230/OASIcs.SCOR.2016.5
Brown H, Zuo L, Gusfield D (2020) Comparing Integer Linear Programming to SAT-Solving for Hard Problems in Computational and Systems Biology. In: International Conference on Algorithms for Computational Biology (pp. 63–76). Springer, Cham. https://doi.org/10.1007/978-3-030-42266-0_6
Lancia G (2008) The phasing of heterozygous traits: algorithms and complexity. Comput Math Appl 55:960–969. https://doi.org/10.1016/j.camwa.2006.12.089
Feizabadi R, Bagherian M, Vaziri H, Salahi M (2016) A new mathematical modeling for pure parsimony haplotyping problem. Math Biosci 281:92–97. https://doi.org/10.1016/j.mbs.2016.09.004
Wang L, Xu Y (2003) Haplotype inference by maximum parsimony. Bioinformatics 19:1773–1780. https://doi.org/10.1093/bioinformatics/btg239
Lynce I, Marques-Silva J (2006) Efficient haplotype inference with Boolean satisfiability. In: National conference on artificial intelligence (AAAI) 2006. AAAI Press, Washington.
Lynce I, Marques-Silva J (2006) SAT in bioinformatics: Making the case with haplotype inference. InInternational Conference on Theory and Applications of Satisfiability Testing (pp. 136–141). Springer, Berlin. https://doi.org/10.1007/11814948_16
Graça A, Marques-Silva J, Lynce I, Oliveira AL (2007) Efficient haplotype inference with pseudo-boolean optimization. In: Algebraic biology: second International Conference, AB 2007, Castle of Hagenberg, Austria, July 2–4, 2007. Proceedings 2 2007 (pp. 125–139). Springer, Berlin. https://doi.org/10.1007/978-3-540-73433-8_10
Di Gaspero L, Roli A (2008) Stochastic local search for large-scale instances of the haplotype inference problem by pure parsimony. J Algorithms 63:55–69. https://doi.org/10.1016/j.jalgor.2008.02.004
Godi A, Tininini L, Bertolazzi P (2004) Haplotype inference by parsimony for large datasets. Technical Report 616, IASI, Istituto di Analisi dei Sistemi ed Informatica–CNR, Rome.
Huang YT, Chao KM, Chen T (2005) An approximation algorithm for haplotype inference by maximum parsimony. J Comput Biol 12:1261–1274. https://doi.org/10.1145/1066677.1066714
Kalpakis K, Namjoshi P (2005) Haplotype phasing using semidefinite programming. In: Bioinformatics and Bioengineering. BIBE 2005. Fifth IEEE Symposium on, 2005, pp 145–152. https://doi.org/10.1109/BIBE.2005.36
Lancia G, Rizzi R (2006) A polynomial case of the parsimony haplotyping problem. Oper Res Lett 34:289–295. https://doi.org/10.1016/j.orl.2005.05.007
Li Z, Zhou W, Zhang XS, Chen L (2005) A parsimonious tree-grow method for haplotype inference. Bioinformatics 21:3475–3481. https://doi.org/10.1093/bioinformatics/bti572
Wang RS, Zhang XS, Sheng L (2005) Haplotype inference by pure parsimony via genetic algorithm. In: Operations Research and Its Applications: the Fifth International Symposium (ISORA’05), Tibet, China, August, 2005, pp. 8–13.
Wei B, Zhao J (2014) Haplotype inference using a novel binary particle swarm optimization algorithm. Appl Soft Comput 21:415–422. https://doi.org/10.1016/j.asoc.2014.03.034
Do DD, Le SV, Hoang XH (2013) ACOHAP: an efficient ant colony optimization for the haplotype inference by pure parsimony problem. Swarm Intell 7:63–77. https://doi.org/10.1007/s11721-013-0077-8
Rosa RS, Cambuim LF, Barros EN (2019) An ensemble strategy for Haplotype Inference based on the internal variability of algorithms. In: 2019 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE, New York. https://doi.org/10.1109/IJCNN.2019.8851693
Zhou Y, Zhang H, Yang Y (2019) CSHAP: efficient haplotype frequency estimation based on sparse representation. Bioinformatics 35(16):2827–2833. https://doi.org/10.1093/bioinformatics/bty1040
Bulteau L, Weller M (2019) Parameterized algorithms in Bbioinformatics: an overview. Algorithms 12(12):256. https://doi.org/10.3390/a12120256
Leiserson CE, Rivest RL, Cormen TH, Stein C (1994) Introduction to algorithms. MIT Press, Cambridge
Stephens M, Donnelly P (2003) A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Human Genet 73:1162–1169. https://doi.org/10.1371/journal.pone.0033133
Pan W, Zhao Y, Xu Y, Zhou F (2014) WinHAP2: an extremely fast haplotype phasing program for long genotype sequences. BMC Bioinformatics 15:164. https://doi.org/10.1186/1471-2105-15-164
Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18:337–338. https://doi.org/10.1093/bioinformatics/18.2.337
Lin Z, Altman RB (2004) Finding haplotype tagging SNPs by use of principal components analysis. Am J Human Genet 75:850–861. https://doi.org/10.1086/425587
Kimmel G, Shamir R (2005) GERBIL: Genotype resolution and block identification using likelihood. Proc Natl Acad Sci 102(1):158–162. https://doi.org/10.1073/pnas.0404730102
Acknowledgements
The authors are grateful to the anonymous reviewers for their helpful comments which lead to this improved version of the paper.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
All authors were involved in proposing and writing the paper. R. Feizabadi conducted the implementations of codes.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Feizabadi, R., Bagherian, M., Vaziri, H. et al. PLEACH: a new heuristic algorithm for pure parsimony haplotyping problem. J Supercomput 80, 8236–8258 (2024). https://doi.org/10.1007/s11227-023-05746-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05746-7