Advertisement

Population and Evolutionary Genetic Inferences in the Whole-Genome Era: Software Challenges

  • Alexandros StamatakisEmail author
Chapter
Part of the Population Genomics book series (POGE)

Abstract

The continuous advances in DNA sequencing technologies are driving a constantly accelerating accumulation of nucleotide sequence data at the whole-genome scale. As a consequence, evolutionary biology researchers have to rely on a growing number of increasingly complex software. All widely used tools in the field have grown considerably, in terms of the number of features as well as lines of code and consequently also with respect to software complexity. Complexity is further increased by exploiting parallelism on multi-core and hardware accelerator architectures. Moreover, typical analysis pipelines now include a substantially larger number of components than 5–10 years ago. A topic that has received little attention in this context is that of code quality and verification of widely used data analysis software. Unfortunately, the majority of users still tend to blindly trust the software and the results it produces. To this end, we assessed the software quality of three highly cited tools in population genetics (Genepop, Migrate, Structure) that are being routinely used in current data analysis pipelines and studies. We also review widely unknown problems associated with floating-point arithmetics in conjunction with parallel processing. Since the software quality of the tools we analyzed is rather mediocre, we provide a list of best practices for improving the quality of existing tools but also list techniques that can be deployed for developing reliable, high-quality scientific software from scratch. Finally, we also discuss some general policy issues that need to be addressed for improving software quality as well as ensuring support for developing new and maintaining existing software.

Keywords

Numerical stability Parallel computing Reproducibility Software quality Software verification 

Notes

Acknowledgements

This work was financially supported by the Klaus Tschira Foundation.

References

  1. Alachiotis N et al. OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets. Bioinformatics 2012;28(17):2274–5.PubMedCrossRefGoogle Scholar
  2. Barone L, Williams J, Micklos D. Unmet needs for analyzing biological big data: a survey of 704 NSF principal investigators. bioRxiv 2017. https://doi.org/10.1101/108555. http://biorxiv.org/content/early/2017/02/15/108555
  3. Beerli P. Comparison of Bayesian and maximum-likelihood inference of population genetic parameters. Bioinformatics 2006;22(3):341–5.PubMedCrossRefGoogle Scholar
  4. Beerli P, Felsenstein J. Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 1999;152(2):763–73.PubMedPubMedCentralGoogle Scholar
  5. Beerli P, Felsenstein J. Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proc Natl Acad Sci 2001;98(8):4563–8.PubMedCrossRefGoogle Scholar
  6. Beerli P, Palczewski M. Unified framework to evaluate panmixia and migration direction among multiple sampling locations. Genetics 2010;185(1):313–26.PubMedPubMedCentralCrossRefGoogle Scholar
  7. Briand LC, Wüst J, Ikonomovski SV, Lounis H. Investigating quality factors in object-oriented designs: an industrial case study. In: Proceedings of the 21st international conference on software engineering. New York: ACM; 1999. p. 345–54.Google Scholar
  8. Briand LC, Wüst J, Daly JW, Porter DV. Exploring the relationships between design measures and software quality in object-oriented systems. J Syst Softw 2000;51(3):245–73.CrossRefGoogle Scholar
  9. Casalnuovo C, Devanbu P, Oliveira A, Filkov V, Ray B. Assert use in GitHub projects. In: Proceedings of the 37th international conference on software engineering - volume 1, ICSE ’15. Piscataway: IEEE Press; 2015. p. 755–66. http://dl.acm.org/citation.cfm?id=2818754.2818846 Google Scholar
  10. Czech L, Huerta-Cepas J, Stamatakis A. A critical review on the use of support values in tree viewers and bioinformatics toolkits. Mol Biol Evol 2017;34(6):1535.PubMedPubMedCentralCrossRefGoogle Scholar
  11. Darriba D, Flouri T, Stamatakis A. The state of software for evolutionary biology. Mol Biol Evol 2018;35(5):1037–46.PubMedPubMedCentralCrossRefGoogle Scholar
  12. Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 2003;164(4):1567.PubMedPubMedCentralGoogle Scholar
  13. Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol Ecol Notes 2007;7(4):574–8.PubMedPubMedCentralCrossRefGoogle Scholar
  14. Fletcher W, Yang Z. The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Mol Biol Evol 2010;27(10):2257.PubMedCrossRefGoogle Scholar
  15. Flouri T, Kobert K, Rognes T, Stamatakis A. Are all global alignment algorithms and implementations correct? bioRxiv (2015). https://doi.org/10.1101/031500. http://biorxiv.org/content/early/2015/11/12/031500
  16. Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol 1982;162(3):705–8. https://doi.org/10.1016/0022-2836(82)90398-9. http://www.sciencedirect.com/science/article/pii/0022283682903989 PubMedCrossRefGoogle Scholar
  17. Hoare CAR. An axiomatic basis for computer programming. Commun ACM 1969;12(10):576–80CrossRefGoogle Scholar
  18. Holder MT, Lewis PO, Swofford DL, Larget B. Hastings ratio of the LOCAL proposal used in Bayesian phylogenetics. Syst Biol 2005;54(6):961–5.PubMedCrossRefGoogle Scholar
  19. Hubisz MJ, Falush D, Stephens M, Pritchard JK. Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour 2009;9(5):1322–32.PubMedPubMedCentralCrossRefGoogle Scholar
  20. Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SY, Faircloth BC, Nabholz B, Howard JT et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 2014;346(6215):1320–31.PubMedPubMedCentralCrossRefGoogle Scholar
  21. Khoshgoftaar TM, Seliya N. Fault prediction modeling for software quality estimation: comparing commonly used techniques. Empir Softw Eng 2003;8(3):255–83.CrossRefGoogle Scholar
  22. McCabe TJ. A complexity measure. IEEE Trans Softw Eng 1976;SE-2(4):308–20.CrossRefGoogle Scholar
  23. Misof B, Liu S, Meusemann K, Peters RS, Donath A, Mayer C, Frandsen PB, Ware J, Flouri T, Beutel RG, et al. Phylogenomics resolves the timing and pattern of insect evolution. Science 2014;346(6210):763–7.PubMedCrossRefGoogle Scholar
  24. Nagappan N, Ball T. Static analysis tools as early indicators of pre-release defect density. In: Proceedings of the 27th international conference on software engineering, ICSE ’05. New York: ACM; 2005. p. 580–6.Google Scholar
  25. Pavlidis P, Jensen JD, Stephan W, Stamatakis A. A critical assessment of storytelling: gene ontology categories and the importance of validating genomic scans. Mol Biol Evol 2012;29(10):3237–48.PubMedCrossRefGoogle Scholar
  26. Pavlidis P, Z˘ivkovic D, Stamatakis A, Alachiotis N. SweeD: likelihood-based detection of selective sweeps in thousands of genomes. Mol Biol Evol 2013;30(9):2224.PubMedPubMedCentralCrossRefGoogle Scholar
  27. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics 2000;155(2):945.PubMedPubMedCentralGoogle Scholar
  28. Raymond M, Rousset F. Genepop (version 1.2): population genetics software for exact tests and ecumenicism. J Hered 1995;86(3):248–9.CrossRefGoogle Scholar
  29. Redelings B. Erasing errors due to alignment ambiguity when estimating positive selection. Mol Biol Evol 2014;31(8):1979.PubMedPubMedCentralCrossRefGoogle Scholar
  30. Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 2012;61(3):539–42. https://doi.org/10.1093/sysbio/sys029. http://sysbio.oxfordjournals.org/content/61/3/539.abstract PubMedPubMedCentralCrossRefGoogle Scholar
  31. Rousset F. genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol Ecol Resour 2008;8(1):103–6.PubMedCrossRefGoogle Scholar
  32. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 2014;30(9):1312–3.PubMedPubMedCentralCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Heidelberg Institute for Theoretical StudiesHeidelbergGermany
  2. 2.Institute for Theoretical InformaticsKarlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations