Skip to main content

De Novo DNA Assembly with a Genetic Algorithm Finds Accurate Genomes Even with Suboptimal Fitness

  • Conference paper
  • First Online:
Applications of Evolutionary Computation (EvoApplications 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10199))

Included in the following conference series:

Abstract

We design an evolutionary heuristic for the combinatorial problem of de-novo DNA assembly with short, overlapping, accurately sequenced single DNA reads of uniform length, from both strands of a genome without long repeated sequences. The representation of a candidate solution is a novel segmented permutation: an ordering of DNA reads into contigs, and of contigs into a DNA scaffold. Mutation and crossover operators work at the contig level. The fitness function minimizes the total length of scaffold (i.e., the sum of the length of the overlapped contigs) and the number of contigs on the scaffold. We evaluate the algorithm with read libraries uniformly sampled from genomes 3835 to 48502 base pairs long, with genome coverage between 5 and 7, and verify the biological accuracy of the scaffolds obtained by comparing them against reference genomes. We find the correct genome as a contig string on the DNA scaffold in over 95% of all assembly runs. For the smaller read sets, the scaffold obtained consists of only the correct contig; for the larger read libraries, the fitness of the solution is suboptimal, with chaff contigs present; however, a simple post-processing step can realign the chaff onto the correct genome. The results support the idea that this heuristic can be used for consensus building in de-novo assembly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Illumina: Illumina sequencing technology, October 2016. http://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

  2. Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E.E., Turner, S.W., Korlach, J.: Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10(6), 563–569 (2013)

    Article  Google Scholar 

  3. Ip, C., Loose, M., Tyson, J., de Cesare, M., Brown, B., Jain, M., Leggett, R., Eccles, D., Zalunin, V., Urban, J., Piazza, P., Bowden, R., Paten, B., Mwaigwisya, S., Batty, E., Simpson, J., Snutch, T., Birney, E., Buck, D., Goodwin, S., Jansen, H., O’Grady, J., Olsen, H.: MinION analysis and reference consortium: phase 1 data release and analysis. F1000Research 4, 1075 (2015)

    Google Scholar 

  4. Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marçais, G., Pop, M., Yorke, J.A.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)

    Article  Google Scholar 

  5. Bradnam, K., Fass, J., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J., Chapuis, G., Chikhi, R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2, 10 (2013)

    Article  Google Scholar 

  6. Cherukuri, Y., Janga, S.C.: Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches. BMC Genom. 17(Suppl. 7), 507 (2016)

    Article  Google Scholar 

  7. Räihä, K.J., Ukkonen, E.: The shortest common supersequence problem over binary alphabet is NP-complete. Theoret. Comput. Sci. 16(2), 187–198 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  8. Parsons, R., Burks, C., Forrest, S.: Genetic algorithms for DNA sequence assembly. In: International Conference in Intelligent Systems for Molecular Biology (1993)

    Google Scholar 

  9. Poptsova, M.S., Il’icheva, I.A., Nechipurenko, D.Y., Panchenko, L.A., Khodikov, M.V., Oparina, N.Y., Polozov, R.V., Nechipurenko, Y.D., Grokhovsky, S.L.: Non-random DNA fragmentation in next-generation sequencing. Sci. Rep. 4(4532), 1697–1712 (2014)

    Google Scholar 

  10. Parsons, R.J., Forrest, S., Burks, C.: Genetic algorithms, operators, and DNA fragment assembly. Mach. Learn. 21(1–2), 11–33 (1995)

    Google Scholar 

  11. Parsons, R., Johnson, M.E.: DNA sequence assembly and genetic algorithms - new results and puzzling insights. In: Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, United Kingdom, 16–19 July 1995, pp. 277–284 (1995)

    Google Scholar 

  12. Parsons, R., Johnson, M.E.: A case study in experimental design applied to genetic algorithms with applications to DNA sequence assembly. Am. J. Math. Manag. Sci. 17(3–4), 369–396 (1997)

    Google Scholar 

  13. Alba, E., Luque, G.: A new local search algorithm for the DNA fragment assembly problem. In: Cotta, C., Hemert, J. (eds.) EvoCOP 2007. LNCS, vol. 4446, pp. 1–12. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71615-0_1

    Chapter  Google Scholar 

  14. Nebro, A., Luque, G., Luna, F., Alba, E.: DNA fragment assembly using a grid-based genetic algorithm. Comput. Oper. Res. 35(9), 2776–2790 (2008). Part Special Issue: Bio-inspired Methods in Combinatorial Optimization

    Article  MATH  Google Scholar 

  15. Dorronsoro, B., Alba, E., Luque, G., Bouvry, P.: A self-adaptive cellular memetic algorithm for the DNA fragment assembly problem. In: 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pp. 2651–2658, June 2008

    Google Scholar 

  16. Firoz, J.S., Rahman, M.S., Saha, T.K.: Bee algorithms for solving DNA fragment assembly problem with noisy and noiseless data. In: Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, GECCO 2012, pp. 201–208. ACM, New York (2012)

    Google Scholar 

  17. Mallén-Fullerton, G.M., Fernández-Anaya, G.: DNA fragment assembly using optimization. In: 2013 IEEE Congress on Evolutionary Computation, pp. 1570–1577, June 2013

    Google Scholar 

  18. NCBI: Human MHC class III region DNA with fibronectin type-III repeats (2016). https://www.ncbi.nlm.nih.gov/nuccore/X60189

  19. NCBI: Human apolipoprotein B-100 mRNA, complete cds (2016). https://www.ncbi.nlm.nih.gov/nuccore/M15421

  20. NCBI: Enterobacteria phage lambda, complete genome (2016). https://www.ncbi.nlm.nih.gov/nuccore/J02459

  21. Garret, A.L.: Inspyred: a framework for creating bio-inspired computational intelligence algorithms in Python (2017). https://pypi.python.org/pypi/inspyred

  22. Warren, R.L., Sutton, G.G., Jones, S.J.M., Holt, R.A.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4), 500–501 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Doina Bucur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Bucur, D. (2017). De Novo DNA Assembly with a Genetic Algorithm Finds Accurate Genomes Even with Suboptimal Fitness. In: Squillero, G., Sim, K. (eds) Applications of Evolutionary Computation. EvoApplications 2017. Lecture Notes in Computer Science(), vol 10199. Springer, Cham. https://doi.org/10.1007/978-3-319-55849-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-55849-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-55848-6

  • Online ISBN: 978-3-319-55849-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics