Limits to Sequencing and de novo Assembly: Classic Benchmark Sequences for Optimizing Fungal NGS Designs

  • José Fernando Muñoz
  • Elizabeth Misas
  • Juan Esteban Gallo
  • Juan Guillermo McEwen
  • Oliver Keatinge Clay
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 232)

Abstract

Planning of pipelines for next-generation sequencing (NGS) projects could be facilitated by using simple DNA sequence benchmarks, i.e., standard test sequences that could monitor or help to predict ease or difficulty of (a) short-read sequencing and (b) de novo assembly of the sequenced reads. We propose that familiar, gene-sized sequences, including but not limited to nuclear protein-coding genes, would provide feasible consensus benchmarks allowing simple visualization. We illustrate our proposal for fungi with candidates from ribosomal DNA (rDNA, used in phylogeny and identification/diagnostics), mitochondrial DNA (mtDNA), and combinatorially constructed conceptual (synthetic) DNA sequences. The exploratory analysis of such familiar candidate loci could be a step toward finding, testing and establishing familiar, biologically interpretable consensus benchmark sequences for fungal and other eukaryotic genomes.

Keywords

Next generation sequencing Eukaryotic genomes De novo assembly Benchmarking 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Audas, T.E., Jacob, M.D., Lee, S.: Immobilization of proteins in the nucleolus by ribosomal intergenic spacer noncoding RNA. Mol. Cell 45, 147–157 (2012)CrossRefGoogle Scholar
  2. 2.
    Berge, C.: Graphs. North Holland, Amsterdam (1989)Google Scholar
  3. 3.
    Bernardi, G.: Lessons from a small, dispensable genome: The mitochondrial genome of yeast. Gene 354, 189–200 (2005)CrossRefGoogle Scholar
  4. 4.
    Bernardi, G.: Structural and evolutionary genomics: Natural selection in genome evolution. Elsevier, Amsterdam (2005)Google Scholar
  5. 5.
    Bradnam, K.R., Fass, J.N., Alexandrov, A., Baranay, P., Bechner, M., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Giga Science (submitted, 2013), preprint at http://arxiv.org/abs/1301.5406
  6. 6.
    Brooks, F.P.: The Mythical Man-Month: Essays on Software Engineering, with four new chapters, Anniversary edn. Addison-Wesley, Reading (1995)Google Scholar
  7. 7.
    Camp, R.: The Search for Industry Best Practices that Lead to Superior Performance, 1st edn. Productivity Press (2006)Google Scholar
  8. 8.
    Carels, N., Barakat, A., Bernardi, G.: The gene distribution of the maize genome. Proc. Natl. Acad. Sci. USA 92, 11057–11060 (1995)CrossRefGoogle Scholar
  9. 9.
    Chromatic: Extreme Programming Pocket Guide. O’Reilly Media, Sebastopol (2003)Google Scholar
  10. 10.
    Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27, 479–486 (2011)CrossRefGoogle Scholar
  11. 11.
    Deng, A., Wu, Y.: De Bruijn digraphs and affine transformations. Eur. J. Comb. 26, 1191–1206 (2005)MathSciNetMATHCrossRefGoogle Scholar
  12. 12.
    Dimitrov, L.N., Brem, R.B., Kruglyak, L., Gottschling, D.E.: Polymorphisms in multiple genes contribute to the spontaneous mitochondrial genome instability of Saccharomyces cerevisiae S288C strains. Genetics 183, 365–383 (2009)CrossRefGoogle Scholar
  13. 13.
    Duzhin, S., Pasechnik, D.: Automorphisms of necklaces and sandpile groups. Preprint, arXiv:1304.2563v1 (2013)Google Scholar
  14. 14.
    Foury, F., Roganti, T., Lecrenier, N., Purnelle, B.: The complete sequence of the mitochondrial genome of Saccharomyces cerevisiae. FEBS Lett. 440, 325–331 (1998)CrossRefGoogle Scholar
  15. 15.
    Fraenkel, A.S., Gillis, J.: Proof that sequences of A, C, G, and T can be assembled to produce chains of ultimate length avoiding repetitions everywhere. Prog. Nucleic Acid Res. Mol. Biol. 5, 343–348 (1966)CrossRefGoogle Scholar
  16. 16.
    Gonzalez, I.L., Sylvester, J.E.: Complete sequence of the 43-kb human ribosomal DNA repeat: analysis of the intergenic spacer. Genomics 27, 320–328 (1995)CrossRefGoogle Scholar
  17. 17.
    Henry, T., Iwen, P.C., Hinrichs, S.H.: Identification of Aspergillus species using internal transcribed spacer regions 1 and 2. J. Clin. Microbiol. 38, 1510–1515 (2000)Google Scholar
  18. 18.
    Hinrikson, H.P., Hurst, S.F., De Aguirre, L., Morrison, C.J.: Molecular methods for the identification of Aspergillus species. Med. Mycol. 43 (suppl. 1), S129–S137 (2005)Google Scholar
  19. 19.
    Kearse, M., Moir, R., Wilson, A., Stones-Havas, S., Cheung, M., et al.: Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28, 1647–1649 (2012)CrossRefGoogle Scholar
  20. 20.
    Kingsford, C., Schatz, M.C., Pop, M.: Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010)CrossRefGoogle Scholar
  21. 21.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)CrossRefGoogle Scholar
  22. 22.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)CrossRefGoogle Scholar
  23. 23.
    Lovasz, L.: Combinatorial Problems and Exercises. North Holland-Elsevier, Amsterdam (1993)MATHGoogle Scholar
  24. 24.
    Luo, R., Liu, B., Xie, Y., Li, Z., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 18 (2012)CrossRefGoogle Scholar
  25. 25.
    Lynch, M., Sung, W., Morris, K., Coffey, N., Landry, C.R., et al.: A genome-wide view of the spectrum of spontaneous mutations in yeast. Proc. Natl. Acad. USA 105, 9272–9277 (2008)CrossRefGoogle Scholar
  26. 26.
    Medvedev, P., Brudno, M.: Maximum likelihood genome assembly. J. Comput. Biol. 16, 1101–1116 (2009)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Morgulis, A., Gertz, E.M., Schäfer, A.A., Agarwala, R.: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comp. Biol. 13, 1028–1040 (2006)CrossRefGoogle Scholar
  28. 28.
    Parra, G., Bradnam, K., Korf, I.: CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007)CrossRefGoogle Scholar
  29. 29.
    Parra, G., Bradnam, K., Ning, Z., Keane, T., Korf, I.: Assessing the gene space in draft genomes. Nucleic Acids Res. 37, 289–297 (2009)CrossRefGoogle Scholar
  30. 30.
    Ruskey, F.: Combinatorial Generation. Working version 1j-CSC 425/520. Available at CiteSeer:10.1.1.93.5967 (2003)Google Scholar
  31. 31.
    Seifert, K.A., Samson, R.A., de Waard, J.R., Houbraken, J., Lévesque, C.A., et al.: Prospects for fungus identification using CO1 DNA barcodes, with Penicillium as a test case. Proc. Natl. Acad. USA 104, 3901–3906 (2007)CrossRefGoogle Scholar
  32. 32.
    Thomas Jr., C.A.: Recombination of DNA molecules. Prog. Nucleic Acid Res. Mol. Biol. 5, 315–337 (1966)CrossRefGoogle Scholar
  33. 33.
    Wang, W., Wei, Z., Lam, T.-W., Wang, J.: Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions. Sci. Rep. 1, 55 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • José Fernando Muñoz
    • 1
    • 2
  • Elizabeth Misas
    • 1
    • 2
  • Juan Esteban Gallo
    • 1
    • 3
  • Juan Guillermo McEwen
    • 1
    • 4
  • Oliver Keatinge Clay
    • 1
    • 5
  1. 1.Cellular and Molecular Biology UnitCorporación para Investigaciones BiológicasMedellínColombia
  2. 2.Institute of BiologyUniversidad de AntioquiaMedellínColombia
  3. 3.Doctoral Program in Biomedical SciencesUniversidad del RosarioBogotáColombia
  4. 4.School of MedicineUniversidad de AntioquiaMedellínColombia
  5. 5.School of Medicine and Health SciencesUniversidad del RosarioBogotáColombia

Personalised recommendations