Skip to main content

Analysis of High-Throughput Ancient DNA Sequencing Data

  • Protocol
  • First Online:
Ancient DNA

Part of the book series: Methods in Molecular Biology ((MIMB,volume 840))

Abstract

Advances in sequencing technologies have dramatically changed the field of ancient DNA (aDNA). It is now possible to generate an enormous quantity of aDNA sequence data both rapidly and inexpensively. As aDNA sequences are generally short in length, damaged, and at low copy number relative to coextracted environmental DNA, high-throughput approaches offer a tremendous advantage over traditional sequencing approaches in that they enable a complete characterization of an aDNA extract. However, the particular qualities of aDNA also present specific limitations that require careful consideration in data analysis. For example, results of high-throughout analyses of aDNA libraries may include chimeric sequences, sequencing error and artifacts, damage, and alignment ambiguities due to the short read lengths. Here, I describe typical primary data analysis workflows for high-throughput aDNA sequencing experiments, including (1) separation of individual samples in multiplex experiments; (2) removal of protocol-specific library artifacts; (3) trimming adapter sequences and merging paired-end sequencing data; (4) base quality score filtering or quality score propagation during data analysis; (5) identification of endogenous molecules from an environmental background; (6) quantification of contamination from other DNA sources; and (7) removal of clonal amplification products or the compilation of a consensus from clonal amplification products, and their exploitation for estimation of library complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Margulies M et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057):376–380

    PubMed  CAS  Google Scholar 

  2. Bentley DR et al (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218):53–59

    Article  PubMed  CAS  Google Scholar 

  3. Shendure J et al (2005) Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309(5741):1728–1732

    Article  PubMed  CAS  Google Scholar 

  4. Harris TD et al (2008) Single-molecule DNA sequencing of a viral genome. Science 320(5872):106–109

    Article  PubMed  CAS  Google Scholar 

  5. Drmanac R et al (2010) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327(5961):78–81

    Article  PubMed  CAS  Google Scholar 

  6. Korlach J et al (2008) Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures. Proc Natl Acad Sci U S A 105(4):1176–1181

    Article  PubMed  CAS  Google Scholar 

  7. Miller W et al (2008) Sequencing the nuclear genome of the extinct woolly mammoth. Nature 456(7220):387–390

    Article  PubMed  CAS  Google Scholar 

  8. Green RE et al (2010) A draft sequence of the Neandertal genome. Science 328(5979):710–722

    Article  PubMed  CAS  Google Scholar 

  9. Rasmussen M et al (2010) Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463(7282):757–762

    Article  PubMed  CAS  Google Scholar 

  10. Krause J et al (2006) Multiplex amplification of the mammoth mitochondrial genome and the evolution of Elephantidae. Nature 439(7077):724–727

    Article  PubMed  CAS  Google Scholar 

  11. Krause J et al (2010) The complete mitochondrial DNA genome of an unknown hominin from southern Siberia. Nature 464(7290):894–897

    Article  PubMed  CAS  Google Scholar 

  12. Briggs AW et al (2009) Targeted retrieval and analysis of five Neandertal mtDNA genomes. Science 325(5938):318–321

    Article  PubMed  CAS  Google Scholar 

  13. Burbano HA et al (2010) Targeted investigation of the Neandertal genome by array-based sequence capture. Science 328(5979):723–725

    Article  PubMed  CAS  Google Scholar 

  14. Poinar HN et al (2006) Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science 311(5759):392–394

    Article  PubMed  CAS  Google Scholar 

  15. Green RE et al (2008) A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell 134(3):416–426

    Article  PubMed  CAS  Google Scholar 

  16. Gilbert MT et al (2008) Intraspecific phylogenetic analysis of Siberian woolly mammoths using complete mitochondrial genomes. Proc Natl Acad Sci U S A 105(24):8327–8332

    Article  PubMed  CAS  Google Scholar 

  17. Briggs AW et al (2007) Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA 104(37):14616–14621

    Article  PubMed  CAS  Google Scholar 

  18. Heyn P et al (2010) Road blocks on paleogenomes—polymerase extension profiling reveals the frequency of blocking lesions in ancient DNA. Nucleic Acids Res 38(16):e161

    Article  PubMed  Google Scholar 

  19. Hofreiter M et al (2001) DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA. Nucleic Acids Res 29(23):4793–4799

    Article  PubMed  CAS  Google Scholar 

  20. Kircher M, Kelso J (2010) High-throughput DNA sequencing—concepts and limitations. Bioessays 32(6):524–536

    Article  PubMed  CAS  Google Scholar 

  21. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145

    Article  PubMed  CAS  Google Scholar 

  22. Reich D et al (2010) Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468(7327):1053–1060

    Article  PubMed  CAS  Google Scholar 

  23. Prüfer K et al (2010) Computational challenges in the analysis of ancient DNA. Genome Biol 11(5):R47

    Article  PubMed  Google Scholar 

  24. Dohm JC et al (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36(16):e105

    Article  PubMed  Google Scholar 

  25. Lassmann T, Hayashizaki Y, Daub CO (2009) TagDust—a program to eliminate artifacts from next generation sequencing data. Bioinformatics 25(21):2839–2840

    Article  PubMed  CAS  Google Scholar 

  26. Briggs AW, Stenzel U, Meyer M, Krause J, Kircher M, Paabo S (2009) Removal of deaminated cytosines and detection of in vivo methylation in ancient DNA. Nucleic Acids Res 38(6):e87

    Article  PubMed  Google Scholar 

  27. Krause J et al (2010) A complete mtDNA genome of an early modern human from Kostenki, Russia. Curr Biol 20(3):231–236

    Article  PubMed  CAS  Google Scholar 

  28. Quinlan AR et al (2008) Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Methods 5(2):179–181

    Article  PubMed  CAS  Google Scholar 

  29. Erlich Y et al (2008) Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods 5(8):679–682

    Article  PubMed  CAS  Google Scholar 

  30. Kao WC, Stevens K, Song YS (2009) BayesCall: a model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res 19(10):1884–1895

    Article  PubMed  CAS  Google Scholar 

  31. Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10(8):R83

    Article  PubMed  Google Scholar 

  32. Whiteford N et al (2009) Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics 25(17):2194–2199

    Article  PubMed  CAS  Google Scholar 

  33. Noer GJ (1998) Cygwin: A free win32 porting layer for UNIX Applications. In: 2nd USENIX NT Symposium, Seattle, WA

    Google Scholar 

  34. Stajich JE et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12(10):1611

    Article  PubMed  CAS  Google Scholar 

  35. Cock PJA et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11):1422

    Article  PubMed  CAS  Google Scholar 

  36. Mason CE et al (2010) Standardizing the next generation of bioinformatics software development with BioHDF (HDF5). Adv Exp Med Biol 680:693–700

    Article  PubMed  Google Scholar 

  37. Chang F et al (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst (TOCS) 26(2):1–26

    Article  Google Scholar 

  38. Venner J (2009) Pro Hadoop. In: Moodie M (ed) Apress. Springer, New York

    Google Scholar 

  39. Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010(6):pdb.prot5448. doi:10.1101/pdb.prot5448

    Article  PubMed  Google Scholar 

  40. Meyer M, Stenzel U, Hofreiter M (2008) Parallel tagged sequencing on the 454 platform. Nat Protoc 3(2):267–278

    Article  PubMed  CAS  Google Scholar 

  41. Illumina Inc. (2008) Multiplexed sequencing with the Illumina Genome Analyzer System [PDF] [cited; 770-2008-011]. Available from: http://www.illumina.com/Documents/products/datasheets/datasheet_sequencing_multiplex.pdf

  42. Stiller M et al (2009) Direct multiplex sequencing (DMPS)—a novel method for targeted high-throughput sequencing of ancient and highly degraded DNA. Genome Res 19(10):1843–1848

    Article  PubMed  CAS  Google Scholar 

  43. Paabo S, Irwin DM, Wilson AC (1990) DNA damage promotes jumping between templates during enzymatic amplification. J Biol Chem 265(8):4718–4721

    PubMed  CAS  Google Scholar 

  44. Lahr DJ, Katz LA (2009) Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase. Biotechniques 47(4):857–866

    PubMed  CAS  Google Scholar 

  45. Meyerhans A, Vartanian JP, Wain-Hobson S (1990) DNA recombination during PCR. Nucleic Acids Res 18(7):1687–1691

    Article  PubMed  CAS  Google Scholar 

  46. Odelberg SJ et al (1995) Template-switching during DNA synthesis by Thermus aquaticus DNA polymerase I. Nucleic Acids Res 23(11):2049–2057

    Article  PubMed  CAS  Google Scholar 

  47. Mamanova L et al (2010) Target-enrichment strategies for next-generation sequencing. Nat Methods 7(2):111–118

    Article  PubMed  CAS  Google Scholar 

  48. R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

    Google Scholar 

  49. Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8(3):186–194

    PubMed  CAS  Google Scholar 

  50. Dolan PC, Denver DR (2008) TileQC: a system for tile-based quality control of Solexa data. BMC Bioinformatics 9:250

    Article  PubMed  Google Scholar 

  51. Andrews S (2010) FastQC: a quality control tool for high throughput sequence data

    Google Scholar 

  52. McKenna A et al (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20(9):1297–1303

    Article  PubMed  CAS  Google Scholar 

  53. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760

    Article  PubMed  CAS  Google Scholar 

  54. Palmer LE et al (2010) Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction. BMC Bioinformatics 11:33

    Article  PubMed  Google Scholar 

  55. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829

    Article  PubMed  CAS  Google Scholar 

  56. Birol I et al (2009) De novo transcriptome assembly with ABySS. Bioinformatics 25(21):2872–2877

    Article  PubMed  CAS  Google Scholar 

  57. Chaisson MJ, Brinza D, Pevzner PA (2009) De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res 19(2):336–346

    Article  PubMed  CAS  Google Scholar 

  58. Jeck WR et al (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics 23(21):2942–2944

    Article  PubMed  CAS  Google Scholar 

  59. Li H et al (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16):2078–2079

    Article  PubMed  Google Scholar 

  60. Creighton CJ, Reid JG, Gunaratne PH (2009) Expression profiling of microRNAs by deep sequencing. Brief Bioinform 10(5):490–497

    Article  PubMed  CAS  Google Scholar 

  61. Green RE et al (2009) The Neandertal genome and ancient DNA authenticity. EMBO J 28(17):2494–2502

    Article  PubMed  CAS  Google Scholar 

  62. Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460–2461

    Article  PubMed  CAS  Google Scholar 

  63. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659

    Article  PubMed  CAS  Google Scholar 

  64. Niu B et al (2010) Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 11:187

    Article  PubMed  Google Scholar 

  65. Blanca J, Chevreux B (2010) sff_extract. http://bioinf.comav.upv.es/sff_extract/index

  66. Langmead B et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25

    Article  PubMed  Google Scholar 

  67. Applied Biosystems (2008) A theoretical understanding of 2 base color codes and its application to annotation, error detection, and error correction. In: White Paper SOLiDâ„¢ System Volume. Life Technologies, Carlsbad

    Google Scholar 

  68. Altschul SF et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    PubMed  CAS  Google Scholar 

  69. Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12(4):656–664

    PubMed  CAS  Google Scholar 

  70. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680

    Article  PubMed  CAS  Google Scholar 

  71. Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217

    Article  PubMed  CAS  Google Scholar 

  72. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797

    Article  PubMed  CAS  Google Scholar 

  73. Trapnell C, Salzberg SL (2009) How to map billions of short reads onto genomes. Nat Biotechnol 27(5):455–457

    Article  PubMed  CAS  Google Scholar 

  74. Li R et al (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24(5):713–714

    Article  PubMed  CAS  Google Scholar 

  75. Smith AD, Xuan Z, Zhang MQ (2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9:128

    Article  PubMed  Google Scholar 

  76. Li R et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15):1966–1967

    Article  PubMed  CAS  Google Scholar 

  77. Zhang Z et al (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7(1–2):203–214

    Article  PubMed  CAS  Google Scholar 

  78. Morgulis A et al (2008) Database indexing for production MegaBLAST searches. Bioinformatics 24(16):1757–1764

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

I thank all current and previous members of the Department of Evolutionary Genetics at the Max Planck Institute for Evolu­tionary Anthropology, and particularly members of the aDNA and sequencing group, for interesting discussions and useful insights as well as for providing their sequencing data for analysis (especially Knut Finstermeier for providing the example data set). I also thank Knut Finstermeier and Beth Shapiro for critical reading and revisions. This work was supported by a grant from the Max Planck Society.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Kircher .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Kircher, M. (2012). Analysis of High-Throughput Ancient DNA Sequencing Data. In: Shapiro, B., Hofreiter, M. (eds) Ancient DNA. Methods in Molecular Biology, vol 840. Humana Press. https://doi.org/10.1007/978-1-61779-516-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-1-61779-516-9_23

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-61779-515-2

  • Online ISBN: 978-1-61779-516-9

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics