On the Accuracy of Short Read Mapping

  • Peter Menzel
  • Jes Frellsen
  • Mireya Plass
  • Simon H. Rasmussen
  • Anders Krogh
Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 1038)

Abstract

The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i.e., mapping the reads to a reference genome. In this new situation, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome (E-value) and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program.

Key words

Mapping Short reads High-throughput sequencing 

References

  1. 1.
    Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402PubMedCrossRefGoogle Scholar
  2. 2.
    Li L, McCorkle S, Monchy S, Taghavi S, van der Lelie D (2009) Bioprospecting metagenomes: glycosyl hydrolases for converting biomass. Biotechnol Biofuels 2:10. doi:10.1186/1754-6834-2-10 PubMedCrossRefGoogle Scholar
  3. 3.
    Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858. doi:10.1101/gr.078212.108 PubMedCrossRefGoogle Scholar
  4. 4.
    Li R, Yu C, Li Y, Lam T, Yiu S, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967. doi:10.1093/bioinformatics/btp336 PubMedCrossRefGoogle Scholar
  5. 5.
    Langmead B, Salzberg S (2012) Fast gapped-read alignment with bowtie 2. Nat Methods 9:357–359. doi:10.1038/nmeth.1923 PubMedCrossRefGoogle Scholar
  6. 6.
    Ruffalo M, LaFramboise T, Koyutürk M (2011) Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27:2790–2796. doi:10.1093/bioinformatics/btr477 PubMedCrossRefGoogle Scholar
  7. 7.
    Stiller M, Green R, Ronan M, Simons J, Du L, He W, Egholm M, Rothberg J, Keates S, Keats S, Ovodov N, Antipina E, Baryshnikov G, Kuzmin Y, Vasilevski A, Wuenschell G, Termini J, Hofreiter M, Jaenicke-Després V, Pääbo S (2006) Patterns of nucleotide misincorporations during enzymatic amplification and direct large-scale sequencing of ancient DNA. Proc Natl Acad Sci U S A 103(13):578–584. doi:10.1073/pnas. 0605327103 Google Scholar
  8. 8.
    Kircher M (2012) Analysis of high-throughput ancient DNA sequencing data. Methods Mol Biol 840:197–228. doi:10.1007/978-1-61779-516-9∖textunderscore23 PubMedCrossRefGoogle Scholar
  9. 9.
    Rasmussen M, Li Y, Lindgreen S, Pedersen J, Albrechtsen A, Moltke I, Metspalu M, Metspalu E, Kivisild T, Gupta R, Bertalan M, Nielsen K, Gilbert M, Wang Y, Raghavan M, Campos P, Kamp H, Wilson A, Gledhill A, Tridico S, Bunce M, Lorenzen E, Binladen J, Guo X, Zhao J, Zhang X, Zhang H, Li Z, Chen M, Orlando L, Kristiansen K, Bak M, Tommerup N, Bendixen C, Pierre T, Grønnow B, Meldgaard M, Andreasen C, Fedorova S, Osipova L, Higham T, Ramsey C, Hansen T, Nielsen F, Crawford M, Brunak S, Sicheritz-Pontén T, Villems R, Nielsen R, Krogh A, Wang J, Willerslev E (2010) Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463:757–762. doi:10.1038/nature08835 PubMedCrossRefGoogle Scholar
  10. 10.
    Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, Fukuda S, Sasaki D, Podhajska A, Harbers M, Kawai J, Carninci P, Hayashizaki Y (2003) Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 100(15):776–781. doi:10.1073/pnas.2136655100 Google Scholar
  11. 11.
    Morin R, O’Connor M, Griffith M, Kuchenbauer F, Delaney A, Prabhu A, Zhao Y, McDonald H, Zeng T, Hirst M, Eaves C, Marra M (2008) Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res 18:610–621. doi:10.1101/gr.7179508 PubMedCrossRefGoogle Scholar
  12. 12.
    Zhang C, Darnell R (2011) Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data. Nat Biotechnol 29:607–614. doi:10.1038/nbt.1873 PubMedCrossRefGoogle Scholar
  13. 13.
    Lander E, Linton L, Birren B, Nusbaum C, Zody M, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov J, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin J, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston R, Wilson R, Hillier L, McPherson J, Marra M, Mardis E, Fulton L, Chinwalla A, Pepin K, Gish W, Chissoe S, Wendl M, Delehaunty K, Miner T, Delehaunty A, Kramer J, Cook L, Fulton R, Johnson D, Minx P, Clifton S, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng J, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs R, Muzny D, Scherer S, Bouck J, Sodergren E, Worley K, Rives C, Gorrell J, Metzker M, Naylor S, Kucherlapati R, Nelson D, Weinstock G, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith D, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee H, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis R, Federspiel N, Abola A, Proctor M, Myers R, Schmutz J, Dickson M, Grimwood J, Cox D, Olson M, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans G, Athanasiou M, Schultz R, Roe B, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie W, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey J, Bateman A, Batzoglou S, Birney E, Bork P, Brown D, Burge C, Cerutti L, Chen H, Church D, Clamp M, Copley R, Doerks T, Eddy S, Eichler E, Furey T, Galagan J, Gilbert J, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson L, Jones T, Kasif S, Kaspryzk A, Kennedy S, Kent W, Kitts P, Koonin E, Korf I, Kulp D, Lancet D, Lowe T, McLysaght A, Mikkelsen T, Moran J, Mulder N, Pollara V, Ponting C, Schuler G, Schultz J, Slater G, Smit A, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf Y, Wolfe K, Yang S, Yeh R, Collins F, Guyer M, Peterson J, Felsenfeld A, Wetterstrand K, Patrinos A, Morgan M, de Jong P, Catanese J, Osoegawa K, Shizuya H, Choi S, Chen Y, Szustakowki J, International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921. doi:10.1038/35057062 PubMedCrossRefGoogle Scholar
  14. 14.
    Longo M, O’Neill M, O’Neill R (2011) Abundant human DNA contamination identified in non-primate genome databases. PLoS One 6:e16,410. doi:10.1371/journal.pone.0016410
  15. 15.
    Cock P, Fields C, Goto N, Heuer M, Rice P (2010) The sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771. doi:10.1093/nar/gkp1137 PubMedCrossRefGoogle Scholar
  16. 16.
    Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z, Dewell S, Du L, Fierro J, Gomes X, Godwin B, He W, Helgesen S, Ho C, Ho C, Irzyk G, Jando S, Alenquer M, Jarvie T, Jirage K, Kim J, Knight J, Lanza J, Leamon J, Lefkowitz S, Lei M, Li J, Lohman K, Lu H, Makhijani V, McDade K, McKenna M, Myers E, Nickerson E, Nobile J, Plant R, Puc B, Ronan M, Roth G, Sarkis G, Simons J, Simpson J, Srinivasan M, Tartaro K, Tomasz A, Vogt K, Volkmer G, Wang S, Wang Y, Weiner M, Yu P, Begley R, Rothberg J (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. doi:10.1038/nature03959 PubMedGoogle Scholar
  17. 17.
    Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, Martin J (2011) Accuracy and quality assessment of 454 GS-FLX titanium pyrosequencing. BMC Genomics 12:245. doi:10.1186/1471-2164-12-245 PubMedCrossRefGoogle Scholar
  18. 18.
    Hamada M, Wijaya E, Frith M, Asai K (2011) Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection. Bioinformatics 27:3085–3092. doi:10.1093/bioinformatics/btr537 PubMedCrossRefGoogle Scholar
  19. 19.
    Kerpedjiev P, Lindgreen S, Frellsen J, Krogh A (2013) Adaptable probabilistic mapping of short reads using position specific scoring matrices. UnpublishedGoogle Scholar
  20. 20.
    Huang W, Li L, Myers J, Marth G (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594. doi:10.1093/bioinformatics/btr708 PubMedCrossRefGoogle Scholar
  21. 21.
    Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of illumina sequence reads. Genome Res 21:936–939. doi:10.1101/gr.111120.110 PubMedCrossRefGoogle Scholar
  22. 22.
    Vacic V, Jin H, Zhu J, Lonardi S (2008) A probabilistic method for small RNA flowgram matching. Pac Symp Biocomput 75–86Google Scholar
  23. 23.
    DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas M, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498. doi:10.1038/ng.806 PubMedCrossRefGoogle Scholar
  24. 24.
    Kodama Y, Shumway M, Leinonen R, International Nucleotide Sequence Database Collaboration (2012) The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res 40:D54–D56. doi:10.1093/nar/gkr854 PubMedCrossRefGoogle Scholar
  25. 25.
    Lindgreen S (2012) AdapterRemoval: easy cleaning of next generation sequencing reads. BMC Res Notes 5:337. doi:10.1186/1756-0500-5-337
  26. 26.
    Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595. doi: 10.1093/bioinformatics/btp698 PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Peter Menzel
    • 1
  • Jes Frellsen
    • 1
  • Mireya Plass
    • 1
  • Simon H. Rasmussen
    • 1
  • Anders Krogh
    • 1
  1. 1.Department of Biology, The Bioinformatics CentreUniversity of CopenhagenCopenhagenDenmark

Personalised recommendations