Skip to main content

Sequence Alignment, Analysis, and Bioinformatic Pipelines

  • Chapter
  • First Online:
Next Generation Sequencing
  • 5473 Accesses

Abstract

As DNA sequencing becomes more affordable and data ‘tsunami’ is practically here, it is clear that the inforamtics and analysis are the rate-limiting step currently for scientific discoveries as well as medical actions in the genomics enterprise. In this chapter, soon after the brief historical overview on the genomics field, we describe details on alignment and variation analysis algorithms and software packages. Remaining challenges and potential directions are discussed at the end of the chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931–945

    Article  Google Scholar 

  2. International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921

    Article  Google Scholar 

  3. Green ED, Guyer MS (2011) Charting a course for genomic medicine from base pairs to bedside. Nature 470:204–213

    Article  PubMed  CAS  Google Scholar 

  4. Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69–87

    Article  Google Scholar 

  5. Gibbs RA et al (2007) Evolutionary and biomedical insights from the rhesus macaque genome. Science 316:222–234

    Article  PubMed  CAS  Google Scholar 

  6. Daly AK (2010) Genome-wide association studies in pharmacogenomics. Nat Rev Genet 11:241–246

    Article  PubMed  CAS  Google Scholar 

  7. Consortium TIH (2004) Integrating ethics and science in the International HapMap Project. Nat Rev Genet 5:467–475

    Article  Google Scholar 

  8. Consortium TIH (2003) The International HapMap Project. Nature 426:789–796

    Article  Google Scholar 

  9. Altshuler DM et al (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467:52–58

    Article  PubMed  CAS  Google Scholar 

  10. Frazer KA et al (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–861

    Article  PubMed  CAS  Google Scholar 

  11. Mills RE et al (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16:1182–1190

    Article  PubMed  CAS  Google Scholar 

  12. Mills RE et al (2011) Natural genetic variation caused by small insertions and deletions in the human genome. Genome Res 21:830–839

    Article  PubMed  CAS  Google Scholar 

  13. Orr HT, Zoghbi HY (2007) Trinucleotide repeat disorders. Annu Rev Neurosci 30:575–621

    Article  PubMed  CAS  Google Scholar 

  14. Gatchel JR, Zoghbi HY (2005) Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet 6:743–755

    Article  PubMed  CAS  Google Scholar 

  15. Kidd JM et al (2008) Mapping and sequencing of structural variation from eight human genomes. Nature 453:56–64

    Article  PubMed  CAS  Google Scholar 

  16. Berger MF et al (2011) The genomic complexity of primary human prostate cancer. Nature 470:214–220

    Article  PubMed  CAS  Google Scholar 

  17. Consortium TH (2005) A haplotype map of the human genome. Nature 437:1299–1320

    Article  Google Scholar 

  18. Lin M, Aquilante C, Johnson JA, Wu R (2005) Sequencing drug response with HapMap. Pharmacogenomics J 5:149–156

    Article  PubMed  CAS  Google Scholar 

  19. Cozen W et al (2012) A genome-wide meta-analysis of nodular sclerosing Hodgkin lymphoma identifies risk loci at 6p21.32. Blood 119:469–475

    Article  PubMed  CAS  Google Scholar 

  20. Sabeti PC et al (2006) Positive natural selection in the human lineage. Science 312:1614–1620

    Article  PubMed  CAS  Google Scholar 

  21. Sabeti PC et al (2007) Genome-wide detection and characterization of positive selection in human populations. Nature 449:913–918

    Article  PubMed  CAS  Google Scholar 

  22. Myers S, Bottolo L, Freeman C, McVean G, Donnelly P (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science 310:321–324

    Article  PubMed  CAS  Google Scholar 

  23. Myers S et al (2006) The distribution and causes of meiotic recombination in the human genome. Biochem Soc Trans 34:526–530

    Article  PubMed  CAS  Google Scholar 

  24. Myers S, Freeman C, Auton A, Donnelly P, McVean G (2008) A common sequence motif associated with recombination hot spots and genome instability in humans. Nat Genet 40:1124–1129

    Article  PubMed  CAS  Google Scholar 

  25. Li Y, Willer C, Sanna S, Abecasis G (2009) Genotype imputation. Annu Rev Genomics Hum Genet 10:387–406

    Article  PubMed  CAS  Google Scholar 

  26. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21:263–265

    Article  PubMed  CAS  Google Scholar 

  27. Manolio TA et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753

    Article  PubMed  CAS  Google Scholar 

  28. 1000 Genomes Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073

    Article  Google Scholar 

  29. Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46

    Article  PubMed  CAS  Google Scholar 

  30. Collins FS (1992) Cystic fibrosis: molecular biology and therapeutic implications. Science 256:774–779

    Article  PubMed  CAS  Google Scholar 

  31. Albert TJ et al (2007) Direct selection of human genomic loci by microarray hybridization. Nat Methods 4:903–905

    Article  PubMed  CAS  Google Scholar 

  32. Saunders CJ et al (2012) Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci Transl Med 4:154ra135

    Google Scholar 

  33. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771

    Article  PubMed  CAS  Google Scholar 

  34. Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  PubMed  CAS  Google Scholar 

  35. Coarfa C et al (2010) Pash 3.0: a versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing. BMC Bioinformatics 11:572

    Article  PubMed  Google Scholar 

  36. Li R, Li Y, Kristiansen K, Wang J (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24:713–714

    Article  PubMed  CAS  Google Scholar 

  37. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760

    Article  PubMed  CAS  Google Scholar 

  38. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359

    Article  PubMed  CAS  Google Scholar 

  39. Marth GT et al (2011) The functional spectrum of low-frequency coding variation. Genome Biol 12:R84

    Article  PubMed  Google Scholar 

  40. Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS One 4:e7767

    Article  PubMed  Google Scholar 

  41. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197

    Article  PubMed  CAS  Google Scholar 

  42. Pearson WR (1994) Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 24:307–331

    PubMed  CAS  Google Scholar 

  43. Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664

    PubMed  CAS  Google Scholar 

  44. Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725–1729

    Article  PubMed  CAS  Google Scholar 

  45. Ma B, Tromp J, Li M (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18:440–445

    Article  PubMed  CAS  Google Scholar 

  46. Li M, Ma B, Kisman D, Tromp J (2003) PatternHunter II: highly sensitive and fast homology search. Genome Inform 14:164–175

    PubMed  CAS  Google Scholar 

  47. Li M, Ma B, Kisman D, Tromp J (2004) Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2:417–439

    Article  PubMed  CAS  Google Scholar 

  48. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858

    Article  PubMed  CAS  Google Scholar 

  49. Smith AD et al (2009) Updates to the RMAP short-read mapping software. Bioinformatics 25:2841–2842

    Article  PubMed  CAS  Google Scholar 

  50. Smith AD, Xuan Z, Zhang MQ (2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9:128

    Article  PubMed  Google Scholar 

  51. Lin H, Zhang Z, Zhang MQ, Ma B, Li M (2008) ZOOM! Zillions of oligos mapped. Bioinformatics 24:2431–2437

    Article  PubMed  CAS  Google Scholar 

  52. Coarfa C, Milosavljevic A (2008) Pash 2.0: scaleable sequence anchoring for next-generation sequencing technologies. Pac Symp Biocomput:102–113

    Google Scholar 

  53. Kalafus KJ, Jackson AR, Milosavljevic A (2004) Pash: efficient genome-scale sequence anchoring by Positional Hashing. Genome Res 14:672–678

    Article  PubMed  CAS  Google Scholar 

  54. Li R et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967

    Article  PubMed  CAS  Google Scholar 

  55. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25

    Article  PubMed  Google Scholar 

  56. Rumble SM et al (2009) SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol 5:e1000386

    Article  PubMed  Google Scholar 

  57. Hach F et al (2010) mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Methods 7:576–577

    Article  PubMed  CAS  Google Scholar 

  58. Weese D, Emde AK, Rausch T, Doring A, Reinert K (2009) RazerS–fast read mapping with sensitivity control. Genome Res 19:1646–1654

    Article  PubMed  CAS  Google Scholar 

  59. Rasmussen KR, Stoye J, Myers EW (2006) Efficient q-gram filters for finding all epsilon-matches over a given length. J Comput Biol 13:296–308

    Article  PubMed  CAS  Google Scholar 

  60. Ahmadi A et al (2012) Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res 40:e41

    Article  PubMed  CAS  Google Scholar 

  61. Lister R et al (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462:315–322

    Article  PubMed  CAS  Google Scholar 

  62. Krueger F, Andrews SR (2011) Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27:1571–1572

    Article  PubMed  CAS  Google Scholar 

  63. Chen PY, Cokus SJ, Pellegrini M (2010) BS Seeker: precise mapping for bisulfite sequencing. BMC Bioinformatics 11:203

    Article  PubMed  CAS  Google Scholar 

  64. Pedersen B, Hsieh TF, Ibarra C, Fischer RL (2011) MethylCoder: software pipeline for bisulfite-treated sequences. Bioinformatics 27:2435–2436

    Article  PubMed  CAS  Google Scholar 

  65. Xi Y, Li W (2009) BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics 10:232

    Article  PubMed  Google Scholar 

  66. Frith MC, Mori R, Asai K (2012) A mostly traditional approach improves alignment of bisulfite-converted DNA. Nucleic Acids Res 40:e100

    Article  PubMed  CAS  Google Scholar 

  67. Harris EY, Ponts N, Le Roch KG, Lonardi S (2012) BRAT-BW: efficient and accurate mapping of bisulfite-treated reads. Bioinformatics 28:1795–1796

    Article  PubMed  CAS  Google Scholar 

  68. Harris EY, Ponts N, Levchuk A, Roch KL, Lonardi S (2010) BRAT: bisulfite-treated reads analysis tool. Bioinformatics 26:572–573

    Article  PubMed  CAS  Google Scholar 

  69. Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079

    Article  PubMed  Google Scholar 

  70. Li R et al (2009) SNP detection for massively parallel whole-genome resequencing. Genome Res 19(6):1124–1132

    Article  PubMed  CAS  Google Scholar 

  71. DePristo MA et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498

    Article  PubMed  CAS  Google Scholar 

  72. Shen Y et al (2010) A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 20:273–280

    Article  PubMed  CAS  Google Scholar 

  73. Challis D et al (2012) An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 13:8

    Article  PubMed  Google Scholar 

  74. Wang Y, Lu JT, Jin Y, Gibbs R, Yu F. Integrative imputation-based framework for variant analysis in population genomics studies. jtlu@bcm.edu (In revision)

  75. Bainbridge MN et al (2010) Whole exome capture in solution with 3 Gbp of data. Genome Biol 11:R62

    Article  PubMed  Google Scholar 

  76. Gravel S et al (2011) Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA 108:11983–11988

    Article  PubMed  CAS  Google Scholar 

  77. Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR (2011) Low-coverage sequencing: implications for design of complex trait association studies. Genome Res 21:940–951

    Article  PubMed  CAS  Google Scholar 

  78. Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12:443–451

    Article  PubMed  CAS  Google Scholar 

  79. Cooper GM, Shendure J (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 12:628–640

    Article  PubMed  CAS  Google Scholar 

  80. Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814

    Article  PubMed  CAS  Google Scholar 

  81. Chun S, Fay JC (2009) Identification of deleterious mutations within three human genomes. Genome Res 19:1553–1561

    Article  PubMed  CAS  Google Scholar 

  82. Liu X, Jian X, Boerwinkle E (2011) dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat 32:894–899

    Article  PubMed  CAS  Google Scholar 

  83. Cooper GM et al (2005) Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15:901–913

    Article  PubMed  CAS  Google Scholar 

  84. Davydov EV et al (2010) Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6:e1001025

    Article  PubMed  Google Scholar 

  85. Adzhubei IA et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249

    Article  PubMed  CAS  Google Scholar 

  86. Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740

    Article  PubMed  Google Scholar 

  87. Evani US et al (2012) Enabling Atlas2 personal genome analysis on the cloud. In Genomic Signal Processing and Statistics (GENSIPS), 2011 I.E. international workshop. San Antonio

    Google Scholar 

  88. Boerwinkle E (2012) Translational genomics is not a spectator sport: a call to action. Genet Epidemiol 36:85–87

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

We thank R. Alan Harris for critical comments on the earlier version of this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fuli Yu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Yu, F., Coarfa, C. (2013). Sequence Alignment, Analysis, and Bioinformatic Pipelines. In: Wong, LJ. (eds) Next Generation Sequencing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7001-4_4

Download citation

Publish with us

Policies and ethics