Abstract
As DNA sequencing becomes more affordable and data ‘tsunami’ is practically here, it is clear that the inforamtics and analysis are the rate-limiting step currently for scientific discoveries as well as medical actions in the genomics enterprise. In this chapter, soon after the brief historical overview on the genomics field, we describe details on alignment and variation analysis algorithms and software packages. Remaining challenges and potential directions are discussed at the end of the chapter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931–945
International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
Green ED, Guyer MS (2011) Charting a course for genomic medicine from base pairs to bedside. Nature 470:204–213
Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69–87
Gibbs RA et al (2007) Evolutionary and biomedical insights from the rhesus macaque genome. Science 316:222–234
Daly AK (2010) Genome-wide association studies in pharmacogenomics. Nat Rev Genet 11:241–246
Consortium TIH (2004) Integrating ethics and science in the International HapMap Project. Nat Rev Genet 5:467–475
Consortium TIH (2003) The International HapMap Project. Nature 426:789–796
Altshuler DM et al (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467:52–58
Frazer KA et al (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–861
Mills RE et al (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16:1182–1190
Mills RE et al (2011) Natural genetic variation caused by small insertions and deletions in the human genome. Genome Res 21:830–839
Orr HT, Zoghbi HY (2007) Trinucleotide repeat disorders. Annu Rev Neurosci 30:575–621
Gatchel JR, Zoghbi HY (2005) Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet 6:743–755
Kidd JM et al (2008) Mapping and sequencing of structural variation from eight human genomes. Nature 453:56–64
Berger MF et al (2011) The genomic complexity of primary human prostate cancer. Nature 470:214–220
Consortium TH (2005) A haplotype map of the human genome. Nature 437:1299–1320
Lin M, Aquilante C, Johnson JA, Wu R (2005) Sequencing drug response with HapMap. Pharmacogenomics J 5:149–156
Cozen W et al (2012) A genome-wide meta-analysis of nodular sclerosing Hodgkin lymphoma identifies risk loci at 6p21.32. Blood 119:469–475
Sabeti PC et al (2006) Positive natural selection in the human lineage. Science 312:1614–1620
Sabeti PC et al (2007) Genome-wide detection and characterization of positive selection in human populations. Nature 449:913–918
Myers S, Bottolo L, Freeman C, McVean G, Donnelly P (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science 310:321–324
Myers S et al (2006) The distribution and causes of meiotic recombination in the human genome. Biochem Soc Trans 34:526–530
Myers S, Freeman C, Auton A, Donnelly P, McVean G (2008) A common sequence motif associated with recombination hot spots and genome instability in humans. Nat Genet 40:1124–1129
Li Y, Willer C, Sanna S, Abecasis G (2009) Genotype imputation. Annu Rev Genomics Hum Genet 10:387–406
Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21:263–265
Manolio TA et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
1000 Genomes Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46
Collins FS (1992) Cystic fibrosis: molecular biology and therapeutic implications. Science 256:774–779
Albert TJ et al (2007) Direct selection of human genomic loci by microarray hybridization. Nat Methods 4:903–905
Saunders CJ et al (2012) Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci Transl Med 4:154ra135
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771
Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Coarfa C et al (2010) Pash 3.0: a versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing. BMC Bioinformatics 11:572
Li R, Li Y, Kristiansen K, Wang J (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24:713–714
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359
Marth GT et al (2011) The functional spectrum of low-frequency coding variation. Genome Biol 12:R84
Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS One 4:e7767
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Pearson WR (1994) Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 24:307–331
Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664
Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725–1729
Ma B, Tromp J, Li M (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18:440–445
Li M, Ma B, Kisman D, Tromp J (2003) PatternHunter II: highly sensitive and fast homology search. Genome Inform 14:164–175
Li M, Ma B, Kisman D, Tromp J (2004) Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2:417–439
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858
Smith AD et al (2009) Updates to the RMAP short-read mapping software. Bioinformatics 25:2841–2842
Smith AD, Xuan Z, Zhang MQ (2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9:128
Lin H, Zhang Z, Zhang MQ, Ma B, Li M (2008) ZOOM! Zillions of oligos mapped. Bioinformatics 24:2431–2437
Coarfa C, Milosavljevic A (2008) Pash 2.0: scaleable sequence anchoring for next-generation sequencing technologies. Pac Symp Biocomput:102–113
Kalafus KJ, Jackson AR, Milosavljevic A (2004) Pash: efficient genome-scale sequence anchoring by Positional Hashing. Genome Res 14:672–678
Li R et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25
Rumble SM et al (2009) SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol 5:e1000386
Hach F et al (2010) mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Methods 7:576–577
Weese D, Emde AK, Rausch T, Doring A, Reinert K (2009) RazerS–fast read mapping with sensitivity control. Genome Res 19:1646–1654
Rasmussen KR, Stoye J, Myers EW (2006) Efficient q-gram filters for finding all epsilon-matches over a given length. J Comput Biol 13:296–308
Ahmadi A et al (2012) Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res 40:e41
Lister R et al (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462:315–322
Krueger F, Andrews SR (2011) Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27:1571–1572
Chen PY, Cokus SJ, Pellegrini M (2010) BS Seeker: precise mapping for bisulfite sequencing. BMC Bioinformatics 11:203
Pedersen B, Hsieh TF, Ibarra C, Fischer RL (2011) MethylCoder: software pipeline for bisulfite-treated sequences. Bioinformatics 27:2435–2436
Xi Y, Li W (2009) BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics 10:232
Frith MC, Mori R, Asai K (2012) A mostly traditional approach improves alignment of bisulfite-converted DNA. Nucleic Acids Res 40:e100
Harris EY, Ponts N, Le Roch KG, Lonardi S (2012) BRAT-BW: efficient and accurate mapping of bisulfite-treated reads. Bioinformatics 28:1795–1796
Harris EY, Ponts N, Levchuk A, Roch KL, Lonardi S (2010) BRAT: bisulfite-treated reads analysis tool. Bioinformatics 26:572–573
Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
Li R et al (2009) SNP detection for massively parallel whole-genome resequencing. Genome Res 19(6):1124–1132
DePristo MA et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498
Shen Y et al (2010) A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 20:273–280
Challis D et al (2012) An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 13:8
Wang Y, Lu JT, Jin Y, Gibbs R, Yu F. Integrative imputation-based framework for variant analysis in population genomics studies. jtlu@bcm.edu (In revision)
Bainbridge MN et al (2010) Whole exome capture in solution with 3 Gbp of data. Genome Biol 11:R62
Gravel S et al (2011) Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA 108:11983–11988
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR (2011) Low-coverage sequencing: implications for design of complex trait association studies. Genome Res 21:940–951
Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12:443–451
Cooper GM, Shendure J (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 12:628–640
Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814
Chun S, Fay JC (2009) Identification of deleterious mutations within three human genomes. Genome Res 19:1553–1561
Liu X, Jian X, Boerwinkle E (2011) dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat 32:894–899
Cooper GM et al (2005) Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15:901–913
Davydov EV et al (2010) Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6:e1001025
Adzhubei IA et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249
Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740
Evani US et al (2012) Enabling Atlas2 personal genome analysis on the cloud. In Genomic Signal Processing and Statistics (GENSIPS), 2011 I.E. international workshop. San Antonio
Boerwinkle E (2012) Translational genomics is not a spectator sport: a call to action. Genet Epidemiol 36:85–87
Acknowledgments
We thank R. Alan Harris for critical comments on the earlier version of this chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Yu, F., Coarfa, C. (2013). Sequence Alignment, Analysis, and Bioinformatic Pipelines. In: Wong, LJ. (eds) Next Generation Sequencing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7001-4_4
Download citation
DOI: https://doi.org/10.1007/978-1-4614-7001-4_4
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7000-7
Online ISBN: 978-1-4614-7001-4
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)