Sequence Alignment, Analysis, and Bioinformatic Pipelines

Yu, Fuli; Coarfa, Cristian

doi:10.1007/978-1-4614-7001-4_4

Fuli Yu² &
Cristian Coarfa³

5473 Accesses

Abstract

As DNA sequencing becomes more affordable and data ‘tsunami’ is practically here, it is clear that the inforamtics and analysis are the rate-limiting step currently for scientific discoveries as well as medical actions in the genomics enterprise. In this chapter, soon after the brief historical overview on the genomics field, we describe details on alignment and variation analysis algorithms and software packages. Remaining challenges and potential directions are discussed at the end of the chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931–945
Article Google Scholar
International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
Article Google Scholar
Green ED, Guyer MS (2011) Charting a course for genomic medicine from base pairs to bedside. Nature 470:204–213
Article PubMed CAS Google Scholar
Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69–87
Article Google Scholar
Gibbs RA et al (2007) Evolutionary and biomedical insights from the rhesus macaque genome. Science 316:222–234
Article PubMed CAS Google Scholar
Daly AK (2010) Genome-wide association studies in pharmacogenomics. Nat Rev Genet 11:241–246
Article PubMed CAS Google Scholar
Consortium TIH (2004) Integrating ethics and science in the International HapMap Project. Nat Rev Genet 5:467–475
Article Google Scholar
Consortium TIH (2003) The International HapMap Project. Nature 426:789–796
Article Google Scholar
Altshuler DM et al (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467:52–58
Article PubMed CAS Google Scholar
Frazer KA et al (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–861
Article PubMed CAS Google Scholar
Mills RE et al (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16:1182–1190
Article PubMed CAS Google Scholar
Mills RE et al (2011) Natural genetic variation caused by small insertions and deletions in the human genome. Genome Res 21:830–839
Article PubMed CAS Google Scholar
Orr HT, Zoghbi HY (2007) Trinucleotide repeat disorders. Annu Rev Neurosci 30:575–621
Article PubMed CAS Google Scholar
Gatchel JR, Zoghbi HY (2005) Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet 6:743–755
Article PubMed CAS Google Scholar
Kidd JM et al (2008) Mapping and sequencing of structural variation from eight human genomes. Nature 453:56–64
Article PubMed CAS Google Scholar
Berger MF et al (2011) The genomic complexity of primary human prostate cancer. Nature 470:214–220
Article PubMed CAS Google Scholar
Consortium TH (2005) A haplotype map of the human genome. Nature 437:1299–1320
Article Google Scholar
Lin M, Aquilante C, Johnson JA, Wu R (2005) Sequencing drug response with HapMap. Pharmacogenomics J 5:149–156
Article PubMed CAS Google Scholar
Cozen W et al (2012) A genome-wide meta-analysis of nodular sclerosing Hodgkin lymphoma identifies risk loci at 6p21.32. Blood 119:469–475
Article PubMed CAS Google Scholar
Sabeti PC et al (2006) Positive natural selection in the human lineage. Science 312:1614–1620
Article PubMed CAS Google Scholar
Sabeti PC et al (2007) Genome-wide detection and characterization of positive selection in human populations. Nature 449:913–918
Article PubMed CAS Google Scholar
Myers S, Bottolo L, Freeman C, McVean G, Donnelly P (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science 310:321–324
Article PubMed CAS Google Scholar
Myers S et al (2006) The distribution and causes of meiotic recombination in the human genome. Biochem Soc Trans 34:526–530
Article PubMed CAS Google Scholar
Myers S, Freeman C, Auton A, Donnelly P, McVean G (2008) A common sequence motif associated with recombination hot spots and genome instability in humans. Nat Genet 40:1124–1129
Article PubMed CAS Google Scholar
Li Y, Willer C, Sanna S, Abecasis G (2009) Genotype imputation. Annu Rev Genomics Hum Genet 10:387–406
Article PubMed CAS Google Scholar
Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21:263–265
Article PubMed CAS Google Scholar
Manolio TA et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
Article PubMed CAS Google Scholar
1000 Genomes Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Article Google Scholar
Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46
Article PubMed CAS Google Scholar
Collins FS (1992) Cystic fibrosis: molecular biology and therapeutic implications. Science 256:774–779
Article PubMed CAS Google Scholar
Albert TJ et al (2007) Direct selection of human genomic loci by microarray hybridization. Nat Methods 4:903–905
Article PubMed CAS Google Scholar
Saunders CJ et al (2012) Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci Transl Med 4:154ra135
Google Scholar
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771
Article PubMed CAS Google Scholar
Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article PubMed CAS Google Scholar
Coarfa C et al (2010) Pash 3.0: a versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing. BMC Bioinformatics 11:572
Article PubMed Google Scholar
Li R, Li Y, Kristiansen K, Wang J (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24:713–714
Article PubMed CAS Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760
Article PubMed CAS Google Scholar
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359
Article PubMed CAS Google Scholar
Marth GT et al (2011) The functional spectrum of low-frequency coding variation. Genome Biol 12:R84
Article PubMed Google Scholar
Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS One 4:e7767
Article PubMed Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Article PubMed CAS Google Scholar
Pearson WR (1994) Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 24:307–331
PubMed CAS Google Scholar
Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664
PubMed CAS Google Scholar
Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725–1729
Article PubMed CAS Google Scholar
Ma B, Tromp J, Li M (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18:440–445
Article PubMed CAS Google Scholar
Li M, Ma B, Kisman D, Tromp J (2003) PatternHunter II: highly sensitive and fast homology search. Genome Inform 14:164–175
PubMed CAS Google Scholar
Li M, Ma B, Kisman D, Tromp J (2004) Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2:417–439
Article PubMed CAS Google Scholar
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858
Article PubMed CAS Google Scholar
Smith AD et al (2009) Updates to the RMAP short-read mapping software. Bioinformatics 25:2841–2842
Article PubMed CAS Google Scholar
Smith AD, Xuan Z, Zhang MQ (2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9:128
Article PubMed Google Scholar
Lin H, Zhang Z, Zhang MQ, Ma B, Li M (2008) ZOOM! Zillions of oligos mapped. Bioinformatics 24:2431–2437
Article PubMed CAS Google Scholar
Coarfa C, Milosavljevic A (2008) Pash 2.0: scaleable sequence anchoring for next-generation sequencing technologies. Pac Symp Biocomput:102–113
Google Scholar
Kalafus KJ, Jackson AR, Milosavljevic A (2004) Pash: efficient genome-scale sequence anchoring by Positional Hashing. Genome Res 14:672–678
Article PubMed CAS Google Scholar
Li R et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967
Article PubMed CAS Google Scholar
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25
Article PubMed Google Scholar
Rumble SM et al (2009) SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol 5:e1000386
Article PubMed Google Scholar
Hach F et al (2010) mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Methods 7:576–577
Article PubMed CAS Google Scholar
Weese D, Emde AK, Rausch T, Doring A, Reinert K (2009) RazerS–fast read mapping with sensitivity control. Genome Res 19:1646–1654
Article PubMed CAS Google Scholar
Rasmussen KR, Stoye J, Myers EW (2006) Efficient q-gram filters for finding all epsilon-matches over a given length. J Comput Biol 13:296–308
Article PubMed CAS Google Scholar
Ahmadi A et al (2012) Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res 40:e41
Article PubMed CAS Google Scholar
Lister R et al (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462:315–322
Article PubMed CAS Google Scholar
Krueger F, Andrews SR (2011) Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27:1571–1572
Article PubMed CAS Google Scholar
Chen PY, Cokus SJ, Pellegrini M (2010) BS Seeker: precise mapping for bisulfite sequencing. BMC Bioinformatics 11:203
Article PubMed CAS Google Scholar
Pedersen B, Hsieh TF, Ibarra C, Fischer RL (2011) MethylCoder: software pipeline for bisulfite-treated sequences. Bioinformatics 27:2435–2436
Article PubMed CAS Google Scholar
Xi Y, Li W (2009) BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics 10:232
Article PubMed Google Scholar
Frith MC, Mori R, Asai K (2012) A mostly traditional approach improves alignment of bisulfite-converted DNA. Nucleic Acids Res 40:e100
Article PubMed CAS Google Scholar
Harris EY, Ponts N, Le Roch KG, Lonardi S (2012) BRAT-BW: efficient and accurate mapping of bisulfite-treated reads. Bioinformatics 28:1795–1796
Article PubMed CAS Google Scholar
Harris EY, Ponts N, Levchuk A, Roch KL, Lonardi S (2010) BRAT: bisulfite-treated reads analysis tool. Bioinformatics 26:572–573
Article PubMed CAS Google Scholar
Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
Article PubMed Google Scholar
Li R et al (2009) SNP detection for massively parallel whole-genome resequencing. Genome Res 19(6):1124–1132
Article PubMed CAS Google Scholar
DePristo MA et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498
Article PubMed CAS Google Scholar
Shen Y et al (2010) A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 20:273–280
Article PubMed CAS Google Scholar
Challis D et al (2012) An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 13:8
Article PubMed Google Scholar
Wang Y, Lu JT, Jin Y, Gibbs R, Yu F. Integrative imputation-based framework for variant analysis in population genomics studies. jtlu@bcm.edu (In revision)
Bainbridge MN et al (2010) Whole exome capture in solution with 3 Gbp of data. Genome Biol 11:R62
Article PubMed Google Scholar
Gravel S et al (2011) Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA 108:11983–11988
Article PubMed CAS Google Scholar
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR (2011) Low-coverage sequencing: implications for design of complex trait association studies. Genome Res 21:940–951
Article PubMed CAS Google Scholar
Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12:443–451
Article PubMed CAS Google Scholar
Cooper GM, Shendure J (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 12:628–640
Article PubMed CAS Google Scholar
Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814
Article PubMed CAS Google Scholar
Chun S, Fay JC (2009) Identification of deleterious mutations within three human genomes. Genome Res 19:1553–1561
Article PubMed CAS Google Scholar
Liu X, Jian X, Boerwinkle E (2011) dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat 32:894–899
Article PubMed CAS Google Scholar
Cooper GM et al (2005) Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15:901–913
Article PubMed CAS Google Scholar
Davydov EV et al (2010) Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6:e1001025
Article PubMed Google Scholar
Adzhubei IA et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249
Article PubMed CAS Google Scholar
Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740
Article PubMed Google Scholar
Evani US et al (2012) Enabling Atlas2 personal genome analysis on the cloud. In Genomic Signal Processing and Statistics (GENSIPS), 2011 I.E. international workshop. San Antonio
Google Scholar
Boerwinkle E (2012) Translational genomics is not a spectator sport: a call to action. Genet Epidemiol 36:85–87
Article PubMed Google Scholar

Download references

Acknowledgments

We thank R. Alan Harris for critical comments on the earlier version of this chapter.

Author information

Authors and Affiliations

Department of Molecular and Human Genetics, Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, 77030, TX, USA
Fuli Yu
Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, 77030, TX, USA
Cristian Coarfa

Authors

Fuli Yu
View author publications
You can also search for this author in PubMed Google Scholar
Cristian Coarfa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fuli Yu .

Editor information

Editors and Affiliations

, Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, 77030, Texas, USA
Lee-Jun C. Wong

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yu, F., Coarfa, C. (2013). Sequence Alignment, Analysis, and Bioinformatic Pipelines. In: Wong, LJ. (eds) Next Generation Sequencing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7001-4_4

Download citation

DOI: https://doi.org/10.1007/978-1-4614-7001-4_4
Published: 07 May 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7000-7
Online ISBN: 978-1-4614-7001-4
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics