Advertisement

GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality

  • Thomas D. Wu
  • Jens Reeder
  • Michael Lawrence
  • Gabe Becker
  • Matthew J. Brauer
Part of the Methods in Molecular Biology book series (MIMB, volume 1418)

Abstract

The programs GMAP and GSNAP, for aligning RNA-Seq and DNA-Seq datasets to genomes, have evolved along with advances in biological methodology to handle longer reads, larger volumes of data, and new types of biological assays. The genomic representation has been improved to include linear genomes that can compare sequences using single-instruction multiple-data (SIMD) instructions, compressed genomic hash tables with fast access using SIMD instructions, handling of large genomes with more than four billion bp, and enhanced suffix arrays (ESAs) with novel data structures for fast access. Improvements to the algorithms have included a greedy match-and-extend algorithm using suffix arrays, segment chaining using genomic hash tables, diagonalization using segmental hash tables, and nucleotide-level dynamic programming procedures that use SIMD instructions and eliminate the need for F-loop calculations. Enhancements to the functionality of the programs include standardization of indel positions, handling of ambiguous splicing, clipping and merging of overlapping paired-end reads, and alignments to circular chromosomes and alternate scaffolds. The programs have been adapted for use in pipelines by integrating their usage into R/Bioconductor packages such as gmapR and HTSeqGenie, and these pipelines have facilitated the discovery of numerous biological phenomena.

Key words

Genomic alignment Genomic mapping Bioinformatics algorithms Next-generation sequencing RNA-seq DNA-seq Sequence analysis Transcriptome analysis 

References

  1. 1.
    Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithms 2:53–86CrossRefGoogle Scholar
  2. 2.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410CrossRefPubMedGoogle Scholar
  3. 3.
    Brennicke A, Marchfelder A, Binder S (1999) RNA editing. FEMS Microbiol Rev 23:297–316CrossRefPubMedGoogle Scholar
  4. 4.
    Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation, Palo Alto, CaliforniaGoogle Scholar
  5. 5.
    Burset M, Seledtsov IA, Solovyev VV (2000) Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res 28:4364–4375CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, and 1000 Genomes Project Analysis Group (2011) The variant call format and vcftools. Bioinformatics 27(15):2156–2158CrossRefGoogle Scholar
  7. 7.
    Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25:3207–3212CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, deWinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, Lundquist P, Ma C, Marks P, Maxham M, Murphy D, Park I, Pham T, Phillips M, Roy J, Sebra R, Shen G, Sorenson J, Tomaney A, Travers K, Trulson M, Vieceli J, Wegener J, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F, Korlach J, Turner S (2009) Real-time DNA sequencing from single polymerase molecules. Science 323:133–138CrossRefPubMedGoogle Scholar
  11. 11.
    Elias P (1975) Universal codeword sets and representations of the integers. IEEE Trans Inf Theory 21:194–203CrossRefGoogle Scholar
  12. 12.
    Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, The RGASP Consortium, Rätsch G, Goldman N, Hubbard TJ, Harrow J, Guigó R, Bertone P (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10:1185–1191CrossRefPubMedGoogle Scholar
  13. 13.
    Farrar M (2007) Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23:156–161CrossRefPubMedGoogle Scholar
  14. 14.
    Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8:967–974PubMedPubMedCentralGoogle Scholar
  15. 15.
    Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci 89:1827–1831CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Grant GR, Farkas MR, Pizarro A, Lahens N, Schug J, Brunk B, Stoeckert CJ Jr, Hogenesch JB, Pierce EA (2011) Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq Unified Mapper (RUM). Bioinformatics 27:2518–2528PubMedPubMedCentralGoogle Scholar
  17. 17.
    Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Borthballer A, Ascano M Jr, Jungkamp A-C, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T (2010) Transcriptome-wide identification of RNA-binding protein and MicroRNA target sites by PAR-CLIP. Cell 141:129–141CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Ole’s AK, Pag‘es H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12(2):115–121Google Scholar
  19. 19.
    Jiang Z, Jhunjhunwala S, Liu J, Haverty PM, Kennemer MI, Guan Y, Lee W, Carnevali P, Stinson J, Johnson S, Diao J, Yeung S, Jubb A, Ye W, Wu TD, Kapadia SB, de Sauvage FJ, Gentleman RC, Stern HM, Seshagiri S, Pant KP, Modrusan Z, Ballinger DG, Zhang Z (2012) The effects of hepatitis B virus integration into the genomes of hepatocellular carcinoma patients. Genome Res 22:593–601CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Kalafus KJ, Jackson AR, Milosavljevic A (2004) Pash: efficient genome-scale sequence anchoring by positional hashing. Genome Res 14:672–678CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    Klijn C, Durinck S, Stawiski EW, Haverty PM, Jiang Z, Liu H, Degenhardt J, Mayba O, Gnad F, Liu J, Pau G, Reeder J, Cao Y, Mukhyala K, Selvaraj SK, Yu M, Zynda GJ, Brauer MJ, Wu TD, Gentleman RC, Manning G, Yauch RL, Bourgon R, Stokoe D, Modrusan Z, Neve RM, de Sauvage FJ, Settleman J, Seshagiri S, Zhang Z (2015) A comprehensive transcriptional portrait of human cancer cell lines. Nat Biotechnol 33:306–312CrossRefPubMedGoogle Scholar
  23. 23.
    Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25CrossRefPubMedPubMedCentralGoogle Scholar
  24. 24.
    Lawrence M, Degenhardt J, Gentleman R (2015) VariantTools: tools for working with genetic variants. R package version 1.10.0Google Scholar
  25. 25.
    Lemire D, Boytsov L (2015) Decoding billions of integers per second through vectorization. Softw Pract Experience 45:1–29Google Scholar
  26. 26.
    Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25:1754–1760CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967CrossRefPubMedGoogle Scholar
  28. 28.
    Lister R, Ecker JR (2009) Finding the fifth base: genome-wide sequencing of cytosine methylation. Genome Res 19:959–966CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Liu J, Lee W, Jiang Z, Chen Z, Jhunjhunwala S, Haverty PM, Gnad F, Guan Y, Gilbert HN, Stinson J, Klijn C, Guillory J, Bhatt D, Vartanian S, Walter K, Chan J, Holcomb T, Dijkgraaf P, Johnson S, Koeman J, Minna JD, Gazdar AF, Stern HM, Hoeflich KP, Wu TD, Settleman J, de Sauvage FJ, Gentleman RC, Neve RM, Stokoe D, Modrusan Z, Seshagiri S, Shames DS, Zhang Z (2012) Genome and transcriptome sequencing of lung cancers reveal diverse mutational and splicing events. Genome Res 22:2315–2327CrossRefPubMedPubMedCentralGoogle Scholar
  30. 30.
    Manber U, Myers G (1990) Suffix arrays: a new method for on-line string searches. In: Symposium on discrete algorithms, pp 319–327Google Scholar
  31. 31.
    Morin PA, Luikart G, Wayne RK, The SNP Workshop Group (2004) SNPs in ecology, evolution and conservation. Trends Ecol Evol 19:208–216CrossRefGoogle Scholar
  32. 32.
    Nacu S, Yuan W, Kan Z, Bhatt D, Rivers CS, Stinson J, Peters BA, Modrusan Z, Jung K, Seshagiri S, Wu TD (2011) Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples. BMC Med Genomics 4:11CrossRefPubMedPubMedCentralGoogle Scholar
  33. 33.
    Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453CrossRefPubMedGoogle Scholar
  34. 34.
    Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725–1729CrossRefPubMedPubMedCentralGoogle Scholar
  35. 35.
    Obenchain V, Lawrence M, Carey V, Gogarten S, Shannon P, Morgan M (2014) VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30(14):2076–2078Google Scholar
  36. 36.
    R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
  37. 37.
    Rognes T, Seeberg E (2000) Six-fold speed-up of Smith–Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16:699–706CrossRefPubMedGoogle Scholar
  38. 38.
    Rudin CM, Durinck S, Stawiski EW, Poirier JT, Modrusan Z, Shames DS, Bergbower EA, Guan Y, Shin J, Guillory J, Rivers CS, Foo CK, Bhatt D, Stinson J, Gnad F, Haverty PM, Gentleman R, Chaudhuri S, Janakiraman V, Jaiswal BS, Parikh C, Yuan W, Zhang Z, Koeppen H, Wu TD, Stern HM, Yauch RL, Huffman KE, Paskulin DD, Illei PB, Varella-Garcia M, Gazdar AF, de Sauvage FJ, Bourgon R, Minna JD, Brock MV, Seshagiri S (2012) Comprehensive genomic analysis identifies SOX2 as a frequently amplified gene in small-cell lung cancer. Nat Genet 44:1111–1116CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Seshagiri S, Stawiski EW, Durinck S, Modrusan Z, Storm EE, Conboy CB, Chaudhuri S, Guan Y, Janakiraman V, Jaiswal BS, Guillory J, Ha C, Dijkgraaf GJP, Stinson J, Gnad F, Huntley MA, Degenhardt JD, Haverty PM, Bourgon R, Wang W, Koeppen H, Gentleman R, Starr TK, Zhang Z, Largaespada DA, Wu TD, de Sauvage FJ (2012) Recurrent R-spondin fusions in colon cancer. Nature 488:660–664CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145, 2008.CrossRefPubMedGoogle Scholar
  41. 41.
    Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197CrossRefPubMedGoogle Scholar
  42. 42.
    Steijger T, Apbril JF, Engström P, Kokocinski F, The RGASP Consortium, Hubbard TJ, Guigó R, Harrow J, Bertone P (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10:1177–1184CrossRefGoogle Scholar
  43. 43.
    Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111CrossRefPubMedPubMedCentralGoogle Scholar
  44. 44.
    Wozniak A (1997) Using video-oriented instructions to speed up sequence comparison. Comput Appl Biosci 13:145–150PubMedGoogle Scholar
  45. 45.
    Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26:873–881CrossRefPubMedPubMedCentralGoogle Scholar
  46. 46.
    Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1975CrossRefPubMedGoogle Scholar
  47. 47.
    Yeo G, Burge CB (2004) Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 11:377–394CrossRefPubMedGoogle Scholar
  48. 48.
    Zhang Y, Luoh S-M, Hon LS, Baertsch R, Wood WI, Zhang Z (2007) GeneHug-GEPIS: digital expression profiling for normal and cancer tissues based on an integrated gene database. Nucleic Acids Res 35:W152–W158CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Thomas D. Wu
    • 1
  • Jens Reeder
    • 1
  • Michael Lawrence
    • 1
  • Gabe Becker
    • 1
  • Matthew J. Brauer
    • 1
  1. 1.GenentechSouth San FranciscoUSA

Personalised recommendations