CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment

Liu, Yongchao; Schmidt, Bertil

doi:10.1007/978-3-319-59826-0_10

Yongchao Liu² &
Bertil Schmidt³

1787 Accesses

Abstract

Next generation sequencing (NGS) technologies have enabled cheap, large-scale, and high-throughput production of short DNA sequence reads and thereby have promoted the explosive growth of data volume. Unfortunately, the produced reads are short and prone to contain errors that are incurred during sequencing cycles. Both large data volume and sequencing errors have complicated the mapping of NGS reads onto the reference genome and have motivated the development of various aligners for very short reads, typically less than 100 base pairs (bps) in length. As read length continues to increase, propelled by advances in NGS technologies, these longer reads tend to have higher sequencing error rates and more true mutations (including substitutions, insertions, or deletions) to the genome. Such new characteristics make inefficient the aligners, which are optimized for very short reads and support only ungapped alignments or gapped alignments with very limited number of gaps (typically one gap), and thereby call for new aligners with fully gapped alignment supported. In this chapter, we present the CUSHAW software suite for NGS read alignment, which is open-source and consists of three individual aligners: CUSHAW, CUSHAW2, and CUSHAW3. This suite offers parallel and efficient NGS read alignments to large genomes, such as the human genome, by harnessing multi-core CPUs or compute unified device architecture (CUDA)-enabled graphics processing units (GPUs). Moreover, it has the capability to align both base-space and color-space reads and is consistently shown to be one of the best alignment tools through our performance evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)
Article Google Scholar
Simpson, J.T., Wong, K., Jackman, S.D., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)
Article Google Scholar
Liu, Y., Schmidt, B., Maskell, D.L.: Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics 12, 354 (2011)
Article Google Scholar
Luo, R., Liu, B., Xie, Y., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012)
Article Google Scholar
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473–83 (2010)
Article Google Scholar
Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for next-generation sequencing. Brief. Bioinform. 14, 56–66 (2013)
Article Google Scholar
Peng, Y., Leung, H.C., Yiu, S.M., et al.: Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics 27, i94–i101 (2011)
Article Google Scholar
Yang, X., Zola, J., Aluru, S.: Parallel metagenomic sequence clustering via sketching and quasi-clique enumeration on map-reduce clouds. In: 25th International Parallel and Distributed Processing Symposium, pp. 1223–1233 (2011)
Google Scholar
Nguyen, T.D., Schmidt, B., Kwoh, C.K.: Fast Dendrogram-based OTU Clustering using Sequence Embedding. In: 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 63–72 (2014)
Google Scholar
Smith, A.D., Xuan, Z., Zhang, M.Q.: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9, 128 (2008)
Article Google Scholar
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008)
Article Google Scholar
Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4, e7767 (2009)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1755–1760 (2009)
Google Scholar
Liu, Y., Schmidt, B., Maskell, D.L.: CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform. Bioinformatics 28, 1830–1837 (2012)
Article Google Scholar
Li, R., Yu, C., Li, Y., et al.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010)
Article Google Scholar
Rizk, G., Lavenier, D.: GASSST: global alignment short sequence search tool. Bioinformatics 26, 2534–2540 (2010)
Article Google Scholar
Langmead, B., Salzberg, S.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)
Article Google Scholar
Liu, Y., Schmidt, B.: Long read alignment based on maximal exact match seeds. Bioinformatics 28, i318–i324 (2012)
Article Google Scholar
Liu, Y., Schmidt, B.: CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing. IEEE Des. Test 31, 31–39 (2014)
Article Google Scholar
Liu, Y., Popp, B., Schmidt, B.: CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding. PLoS ONE 9, e86869 (2014)
Article Google Scholar
González-Domínguez, J., Liu, Y., Schmidt, B.: Parallel and scalable short-read alignment on multi-core clusters Using UPC++. PLoS ONE 11, e0145490 (2016)
Article Google Scholar
Marco-Sola, S., Sammeth, M., Guigó, R., et al.: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1885–1888 (2012)
Article Google Scholar
Mu, J.C., Jiang, H., Kiani, A., et al.: Fast and accurate read alignment for resequencing. Bioinformatics 28, 2366–2373 (2012)
Article Google Scholar
Luo, R., Wong, T., Zhu, J., et al.: SOAP3-dp: fast, accurate and sensitive GPU-based short-read aligner. PLoS ONE 8, e65632 (2013)
Article Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). arXiv:1303.3997 [q-bio.GN]
Google Scholar
Altschul, S.F., Gish, W., Miller, W., et al.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Article Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, CA (1994)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 4 (2005)
MathSciNet MATH Google Scholar
Novoalign, http://www.novocraft.com/products/novoalign
David, M., Dzamba, M., Lister, D., et al.: SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 27, 1011–1012 (2011)
Article Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
Article Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Blom, J., Jakobi, T., Doppmeier, D., et al.: Exact and complete short read alignment to microbial genomes using Graphics Processing Unit programming. Bioinformatics 27, 1351–1358 (2011)
Article Google Scholar
Kiełbasa, S.M., Wan, R., Sato, K., et al.: Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011)
Article Google Scholar
Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Article Google Scholar
Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13, 296–308 (2006)
Article MathSciNet MATH Google Scholar
Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998)
Article Google Scholar
Rognes, T.: Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics 12, 221 (2011)
Article Google Scholar
Liu, Y., Schmidt, B.: GSWABE: faster GPU-accelerated sequence alignment with optimal alignment retrieval for short DNA sequences. Concurr. Comput. Pract. Exp. 27, 958–972 (2014). doi:10.1002/cpe.3371
Article Google Scholar
Li, H., Handsaker, B., Wysoker, A., et al.: The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009)
Article Google Scholar
Highnam, G., Wang, J.J., Kusler, D., et al.: An analytical framework for optimizing variant discovery from personal genomes. Nat. Commun. 6, 6275 (2015)
Article Google Scholar
Sherry, S.T., Ward, M.H., Kholodov, M., et al.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001)
Google Scholar
Huang, W., Li, L., Myers, J.R., et al.: ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012)
Article Google Scholar
McKenna, A., Hanna, M., Banks, E., et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010)
Article Google Scholar
Liu, Y., Schmidt, B., Maskell, D.L.: MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA. In: 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, pp. 121–128 (2009)
Google Scholar
Liu, Y., Maskelml, D.L., Schmidt, B.: CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2, 73 (2009)
Article Google Scholar
Alachiotis, N., Berger, S.A., Stamatakis, A.: Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel. BMC Bioinformatics 13, 196 (2012)
Article Google Scholar
Liu, Y., Schmidt, B.: SWAPHI: Smith-Waterman protein database search on Xeon Phi coprocessors. In: 25th IEEE International Conference on Application-specific Systems, Architectures and Processors, pp. 184–185 (2014)
Google Scholar
Liu, Y., Tran, T.T., Lauenroth, F., et al.: SWAPHI-LS: Smith-Waterman algorithm on Xeon Phi coprocessors for long DNA sequences. In: 2014 IEEE International Conference on Cluster Computing, pp. 257–265 (2014)
Google Scholar
Wang, L., Chan, Y., Duan, X., et al.: XSW: accelerating biological database search on Xeon Phi. In: 28th IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 950–957 (2014)
Google Scholar

Download references

Acknowledgements

We thank the Novocraft Technologies Company for granting a trial license of Novoalign.

Author information

Authors and Affiliations

School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Yongchao Liu
Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz, Germany
Bertil Schmidt

Authors

Yongchao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bertil Schmidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongchao Liu .

Editor information

Editors and Affiliations

LaTICE, Tunis, Tunisia
Mourad Elloumi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, Y., Schmidt, B. (2017). CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-59826-0_10
Published: 19 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59824-6
Online ISBN: 978-3-319-59826-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics