Skip to main content

CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment

  • 1628 Accesses

Abstract

Next generation sequencing (NGS) technologies have enabled cheap, large-scale, and high-throughput production of short DNA sequence reads and thereby have promoted the explosive growth of data volume. Unfortunately, the produced reads are short and prone to contain errors that are incurred during sequencing cycles. Both large data volume and sequencing errors have complicated the mapping of NGS reads onto the reference genome and have motivated the development of various aligners for very short reads, typically less than 100 base pairs (bps) in length. As read length continues to increase, propelled by advances in NGS technologies, these longer reads tend to have higher sequencing error rates and more true mutations (including substitutions, insertions, or deletions) to the genome. Such new characteristics make inefficient the aligners, which are optimized for very short reads and support only ungapped alignments or gapped alignments with very limited number of gaps (typically one gap), and thereby call for new aligners with fully gapped alignment supported. In this chapter, we present the CUSHAW software suite for NGS read alignment, which is open-source and consists of three individual aligners: CUSHAW, CUSHAW2, and CUSHAW3. This suite offers parallel and efficient NGS read alignments to large genomes, such as the human genome, by harnessing multi-core CPUs or compute unified device architecture (CUDA)-enabled graphics processing units (GPUs). Moreover, it has the capability to align both base-space and color-space reads and is consistently shown to be one of the best alignment tools through our performance evaluations.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-59826-0_10
  • Chapter length: 31 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-59826-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Hardcover Book
USD   169.99
Price excludes VAT (USA)
Fig. 10.1
Fig. 10.2
Fig. 10.3
Fig. 10.4
Fig. 10.5

References

  1. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)

    CrossRef  Google Scholar 

  2. Simpson, J.T., Wong, K., Jackman, S.D., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)

    CrossRef  Google Scholar 

  3. Liu, Y., Schmidt, B., Maskell, D.L.: Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics 12, 354 (2011)

    CrossRef  Google Scholar 

  4. Luo, R., Liu, B., Xie, Y., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012)

    CrossRef  Google Scholar 

  5. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473–83 (2010)

    CrossRef  Google Scholar 

  6. Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for next-generation sequencing. Brief. Bioinform. 14, 56–66 (2013)

    CrossRef  Google Scholar 

  7. Peng, Y., Leung, H.C., Yiu, S.M., et al.: Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics 27, i94–i101 (2011)

    CrossRef  Google Scholar 

  8. Yang, X., Zola, J., Aluru, S.: Parallel metagenomic sequence clustering via sketching and quasi-clique enumeration on map-reduce clouds. In: 25th International Parallel and Distributed Processing Symposium, pp. 1223–1233 (2011)

    Google Scholar 

  9. Nguyen, T.D., Schmidt, B., Kwoh, C.K.: Fast Dendrogram-based OTU Clustering using Sequence Embedding. In: 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 63–72 (2014)

    Google Scholar 

  10. Smith, A.D., Xuan, Z., Zhang, M.Q.: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9, 128 (2008)

    CrossRef  Google Scholar 

  11. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008)

    CrossRef  Google Scholar 

  12. Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4, e7767 (2009)

    CrossRef  Google Scholar 

  13. Langmead, B., Trapnell, C., Pop, M., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)

    CrossRef  Google Scholar 

  14. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1755–1760 (2009)

    Google Scholar 

  15. Liu, Y., Schmidt, B., Maskell, D.L.: CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform. Bioinformatics 28, 1830–1837 (2012)

    CrossRef  Google Scholar 

  16. Li, R., Yu, C., Li, Y., et al.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009)

    CrossRef  Google Scholar 

  17. Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010)

    CrossRef  Google Scholar 

  18. Rizk, G., Lavenier, D.: GASSST: global alignment short sequence search tool. Bioinformatics 26, 2534–2540 (2010)

    CrossRef  Google Scholar 

  19. Langmead, B., Salzberg, S.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)

    CrossRef  Google Scholar 

  20. Liu, Y., Schmidt, B.: Long read alignment based on maximal exact match seeds. Bioinformatics 28, i318–i324 (2012)

    CrossRef  Google Scholar 

  21. Liu, Y., Schmidt, B.: CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing. IEEE Des. Test 31, 31–39 (2014)

    CrossRef  Google Scholar 

  22. Liu, Y., Popp, B., Schmidt, B.: CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding. PLoS ONE 9, e86869 (2014)

    CrossRef  Google Scholar 

  23. González-Domínguez, J., Liu, Y., Schmidt, B.: Parallel and scalable short-read alignment on multi-core clusters Using UPC++. PLoS ONE 11, e0145490 (2016)

    CrossRef  Google Scholar 

  24. Marco-Sola, S., Sammeth, M., Guigó, R., et al.: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1885–1888 (2012)

    CrossRef  Google Scholar 

  25. Mu, J.C., Jiang, H., Kiani, A., et al.: Fast and accurate read alignment for resequencing. Bioinformatics 28, 2366–2373 (2012)

    CrossRef  Google Scholar 

  26. Luo, R., Wong, T., Zhu, J., et al.: SOAP3-dp: fast, accurate and sensitive GPU-based short-read aligner. PLoS ONE 8, e65632 (2013)

    CrossRef  Google Scholar 

  27. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). arXiv:1303.3997 [q-bio.GN]

    Google Scholar 

  28. Altschul, S.F., Gish, W., Miller, W., et al.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)

    CrossRef  Google Scholar 

  29. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, CA (1994)

    Google Scholar 

  30. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 4 (2005)

    MathSciNet  MATH  Google Scholar 

  31. Novoalign, http://www.novocraft.com/products/novoalign

  32. David, M., Dzamba, M., Lister, D., et al.: SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 27, 1011–1012 (2011)

    CrossRef  Google Scholar 

  33. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)

    CrossRef  Google Scholar 

  34. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    CrossRef  Google Scholar 

  35. Blom, J., Jakobi, T., Doppmeier, D., et al.: Exact and complete short read alignment to microbial genomes using Graphics Processing Unit programming. Bioinformatics 27, 1351–1358 (2011)

    CrossRef  Google Scholar 

  36. Kiełbasa, S.M., Wan, R., Sato, K., et al.: Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011)

    CrossRef  Google Scholar 

  37. Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)

    CrossRef  Google Scholar 

  38. Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13, 296–308 (2006)

    MathSciNet  CrossRef  MATH  Google Scholar 

  39. Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998)

    CrossRef  Google Scholar 

  40. Rognes, T.: Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics 12, 221 (2011)

    CrossRef  Google Scholar 

  41. Liu, Y., Schmidt, B.: GSWABE: faster GPU-accelerated sequence alignment with optimal alignment retrieval for short DNA sequences. Concurr. Comput. Pract. Exp. 27, 958–972 (2014). doi:10.1002/cpe.3371

    CrossRef  Google Scholar 

  42. Li, H., Handsaker, B., Wysoker, A., et al.: The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009)

    CrossRef  Google Scholar 

  43. Highnam, G., Wang, J.J., Kusler, D., et al.: An analytical framework for optimizing variant discovery from personal genomes. Nat. Commun. 6, 6275 (2015)

    CrossRef  Google Scholar 

  44. Sherry, S.T., Ward, M.H., Kholodov, M., et al.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001)

    Google Scholar 

  45. Huang, W., Li, L., Myers, J.R., et al.: ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012)

    CrossRef  Google Scholar 

  46. McKenna, A., Hanna, M., Banks, E., et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010)

    CrossRef  Google Scholar 

  47. Liu, Y., Schmidt, B., Maskell, D.L.: MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA. In: 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, pp. 121–128 (2009)

    Google Scholar 

  48. Liu, Y., Maskelml, D.L., Schmidt, B.: CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2, 73 (2009)

    CrossRef  Google Scholar 

  49. Alachiotis, N., Berger, S.A., Stamatakis, A.: Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel. BMC Bioinformatics 13, 196 (2012)

    CrossRef  Google Scholar 

  50. Liu, Y., Schmidt, B.: SWAPHI: Smith-Waterman protein database search on Xeon Phi coprocessors. In: 25th IEEE International Conference on Application-specific Systems, Architectures and Processors, pp. 184–185 (2014)

    Google Scholar 

  51. Liu, Y., Tran, T.T., Lauenroth, F., et al.: SWAPHI-LS: Smith-Waterman algorithm on Xeon Phi coprocessors for long DNA sequences. In: 2014 IEEE International Conference on Cluster Computing, pp. 257–265 (2014)

    Google Scholar 

  52. Wang, L., Chan, Y., Duan, X., et al.: XSW: accelerating biological database search on Xeon Phi. In: 28th IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 950–957 (2014)

    Google Scholar 

Download references

Acknowledgements

We thank the Novocraft Technologies Company for granting a trial license of Novoalign.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongchao Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Liu, Y., Schmidt, B. (2017). CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59826-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59824-6

  • Online ISBN: 978-3-319-59826-0

  • eBook Packages: Computer ScienceComputer Science (R0)