GPU accelerated sequence alignment with traceback for GATK HaplotypeCaller
 107 Downloads
Abstract
Background
Pairwise sequence alignment is widely used in many biological tools and applications. Existing GPU accelerated implementations mainly focus on calculating optimal alignment score and omit identifying the optimal alignment itself. In GATK HaplotypeCaller (HC), the semiglobal pairwise sequence alignment with traceback has so far been difficult to accelerate effectively on GPUs.
Results
We first analyze the characteristics of the semiglobal alignment with traceback in GATK HC and then propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs. For the first stage, we choose intratask parallelization model to calculate the position of the optimal alignment score and the backtracking matrix. Moreover, in the first stage, our GPU implementation also records the length of consecutive matches/mismatches in addition to lengths of consecutive insertions and deletions as in the CPUbased implementation. This helps efficiently retrieve the backtracking matrix to obtain the optimal alignment in the second stage.
Conclusions
Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. When integrated into GATK HC (alongside a GPU accelerated pairHMMs forward kernel), the overall acceleration is 2.3x faster than the baseline GATK HC implementation, and 1.34x faster than the GATK HC implementation with the integrated GPUbased pairHMMs forward algorithm. Although the methods proposed in this paper is to improve the performance of GATK HC, they can also be used in other pairwise alignments and applications.
Keywords
Semiglobal alignment with traceback Optimal alignment GATK HaplotypeCaller (HC) GPUsAbbreviations
 HC
HaplotypeCaller
 NGS
Next Generation Sequencing
Background
NGS (Next Generation Sequencing) platforms offer the capacity to generate large amounts of DNA sequencing data in a short time and at a low cost. However, the analysis of the dramatic amounts of DNA sequencing data is still a computational challenge. Researchers have proposed many methods to improve the performance of the DNA sequencing data analysis tools and applications. One method is to execute these tools and applications on high performance computing architectures, such as supercomputers, clusters and even cloud environments. Another method is to use accelerators, such as GPUs and FPGAs, to accelerate the timeconsuming kernels of these tools and applications to improve their performance.
GATK HaplotypeCaller (HC) is a popular variant caller, which is used to find the differences (or variants) between the sample DNA sequence compared with the reference sequence. Although GATK HC has higher accuracy of identifying variants compared with many other variant callers, its feasibility is limited by the long execution time needed for the analysis, which has proven to be difficult to optimize. This has driven researchers to improve its performance. Intel and IBM researchers both employ vector instructions to optimize the pairHMMs forward algorithm [1, 2], which is the most timeconsuming part of GATK HC, to reduce the total execution time. Ren et al. [3, 4] uses GPUs to accelerate the pairHMMs forward algorithm in GATK HC, which achieved 1.71x speedup in single thread mode. After accelerating the pairHMMs forward algorithm on GPUs, profiling of GATK HC shows that the semiglobal pairwise sequence alignment accounts for around 34.5% of the overall execution time, making it the most timeconsuming kernel in the application. This provides an opportunity to further improve the performance of GATK HC using GPU acceleration.
Pairwise sequence alignment, which includes global alignment, semiglobal alignment and local alignment, is one of the commonly used techniques in many biological tools and applications. The global alignment and the semiglobal alignment are calculated by the NeedlemanWunsch algorithm and the modified NeedlemanWunsch algorithm, respectively, while the local alignment is calculated by the SmithWaterman algorithm. Although there are some differences existing in the three algorithms, the main framework of these algorithms is similar, which includes two stages: (1) a dynamic programming kernel to calculate the score matrices and find the optimal alignment score; (2) a traceback (or backtracking) kernel to find the optimal alignment.
Since three kinds of pairwise sequence alignment (global, semiglobal and local) have the same framework and differ only in details, techniques of speeding up one can be applied to the other two with tiny modifications. Different kinds of highperformance platforms, especially accelerators, such as FPGAs [5, 6] and GPUs [7, 8, 9, 10, 11, 12, 13, 14, 15, 16], are used to reduce their execution time.
There has been much research done to reduce the execution time of the three kinds of pairwise alignment on GPUs. There are two approaches to implement the first stage of the pairwise sequence alignment on GPUs (which is to calculate the optimal alignment score): intertask parallelization model and intratask parallelization model. The former is that each thread performs one alignment independently, such as [7] and [8]. The latter is that threads in a block cooperate to perform an alignment, such as [9]. If the pairwise sequence alignment is applied for sequence database scanning, aligning a query sequence with all database sequences for sequence similarity, a query profile and related data storage and access techniques are employed to reduce memory accesses on GPUs, such as [10] and [11]. In [11], alignments are performed in interleaved mode in order to amortize the cost of initiating each execution pass.
However, very few researchers implement the second stage on GPUs. The existing implementations can be classified into two groups. The implementations of the first group are based on backtracking matrices. Liu et al. [11] proposed to store the score matrices and backtrack the score matrices to obtain the optimal alignment. However, the method is not described clearly. gpupairAlign [12] proposed to store the alignment moves in four Boolean backtracking matrices during the first stage and retrieve the four Boolean backtracking matrices instead of the score matrices. This group of implementations obtain the optimal alignment in linear time, but the disadvantage is that their space complexity is quadratic. The implementations of the second group are based on the MyersMiller algorithm. MSACUDA [13] developed a stackbased iterative implementation of the MyersMiller algorithm [17] to retrieve the optimal alignment in linear space. SW# [14] proposed a modified MyersMiller algorithm. CUDAlign 2.0 [15] combined the MyersMiller and SmithWaterman algorithm. Moreover, with several versions of incremental optimizations, CUDAlign 4.0 [16] is able to achieve the optimal alignment of chromosomewide sequences using multiple GPUs. However, their approaches have quadratic time complexity, making them only suitable for the pairwise alignment of very long DNA and protein sequences.

We first analyze the characteristics of the semiglobal alignment in GATK HC and then propose a GPUbased implementation of the semiglobal alignment with traceback based on the analysis.

During the first stage, we propose to record the length of consecutive match(es)/mismatch(es) and store the alignment moves in a special backtracking matrix.

We also propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs.

We benchmark the results and show an overall speedup of GATK HC of about 2.3x over the nonaccelerated version.
Although this paper proposes to improve the performance of GATK HC, the GPUbased implementation of the semiglobal alignment with traceback can be used in other applications and tools. Moreover, since there are only small differences among the global alignment, semiglobal alignment and local alignment, the methods proposed in this paper can also be applied to the global alignment and local alignment.
Methods
A brief overview of semiglobal alignment
The pairwise sequence alignment is to find the optimal alignment between two sequences, which has the optimal alignment score. The modified NeedlemanWunsch algorithm with affine gap penalties to calculate the optimal alignment score of the semiglobal alignment in GATK HC is defined as
where m and n are the length of R1 and R2, respectively. In these equations, M_{i,j} represents the optimal alignment score of two subsequences R1[1]...R1[i] and R2[1]...R2[j], while I_{i,j} and D_{i,j} represent the optimal alignment score of two subsequences R1[1]...R1[i] and R2[1]...R2[j] with R2[j] aligned to a gap and R1[i] aligned to a gap, respectively. Here, the semiglobal alignment uses an affine gap penalty model to calculate gap penalties, in which α and β are the gap open penalty and the gap extension penalty, respectively. sbt is the score of a match or mismatch. As shown by Eq. 1, the penalties of gaps at the start and end of two sequences are neglected. As shown by Eq. 3, the optimal alignment score of the semiglobal alignment in GATK HC is the greatest value of the elements in the last row and the last column of the matrix M.
These equations indicate that the computation complexity of the modified NeedlemanWunsch algorithm is O(mn), which makes the execution time increase quadratically with the sequence length. Usually, the algorithm is implemented using dynamic programming which solves three two dimensional matrices. According to the equations, M_{i,j}, I_{i,j} and D_{i,j} only depend on the upleft, up and left neighbor elements, which implies that the elements on the same antidiagonal can be computed in parallel. Thus, a method employed by many researchers to reduce the execution time is to exploit this inherent parallelism in the algorithm.
Cigar format
CIGAR operations used in GATK HC
Operation  Description 

M  Match/mismatch 
I  Insertion (gap in R1) 
D  Deletion (gap in R2) 
S  Soft clipping (base at the beginning or the end of R2 but not in R1) 
Take the alignment in Fig. 1 for example. The CIGAR representation of the alignment is “3M2D1M2I2M1S” and POS of the alignment is 1.
GPU architecture
Modern GPUs are widely used to accelerate computationally intensive algorithms. A GPU consists of thousands of small cores capable of executing one thread at a time. On NVIDIA GPUs, threads are grouped into blocks and these blocks are grouped into grids. Furthermore, consecutive threads in the same block are grouped into warps. The size of a warp is usually 32. The memory hierarchy includes registers, shared memory, global memory, cache and so on. Each thread is assigned a set of registers. The shared memory is accessed by all threads in a block. Using the shared memory, the threads in a block can exchange data at a very fast rate. The global memory is accessed by all the threads on the GPU. The latency of the global memory access is high since it resides on the device DRAM. If the data accessed by each thread in the same warp are stored at consecutive addresses, the global memory accesses of these threads can be coalesced. Usually, the width of one global memory access is 128 bytes. If the global memory accesses of threads in a warp are coalesced, there will be only one global memory access when the data accessed by each thread is not more than 4 bytes. Otherwise, there would be 32 sequential global memory accesses in the worstcase situation.
Semiglobal alignment in GATK HC
Implementation of alignment in GATK HC
In GATK HC, the semiglobal pairwise alignment is performed in two stages.

>0  indicates a deletion and the length of the consecutive deletion(s) is the value of the element

=0  indicates a match or mismatch and the length of the consecutive match(es)/mismatch(es) is increased by 1

<0  indicates an insertion and the length of the consecutive insertion(s) is the absolute value of the element
The absolute values of the elements in the backtracking matrix are calculated by recording the length of the consecutive deletion(s) and consecutive insertion(s) when calculating the score matrix.
The implementation of the second stage is to calculate the optimal alignment in CIGAR format and POS. The score matrix sw is first used to find the optimal alignment score and the backtracking matrix btrack is then used to obtain the optimal alignment and POS. The optimal alignment is calculated in linear time. The backtracking matrix in GATK HC is helpful during backtracking. It is much easier to identify the next move compared with other methods since it does not need to jump among several backtracking matrices (shown in Fig. 1) or calculate the next move based on the current move [12]. Moreover, the lengths of the consecutive deletion(s) and consecutive insertion(s) are given by the element of the btrack matrix. However, the length of the consecutive match(es)/mismatch(es) is not given, which is increased by one instead.
Data analysis
 1
Align the reference path with the dangling path to recover dangling branches for the local assembly.
 2
Align the read with the assembled haplotype.
 3
Align the assembled haplotype with the reference to decide whether the assembled haplotype satisfied the defined requirements.
Execution time of the semiglobal alignment in three situations of GATK HC
Situation  Number of alignments  Execution time 

1  3529  0.03% 
2  850376  14.58% 
3  54802  19.89% 
Total  908707  34.5% 
Hence, although the computation complexity of the optimal alignment is linear, most of its execution time is used to calculate the length of the consecutive match(es)/mismatch(es).
Implementation on GPUs
The implementation of the semiglobal pairwise alignment for GATK HC on GPUs is performed in two stages. In the first stage, it performs the modified NeedlemanWunsch algorithm in order to obtain the backtracking matrix and the position of the optimal alignment score. In the second stage, it retrieves the backtracking matrix in order to obtain the optimal alignment in CIGAR format and POS.
First stage implementation
Intratask parallelization
Recording the length of consecutive match(es)/mismatch(es)

x>0  indicates a deletion and the length of the consecutive deletion(s) is the value of the element

x=0  indicates a match or mismatch and the length of the consecutive match(es)/mismatch(es) is y.

x<0  indicates an insertion and the length of the consecutive insertion(s) is the absolute value of the element
The data type of x and y is short, of which the minimum value and maximum value are −32768 and 32767, respectively. The absolute values of the minimum value and maximum value are bigger than the theoretical maximum length of the consecutive operations, which is the length of R1 or R2. In order to calculate the backtracking matrix, a short2 vector in the shared memory and two registers of each thread are used.
Second stage implementation
In the second stage, we use the backtracking matrix btrack to obtain the optimal alignment and POS. Algorithm 1 presents the pseudo code of the optimal alignment retrieval on GPUs. P1 and P2 describe the position of the optimal alignment score. Algorithm 1 first checks whether there are soft clippings at the end of R2, and then computes the optimal alignment in a while loop. At the end, it checks whether there are soft clippings at the beginning of R2. The backtracking starts from (P1,P2) and finishes when i≤0 or j≤0, which is calculated in linear time. POS is the value of (i−1) at the end of the while loop. In addition, the position of each element in the backtracking matrix is calculated by i, j and max_col, as shown in the 9th line in Algorithm 1. max_col is the column size of the maximum backtracking matrix of all sequence pairs.
The length of the deletion, insertion and match/mismatch is given by the value of an element of the backtracking matrix, as shown in the 13th, 18th and 23rd line, respectively. This reduces the global memory accesses used to calculate the length of the operations.
Results
All the experiments are performed on IBM Power System S823L (8247842L), which has 2 IBM Power8 processors (10 cores each) running at 3.6 GHz, 256 GB of DDR3 memory, and an NVIDIA Tesla K40 card. The NVIDIA Tesla K40 card has 2880 cores that run at up to 745 MHz and has a CUDA compute capability of 3.5.
We first compare the performance of the GPUbased semiglobal alignment implementation with different techniques using the synthetic datasets. The synthetic datasets are created based on the output of Wgsim [21] with default parameters. We then compare the performance of GPUbased semiglobal alignment implementation with gpupairAlign implementation using synthetic datasets. Next, we compare the performance of GPUbased semiglobal alignment implementation with CPUbased implementation using synthetic and real datasets. We last integrate the GPUbased semiglobal alignment implementation into GATK HC 3.7 and compare the overall performance.
Throughput is used as a performance metric of the first stage of the GPUbased implementation, which is measured by giga cell updates per second (GCUPS). Note that it is not fair to compare the throughput of the first stage of the semiglobal alignment with traceback with that of the scoreonly alignments since the former needs to store backtracking matrices in the global memory.
Performance comparison of multipass
There are two solutions to implement the first stage of the semiglobal alignment on GPUs if the length of R2 is bigger than the number of threads in a block. We realized these two solutions and used different synthetic datasets to compare the performance of the two solutions.
For the first solution, which is to increase the block size, there are in total 9 implementations for 9 datasets. The differences of these implementations are the block size and the sizes of vectors in the shared memory which store the intermediate results. For the second solution, which employs multipass, there is 1 implementation with block size of 128.
As shown by Fig. 8, the throughput of the first solution is higher than that of the second solution when the number of sequence pairs of the datasets is 25 and 100. However, when the number of sequence pairs of the datasets is 1000, the throughput of the second solution is higher in most cases. This is because the efficiency of the implementations for the first solution is smaller than that of the implementation for the second solution and the advantage of the second solution overweighs its disadvantage when the number of sequence pairs of the dataset is big. Thus, we can choose the implementation of these two solutions based on the number of sequence pairs of the dataset.
Performance comparison of recording match/mismatch lengths
In this section, we analyze the impact of recording the length of consecutive matches/mismatches on the performance of the second stage of the alignment on GPUs. We realized two implementations. The first implementation is our approach shown in Algorithm 1 in which the length of consecutive matches/mismatches is recorded in the backtrack matrix. The second implementation is similar to Algorithm 1 except that the length of M is increased by one and the coordinates (i, j) of M are decreased by 1. The backtracking matrices are produced by 9 implementations for the first solution using 9 synthetic datasets. Here, the synthetic datasets are not based on the output of Wgsim since we consider the best case, in which only many M operations exist in the optimal alignment. The lengths of R1/R2 in the 9 synthetic datasets are 64, 128, 192, 256, 320, 384, 448, 512 and 576. The number of sequence pairs in the 9 datasets is 100.
Performance comparison with gpupairAlign
As mentioned in “Background” section, there are two methods to implement the second stage on GPUs: the method based on the MyersMiller algorithm and the method based on backtracking matrices. The method based on the MyersMiller algorithm is only suitable for the pairwise alignment of very long DNA and protein sequences. Thus, we compared our implementation with gpupairAlign [12], which uses backtracking matrices to obtain the optimal alignments. gpupairAlign is designed to perform alignment of every given sequence pair on GPUs, especially for protein sequence pairs. It includes algorithms for global alignment, semiglobal alignment and local alignment. We compare with its semiglobal alignment algorithm. The semiglobal alignment algorithm of gpupairAlign is also performed in two stages: the optimal alignment score and the backtracking matrices are computed in the first stage; the backtracking is performed in the second stage.
There are two main differences between the gpupairAlign implementation and our implementation: (1) In the first stage, our implementation employs the intratask parallelization model, while the gpupairAlign implementation employs the intertask parallelization model; (2) The backtracking matrix of our implementation is a short2 matrix, elements of which are the length of consecutive deletion(es), insertion(es) and match(es)/mismatch(es), while the backtracking matrices of the gpupairAlign implementation are four Boolean matrices, elements of which indicate the proper direction of backtracking moves.
We modified the gpupairAlign implementation to make it to deal with the data produced by GATK HC: (1) Since the input data of our implementation is a set of sequence pairs instead of a set of sequences, the way in which the gpupairAlign implementation handles input data is modified; (2) Integer arrays are used to store the intermediate results instead of short arrays since the intermediate results are bigger than the maximum value of the short data type; (3) The alignments are modified to be represented using the CIGAR format and POS.
Performance comparison with CPUbased implementation
In this section, we compare the performance of our GPUbased semiglobal alignment with traceback implementation with the CPUbased implementation using synthetic and real datasets. We used the first solution which increases the block size when the length of R2 is bigger than the block size and records the length of consecutive matches/mismatches. The CPUbased implementation is written in the C++ programming language and compiled with gcc O3 optimization, running on one Power8 core. The real datasets are produced by using a typical workload (Chromosome 10 of the whole human genome dataset G15512.HCC1954.1).
Performance of GPUbased implementations on real datasets
Throughput (GCUPS)  GPU (sec)  CPU (sec)  Speedup  

Stage 1 of S2  1.86  2.32  43.93  18.94x 
Stage 2 of S2    0.14  0.62  4.43x 
Overall of S2    3.15  44.55  14.14x 
Stage 1 of S3  0.64  10.20  53.29  5.22x 
Stage 2 of S3    0.09  0.17  1.89x 
Overall of S3    10.93  53.46  4.89x 
Integration into GATK HC
The two GPUbased implementations with block size of 128 and 576 are integrated into GATK 3.7 to accelerate the semiglobal alignment with traceback of situation 2 and situation 3, respectively. The GATK HC implementation with both GPUbased pairHMMs forward algorithm and GPUbased semiglobal alignment with traceback is compared with other two GATK HC implementations: GATK HC (referred to as baseline), which is downloaded from the GATK website, and GATK HC with only GPUbased pairHMMs forward algorithm. The dataset is Chromosome 10 of the whole human genome dataset (G15512.HCC1954.1). All the GATK HC implementations are performed in single thread mode.
Execution time of GATK HC implementations
Total time (s)  Speedup  

Baseline  8034.05   
GPU (only pairHMMs)  4687.08  1.71x 
GPU (pairHMMs + semiglobal alignment with traceback)  3490.70  2.30x 
Note that the number of sequence pairs of each batch produced by GATK HC is small, leading to under utilization of the GPU resources. It is better to launch multiple GATK HC processes at the same time to fully utilize the GPU resources.
Conclusion
This paper presents an implementation of the semiglobal alignment with traceback on GPUs to improve the performance of GATK HC. Semiglobal alignment with traceback has two stages: in the first stage, a backtracking matrix is computed; in the second stage, the optimal alignment is calculated using the backtracking matrix. Based on the characteristics of the semiglobal alignment with traceback in GATK HC, the intratask parallelization model is chosen. The first stage of our GPU implementation is up to 18.94x faster than CPU. Moreover, our GPU implementation also records the length of consecutive matches/mismatches in addition to lengths of consecutive insertions and deletions as in the CPU implementation. This helps to reduce global memory accesses and provides a speedup of up to 4.43x in the second stage. Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. The GATK HC implementation with both GPUbased pairHMMs forward algorithm and GPUbased semiglobal alignment with traceback is 2.30x faster than the baseline GATK HC. It is 1.34x faster than the GATK HC implementation with only GPUbased pairHMMs forward algorithm.
Notes
Acknowledgements
The authors wish to thank the Texas Advanced Computing Center (TACC) at the University of Texas at Austin and IBM for the giving access to the IBM Power8 machines used in this paper.
Funding
This work was supported by CSC (Chinese Scholarship Council) grant and Delft University of Technology. Publication of this article was sponsored by Delft University of Technology.
Availability of data and materials
The algorithm generated in this manuscript as well as all input datasets are publicly available on a publicly available repository: https://github.com/ShanshanRen/semiglobalalignmentwithtraceback.
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume20supplement2.
Authors’ contributions
SR designed and performed the experiments, analyzed the data, and wrote the manuscript. All the authors jointly developed the structure and arguments for the paper, made critical revisions and approved final version.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not Applicable.
Competing interests
The authors declare that they have no competing interests.
References
 1.Takeshi O, Yinhe C, Kathy T. Performance optimization of Broad Institute GATK Best Practices on IBM reference architecture for healthcare and life sciences. IBM Systems Technical White Paper. 2017. https://www.ibm.com/downloads/cas/LY1OY9XJ.
 2.Proffitt A. Broad, Intel Announce Speed Improvements to GATK Powered by Intel Optimizations. BioIT World. 2014. http://www.bioitworld.com/2014/3/20/broadintelannouncespeedimprovementsgatkpoweredbyinteloptimizations.html.
 3.Ren S, Bertels K, AlArs Z. GPUAccelerated GATK HaplotypeCaller with LoadBalanced MultiProcess Optimization. In: IEEE International Conference on Bioinformatics and Bioengineering: 2017. p. 497–502.Google Scholar
 4.Ren S, Bertels K, AlArs Z. Efficient Acceleration of the PairHMMs Forward Algorithm for GATK HaplotypeCaller on Graphics Processing Units. Evol Bioinform Online. 2018; 14:1176934318760543.CrossRefGoogle Scholar
 5.Li IT, Shum W, Truong AK. 160fold acceleration of the SmithWaterman algorithm using a field programmable gate array (FPGA). BMC Bioinforma. 2007; 8(1):1–7.CrossRefGoogle Scholar
 6.Benkrid K, Liu Y, Benkrid AS. A Highly Parameterized and Efficient FPGABased Skeleton for Pairwise Biological Sequence Alignment. IEEE Trans Very Large Scale Integr Syst. 2009; 17(4):561–70.CrossRefGoogle Scholar
 7.Hasan L, Kentie M, AlArs Z. DOPA: GPUbased protein alignment using database and memory access optimizations. BMC Res Notes. 2011; 4(1):261.CrossRefGoogle Scholar
 8.Ahmed N, Mushtaq H, Bertels K, AlArs Z. GPU accelerated API for alignment of genomics sequencing data. In: IEEE International Conference on Bioinformatics and Biomedicine: 2017. p. 510–5.Google Scholar
 9.Maskell DL, Liu Y, Bertil S. CUDASW++: optimizing SmithWaterman sequence database searches for CUDAenabled graphics processing units. BMC Res Notes. 2009; 2(1):73.CrossRefGoogle Scholar
 10.Liu Y, Schmidt B, Maskell DL. CUDASW++2.0: enhanced SmithWaterman protein database search on CUDAenabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Res Notes. 2010; 3(1):93.CrossRefGoogle Scholar
 11.Liu Y, Huang W, Johnson J, Vaidya S. GPU Accelerated SmithWaterman. In: International Conference on Computational Science: 2006. p. 188–95.Google Scholar
 12.Blazewicz J, Frohmberg W, Kierzynka M, Pesch E, Wojciechowski P. Protein alignment algorithms with an efficient backtracking routine on multiple GPUs. BMC Bioinforma. 2011; 12(1):181.CrossRefGoogle Scholar
 13.Liu Y, Schmidt B, Maskell DL. MSACUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA. In: IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors: 2009. p. 121–8.Google Scholar
 14.Korpar M, Sikic M. SW#GPUenabled exact alignments on genome scale. Bioinformatics. 2013; 29(19):2494–5.CrossRefGoogle Scholar
 15.de O Sandes EF, de Melo ACMA. SmithWaterman Alignment of Huge Sequences with GPU in Linear Space. In: 2011 IEEE International Parallel Distributed Processing Symposium: 2011. p. 1199–211.Google Scholar
 16.Sandes EFO, Miranda G, Martorell X, Ayguade E, Teodoro G, Melo ACMA. CUDAlign 4.0: Incremental Speculative Traceback for Exact ChromosomeWide Alignment in GPU Clusters. IEEE Trans Parallel Distrib Syst. 2016; 27(10):2838–50.CrossRefGoogle Scholar
 17.Myers EW, Miller W. Optimal alignments in linear space. Comput Appl Biosci Cabios. 1988; 4(1):11–7.PubMedGoogle Scholar
 18.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al.The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078–9.CrossRefGoogle Scholar
 19.TCGA Mutation Calling Benchmark 4 Files. https://gdc.cancer.gov/resourcestcgausers/tcgamutationcallingbenchmark4files. G15512.HCC1954.1.
 20.Xiao S, Aji AM, Feng W. On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit. In: International Conference on Parallel and Distributed Systems: 2009. p. 26–33.Google Scholar
 21.Wgsim. https://github.com/lh3/wgsim.
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.