A Customized Many-Core Hardware Acceleration Platform for Short Read Mapping Problems Using Distributed Memory Interface with 3D–Stacked Architecture
- First Online:
- Cite this article as:
- Liu, P., Hemani, A., Paul, K. et al. J Sign Process Syst (2017) 87: 327. doi:10.1007/s11265-016-1204-8
- 1.1k Downloads
Rapidly developing Next Generation Sequencing technologies produce huge amounts of short reads that consisting randomly fragmented DNA base pair strings. Assembling of those short reads poses a challenge on the mapping of reads to a reference genome in terms of both sensitivity and execution time. In this paper, we propose a customized many-core hardware acceleration platform for short read mapping problems based on hash-index method. The processing core is highly customized to suite both 2-hit string matching and banded Smith-Waterman sequence alignment operations, while distributed memory interface with 3D–stacked architecture provides high bandwidth and low access latency for highly customized dataset partitioning and memory access scheduling. Conformal with original BFAST program, our design provides an amazingly 45,012 times speedup over software approach for single-end short reads and 21,102 times for paired-end short reads, while also beats similar single FPGA solution for 1466 times in case of single end reads. Optimized seed generation gives much better sensitivity while the performance boost is still impressive.
KeywordsAccelerator architectures Application specific integrated circuits Bioinformatics Computational biology Coprocessors Three-dimensional integrated circuits
DNA sequencing is a technology on detection of the sequence base pairs of nucleotides in a DNA sample. It’s the basement of modern life science fields such as medicine, agriculture and forensics. Next Generation Sequencing (NGS) technologies have greatly reduced the cost of DNA sequencing and increased data throughput. It randomly breaks down many genome copies of an organism into small pieces, then optically “read” the fragmented string ends in parallel and output as huge amount of randomly fragmented DNA strings with equal length.
Those strings called “read” are typically representing 50 ~ 400 base-pairs (bp) in length for DNA. In order to reconstruct the original sequence structure of target organism sample, an assembling stage is introduced with two different approaches: de-novo assembly and reference-guided assembly. The de-novo method assembles target short reads by comparing each other to create full-length sequences, while reference-guided method assembles target short reads against an existed genome model to build a similar but unnecessarily identical sequence. De-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies, meanwhile the latter is much more commonly used in healthcare research of human beings. In this paper, we focus on the reference-guided assembly method, or so called short read mapping.
The mapping process, either indexing the input reads and then scanning through the reference genome or aligning each read independently against indexed reference sequence, all the approaches within this category follow the same principle: index, search, and local sequence alignment. Rapidly developing NGS technologies produces more and more read data than ever before, and the read-indexing approaches turned to be deprecated nowadays, thus we focus on database-indexing methods. There are mainly two types of database indexing methods, hash-index and FM-index .
The hash-index method leverages the fact that individual genomes differ slightly, so that it is likely that some tiny sub-sequences of a short read will match part of the reference genome exactly. These tiny sub-sequences are called seeds, hashed value of which shall be evaluated throughout the hashed reference genome to determine the candidate matching locations. For each candidate matching location, corresponding short read shall be scored against the reference sequence using fully gapped Smith-Waterman sequence alignment  for verification. BFAST  is an example program based upon this approach.
The FM-index approaches are based on a block sorting data compression algorithm called Burrows-Wheeler transform (BWT) . FM-index  is a compressed index that combines the properties of the suffix array  with the BWT. It allows searching for a pattern using binary search instead of a linear time scan. Although it needs more table accesses than classical hash-index based approaches, it’s shown to be one to two orders of magnitude faster in case of limited number of differences between a read and the reference genome. Programs within this group performance well on desktop computers, but the running time would be exponential growth with respect to the allowed number of differences. Therefore these tools tend to be less sensitive. Bowtie  and BWA  are examples based upon this approach.
Modern sequencing machines can generate billions of bp data in very short time. For example, an Illumina HiSeq 2500 sequencing system  would generate a maximum dataset of 1000 Gb per single run, which consists of 4 billion reads with each length up to 2 × 125 bp. Thus the data processing of these reads, traditionally performed by software and general purpose computers, is becoming the bottle-neck of the sequencing industry. A number of acceleration platforms have been proposed based on FPGAs [9, 10] and GPGPU . Both FM-index and hash-index programs have been significantly accelerated, but a series of restrictions are applied among those implementations. Performance of FM-index programs is greatly reduced in case of high difference rate even with hardware acceleration. Hash-index short read mapping can achieve high accuracy with high difference rates, but hardware acceleration is restricted by memory sub-system in terms of access latency and memory size .
In this paper, we turn to architectural innovation to address the performance bottleneck caused by memory sub-system in traditional FPGA designs. We present a novel architecture for short read mapping problem of hash-index method, using customized many-core and 3D–stacked technologies.
We present an analysis of hash-index based short read mapping problem. We summarize the application with computation stages and data flow, and identify the kernel functions which consume a dominant portion of the overall execution time.
We describe a processing core with dedicated DRAM channels, which is designed to facilitate both 2-hit string matching and Smith-Waterman local sequence alignment operations concurrently. Dataset partitioning, customized interconnections and 3D–stacked technologies enable the use of customized many-core architecture for massively parallelization.
We introduce a memory access policy to maximize the throughput of local DRAM access while devoid the bus congestion in a circuit-switching many-core system. This policy has been adopted to provide both high bandwidth and low access latency.
We present results from prototyping stripped down implementation of our design on a commodity FPGA, representing the performance of a single processing core of our architecture. Then we scale the results to give a plausible estimation of our design with customized many-core architecture and 3D–stacked DRAM. Based on conservative area estimation and simulation results, we show that our design achieves the highest performance and greatest energy efficiency comparing with all other known accelerators.
The remainder of this paper is organized as follows. Section 2 introduces hash-index based short read mapping, Smith-Waterman local sequence alignment, and 3D–stacked DRAM technologies. Related works that accelerating hash-index based short read mapping with FPGAs are also briefly covered. Section 3 describes multiple optimizations that have been applied in our platform to address both performance and sensitivity improvements, while Section 4 introduces detailed implementation of our customized many-core architecture with 3D–stacked DRAM. Experimental results are presented in Section 5. Our architecture (with its concomitant software support) is the first reported solution on accelerating short read mapping problem of hash-index method with 3D–stacked VLSI architecture. Finally Section 7 presents conclusions.
2 Background and Related Work
2.1 Hash-Index Based Short Read Mapping
Hash-index based short read mapping is a deterministic method. During data preparation stage, a large index of the reference genome is compiled, which maps every seed resides in the reference genome to the candidate alignment location (CAL) where it resides. This index is formed as an array of seed-CAL pairs and sorted by seed, creating a structured table called CAL table.
The complete CAL table is extremely large in real practice, thus a pointer table is built to help accessing the specific range of the CAL table without traverse it completely every time. The pointer table is indexed using prefix portion of each seed. Each entry of the pointer table contains an address to the CAL table along with the size of the data bucket. Each CAL bucket contains a series of locations where the actual seed resides in the reference genome, as well as the hashed rest nucleotides of those seeds. All seeds in the same CAL bucket share the same address, but they may differ by their least significant nucleotides, called key.
For current short read curr_read, all the possible seeds are extracted with a defined length between 18 to 22 character symbols accordingly and then hashed to get an address-key pair. The prefix part of a hashed seed, addr_seed, with 14 to 16 nucleotides in length is used as an address to query the pointer table, and the significant suffix part is reserved as key_seed. A valid hit to pointer table using addr_seed retrieves the beginning address bucket_cal and the size of relevant data bucket num_cal in the CAL table, which consists of a number of CALs.
For each record of the data bucket retrieved from CAL table, the hashed candidate alignment location from reference genome, key_ref, is compare with the suffix of hashed seed key_seed, and a hit is reported along with the location to reference genome, cal_curr.
If any CAL matches the current seed indicated by 2-hit string matching, Smith-Waterman local sequence alignment shall be applied between the current short read and reference genome sequence at each of associated CALs pointed by corresponding cal_curr. The location with the highest matching score is chosen as the alignment location of the current short read.
For each read, k-mers are obtained called seed string and hashed.
Prefix and suffix of hashed seed string are used for 2-hit string matching against hashed reference sequences.
Smith-Waterman local sequence alignment is executed between the short read and reference sequence at each substring matched sites for verification.
Similar to most scientific programs, a dominant portion of overall execution time is consumed by a few stages of all the applications of interest in hash-index based short read mapping. The profiling of BFAST presents that 48% of overall run time is consumed by 2-hit string matching and another 40% by Smith-Waterman local sequence alignment on our software benchmark platform.
This flow is sequentially executed in software but also well-suited for massively parallelism acceleration, as operations regarding each seed as well as each read are independent from others, thus the problem can be partitioned into small sub-instances. A more detailed description of hash-index based short read mapping can be found in BFAST .
2.2 Smith-Waterman Local Sequence Alignment
The pairwise alignment is the most fundamental class of method in the field of biological sequence analysis; a large number of complex applications have incorporated the pairwise alignment algorithms within their processing flow. Smith-Waterman sequence alignment  is the standard model for local sequence alignment, that only the sub-sequences which present the highest similarity are relatively aligned between two sequences.
2.3 Related Works
BFAST  by Homer et al. is a short read alignment tool for large scale genome resequencing using hash-index method. In general case, the first 18 to 22 nucleotides of each read are used as a seed and hashed for 2-hit string matching, and the matching results are stored into a compressed file. Then the program use Smith-Waterman local sequence alignment to verify each candidate matching site sequentially, and report the final mapping results.
Olson et al.  first proposed a FPGA based solution to accelerate modified BFAST algorithm. They hashed the seeds using a 1–1 hash function on the all 44-bits, and then use the most significant 30 bits of the result to address the pointer table. The remaining 14-bits are stored as key part to identify the seed amongst the CALs in the CAL table. Their platform consists of multiple FPGA modules, thus they have to partition the CAL table and pointer table into small blocks, so that the system has the ability to compute large genome dataset in real. They reported a 250× speedup versus the original BFAST software with 91% sensitivity, and 31 times speedup versus Bowtie  with 80% of short reads aligned.
Sogabe et al.  also proposed a FPGA based platform to accelerate hash-index based BFAST, while they focused on the hash matching functions but not the complete BFAST application. They modified the BFAST source code for shorter length of address pointer, which resulted to a longer key part for more precise matching. Allowing substitutions in the key filtering process, the search becomes more tolerant against substitutions. As a result, their system reported to be more sensitive than BFAST, Bowtie and implementation by Olson et al. With a very high matching rate of 99.48%, they retain a speedup of 2.5 times versus Bowtie.
There has been no reported GPGPU implementation of BFAST, since the hardware architecture of which is weak on frequent random memory access with tiny payloads.
2.4 Random Memory Access
The principles of random read and write in Randmem benchmark is similar with the case of 2-hit string matching algorithm in BFAST. During the string matching, various tables stored in main buffer are randomly retrieved with tiny payload. It could hardly benefit from memory bursts, as the payload is very small on each access and the access address is unpredictable. This principle leads to extremely low memory bandwidth utilization in real, which is the major cause of the sharp drop over random access curves in Fig. 2.
FPGA and GPGPU also suffer from this bottleneck, since each (G)DDR2/3/4/5 SDRAM channel could only response to limited request at the same time. Therefore theoretical bandwidth does not yield high random access bandwidth in such extreme application, which results relatively poor performance on FPGA implementations [9, 10].
2.5 MPSoC and 3D–Stacked Technologies
Interconnect scaling has become one of the most crucial challenges in chip design, and is expected to get worse in the future. Multiprocessor System-on-Chip (MPSoC), Network-on-a-Chip (NoC) and 3D stacking technologies are expected to be the promising solutions to overcome many of these challenges.
For hardware acceleration, the demand of memory bandwidth is exponentially growing along with increasing compute ability, which becomes a bottle-neck of massively paralleled systems. The idea of distributing memory controllers to manage multiple DRAM channels has been proposed as a solution. A novel scheme for memory controller placement in many-core processor was presented by Abts et al. , while they retain a 2D mesh of chip layout.
This MPSoC 2D array is capable for solving BFAST problem in theory, but its performance is restricted by external memory bandwidth. For 2-hit string matching algorithm, most of the computations are finished in one clock cycle and no further propagation on data-path. Smith-Waterman local sequence alignments are rarely executed due to small table partition. Therefore there is no efficient way to make use of most of the processing resources in an MPSoC 2D array to accelerate BFAST application, since the dominant external DRAM accesses are random read, and the available number of on-chip memory controllers is constrained by limited PCB layout resources.
Loi et al.  first proposed distributed memory controller architecture for vertically stacked memory over a many-core logic die, which ensures ultra-low latency access to the memory modules on top of each processing element. On the horizontal logic plane, they connected logic cores through NoC, therefore the vertically stacked memory system can be modeled with the abstraction of memory neighborhood, as processing elements within an array have fast access and high bandwidth to a vertical stack of memory banks on top of it. Communication to these local modules do not travel through the NoC, thus it takes full advantage of the lower latency on vertical TSV interconnect, resulted to a significantly speedup in case of local accesses. Experimental results demonstrate significant bandwidth improvement that ranges from 1.44× to 7.40× compared to the JEDEC standard, with peaks of 4.53GB/s for direct memory access.
Recently Weis et al.  explored the 3D–DRAM architecture design space. They proposed a highly energy-efficient DRAM subsystem for next-generation 3D–integrated System-on-a-Chip (SoC), consisting of a 3D–DRAM controller and an attached 3D–DRAM cube with fine-grained access and a flexible (WIDE-IO) interface. They also investigate different DRAM families and densities from 256 Mb to 4Gb per channel, as well as the energy efficiency issues. The implementation results of the proposed 3D–DRAM subsystem show that energy optimized accesses to the 3D–DRAM enable up to 50% energy savings compared to standard accesses.
Micron has shipped the example devices of Hybrid Memory Cube (HMC) products , which could provide a raw link bandwidth of 320Gb/s. As a commercial product in early evaluation stage, most key technical specifications are being kept as secret. The sequential link transactions consists of data in assembled packets, but the most recent HMC specification v1.1  doesn’t provide the detail about the packet structure, as well as the clock latency of TX/RX, thus the utilization ratio of HMC data link is a mysterious. We can hardly make rough performance estimations for HMC in real application at this moment.
3 System Optimizations
In order to improve the performance of hash-index based short read mapping problem, a series of optimizations has been implemented and introduced in this section.
The original software implementation has been optimized into a HW-SW co-design architecture, which makes use of hardware massively parallelization to accelerate multiple execution stages. In order to fulfill the bandwidth requirement of this massively parallelized architecture, a 3D–stacked memory sub-system has been proposed with optimized access policy.
3.1 Data and Task Partition of Hash-Index Based Short Read Mapping
According to Fig. 1, the seed matching flow of hash-index based short read mapping is sequentially executed in software but also well-suited for massively parallelism acceleration, as operations regarding each seed as well as each read are independent from others, thus the specific task can be partitioned into small sub-instances. In order to parallel such a matching task, we should partition the CAL table and pointer table into small sub-tables which include different data regions respectively, thus different reads can be processed on multiple sub-instances.
To partition the tables with n ways, both the CAL table and pointer table are partitioned by the first log2(n) bits of the seed hash, thus seeds can be searched against different data regions with n instances in parallel. Additionally, as log2(n) bits have been used to address the sub-instances, the address bits in pointer table can be reduced by log2(n) bits.
On the other hand, all the matched CALs shall be evaluated with Smith-Waterman local sequence alignment. Each matched CAL is used to retrieve corresponding reference genome clip stored in memory, which is to be aligned against the short read that contains the matched seed. This procedure can be also partitioned for parallelization, since each run of Smith-Waterman alignment is executed independently between a specific short read and a clip of reference genome. The final alignment score will be temporary stored and then reported back to host processor for further processing.
3.2 Banded Smith-Waterman Local Sequence Alignment
Assume Beff_in = Barray_out, which is roughly the real case of sequential operations over external memory sub-system, then m = 2/21 from (2) to (3). This mismatch makes classical approach of distance matrix filling benefit less than anticipated on massively paralleled accelerators, since all the distance scores shall be saved for backtracking stage.
For each CAL, distance scores are generated and accumulated with banded gap. Alignment stops whenever reaching the band gap or all sequence elements have been aligned. No distance scores will be stored in this stage.
For each CAL, accumulated alignment scores of each CAL are compared to find out the maximum one.
The CAL with maximum alignment score is reported as the corresponding read mapping with quality information. Formal Smith-Waterman alignment can be executed on that site if interested at exact alignment result.
This approach boost the Smith-Waterman local sequence alignment stage by an order of magnitude as the wave-front computation without matrix filling rely on input bandwidth of systolic array only.
3.3 3D–Stacked DRAM with Scatter/Gather Read Policy
According to JEDEC compliant DRAM model, a write request is executed in 3 phases: the target row and bank is activated during the activate phase, and then the column is selected while the data is written in the write buffer during the write phase. The precharge phase is needed only if the row must be deactivated; as the precharge command writes back the data in this row, then this bank is ready for a new request in a different row. The read request is slightly different: there is a latency of several clock cycles before the data reaches the I/O during the second phase.
For 3D compliant DRAM model proposed by Loi et al. , Activate and Precharge phases show the same latency as JEDEC since they rely on the physical interconnections, while the Read/Write phase is shorter due to improvement on the I/O interface and lean column decoder architecture. The protocol translation between processor and DRAM is still needed but the access time is much less than classical approaches. Our design follows this model with further optimizations over read access.
We assume that all banks in a DRAM channel share a common 64-bit wide TSV data bus, a single channel is achieved by 8 banks in 8 layers, each bank with a size of 64 Mb. While 64 bits seems to be outdated, 128 bits or more does not give better performance in this specific case. For 2-hit string matching, over 80% of DRAM accesses are random read of 64 bits, and the rest are also random read but with length 8. For Smith-Waterman local sequence alignment, the 64 bit read with random address would last for length 8 at the most. Therefore wider DRAM channel does not help a lot but consumes a lot of extra energy and silicon spaces, which is not acceptable.
Wide I/O DRAM has been standardized by JEDEC (JESD229)  as a 3D DRAM solution for embedded SoC systems with lots of low power consideration. In order to conform to the WIDE I/O standard, we assume using LVCMOS signaling for interconnection so that we could avoid using complex SSTL interface, which results in both silicon and power efficiency. We adopt the wafer-to-wafer bumpless stacking process which provides aggressively small TSVs at an extremely fine pitch (5 μm). This approach provides much better silicon efficiency than micro-bump based die-to-die stacking with large pitch (20 ~ 40 μm).
Figure 4 demonstrates the differences of read access between classical 2D DRAM and our 3D DRAM, where the small blocks of active rows indicate data read. For 2D DRAM rank, the data is distributed over all banks, and each access targets a small fraction of the same active address to all banks. Scatter/Gather read policy in our 3D architecture allows accessing different banks at the same time, while the target address can be different between each bank. Although data from different banks can’t be transferred altogether, the I/O width to each bank is exact the same as the channel width, therefore random read from different banks can be sequentially accomplished in continuous clock cycle, so that achieving much higher bandwidth efficiency.
Data-path latency is another factor that affects DRAM random access, as cache line won’t be of much help in this case. Signal from classical 2D DRAM has to be transferred through PCB, SSTL pads, memory controller, and system bus before reaching PE, while in our 3D architecture it becomes much simpler and more efficient. We don’t even have to consider the system bus latency in general 3D designs, as remote access to local DRAM channels is avoided by careful dataset partitioning and task scheduling.
Our design does not rely on NoC since the partitioned seed matching as well as Smith-Waterman algorithm takes use of local data only through dataset partitioning. Although our design is many-core architecture on the horizontal logic plane, there is no remote access between local DRAMs and neighborhood processing cores. Each processing core works independently with two integrated memory controllers exclusively, which interfaces with a 3D–stacked DRAM of 64 MB in size respectively. Partitioned CAL table is stored in one of the DRAM stack, as well as partitioned pointer table and reference genomes in another. This approach provides an extremely high bandwidth for fast local access without consideration of remote access during computation. In case of write or read from host processor, each local processing core will be accessed in a serialized queue, thus no congestion shall be considered on shared bus. Therefore the scalability of our architecture is not restricted by data-path or problem size, but relies overall power dissipation and manufacture cost.
4.1 Processing Core
A hashed seed is input as a combination of addr_seed and key_seed part. Each seed consists of 22 nucleotides, thus a hashed seed is represented by 44-bits. The addr_seed could be varies between 24 to 32 bit, and key_seed could be 20 to 12 bit respectively.
A hashed seed is input along with a read id, indicates which read that this seed belongs to, although the process of read id is omitted in Fig. 7. The addr_seed is used to search the partitioned pointer table stored in DRAM1, which forms the first hit of 2-hit string matching. If current addr_seed is a valid address to the partitioned pointer table, corresponding CAL bucket will be read out from DRAM2. Then the keys stored in that CAL bucket will be compared with keey_seed respectively. A 2-hit string matching event will be broadcasted to all the processing cores through shared bus along with matched CAL and the read id. This stage corresponds to line 7 to line 26 of algorithm 3.
On the other hand, all the broadcasted matches shall be evaluated with Smith-Waterman alignment. Each matched CAL is used to retrieve corresponding reference genome clip stored in DRAM1, which is to be aligned with the short read that contains the matched seed. The final alignment score will not be stored but only report whether the banded aliment is completed for the whole read or not. Line 27 to 32 of our paralleled algorithm represents this stage, which is detailed with algorithm 4.
4.2 Customized Many-Core Architecture
For local access, the transfer takes the advantage of the low latency and high bandwidth of direct access; therefore an entire cache line (or simply a single word) can be written or read in a few clock cycles. A remote request is only considered from external host processor but not neighborhood processors, while the partitioned CAL table, pointer table and reference genome sequence are written from host processor through burst access, or the short reads with aligned scores are read back.
Alignment buffer will collect all the alignment results from Smith-Water local alignment and transfer back to host on the fly. For paired end reads, each pair will be evaluated for seed matching, and then Smith-Waterman local alignment will be applied if any single sequence has matched CAL.
There are 10 different seeds from each short read, corresponding to 10 different pre-generated CAL index tables. The hash index mapping computation is executed sequentially for each CAL index tables due to the requirement of data storage space. In order to avoid redundant computation, mapped reads which passed banded Smith-Waterman alignment will be recorded and skipped for the rest computations. This approach would avoid more than 80% of computation in the rest iterations.
4.3 Smith-Waterman Processing Element Array
The short read shall be aligned to the reference genome according to each CAL using alternative banded Smith-Waterman local alignment, as it’s globally aligned against a local section of the reference. Thus we will receive the alignment of the entire short read to a subsequence of the reference genome.
With the help of affine gap, long insertion or deletion chains scores are verified to be more biologically accurate than short insertion/deletion by linear gap. The cost to begin an insertion or deletion chain is higher than the cost to continue a chain in this case. Banded Smith-Waterman alignment would just stop work in case of small gap, which would reduce overall sensitivity. Therefor those short reads without positive mapping found by banded Smith-Waterman alignment shall be aligned with regular Smith-Waterman again. This does not leads to high performance impact since such reads is very few in count. A well optimized band gap will be helpful as well.
4.4 3D–Stacked Architecture
In Fig. 11, each DRAM layer consists of two DRAM tiles. Each tile is a DRAM unit of 64 Mb; each processing core has two DRAM stacks; there are 4 × 8 = 32 processing cores; the DRAM stacks are formed with 8 layers; thus we have an on-chip storage space of (64 × 2 × 32 × 8) / 8 = 4096 MB in total.
According to , a typical 3D DRAM core tile of 64 Mb occupies 1.92 mm2 in 45 nm tech node, thus the chip area we planned is approximately 130 mm2.
5 Experimental Results
In this section, we evaluate and compare the overall performance and silicon efficiency of our platform with software and scaled FPGA platforms.
5.1 Evaluation Datasets and Platforms
To assess the performance on real data, we downloaded a dataset which consists of about 63.9 million pairs of 101 bp reads from DNANEXUS (ncbi run SRR867061). These reads were produced by Illumina HiSeq2500 for NA12878, a female human included in the 1000 Genomes Project . Both single and paired end reads from this dataset were used for this performance evaluation. Reads were mapped to the human genome issue hg38 from UCSC .
The baseline for performance evaluation is a workstation equipped with Intel Core i7–960 3.20GHz CPU, 48 GB RAM and Redhat Enterprise Linux v5.8 × 64 running on it. Software performance corresponding to the baseline is reported as “BFAST” for BFAST v0.7.0a with all the eight CPU threads function in parallel using “-n 8” switch, other settings remain default. Source codes are compiled with gcc and “-O3” optimization flag.
Our VLSI design is based on TSMC 45 nm silicon process, synthesized and verified with post-synthesize simulation using EDA tools from Synopsys. A total of 32 processing cores are planned in our design, and the overall silicon area is estimated about 130 mm2 in 2D mesh. The simulated working frequency in our design is 333Mhz for 3D DRAM layers and DRAM controller, and 800Mhz for other function blocks in the logic layer. Simulation of short read mapping against whole human genome is impossible on RTL level of such a complicate design, therefore we make use of a FPGA board, terasIC DE5-NET equipped with 5SGXEA7N2F45C2N, which could verify one processing core with two DDR3 controllers function in real, and then we could scale the test results in order to compare with other platforms. The gate counts of each processing core is approximately 6 M, and the Smith-Waterman systolic array within each processing core consists of 1024 processing elements, which consumes about 68% of silicon area of a processing core.
The different length of short reads would not affect much over the seed matching part, but run time of Smith-Waterman local alignment is correlated to sequence length. However Olson et al. didn’t report the Smith-Waterman performance of their platform individually, thus the estimation given by Eq. 1 shall be better than it could perform in real.
GPU acceleration has been introduced for short read mapping problems, but the most recent skeleton work still lacks the support over whole human genome sequencing. Running on our workstation equipped with nVIDIA Geforce GTX 680, CUSHAW2-GPU v2.4.3 , which is the most recent implementation of short read mapping on GPU, reported error during loading the index dataset into onboard memory. Therefore we have to skip the comparison between our design and GPGPU, since an equivalent simulation is not possible without enough onboard memory.
5.2 Mapping Time
3D VLSI 1c
3D VLSI 2d
Results labelled with “3D VLSI 1” are generated by conformal with Original BFAST program, while “3D VLSI 2” is results with adjusted seed generation policy for higher sensitivity. The evaluation results show that the acceleration with “3D VLSI 1” has an amazingly 45,012 times speedup over original BFAST for single end reads, which also beat single FPGA for 1466 times. The acceleration over paired end reads is slightly lower but still as good as 21,102 times.
The DDR3 DRAM channels on DE5 are working on 667 MT/s. Since the DRAM access pattern is completely random, double date rate does little help for real bandwidth. The stacked DRAM layers gives single sub-processor the peak transfer rate of 64 × 2 × 333/8 = 5328 MB/s. This value is theoretically worse than that of DE5 board with two DDR3–667 channels giving 10.67GB/s, or the new WIDE I/O 2  standard of 34GB/s. But the idea of extremely low latency and high bandwidth over random access shines: 3D–stacked SDR DRAM provides the best effort for low latency random memory access, while the sequential transfer rate is also sufficient for maximum read length of 8.
According to the recorded DRAM access traces, the effective read bandwidth on DRAM channel 0 of DE5 is approximately 450 MB/s, which consists of accesses to pointer table and original reference sequence. Accessing to DRAM channel 1 is less frequent; therefore DRAM channel 0 is the most critical system bottleneck.
According to simulated DRAM access traces of 3D SDRAM with 333Mhz, the random access is highly benefit from Scatter/Gather read policy, and the effective read bandwidth of DRAM channel 0 is about 1210 MB/s. Therefore the results “3D VLSI 1” are given by scaling of the FPGA results with a factor of (1210/450) × 32 linearly, as we don’t have to consider data congestion with individual broadcast bus. The working frequency of processor layer is ignored here, since both 3D VLSI and stripped FPGA design is sufficient to process those data. There is no DRAM write operation during mapping, thus DRAM write is also ignored for this estimation.
Original BFAST program only use small suffix part of short read for seed generation, e.g. first 22 symbols of total 101 in length of each read. Definitely this would lead to miss mapping. The results of “3D VLSI 2” is scaled with the same method as “3D VLSI 1”, but the seed generation is optimized for higher mapping sensitivity. The original BFAST application extracts the first 22 nucleotides of each read as seed, while in “3D VLSI 1” we try to pick a seed for every 22 symbols of each short read if possible, so that the seed coverage is much higher than original BFAST. This approach results very high sensitivity as described in section C.
3D VLSI 1
3D VLSI 2
5.4 Power Estimation
The logic layer of one sub-processor running at 800Mhz on 45-nm process consumes around 0.9 W given by post-synthesis simulation. This is achieved by using power gating on Smith-Waterman processing array, as it’s idle by 62% of execution time. Therefore our example design would consume around 29 W energy on the logic layer with 130 mm2 silicon areas.
The simulated memory traces depicted that the classical memory controller optimization methods does not apply to fully random accesses. With classical CACTI model , the estimated maximum power consumption for all the 8 DRAM layers is a bit less than 60 W.
In order to estimate accurate power consumption of 3D–stacked DRAM layers, the simulated DRAM access traces has been recorded and replayed in the DRAMSys tool . Comparing with the theoretical maximum power consumption given by CACTI, estimated power consumption of 3D DRAM sub-system given by TLM models has been greatly reduced to 46 W in total, results from real transaction rate, Scatter/Gather read policy and scheduled power down.
This power density is significantly below the maximum value of 200 W/cm2 as defined by International Technology Roadmap for Semiconductors .
We have described an approach to improve the performance of hash-index based short read mapping algorithm using a hardware acceleration platform.
We summarized the BFAST application with computation stages and data flow, and identified that 2-hit string matching and Smith-Waterman local sequence alignment consume a dominant portion of the overall execution time.
Then we introduced a customized processing core with dedicated DRAM channels, which is designed to facilitate 2-hit string matching and banded Smith-Waterman local sequence alignment. Parallelization of the two compute intensive stages has been well described, while both stages could benefit from exclusive low latency access to local DRAM channels. With customized dataset partitioning and task scheduling, the overlapped run time of these two major computation stages gives another factor of performance improvements over software implementation.
In our approach, a distributed memory interface with 3D–stacked architecture has been proposed to provide exclusive local access to each partitioned dataset. We introduced a Scatter/Gather read policy on DRAM controller to maximize the throughput of local random DRAM read access.
We present results from prototyping stripped down implementation of our design on a commodity FPGA, representing the performance of a single processing core of our architecture. Then we scale the results to give a plausible estimation of our design with customized many-core and 3D stacked architecture. Based on conservative area estimation and simulation results, we show that our design achieves the highest performance and greatest energy efficiency comparing with all other known accelerators.
For sensitivity, the different indices and seeds combination could avoid most read errors, but the matching result can be further improved by extending the seed generation policy for higher coverage of short reads. This method would greatly impact the performance of software implementation, but suite well on our hardware acceleration platform.
Our platform could accelerate hash index short read mapping only at this moment, support of FM-index mapping is planned in the future.
We are very grateful to Professor Lars Arvestad for providing many valuable suggestions.
|Funder Name||Grant Number||Funding Note|
|Royal Institute of Technology|
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.