Skip to main content

Advertisement

Log in

Enabling fast and energy-efficient FM-index exact matching using processing-near-memory

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Memory bandwidth and latency constitutes a major performance bottleneck for many data-intensive applications. While high-locality access patterns take advantage of the deep cache hierarchies available in modern processors, unpredictable low-locality patterns cause a significant part of the execution time to be wasted waiting for data. An example of those memory bound applications is the exact matching algorithm based on FM-index, used in some well-known sequence alignment applications. Processing-Near-Memory (PNM) has been proposed as a strategy to overcome the memory wall problem, by placing computation close to data, speeding up memory bound workloads by reducing data movements. This paper presents a performance and energy evaluation of two classes of processor architectures when executing the FM-index exact matching algorithm, as a reference algorithm for exact sequence alignment. One architecture class is processor-centric, based on complex cores and DDR3/4 SDRAM memory technology. The other architecture class is memory-centric, based on simple cores and ultra-high-bandwidth hybrid memory cube (HMC) 3D-stacked memory technologies. The results show that the PNM solution improves performance between 1.26\(\times\) and 3.7\(\times\) and the energy consumption per operation is reduced between 21\(\times\) and 40\(\times\). In addition, a synthetic benchmark RANDOM was developed that mimics the memory access pattern of the FM-index exact matching algorithm, but with a user configurable operational intensity. This benchmark allows us to extend the evaluation to the class of algorithms with similar memory behaviour but running over a range of operational intensity values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Chen C, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347

    Article  Google Scholar 

  2. Kestor G, Gioiosa R, Kerbyson DJ, Hoisie A (2013) Quantifying the energy cost of data movement in scientific applications. In: 2013 IEEE International Symposium on Workload Characterization (IISWC). 56–65

  3. Herruzo J, Gonzalez-Navarro S, Ibañez P, Viñals V, Alastruey J, Plata O (2020) Accelerating sequence alignments based on FM-index using the Intel KNL processor. IEEE/ACM Trans Comput Biol Bioinform 17(4):1093–1104

    Google Scholar 

  4. NovaSeq System Specifications The next era of sequencing starts now. https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html

  5. Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209

    Article  Google Scholar 

  6. Mutlu O, Ghose S, Gomez-Luna J, Ausavarungnirun R (2020) A modern primer on processing in memory. arXiv preprint arXiv:2012.03112

  7. Ghose S, Boroumand A, Kim J, Gomez-Luna J, Mutlu O (2019) Processing-in-memory: a workload-driven perspective. IBM J Res Dev 63(6):3:1-3:19

    Article  Google Scholar 

  8. Micron Technology, Inc. Hybrid Memory Cube (HMC). https://www.micron.com/products/hybrid-memory-cube

  9. Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings Bioinform 11(5):473–483

    Article  Google Scholar 

  10. Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science. 390–398

  11. Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation

  12. Chacon A, Moure JC, Espinosa A, Hernandez P (2013) n-step FM-index for faster pattern matching. Procedia Comput Sci 18:70–79

    Article  Google Scholar 

  13. Intel Xeon Phi Processor 7210 (16GB, 1.30GHz, 64 core) Product Specifications. https://ark.intel.com/content/www/us/en/ark/products/ 94033/intel-xeon-phi-processor-7210-16gb-1-30-ghz-64-core.html

  14. Lee, D.U., Kim, K.W., Kim, K.W., Kim, H., Kim, J.Y., Park, Y.J., Kim, J.H., Kim, D.S., Park, H.B., Shin, J.W., Cho, J.H., Kwon, K.H., Kim, M.J., Lee, J., Park, K.W., Chung, B., Hong, S.: 25.2 A 1.2v 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In: IEEE International Solid-State Circuits Conference (ISSCC’14). (2014) 432–433

  15. Devaux, F (2019) The true processing in memory accelerator. In: IEEE Hot Chips 31 Symposium (HOTCHIPS 2019).

  16. Each milliwatt matters—ultra high efficiency application processors. http://www.armtechforum.com.cn/attached/article/ARM_Each _Milliwatt_Matters20151210111238.pdf

  17. McCalpin, J.D.: Stream: sustainable memory bandwidth in high performance computers. Technical report, University of Virginia, Charlottesville, Virginia (1991-2007) A continually updated technical report. http://www.cs.virginia.edu/stream/

  18. Sanchez, D., Kozyrakis, C.: ZSim (2013) Fast and accurate microarchitectural simulation of thousand-core systems. In: 40th Annual International Symposium on Computer Architecture (ISCA’13). 475–486

  19. Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K (2005) Pin: Building customized program analysis tools with dynamic instrumentation. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). 190–200

  20. Kim Y, Yang W, Mutlu O (2015) Ramulator: a fast and extensible DRAM simulator. IEEE Comput Archit Lett 15(1):45–49

    Article  Google Scholar 

  21. Rosenfeld P, Cooper-Balis E, Jacob B (2011) DRAMSim2 a cycle accurate memory system simulator. IEEE Comput Archit Lett 10(1):16–19

    Article  Google Scholar 

  22. s5z/zsim: a fast and scalable x86-64 multicore simulator. https://github.com/s5z/zsim

  23. Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). (December 2009) 469–480

  24. Micron Power Calculators. www.micron.com/support/tools-and-utilities/power-calc

  25. Crucial (Micron Technology, Inc.) How much power does memory use? https://www.crucial.com/support/articles-faq-memory/how-much-power-does-memory-use

  26. Schmidt B, Hildebrandt A (2017) Next-generation sequencing: big data meets high performance computing. Drug Discov Today 22(4):712–717

    Article  Google Scholar 

  27. Kim D, Paggi JM, Park C, Bennett C, Salzberg SL (2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnol 37:907–915

    Article  Google Scholar 

  28. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with bowtie2. Nature Methods 9:357–359

    Article  Google Scholar 

  29. Langmead B, Wilks C, Antonescu V, Rone C (2019) Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics 35(3):421–432

    Article  Google Scholar 

  30. Wilton R, Budavari T, Langmead B, Wheelan SJ, Salzberg SL, Szalay AS (2015) Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space. PeerJ 3:e808

    Article  Google Scholar 

  31. Gonzalez-Dominguez J, Liu Y, Schmidt B (2016) Parallel and scalable short-read alignment on multi-core clusters using UPC++. PLoS One 11(1)

  32. Abuin JM, Pichel JC, Pena TF, Amigo J (2015) BigBWA: approaching the Burrows-Wheeler aligner to big data technologies. Bioinformatics 31(24):4003–4005

    Google Scholar 

  33. Fernandez EB, Villarreal J, Lonardi S (2015) FHAST: FPGA-based acceleration of Bowtie in hardware. IEEE/ACM Trans Comput Biol Bioinf 12(5):973–981

    Article  Google Scholar 

  34. Fujiki D, Subramaniyan A, Zhang T, Zeng Y, Das R, Blaauw D, Narayanasamy S (2018) Genax: a genome sequencing accelerator. In: ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). 69–82

  35. Koliogeorgi K, Voss N, Fytraki S, Xydis S, Gaydadjiev G, Soudris D (2019) Dataflow acceleration of Smith-Waterman with traceback for high throughput next generation sequencing. In: 29th International Conference on Field Programmable Logic and Applications (FPL’19). 74–80

  36. Miller NA, Farrow EG, Gibson M, Willig LK, Twist G, Yoo B, Marrs T, Corder S, Krivohlavek L, Walter A et al (2015) A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases. Genome Med 7(1):1–16

    Article  Google Scholar 

  37. Ghose S, Hsieh K, Boroumand A, Ausavarungnirun R, Mutlu O (2018) Enabling the adoption of processing-in-memory: challenges, mechanisms, future research directions. arXiv preprint arXiv:1802.00320

  38. Ahn J, Hong S, Yoo S, Mutlu O, Choi K (2015) A Scalable Processing-in-memory ccelerator for parallel graph processing. In: Int’l. Symp. on Computer Architecture (ISCA’15). 105–117

  39. Boroumand A, Ghose S, Kim Y, Ausavarungnirun R, Shiu E, Thakur R, Kim D, Kuusela A, Knies A, Ranganathan P, Mutlu O (2018) Google Workloads for consumer devices: mitigating data movement bottlenecks. In: ACM 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). 316–331

  40. Nai L, Hadidi R, Sim J, Kim H, Kumar P, Kim H (2017) GraphPIM: enabling instruction-level PIM offloading in graph computing frameworks. In: 23rd IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 457–468

  41. Zhang M, Zhuo Y, Wang C, Gao M, Wu Y, Chen K, Kozyrakis C, Qian X (2018) GraphP: reducing communication for PIM-based graph processing with efficient data partition. In: 24th IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 544–557

  42. Drumond Lages De Oliveira MP, Daglis A, Mirzadeh N, Ustiugov D, Picorel Obando J, Falsafi B, Grot B, Pnevmatikatos D (2017) The Mondrian data engine. 44th International Symposium on Computer Architecture (ISCA’17)

  43. Gao M, Ayers G, Kozyrakis C (2015) Practical near-data processing for in-memory analytics frameworks. In: 24th International Conference on Parallel Architectures and Compilation Techniques (PACT’15). 113–124

  44. Gao M, Pu J, Yang X, Horowitz M, Kozyrakis C (2017) TETRIS: scalable and efficient neural network acceleration with 3D memory. In: 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). 751–764

  45. Kim JS, Cali DS, Xin H, Lee D, Ghose S, Alser M, Hassan H, Ergin O, Alkan C, Mutlu O (2018) GRIM-Filter: fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Gen 19(2):23–40

    Google Scholar 

  46. Fernandez I, Quislant R, Gutierrez E, Plata O, Giannoula C, Alser M, Gomez-Luna J, Mutlu O (2020) NATSA: a near-data processing accelerator for time series analysis. In: IEEE 38th International Conference on Computer Design (ICCD’20). 120–129

  47. Hsieh K, Ebrahimi E, Kim G, Chatterjee N, O’Connor M, Vijaykumar N, Mutlu O, Keckler SW (2016) Transparent Offloading and Mapping (TOM): enabling programmer-transparent near-data processing in GPU systems. In: ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 204–216

  48. Zhang D, Jayasena N, Lyashevsky A, Greathouse JL, Xu L, Ignatowski M (2014) TOP-PIM: throughput-oriented programmable processing in memory. In: 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC’14). 85–98

  49. Farahani AF, Ahn JH, Morrow K, Kim NS (2015) NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. 21st IEEE International Symposium on High Performance Computer Architecture (HPCA’15) 283–295

  50. Asghari-Moghaddam H, Son YH, Ahn JH, Kim NS (2016) Chameleon: versatile and practical near-DRAM acceleration architecture for large memory systems. In: 49th Annual ACM/IEE International Symposium on Microarchitecture (MICRO’16)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Fernandez.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Herruzo, J.M., Fernandez, I., González-Navarro , S. et al. Enabling fast and energy-efficient FM-index exact matching using processing-near-memory. J Supercomput 77, 10226–10251 (2021). https://doi.org/10.1007/s11227-021-03661-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03661-3

Keywords

Navigation