Enabling fast and energy-efficient FM-index exact matching using processing-near-memory

Herruzo, Jose M.; Fernandez, Ivan; González-Navarro , Sonia; Plata, Oscar

doi:10.1007/s11227-021-03661-3

Enabling fast and energy-efficient FM-index exact matching using processing-near-memory

Published: 02 March 2021

Volume 77, pages 10226–10251, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jose M. Herruzo¹,
Ivan Fernandez ORCID: orcid.org/0000-0001-6133-5670¹,
Sonia González-Navarro ¹ &
…
Oscar Plata¹

593 Accesses
10 Citations
Explore all metrics

Abstract

Memory bandwidth and latency constitutes a major performance bottleneck for many data-intensive applications. While high-locality access patterns take advantage of the deep cache hierarchies available in modern processors, unpredictable low-locality patterns cause a significant part of the execution time to be wasted waiting for data. An example of those memory bound applications is the exact matching algorithm based on FM-index, used in some well-known sequence alignment applications. Processing-Near-Memory (PNM) has been proposed as a strategy to overcome the memory wall problem, by placing computation close to data, speeding up memory bound workloads by reducing data movements. This paper presents a performance and energy evaluation of two classes of processor architectures when executing the FM-index exact matching algorithm, as a reference algorithm for exact sequence alignment. One architecture class is processor-centric, based on complex cores and DDR3/4 SDRAM memory technology. The other architecture class is memory-centric, based on simple cores and ultra-high-bandwidth hybrid memory cube (HMC) 3D-stacked memory technologies. The results show that the PNM solution improves performance between 1.26\(\times\) and 3.7\(\times\) and the energy consumption per operation is reduced between 21\(\times\) and 40\(\times\). In addition, a synthetic benchmark RANDOM was developed that mimics the memory access pattern of the FM-index exact matching algorithm, but with a user configurable operational intensity. This benchmark allows us to extend the evaluation to the class of algorithms with similar memory behaviour but running over a range of operational intensity values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PIM-Align: A Processing-in-Memory Architecture for FM-Index Search Algorithm

Article 30 January 2021

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

Article Open access 06 July 2017

Performance Evaluation of Spark, Ray and MPI: A Case Study on Long Read Alignment Algorithm

References

Chen C, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347
Article Google Scholar
Kestor G, Gioiosa R, Kerbyson DJ, Hoisie A (2013) Quantifying the energy cost of data movement in scientific applications. In: 2013 IEEE International Symposium on Workload Characterization (IISWC). 56–65
Herruzo J, Gonzalez-Navarro S, Ibañez P, Viñals V, Alastruey J, Plata O (2020) Accelerating sequence alignments based on FM-index using the Intel KNL processor. IEEE/ACM Trans Comput Biol Bioinform 17(4):1093–1104
Google Scholar
NovaSeq System Specifications The next era of sequencing starts now. https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html
Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209
Article Google Scholar
Mutlu O, Ghose S, Gomez-Luna J, Ausavarungnirun R (2020) A modern primer on processing in memory. arXiv preprint arXiv:2012.03112
Ghose S, Boroumand A, Kim J, Gomez-Luna J, Mutlu O (2019) Processing-in-memory: a workload-driven perspective. IBM J Res Dev 63(6):3:1-3:19
Article Google Scholar
Micron Technology, Inc. Hybrid Memory Cube (HMC). https://www.micron.com/products/hybrid-memory-cube
Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings Bioinform 11(5):473–483
Article Google Scholar
Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science. 390–398
Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation
Chacon A, Moure JC, Espinosa A, Hernandez P (2013) n-step FM-index for faster pattern matching. Procedia Comput Sci 18:70–79
Article Google Scholar
Intel Xeon Phi Processor 7210 (16GB, 1.30GHz, 64 core) Product Specifications. https://ark.intel.com/content/www/us/en/ark/products/ 94033/intel-xeon-phi-processor-7210-16gb-1-30-ghz-64-core.html
Lee, D.U., Kim, K.W., Kim, K.W., Kim, H., Kim, J.Y., Park, Y.J., Kim, J.H., Kim, D.S., Park, H.B., Shin, J.W., Cho, J.H., Kwon, K.H., Kim, M.J., Lee, J., Park, K.W., Chung, B., Hong, S.: 25.2 A 1.2v 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In: IEEE International Solid-State Circuits Conference (ISSCC’14). (2014) 432–433
Devaux, F (2019) The true processing in memory accelerator. In: IEEE Hot Chips 31 Symposium (HOTCHIPS 2019).
Each milliwatt matters—ultra high efficiency application processors. http://www.armtechforum.com.cn/attached/article/ARM_Each _Milliwatt_Matters20151210111238.pdf
McCalpin, J.D.: Stream: sustainable memory bandwidth in high performance computers. Technical report, University of Virginia, Charlottesville, Virginia (1991-2007) A continually updated technical report. http://www.cs.virginia.edu/stream/
Sanchez, D., Kozyrakis, C.: ZSim (2013) Fast and accurate microarchitectural simulation of thousand-core systems. In: 40th Annual International Symposium on Computer Architecture (ISCA’13). 475–486
Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K (2005) Pin: Building customized program analysis tools with dynamic instrumentation. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). 190–200
Kim Y, Yang W, Mutlu O (2015) Ramulator: a fast and extensible DRAM simulator. IEEE Comput Archit Lett 15(1):45–49
Article Google Scholar
Rosenfeld P, Cooper-Balis E, Jacob B (2011) DRAMSim2 a cycle accurate memory system simulator. IEEE Comput Archit Lett 10(1):16–19
Article Google Scholar
s5z/zsim: a fast and scalable x86-64 multicore simulator. https://github.com/s5z/zsim
Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). (December 2009) 469–480
Micron Power Calculators. www.micron.com/support/tools-and-utilities/power-calc
Crucial (Micron Technology, Inc.) How much power does memory use? https://www.crucial.com/support/articles-faq-memory/how-much-power-does-memory-use
Schmidt B, Hildebrandt A (2017) Next-generation sequencing: big data meets high performance computing. Drug Discov Today 22(4):712–717
Article Google Scholar
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL (2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnol 37:907–915
Article Google Scholar
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with bowtie2. Nature Methods 9:357–359
Article Google Scholar
Langmead B, Wilks C, Antonescu V, Rone C (2019) Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics 35(3):421–432
Article Google Scholar
Wilton R, Budavari T, Langmead B, Wheelan SJ, Salzberg SL, Szalay AS (2015) Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space. PeerJ 3:e808
Article Google Scholar
Gonzalez-Dominguez J, Liu Y, Schmidt B (2016) Parallel and scalable short-read alignment on multi-core clusters using UPC++. PLoS One 11(1)
Abuin JM, Pichel JC, Pena TF, Amigo J (2015) BigBWA: approaching the Burrows-Wheeler aligner to big data technologies. Bioinformatics 31(24):4003–4005
Google Scholar
Fernandez EB, Villarreal J, Lonardi S (2015) FHAST: FPGA-based acceleration of Bowtie in hardware. IEEE/ACM Trans Comput Biol Bioinf 12(5):973–981
Article Google Scholar
Fujiki D, Subramaniyan A, Zhang T, Zeng Y, Das R, Blaauw D, Narayanasamy S (2018) Genax: a genome sequencing accelerator. In: ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). 69–82
Koliogeorgi K, Voss N, Fytraki S, Xydis S, Gaydadjiev G, Soudris D (2019) Dataflow acceleration of Smith-Waterman with traceback for high throughput next generation sequencing. In: 29th International Conference on Field Programmable Logic and Applications (FPL’19). 74–80
Miller NA, Farrow EG, Gibson M, Willig LK, Twist G, Yoo B, Marrs T, Corder S, Krivohlavek L, Walter A et al (2015) A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases. Genome Med 7(1):1–16
Article Google Scholar
Ghose S, Hsieh K, Boroumand A, Ausavarungnirun R, Mutlu O (2018) Enabling the adoption of processing-in-memory: challenges, mechanisms, future research directions. arXiv preprint arXiv:1802.00320
Ahn J, Hong S, Yoo S, Mutlu O, Choi K (2015) A Scalable Processing-in-memory ccelerator for parallel graph processing. In: Int’l. Symp. on Computer Architecture (ISCA’15). 105–117
Boroumand A, Ghose S, Kim Y, Ausavarungnirun R, Shiu E, Thakur R, Kim D, Kuusela A, Knies A, Ranganathan P, Mutlu O (2018) Google Workloads for consumer devices: mitigating data movement bottlenecks. In: ACM 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). 316–331
Nai L, Hadidi R, Sim J, Kim H, Kumar P, Kim H (2017) GraphPIM: enabling instruction-level PIM offloading in graph computing frameworks. In: 23rd IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 457–468
Zhang M, Zhuo Y, Wang C, Gao M, Wu Y, Chen K, Kozyrakis C, Qian X (2018) GraphP: reducing communication for PIM-based graph processing with efficient data partition. In: 24th IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 544–557
Drumond Lages De Oliveira MP, Daglis A, Mirzadeh N, Ustiugov D, Picorel Obando J, Falsafi B, Grot B, Pnevmatikatos D (2017) The Mondrian data engine. 44th International Symposium on Computer Architecture (ISCA’17)
Gao M, Ayers G, Kozyrakis C (2015) Practical near-data processing for in-memory analytics frameworks. In: 24th International Conference on Parallel Architectures and Compilation Techniques (PACT’15). 113–124
Gao M, Pu J, Yang X, Horowitz M, Kozyrakis C (2017) TETRIS: scalable and efficient neural network acceleration with 3D memory. In: 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). 751–764
Kim JS, Cali DS, Xin H, Lee D, Ghose S, Alser M, Hassan H, Ergin O, Alkan C, Mutlu O (2018) GRIM-Filter: fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Gen 19(2):23–40
Google Scholar
Fernandez I, Quislant R, Gutierrez E, Plata O, Giannoula C, Alser M, Gomez-Luna J, Mutlu O (2020) NATSA: a near-data processing accelerator for time series analysis. In: IEEE 38th International Conference on Computer Design (ICCD’20). 120–129
Hsieh K, Ebrahimi E, Kim G, Chatterjee N, O’Connor M, Vijaykumar N, Mutlu O, Keckler SW (2016) Transparent Offloading and Mapping (TOM): enabling programmer-transparent near-data processing in GPU systems. In: ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 204–216
Zhang D, Jayasena N, Lyashevsky A, Greathouse JL, Xu L, Ignatowski M (2014) TOP-PIM: throughput-oriented programmable processing in memory. In: 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC’14). 85–98
Farahani AF, Ahn JH, Morrow K, Kim NS (2015) NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. 21st IEEE International Symposium on High Performance Computer Architecture (HPCA’15) 283–295
Asghari-Moghaddam H, Son YH, Ahn JH, Kim NS (2016) Chameleon: versatile and practical near-DRAM acceleration architecture for large memory systems. In: 49th Annual ACM/IEE International Symposium on Microarchitecture (MICRO’16)

Download references

Author information

Authors and Affiliations

Department of Computer Architecture, University of Malaga, 29071, Malaga, Spain
Jose M. Herruzo, Ivan Fernandez, Sonia González-Navarro & Oscar Plata

Authors

Jose M. Herruzo
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Fernandez
View author publications
You can also search for this author in PubMed Google Scholar
Sonia González-Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Plata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Fernandez.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Herruzo, J.M., Fernandez, I., González-Navarro , S. et al. Enabling fast and energy-efficient FM-index exact matching using processing-near-memory. J Supercomput 77, 10226–10251 (2021). https://doi.org/10.1007/s11227-021-03661-3

Download citation

Accepted: 28 January 2021
Published: 02 March 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11227-021-03661-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enabling fast and energy-efficient FM-index exact matching using processing-near-memory

Abstract

Access this article

Similar content being viewed by others

PIM-Align: A Processing-in-Memory Architecture for FM-Index Search Algorithm

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

Performance Evaluation of Spark, Ray and MPI: A Case Study on Long Read Alignment Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enabling fast and energy-efficient FM-index exact matching using processing-near-memory

Abstract

Access this article

Similar content being viewed by others

PIM-Align: A Processing-in-Memory Architecture for FM-Index Search Algorithm

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

Performance Evaluation of Spark, Ray and MPI: A Case Study on Long Read Alignment Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation