A Modern Primer on Processing in Memory

Mutlu, Onur; Ghose, Saugata; Gómez-Luna, Juan; Ausavarungnirun, Rachata

doi:10.1007/978-981-16-7487-7_7

Onur Mutlu^7,8,
Saugata Ghose⁹,
Juan Gómez-Luna⁷ &
…
Rachata Ausavarungnirun¹⁰

Part of the book series: Computer Architecture and Design Methodologies ((CADM))

2695 Accesses
37 Citations

Abstract

Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the general idea of PIM is not new, we discuss motivating trends in applications as well as memory circuits/technology that greatly exacerbate the need for enabling it in modern computing systems. We examine at least two promising new approaches to designing PIM systems to accelerate important data-intensive applications: (1) processing using memory by exploiting analog operational properties of DRAM chips to perform massively-parallel operations in memory, with low-cost changes, (2) processing near memory by exploiting 3D-stacked memory technology design to provide high memory bandwidth and low memory latency to in-memory logic. In both approaches, we describe and tackle relevant cross-layer research, design, and adoption challenges in devices, architecture, systems, and programming models. Our focus is on the development of in-memory processing designs that can be adopted in real computing platforms at low cost. We conclude by discussing work on solving key challenges to the practical adoption of PIM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

6th Generation Intel Core Processor Family Datasheet (2021), http://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-datasheet-vol-1.html
B. Abali, H. Franke, D.E. Poff, R.A. Saccone, C.O. Schulz, L.M. Herger, T.B. Smith, Memory expansion technology (MXT): software support and performance. IBM J. Res. Dev. (2001)
Google Scholar
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., Tensorflow: a system for large-scale machine learning, in OSDI (2016)
Google Scholar
A. Acharya, M. Uysal, J. Saltz, Active disks: programming model, algorithms and evaluation, in ASPLOS (1998)
Google Scholar
M.T. Aga, Z.B. Aweke, T. Austin, When good protections go bad: exploiting anti-DoS measures to accelerate RowHammer attacks, in HOST (2017a)
Google Scholar
S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, R. Das, Compute caches, in HPCA (2017b)
Google Scholar
J. Ahn, A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing (2015a), https://people.inf.ethz.ch/omutlu/pub/tesseract-pim-architecture-for-graph-processing_isca15-talk.pdf, conference talk at ISCA 2015
J. Ahn, PIM-Enabled Instructions: A Low-Overhead, Locality-Aware PIM Architecture (2015b), https://people.inf.ethz.ch/omutlu/pub/pim-enabled-instructons-for-low-overhead-pim_isca15-talk.pdf, conference talk at ISCA 2015
J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, A scalable processing-in-memory accelerator for parallel graph processing, in ISCA (2015a)
Google Scholar
J. Ahn, S. Yoo, O. Mutlu, K. Choi, PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture, in ISCA (2015b)
Google Scholar
A. Ailamaki, D.J. DeWitt, M.D. Hill, D.A. Wood, DBMSs on a modern processor: where does time go? in VLDB (1999)
Google Scholar
B. Akin, F. Franchetti, J.C. Hoe, Data reorganization in memory using 3D-stacked DRAM, in ISCA (2015)
Google Scholar
C. Alkan et al., Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. (2009)
Google Scholar
M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, C. Alkan, GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics (2017)
Google Scholar
M. Alser, H. Hassan, A. Kumar, O. Mutlu, C. Alkan, Shouji: a fast and efficient pre-alignment filter for sequence alignment. Bioinformatics (2019)
Google Scholar
M. Alser, Z. Bingöl, D. Senol Cali, J. Kim, S. Ghose, C. Alkan, O. Mutlu, accelerating genome analysis: a primer on an ongoing journey. IEEE Micro (2020a)
Google Scholar
M. Alser, T. Shahroodi, J. Gomez-Luna, C. Alkan, O. Mutlu, SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs, and FPGAs (2020b)
Google Scholar
S. Angizi, D. Fan, Graphide: a graph processing accelerator leveraging in-dram-computing, in GLSVLSI (2019)
Google Scholar
S. Angizi, Z. He, D. Fan, PIMA-logic: a novel processing-in-memory architecture for highly flexible and energy-efficient logic computation in DAC (2018a)
Google Scholar
S. Angizi, A.S. Rakin, D. Fan, CMP-PIM: an energy-efficient comparator-based processing-in-memory neural network accelerator, in DAC (2018b)
Google Scholar
S. Angizi, J. Sun, W. Zhang, D. Fan, AlignS: a processing-in-memory accelerator for DNA short read alignment leveraging SOT-MRAM in DAC (2019)
Google Scholar
A. Ankit, I.E. Hajj, S.R. Chalamalasetti, G. Ndu, M. Foltin, R.S. Williams, P. Faraboschi, W.-M.W. Hwu, J.P. Strachan, K. Roy, D.S. Milojicic, PUMA: a programmable ultra-efficient memristor-based accelerator for machine learning inference, in ASPLOS (2019)
Google Scholar
Apple Inc., About the Security Content of Mac EFI Security Update 2015-001 (2015), https://support.apple.com/en-us/HT204934
H. Asghari-Moghaddam, Y.H. Son, J.H. Ahn, N.S. Kim, Chameleon: versatile and practical near-DRAM acceleration architecture for large memory systems, in MICRO (2016)
Google Scholar
R. Ausavarungnirun, Techniques for shared resource management in systems with throughput processors. Ph.D. Thesis (Carnegie Mellon University, 2017)
Google Scholar
R. Ausavarungnirun, S. Ghose, O. Kayıran, G.H. Loh, C.R. Das, M.T. Kandemir, O. Mutlu, Exploiting inter-warp heterogeneity to improve GPGPU performance, in PACT (2015)
Google Scholar
R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C.J. Rossbach, O. Mutlu, Mosaic: a GPU memory manager with application-transparent support for multiple page sizes, in MICRO (2017)
Google Scholar
R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. Rossbach, O. Mutlu, MASK: redesigning the GPU memory hierarchy to support multi-application concurrency, in ASPLOS (2018a)
Google Scholar
R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C.J. Rossbach, O. Mutlu, Mosaic: enabling application-transparent support for multiple page sizes in throughput processors. SIGOPS Oper. Syst. Rev. (2018b)
Google Scholar
A.J. Awan, M. Brorsson, V. Vlassov, E. Ayguade, Performance characterization of in-memory data analytics on a modern cloud server, in CCBD (2015)
Google Scholar
A.J. Awan, M. Brorsson, V. Vlassov, E. Ayguade, Micro-architectural characterization of apache spark on batch and stream processing workloads, in BDCloud-SocialCom-SustainCom (2016)
Google Scholar
O.O. Babarinsa, S. Idreos, JAFAR: near-data processing for databases, in SIGMOD (2015)
Google Scholar
R. Baeza-Yates, G.H. Gonnet, A new approach to text searching. Commun. ACM (1992)
Google Scholar
A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, T.M. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, in ISPASS (2009)
Google Scholar
A. Barenghi, L. Breveglieri, N. Izzo, G. Pelosi, Software-only reverse engineering of physical DRAM mappings for RowHammer attacks, in IVSW (2018)
Google Scholar
G. Benson, Y. Hernandez, J. Loving, A bit-parallel, general integer-scoring sequence alignment algorithm, in CPM (2013)
Google Scholar
D. Bhattacharjee, R. Devadoss, A. Chattopadhyay, ReVAMP: ReRAM based VLIW architecture for in-memory computing, in DATE (2017)
Google Scholar
S. Bhattacharya, D. Mukhopadhyay, Curious case of RowHammer: flipping secret exponent bits using timing analysis, in CHES (2016)
Google Scholar
S. Bhattacharya, D. Mukhopadhyay, Advanced fault attacks in software: exploiting the RowHammer bug, in Fault Tolerant Architectures for Cryptography and Hardware Security (2018)
Google Scholar
N. Binkert, B. Beckman, A. Saidi, G. Black, A. Basu, The gem5 simulator. CAN (2011)
Google Scholar
P.A. Boncz, S. Manegold, M.L. Kersten, Database architecture optimized for the new bottleneck: memory access, in VLDB (1999)
Google Scholar
L. Bongiovanni, Maintaining sorted files in a magnetic bubble memory. IEEE Trans. Comput. (1980)
Google Scholar
A. Boroumand, Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks (2018), https://people.inf.ethz.ch/omutlu/pub/Google-consumer-workloads-data-movement-and-PIM_asplos18-talk.pdf, conference talk at ASPLOS 2018
A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K.T. Malladi, H. Zheng, O. Mutlu, LazyPIM: an efficient cache coherence mechanism for processing-in-memory. CAL (2016)
Google Scholar
A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, N. Hajinazar, K. Hsieh, K.T. Malladi, H. Zheng, O. Mutlu, LazyPIM: efficient support for cache coherence in processing-in-memory architectures (2017), arXiv:1706.03162 [cs:AR]
A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, O. Mutlu, Google workloads for consumer devices: mitigating data movement bottlenecks, in ASPLOS (2018)
Google Scholar
A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, R. Ausavarungnirun, K. Hsieh, N. Hajinazar, K.T. Malladi, H. Zheng, O. Mutlu, CoNDA: efficient cache coherence support for near-data accelerators, in ISCA (2019)
Google Scholar
E. Bosman, K. Razavi, H. Bos, C. Giuffrida, Dedup EST machina: memory deduplication as an advanced exploitation vector, in S&P (2016)
Google Scholar
A.W. Burks, H.H. Goldstine, J. von Neumann, Preliminary discussion of the logical design of an electronic computing instrument (1946)
Google Scholar
Y. Cai, NAND flash memory: characterization, analysis, modeling, and mechanisms. Ph.D. Thesis (Carnegie Mellon University, 2013)
Google Scholar
Y. Cai, E.F. Haratsch, O. Mutlu, K. Mai, Error patterns in MLC NAND flash memory: measurement, characterization, and analysis, in DATE (2012a)
Google Scholar
Y. Cai, G. Yalcin, O. Mutlu, E.F. Haratsch, A. Cristal, O.S. Unsal, K. Mai, Flash correct-and-refresh: retention-aware error management for increased flash memory lifetime, in ICCD (2012b)
Google Scholar
Y. Cai, O. Mutlu, E.F. Haratsch, K. Mai, Program interference in MLC NAND flash memory: characterization, modeling, and mitigation, in ICCD (2013a)
Google Scholar
Y. Cai, E.F. Haratsch, O. Mutlu, K. Mai, Threshold voltage distribution in MLC NAND flash memory: characterization, analysis, and modeling, in DATE (2013b)
Google Scholar
Y. Cai, G. Yalcin, O. Mutlu, E.F. Haratsch, A. Crista, O.S. Unsal, K. Mai, Error analysis and retention-aware error management for NAND flash memory. Intel Technol. J. (2013c)
Google Scholar
Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, A. Cristal, K. Mai, Neighbor-cell Assisted Error Correction for MLC NAND Flash Memories, in: SIGMETRICS, 2014
Google Scholar
Y. Cai, Y. Luo, E.F. Haratsch, K. Mai, O. Mutlu, Data retention in MLC NAND flash memory: characterization, optimization, and recovery, in HPCA (2015a)
Google Scholar
Y. Cai, Y. Luo, S. Ghose, O. Mutlu, Read disturb errors in MLC NAND flash memory: characterization, mitigation, and recovery, in DSN (2015b)
Google Scholar
Y. Cai, S. Ghose, E.F. Haratsch, Y. Luo, O. Mutlu, Error characterization, mitigation, and recovery in flash-memory-based solid-state drives. Proc. IEEE (2017a)
Google Scholar
Y. Cai, S. Ghose, Y. Luo, K. Mai, O. Mutlu, E.F. Haratsch, Vulnerabilities in MLC NAND flash memory programming: experimental analysis, exploits, and mitigation techniques, in HPCA (2017b)
Google Scholar
Y. Cai, S. Ghose, E.F. Haratsch, Y. Luo, O. Mutlu, Reliability issues in flash-memory-based solid-state drives: experimental analysis, mitigation, recovery, in Inside Solid State Drives (SSDs) (2018a)
Google Scholar
Y. Cai, S. Ghose, E.F. Haratsch, Y. Luo, O. Mutlu, Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery (2018b), arXiv:1711.11427 [cs:AR]
D.S. Cali, G.S. Kalsi, Z. Bingöl, C. Firtina, L. Subramanian, J.S. Kim, R. Ausavarungnirun, M. Alser, J. Gomez-Luna, A. Boroumand et al., GenASM: a high-performance, low-power approximate string matching acceleration framework for genome sequence analysis, in MICRO (2020)
Google Scholar
S. Carre, M. Desjardins, A. Facon, S. Guilley, OpenSSL Bellcore’s protection helps fault attack, in DSD (2018)
Google Scholar
C.-Y. Chan, Y. E. Ioannidis, Bitmap index design and evaluation, in SIGMOD (1998)
Google Scholar
K.K. Chang, Understanding and improving the latency of DRAM-based memory systems (2016), https://www.archive.ece.cmu.edu/~safari/thesis/kchang_dissertation.pdf, slides available at https://safari.ethz.ch/safari_public_wp/wp-content/uploads/2018/12/kchang_defense_slides.pptx
K.K. Chang, Understanding and improving the latency of DRAM-based memory systems. Ph.D. Thesis (Carnegie Mellon University, 2017)
Google Scholar
K.K. Chang, D. Lee, Z. Chishti, A.R. Alameldeen, C. Wilkerson, Y. Kim, O. Mutlu, Improving DRAM performance by parallelizing refreshes with accesses, in HPCA (2014)
Google Scholar
K.K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhimenko, S. Khan, O. Mutlu, Understanding latency variation in modern DRAM chips: experimental characterization, analysis, and optimization, in SIGMETRICS (2016a), https://people.inf.ethz.ch/omutlu/pub/understanding-latency-variation-in-DRAM-chips_kevinchang_sigmetrics16-talk.pdf
K.K. Chang, P.J. Nair, D. Lee, S. Ghose, M.K. Qureshi, O. Mutlu, Low-cost inter-linked subarrays (LISA): enabling fast inter-subarray data movement in DRAM, in HPCA (2016b)
Google Scholar
K.K. Chang, A. G. Yağlıkçı, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap, D. Lee, M. O’Connor, H. Hassan, O. Mutlu, Understanding reduced-voltage operation in modern DRAM devices: experimental characterization, analysis, and mechanisms, in SIGMETRICS (2017)
Google Scholar
P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory, in ISCA (2016)
Google Scholar
C. Chou, P. Nair, M.K. Qureshi, Reducing refresh power in mobile devices with morphable ECC, in DSN (2015)
Google Scholar
L. Chua, Memristor—the missing circuit element. IEEE TCT (1971)
Google Scholar
I. Churin, A. Georgiev, A CAMAC crate controller for the IBM PC/XT family computers with built-in selftest features. Microprocess. Microprogram. (1988)
Google Scholar
R. Clapp, M. Dimitrov, K. Kumar, V. Viswanathan, T. Willhalm, Quantifying the performance impact of memory latency and bandwidth for big data workloads, in IISWC (2015)
Google Scholar
L. Cojocar, J. Kim, M. Patel, L. Tsai, S. Saroiu, A. Wolman, O. Mutlu, Are we susceptible to RowHammer? An end-to-end methodology for cloud providers, in S&P (2020)
Google Scholar
L. Cojocar, K. Razavi, C. Giuffrida, H. Bos, Exploiting correcting codes: on the effectiveness of ECC memory against RowHammer attacks, in S&P (2019)
Google Scholar
G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, H. Yang, GraphH: a processing-in-memory architecture for large-scale graph processing. IEEE TCAD (2018)
Google Scholar
W.J. Dally, Challenges for future computing systems. HiPEAC Keynote (2015)
Google Scholar
A. Das, H. Hassan, O. Mutlu, VRL-DRAM: improving DRAM performance via variable refresh latency, in DAC (2018)
Google Scholar
H. David, C. Fallin, E. Gorbatov, U.R. Hanebutte, O. Mutlu, Memory power management via dynamic voltage/frequency scaling, in 8th ACM International Conference on Autonomic Computing (2011)
Google Scholar
J. Dean, L.A. Barroso, The tail at scale. ACM Commun. (2013)
Google Scholar
Q. Deng, L. Jiang, Y. Zhang, M. Zhang, J. Yang, DrAcc: a DRAM based accelerator for accurate CNN inference, in DAC (2018)
Google Scholar
Q. Deng, D. Meisner, L. Ramos, T.F. Wenisch, R. Bianchini, Memscale: active low-power modes for main memory, in ASPLOS (2011)
Google Scholar
R.H. Dennard, Field-effect transistor memory. US Patent 3,387,286 (1968)
Google Scholar
R.H. Dennard, F.H. Gaensslen, H.-N. Yu, V.L. Rideout, E. Bassous, A.R. LeBlanc, Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid-State Circuits (1974)
Google Scholar
P.J. Denning, T.G. Lewis, Exponential laws of computing growth. ACM Commun. (2017)
Google Scholar
F. Devaux, The true processing in memory accelerator, in Hot Chips (2019)
Google Scholar
Doty, Greenblatt, S.Y.W. Su, Magnetic bubble memory architectures for supporting associative searching of relational databases. IEEE Trans. Comput. (1980)
Google Scholar
J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C.W. Kang, I. Kim, G. Daglikoca, The architecture of the DIVA processing-in-memory chip, in SC (2002)
Google Scholar
M.P. Drumond Lages De Oliveira, A. Daglis, N. Mirzadeh, D. Ustiugov, J. Picorel Obando, B. Falsafi, B. Grot, D. Pnevmatikatos, The Mondrian data engine, in ISCA (2017)
Google Scholar
C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, R. Das, Neural cache: bit-serial in-cache acceleration of deep neural networks. in ISCA (2018)
Google Scholar
D.G. Elliott, W.M. Snelgrove, M. Stumm, Computational RAM: a memory-SIMD hybrid and its application to DSP, in CICC (1992)
Google Scholar
D. Elliott, M. Stumm, W.M. Snelgrove, C. Cojocaru, R. McKenzie, Computational RAM: implementing processors in memory. IEEE Des. Test (1999)
Google Scholar
A. Farmahini-Farahani, J.H. Ahn, K. Morrow, N.S. Kim, NDA: near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules, in HPCA (2015)
Google Scholar
FastBit: An Efficient Compressed Bitmap Index Technology (2021), https://sdm.lbl.gov/fastbit/
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A.D. Popescu, A. Ailamaki, B. Falsafi, Clearing the clouds: a study of emerging scale-out workloads on modern hardware, in ASPLOS (2012)
Google Scholar
I. Fernandez, R. Quislant, C. Giannoula, M. Alser, J. Gomez-Luna, E. Gutierrez, O. Plata, O. Mutlu, NATSA: a near-data processing accelerator for time series analysis, in ICCD (2020)
Google Scholar
A.P. Fournaris, L. Pocero Fraile, O. Koufopavlou, Exploiting hardware vulnerabilities to attack embedded system devices: a survey of potent microarchitectural attacks. Electronics (2017)
Google Scholar
J. Friedrich, H. Le, W. Starke, J. Stuechli, B. Sinharoy, E.J. Fluhr, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, D. Hogenmiller, F. Malgioglio, R. Nett, R. Puri, P. Restle, D. Shan, Z.T. Deniz, D. Wendel, M. Ziegler, D. Victor, The POWER8TM processor: designed for big data, analytics, and cloud environments, in IEEE International Conference on IC Design Technology (2014)
Google Scholar
P. Frigo et al., Grand pwning unit: accelerating microarchitectural attacks with the GPU, in S&P (2018)
Google Scholar
P. Frigo, E. Vannacci, H. Hassan, V. van der Veen, O. Mutlu, C. Giuffrida, H. Bos, K. Razavi, TRRespass: exploiting the many sides of target row refresh, in S&P (2020)
Google Scholar
D. Fujiki, A. Subramaniyan, T. Zhang, Y. Zeng, R. Das, D. Blaauw, S. Narayanasamy, Genax: a genome sequencing accelerator, in ISCA (2018)
Google Scholar
D. Fujiki, S. Mahlke, R. Das, Duality cache for data parallel acceleration, in ISCA (2019)
Google Scholar
P.-E. Gaillardon, L. Amaru, A. Siemon et al., The programmable logic-in-memory (PLiM) computer, in DATE (2016)
Google Scholar
M. Gao, G. Ayers, C. Kozyrakis, Practical near-data processing for in-memory analytics frameworks, in PACT (2015)
Google Scholar
M. Gao, C. Kozyrakis, HRL: efficient and flexible reconfigurable logic for near-data processing, in HPCA (2016)
Google Scholar
M. Gao, J. Pu, X. Yang, M. Horowitz, C. Kozyrakis, Tetris: scalable and efficient neural network acceleration with 3D memory, in ASPLOS (2017)
Google Scholar
F. Gao, G. Tziantzioulis, D. Wentzlaff, ComputeDRAM: in-memory compute using off-the-shelf DRAMs, in MICRO (2019)
Google Scholar
GeForce GTX 745 (2021), http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-745-oem/specifications
S. Ghose, K. Hsieh, A. Boroumand, R. Ausavarungnirun, O. Mutlu, Enabling the adoption of processing-in-memory: challenges, mechanisms, future research directions (2018a) , arXiv:1802.00320 [cs:AR]
S. Ghose, A.G. Yaglikçi, R. Gupta, D. Lee, K. Kudrolli, W.X. Liu, H. Hassan, K.K. Chang, N. Chatterjee, A. Agrawal, M. O’Connor, O. Mutlu, What your DRAM power models are not telling you: lessons from a detailed experimental study, in SIGMETRICS (2018b)
Google Scholar
S. Ghose, A. Boroumand, J.S. Kim, J.Gómez-Luna, O. Mutlu, A workload and programming ease driven perspective of processing-in-memory (2019a), arXiv:1907.12947 [cs:AR]
S. Ghose, A. Boroumand, J.S. Kim, J. Gómez-Luna, O. Mutlu, Processing-in-memory: a workload-driven perspective. IBM JRD (2019b)
Google Scholar
S. Ghose, K. Hsieh, A. Boroumand, R. Ausavarungnirun, O. Mutlu, The processing-in-memory paradigm: mechanisms to enable adoption, in Beyond-CMOS Technologies for Next Generation Computer Design (2019c)
Google Scholar
S. Ghose, T. Li, N. Hajinazar, D.S. Cali, O. Mutlu, Demystifying complex workload-DRAM interactions: an experimental study, in SIGMETRICS (2019d)
Google Scholar
K. Gillespie, H.R. Fair, C. Henrion, R. Jotwani, S. Kosonocky, R.S. Orefice, D.A. Priore, J. White, K. Wilcox, 5.5 Steamroller: an x86-64 core implemented in 28 nm bulk CMOS, in ISSCC (2014)
Google Scholar
M. Gokhale, B. Holmes, K. Iobst, Processing in memory: the terasys massively parallel PIM array. IEEE Comput. (1995)
Google Scholar
A. Gondimalla, N. Chesnut, M. Thottethodi, T. Vijaykumar, Sparten: a sparse tensor accelerator for convolutional neural networks, in MICRO (2019)
Google Scholar
J.E. Gonzalez et al., PowerGraph: distributed graph-parallel computation on natural graph, in OSDI (2012)
Google Scholar
B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei, S. Elnikety, Y. He, BitFunnel: revisiting signatures for search, in SIGIR (2017)
Google Scholar
Google LLC, Chrome Browser (2021), https://www.google.com/chrome/browser/
Google LLC, TensorFlow: Mobile (2021), https://www.tensorflow.org/mobile/
B. Gopireddy, J. Torrellas, Designing vertical processors in monolithic 3D, in ISCA (2019)
Google Scholar
A. Grange, P. de Rivaz, J. Hunt, VP9 Bitstream and decoding process specification (2021), http://storage.googleapis.com/downloads.webmproject.org/docs/vp9/vp9-bitstream-specification-v0.6-20160331-draft.pdf
D. Gruss, C. Maurice, S. Mangard, Rowhammer.js: a remote software-induced fault attack in JavaScript. CoRR (2015), arXiv:1507.06955
D. Gruss et al., Another flip in the wall of rowhammer defenses, in S&P (2018)
Google Scholar
B. Gu, A.S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, D. Chang, Biscuit: a framework for near-data processing of big data workloads, in ISCA (2016)
Google Scholar
Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T.M. Low, L. Pileggi, J.C. Hoe, F. Franchetti, 3D-stacked memory-side acceleration: accelerator and system design, in WoNDP (2014)
Google Scholar
N. Hajinazar, P. Patel, M. Patel, K. Kanellopoulos, S. Ghose, R. Ausavarungnirun, G.F.D. Oliveira Jr, J. Appavoo, V. Seshadri, O. Mutlu, The virtual block interface: a flexible alternative to the conventional virtual memory framework, in ISCA (2020)
Google Scholar
J. Haj-Yahya, M. Alser, J. Kim, A. G. Yaglıkçı, N. Vijaykumar, E. Rotem, O. Mutlu, SysScale: exploiting multi-domain dynamic voltage and frequency scaling for energy efficient mobile processors, in ISCA (2020a)
Google Scholar
J. Haj-Yahya, Y. Sazeides, M. Alser, E. Rotem, O. Mutlu, Techniques for reducing the connected-standby energy consumption of mobile devices, in HPCA (2020b)
Google Scholar
S. Hamdioui, L. Xie, H.A.D. Nguyen et al., Memristor based computation-in-memory architecture for data-intensive applications, in DATE (2015)
Google Scholar
S. Hamdioui, S. Kvatinsky, G. Cauwenberghs, Memristor for computing: Myth or Reality?, in DATE (2017)
Google Scholar
J.-W. Han, C.-S. Park, D.-H. Ryu, E.-S. Kim, Optical image encryption based on XOR operations. SPIE OE (1999)
Google Scholar
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in ISCA (2016)
Google Scholar
Harshvardhan et al., KLA: a new algorithmic paradigm for parallel graph computation, in PACT (2014)
Google Scholar
M. Hashemi, Khubaib, E. Ebrahimi, O. Mutlu, Y.N. Patt, Accelerating dependent cache misses with an enhanced memory controller, in ISCA (2016a)
Google Scholar
M. Hashemi, O. Mutlu, Y.N. Patt, Continuous runahead: transparent hardware acceleration for memory intensive workloads, in MICRO (2016b)
Google Scholar
H. Hassan, M. Patel, J.S. Kim, A.G. Yaglikci, N. Vijaykumar, N.M. Ghiasi, S. Ghose, O. Mutlu, CROW: a low-cost substrate for improving DRAM performance, energy efficiency, and reliability, in ISCA (2019)
Google Scholar
S.M. Hassan, S. Yalamanchili, S. Mukhopadhyay, Near data processing: impact and optimization of 3D memory system architecture on the uncore, in MEMSYS (2015)
Google Scholar
H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, O. Mutlu, ChargeCache: reducing DRAM latency by exploiting row access locality, in HPCA (2016)
Google Scholar
H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee, O. Ergin, O. Mutlu, SoftMC: a flexible and practical open-source infrastructure for enabling experimental DRAM studies, in HPCA (2017)
Google Scholar
K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, C.W. Fletcher, Extensor: an accelerator for sparse tensor algebra, in MICRO (2019)
Google Scholar
S. Hong, H. Chafi, E. Sedlar, K. Olukotun, Green-Marl: a DSL for easy and efficient graph analysis, in ASPLOS (2012)
Google Scholar
J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R.V.D. Wijngaart, T. Mattson, A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS, in ISSCC (2010)
Google Scholar
K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Conner, N. Vijaykumar, O. Mutlu, S. Keckler, Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems, in ISCA (2016)
Google Scholar
K. Hsieh, S. Khan, N. Vijaykumar, K.K. Chang, A. Boroumand, S. Ghose, O. Mutlu, Accelerating pointer chasing in 3D-stacked memory: challenges, mechanisms, evaluation, in ICCD (2016)
Google Scholar
Y. Huang, L. Zheng, P. Yao, J. Zhao, X. Liao, H. Jin, J. Xue, A heterogeneous PIM hardware-software co-design for energy-efficient graph processing, in IPDPS (2020)
Google Scholar
W. Hwang, W. Wan, S. Mitra, H.P. Wong, Coming up N3XT, after 2D scaling of Si CMOS, in ISCAS (2018)
Google Scholar
Hybrid Memory Cube Consortium, HMC Specification 2.0 (2014)
Google Scholar
Hybrid Memory Cube Consortium, HMC Specification 1, 1 (2013)
Google Scholar
International Technology Roadmap for Semiconductors (ITRS) (2009)
Google Scholar
Y. Jang, J. Lee, S. Lee, T. Kim, SGX-Bomb: locking down the processor via RowHammer attack, in SysTEX (2017)
Google Scholar
JEDEC, Wide I/O Single Data Rate (Wide I/O SDR), Standard No. JESD229 (2011)
Google Scholar
JEDEC, High Bandwidth Memory (HBM) DRAM, Standard No. JESD235 (2013)
Google Scholar
JEDEC, Wide I/O 2 (WideIO2), Standard No. JESD229-2 (2014)
Google Scholar
JEDEC, JESD79-5 DDR5 SDRAM standard (2020)
Google Scholar
M. Jino, J.W.S. Liu, Intelligent magnetic bubble memories, in ISCA (1978)
Google Scholar
A. Jog, O. Kayiran, N.C. Nachiappan, A.K. Mishra, M.T. Kandemir, O. Mutlu, R. Iyer, C.R. Das, OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance, in ASPLOS (2013a)
Google Scholar
A. Jog, O. Kayiran, A.K. Mishra, M.T. Kandemir, O. Mutlu, R. Iyer, C.R. Das, Orchestrated scheduling and prefetching for GPGPUs, in ISCA (2013b)
Google Scholar
A. Jog, O. Kayiran, A. Pattnaik, M.T. Kandemir, O. Mutlu, R. Iyer, C.R. Das, Exploiting core criticality for enhanced GPU performance, in SIGMETRICS (2016)
Google Scholar
R. Jotwani, S. Sundaram, S. Kosonocky, A. Schaefer, V. Andrade, G. Constant, A. Novak, S. Naffziger, An x86-64 core implemented in 32 nm SOI CMOS, in ISSCC (2010)
Google Scholar
K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi, S. Koppula, N. Mansouri Ghiasi, T. Shahroodi, J. Gomez-Luna, O. Mutlu, SMASH: Co-designing software compression and hardware-accelerated indexing for efficient sparse matrix operations, in MICRO (2019)
Google Scholar
S. Kanev, J.P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, D. Brooks, Profiling a warehouse-scale computer, in ISCA (2015)
Google Scholar
H. Kang, S. Hong, One-transistor type DRAM. US Patent 7701751 (2009)
Google Scholar
Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, J. Torrellas, FlexRAM: toward an advanced intelligent memory system, in ICCD (1999)
Google Scholar
M. Kang, M.-S. Keel, N.R. Shanbhag, S. Eilert, K. Curewitz, An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM, in ICASSP (2014a)
Google Scholar
U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, J. Choi, Co-architecting controllers and DRAM to enhance DRAM process scaling, in The Memory Forum (2014b)
Google Scholar
S. Kaxiras, R. Sugumar, Distributed vector architecture: beyond a single vector-IRAM, in First Workshop on Mixing Logic and DRAM: Chips that Compute and Remember (1997)
Google Scholar
S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, D. Glasco, GPUs and the future of parallel computing. IEEE Micro (2011)
Google Scholar
K. Keeton, D.A. Patterson, J.M. Hellerstein, A case for intelligent disks (IDISKs). SIGMOD Rec. (1998)
Google Scholar
G. Kestor, R. Gioiosa, D.J. Kerbyson, A. Hoisie, Quantifying the energy cost of data movement in scientific applications, in IISWC (2013)
Google Scholar
S. Khan, A.R. Alameldeen, C. Wilkerson, O. Mutlu, D.A. Jimenez, Improving cache performance using read-write partitioning, in HPCA (2014a)
Google Scholar
S. Khan, D. Lee, Y. Kim, A.R. Alameldeen, C. Wilkerson, O. Mutlu, The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study, in SIGMETRICS (2014b)
Google Scholar
S. Khan, D. Lee, O. Mutlu, PARBOR: an efficient system-level technique to detect data dependent failures in DRAM, in DSN (2016a)
Google Scholar
S. Khan, C. Wilkerson, D. Lee, A.R. Alameldeen, O. Mutlu, A case for memory content-based detection and mitigation of data-dependent failures in DRAM. CAL (2016b)
Google Scholar
S. Khan, C. Wilkerson, Z. Wang, A. Alameldeen, D. Lee, O. Mutlu, Detecting and mitigating data-dependent DRAM failures by exploiting current memory content, in MICRO (2017)
Google Scholar
Y. Kim, Flipping bits in memory without accessing them. DRAM disturbance errors (2014), https://people.inf.ethz.ch/omutlu/pub/dram-row-hammer_kim_talk_isca14.pdf, conference talk at ISCA 2014
Y. Kim, Architectural techniques to enhance DRAM scaling. Ph.D. Thesis (Carnegie Mellon University, 2015)
Google Scholar
K. Kim, J. Lee, A new investigation of data retention time in truly nanoscaled DRAMs. IEEE Electron Device Lett. (2009)
Google Scholar
Y. Kim, D. Han, O. Mutlu, M. Harchol-Balter, ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers, in HPCA (2010a)
Google Scholar
Y. Kim, M. Papamichael, O. Mutlu, M. Harchol-Balter, Thread cluster memory scheduling: exploiting differences in memory access behavior, in MICRO (2010b)
Google Scholar
Y. Kim, V. Seshadri, D. Lee, J. Liu, O. Mutlu, A case for exploiting subarray-level parallelism (SALP) in DRAM, in ISCA (2012)
Google Scholar
Y. Kim, R. Daly, J. Kim, C. Fallin, J.H. Lee, D. Lee, C. Wilkerson, K. Lai, O. Mutlu, Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors, in ISCA (2014a)
Google Scholar
H. Kim, D. De Niz, B. Andersson, M. Klein, O. Mutlu, R. Rajkumar, Bounding memory interference delay in COTS-based multi-core systems, in RTAS (2014b)
Google Scholar
Y. Kim, W. Yang, O. Mutlu, Ramulator: a fast and extensible DRAM simulator. CAL (2015)
Google Scholar
H. Kim, D. De Niz, B. Andersson, M. Klein, O. Mutlu, R. Rajkumar, Bounding and reducing memory interference in COTS-based multi-core systems, real-time systems (2016a)
Google Scholar
D. Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay, Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory, in ISCA (2016b)
Google Scholar
J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin, C. Alkan, O. Mutlu, GRIM-Filter: fast seed filtering in read mapping using emerging memory technologies (2017a), arXiv:1708.04329 [q-bio.GN]
G. Kim, N. Chatterjee, M. O’Connor, K. Hsieh, Toward standardized near-data processing with unrestricted data placement for GPUs, in SC (2017b)
Google Scholar
J.S. Kim, The DRAM latency PUF: quickly evaluating physical unclonable functions by exploiting the latency–reliability tradeoff in modern commodity DRAM devices (2018a), https://people.inf.ethz.ch/omutlu/pub/dram-latency-puf_hpca18_talk.pdf, conference talk at HPCA 2018
J. Kim, M. Patel, H. Hassan, O. Mutlu, Solar-DRAM: reducing DRAM access latency by exploiting the variation in local bitlines, in ICCD (2018b)
Google Scholar
J.S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin, C. Alkan, O. Mutlu, GRIM-Filter: fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics (2018c)
Google Scholar
J. Kim, M. Patel, H. Hassan, L. Orosa, O. Mutlu, D-RaNGe: using commodity DRAM devices to generate true random numbers with low latency and high throughput, in HPCA (2019), https://people.inf.ethz.ch/omutlu/pub/drange-dram-latency-based-true-random-number-generator_hpca19-talk.pdf, conference talk at HPCA 2019
M. Kim, J. Park, G. Cho, Y. Kim, L. Orosa, O. Mutlu, J. Kim, Evanesco: architectural support for efficient data sanitization in modern flash-based storage systems, in ASPLOS (2020a)
Google Scholar
J.S. Kim, M. Patel, A.G. Yağlıkçı, H. Hassan, R. Azizi, L. Orosa, O. Mutlu, Revisiting RowHammer: an experimental analysis of modern DRAM devices and mitigation techniques, in ISCA (2020b)
Google Scholar
D.E. Knuth, The Art of Computer Programming, vol. 4 Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Diagrams (2009)
Google Scholar
P.M. Kogge, EXECUBE–a new architecture for scaleable MPPs, in ICPP (1994)
Google Scholar
S. Koppula, L. Orosa, A.G. Yağlıkçı, R. Azizi, T. Shahroodi, K. Kanellopoulos, O. Mutlu, EDEN: enabling energy-efficient, high-performance deep neural network inference using approximate DRAM, in MICRO (2019)
Google Scholar
K. Korgaonkar, R. Ronen, A. Chattopadhyay, S. Kvatinsky, The bitlet model: defining a litmus test for the bitwise processing-in-memory paradigm (2019), arXiv:1910.10234
T.S. Kuhn, The Structure of Scientific Revolutions (2012)
Google Scholar
E. Kültürsay, M. Kandemir, A. Sivasubramaniam, O. Mutlu, Evaluating STT-RAM as an energy-efficient main memory alternative, in ISPASS (2013)
Google Scholar
R. Kumar, G. Hinton, A family of 45 nm IA processors, in ISSCC (2009)
Google Scholar
S. Kvatinsky, A. Kolodny, U.C. Weiser, E.G. Friedman, Memristor-based IMPLY logic design procedure, in ICCD (2011)
Google Scholar
S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C. Weiser, MAGIC-Memristor-Aided Logic, Express Briefs (IEEE TCAS II, 2014a)
Google Scholar
S. Kvatinsky, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C. Weiser, Memristor-based material implication (IMPLY) logic: design principles and methodologies, in TVLSI (2014b)
Google Scholar
N. Kwak, S.-H. Kim, K.H. Lee, C.-K. Baek, M.S. Jang, Y. Joo, S.-H. Lee, W.Y. Lee, E. Lee, D. Han et al., 23.3 A 4.8 Gb/s/pin 2Gb LPDDR4 SDRAM with sub-100 \(\mu \)A self-refresh current for IoT applications, in ISSCC (2017)
Google Scholar
H.-J. Kwon, E. Seo, C.-Y. Lee, Y.-H. Seo, G.-H. Han, H.-R. Kim, J.-H. Lee, M.-S. Jang, S.-G. Do, S.-H. Cho et al., 23.4 An extremely low-standby-power 3.733 Gb/s/pin 2Gb LPDDR4 SDRAM for wearable devices, in ISSCC (2017)
Google Scholar
D. Lee, Reducing DRAM latency at low cost by exploiting heterogeneity. Ph.D. Thesis (Carnegie Mellon University, 2016)
Google Scholar
B.C. Lee, E. Ipek, O. Mutlu, D. Burger, Architecting phase change memory as a scalable DRAM alternative, in ISCA (2009)
Google Scholar
B.C. Lee, E. Ipek, O. Mutlu, D. Burger, Phase change memory architecture and the quest for scalability. CACM (2010a)
Google Scholar
B.C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, D. Burger, Phase-change technology and the future of main memory. IEEE Micro (2010b)
Google Scholar
D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, O. Mutlu, Tiered-latency DRAM: a low latency and low cost DRAM architecture, in HPCA (2013)
Google Scholar
D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, O. Mutlu, Adaptive-latency DRAM: optimizing DRAM timing for the common-case, in HPCA (2015a)
Google Scholar
J.H. Lee, J. Sim, H. Kim, BSSync: processing near memory for machine learning workloads with bounded staleness consistency models, in PACT (2015b)
Google Scholar
D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, O. Mutlu, Decoupled direct memory access: isolating CPU and IO traffic by leveraging a dual-data-port DRAM, in PACT (2015c)
Google Scholar
D. Lee, S. Ghose, G. Pekhimenko, S. Khan, O. Mutlu, Simultaneous multi-layer access: improving 3D-stacked memory bandwidth at low cost. TACO (2016)
Google Scholar
D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko, V. Seshadri, O. Mutlu, Design-induced latency variation in modern DRAM chips: characterization, analysis, and latency reduction mechanisms, in SIGMETRICS (2017)
Google Scholar
J.-B. Lee, Green Memory Solution (Investor’s Forum, Samsung electronics, 2021)
Google Scholar
C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, T. W. Keller, Energy management for commercial servers. Computer (2003)
Google Scholar
Y. Levy, J. Bruck, Y. Cassuto, E.G. Friedman, A. Kolodny, E. Yaakobi, S. Kvatinsky, Logic operations in memory using a memristive Akers array. Microelectron. J. (2014)
Google Scholar
H. Li, Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics (2018)
Google Scholar
H. Li, R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics (2009)
Google Scholar
S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: a processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories, in DAC (2016)
Google Scholar
Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, O. Mutlu, Utility-based hybrid memory management, in CLUSTER (2017a)
Google Scholar
S. Li, D. Niu, K.T. Malladi, H. Zheng, B. Brennan, Y. Xie, DRISA: A DRAM-based reconfigurable in-situ accelerator, in MICRO (2017b)
Google Scholar
C. Li, R. Ausavarungnirun, C.J. Rossbach, Y. Zhang, O. Mutlu, Y. Guo, J. Yang, A framework for memory oversubscription management in graphics processing units, in ASPLOS (2019)
Google Scholar
Y. Li, J.M. Patel, BitWeaving: fast scans for main memory data processing, in SIGMOD (2013)
Google Scholar
K. Lim, J. Chang, T. Mudge, P. Ranganathan, S.K. Reinhardt, T.F. Wenisch, Disaggregated memory for expansion and sharing in blade servers, in ISCA (2009)
Google Scholar
M. Lipp et al., Nethammer: inducing Rowhammer faults through network requests (2018), arxiv.org
Google Scholar
J. Liu, RAIDR: retention-aware intelligent DRAM refresh (2012), https://people.inf.ethz.ch/omutlu/pub/liu_isca12_talk.pdf, conference talk at ISCA 2012
Z. Liu, I. Calciu, M. Herlihy, O. Mutlu, Concurrent data structures for near-memory computing, in SPAA (2017)
Google Scholar
J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, O. Mutlu, An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms, in ISCA (2013)
Google Scholar
J. Liu, B. Jaiyen, R. Veras, O. Mutlu, RAIDR: retention-aware intelligent DRAM refresh, in ISCA (2012)
Google Scholar
X. Liu, D. Roberts, R. Ausavarungnirun, O. Mutlu, J. Zhao, Binary star: coordinated reliability in heterogeneous memory systems for high performance and scalability, in MICRO (2019)
Google Scholar
G.H. Loh, 3D-stacked memory architectures for multi-core processors in ISCA (2008)
Google Scholar
G.H. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts, M. Meswani, D.P. Zhang, M. Ignatowski, A processing in memory taxonomy and a case for studying fixed-function PIM, in WoNDP (2013)
Google Scholar
Y. Long, T. Na, S. Mukhopadhyay, ReRAM-based processing-in-memory architecture for recurrent neural network acceleration, in TVLSI (2018)
Google Scholar
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, J.M. Hellerstein, Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB Endowment (2012)
Google Scholar
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J.M. Hellerstein, GraphLab: a new framework for parallel machine learning (2010), arXiv:1006.4990 [cs:LG]
S.-L. Lu, Y.-C. Lin, C.-L. Yang, Improving DRAM latency with dynamic asymmetric subarray, in MICRO (2015)
Google Scholar
Y. Luo, Architectural techniques for improving NAND flash memory reliability. Ph.D. Thesis (Carnegie Mellon University, 2018)
Google Scholar
Y. Luo, Y. Cai, S. Ghose, J. Choi, O. Mutlu, WARM: improving NAND flash memory lifetime with write-hotness aware retention management, in MSST (2015)
Google Scholar
Y. Luo, S. Ghose, Y. Cai, E.F. Haratsch, O. Mutlu, Enabling accurate and practical online flash channel modeling for modern MLC NAND flash memory. JSAC (2016)
Google Scholar
Y. Luo, S. Ghose, Y. Cai, E.F. Haratsch, O. Mutlu, HeatWatch: improving 3D NAND flash memory device reliability by exploiting self-recovery and temperature awareness, in HPCA (2018a)
Google Scholar
Y. Luo, S. Ghose, Y. Cai, E.F. Haratsch, O. Mutlu, Improving 3D NAND flash memory lifetime by tolerating early retention loss and process variation, in SIGMETRICS (2018b)
Google Scholar
Y. Luo, S. Ghose, T. Li, S. Govindan, B. Sharma, B. Kelly, A. Boroumand, O. Mutlu, Using ECC DRAM to adaptively increase memory capacity (2017), arXiv:1706.08870 [cs:AR]
Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu, B. Khessib, K. Vaid, O. Mutlu, Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory, in DSN (2014)
Google Scholar
H. Luo, T. Shahroodi, H. Hassan, M. Patel, A.G. Yaglikci, L. Orosa, J. Park, O. Mutlu, CLR-DRAM: a low-cost DRAM architecture enabling dynamic capacity-latency trade-off, in ISCA (2020)
Google Scholar
K. Mai, T. Paaske, N. Jayasena, R. Ho, W.J. Dally, M. Horowitz, Smart memories: a modular reconfigurable architecture, in ISCA (2000)
Google Scholar
G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in SIGMOD (2010)
Google Scholar
S.A. Manavski, CUDA compatible GPU as an efficient hardware accelerator for AES cryptography, in ICSPC (2007)
Google Scholar
J.A. Mandelman, R.H. Dennard, G.B. Bronner, J.K. DeBrosse, R. Divakaruni, Y. Li, C.J. Radens, Challenges and Future Directions for the Scaling of Dynamic Random-Access Memory (DRAM) (IBM JRD, 2002)
Google Scholar
S.A. McKee, Reflections on the memory wall, in CF (2004)
Google Scholar
Memcached: A High Performance, Distributed Memory Object Caching System (2021), http://memcached.org
J. Meza, J. Chang, H. Yoon, O. Mutlu, P. Ranganathan, Enabling efficient and scalable hybrid memories using fine-granularity DRAM cache management. CAL (2012)
Google Scholar
J. Meza, Q. Wu, S. Kumar, O. Mutlu, Revisiting memory errors in large-scale production data centers: analysis and modeling of new trends from the field, in DSN (2015)
Google Scholar
Micron Technology Inc., ECC brings reliability and power efficiency to mobile devices. Technical Report (2017)
Google Scholar
Micron, DDR4 SDRAM Datasheet (2021), p. 380
Google Scholar
S. Mitra, Abundant-data computing: The N3XT 1,000X, in VLSI-TSA (2018)
Google Scholar
S. Mitra, From nanodevices to nanosystems: the N3XT information technology, in E3S (2015)
Google Scholar
A. Morad, L. Yavits, R. Ginosar, GP-SIMD processing-in-memory. ACM TACO (2015)
Google Scholar
T. Moscibroda, O. Mutlu, Memory performance attacks: denial of memory service in multi-core systems, in USENIX Security (2007)
Google Scholar
S.P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, T. Moscibroda, Reducing memory interference in multicore systems via application-aware memory channel partitioning, in MICRO (2011)
Google Scholar
O. Mutlu, An experimental study of data retention behavior in modern DRAM devices. Implications for retention time profiling mechanisms (2013a), https://people.inf.ethz.ch/omutlu/pub/mutlu_isca13_talk.pdf, conference talk at ISCA 2013
O. Mutlu, Memory scaling: a systems architecture perspective, in IMW (2013b)
Google Scholar
O. Mutlu, Processing Data Where It Makes Sense: Enabling In-Memory Computation (2017), https://people.inf.ethz.ch/omutlu/pub/onur-MST-Keynote-EnablingInMemoryComputation-October-27-2017-unrolled-FINAL.pptx, keynote talk at MST
O. Mutlu, RowHammer, in Top Picks in Hardware and Embedded Security (2018)
Google Scholar
O. Mutlu, Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation, https://people.inf.ethz.ch/omutlu/pub/onur-GWU-EnablingInMemoryComputation-February-15-2019-unrolled-FINAL.pptx, video available at https://www.youtube.com/watch?v=oHqsNbxgdzM, distinguished lecture at George Washington University (2019)
O. Mutlu, Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation (2019b), https://people.inf.ethz.ch/omutlu/pub/onur-ICCD-Keynote-EnablingInMemoryComputation-November-19-2019-unrolled.pptx, video available at https://www.youtube.com/watch?v=njX_14584Jw, keynote talk at 37th IEEE International Conference on Computer Design (ICCD), Abu Dhabi, UAE, 19 November 2019
O. Mutlu, Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation (2019c), https://people.inf.ethz.ch/omutlu/pub/onur-GLSVLSI-KeynoteTalk-EnablingInMemoryComputation-May-10-2019-unrolled.pptx, keynote Talk at 29th ACM Great Lakes Symposium on VLSI (GLSVLSI), Washington, DC, USA, May 2019
O. Mutlu, Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation (2019d), https://people.inf.ethz.ch/omutlu/pub/onur-APPT-Keynote-EnablingInMemoryComputation-August-16-2019-unrolled.pptx, video available at https://www.youtube.com/watch?v=K0OcjxVVhEw, keynote talk at International Symposium on Advanced Parallel Processing Technology (APPT), Tianjin, China, 16 August 2019
O. Mutlu, Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation (2019e), https://www.people.inf.ethz.ch/omutlu/pub/onur-ISSCC2019-talk.pptx, Invited Talk at ISSCC Special Forum on Intelligence at the Edge: How Can We Make Machine Learning More Energy Efficient? as part of the, International Solid State Circuits Conference (ISSCC), CA, USA, February, San Francisco, 2019
O. Mutlu, Accelerating Genome Analysis: A Primer on an Ongoing Journey (2019f), https://people.inf.ethz.ch/omutlu/pub/onur-AcceleratingGenomeAnalysis-AACBB-Keynote-Feb-16-2019-FINAL.pptx, video available at https://www.youtube.com/watch?v=hPnSmfwu2-A, keynote talk at 2nd Workshop on Accelerator Architecture in Computational Biology and Bioinformatics (AACBB), Washington, DC, USA, February 2019
O. Mutlu, Intelligent Architectures for Intelligent Machines (2020a), https://people.inf.ethz.ch/omutlu/pub/intelligent-architectures-for-intelligent-machines_keynote-paper_VLSI20.pdf
O. Mutlu, Intelligent Architectures for Intelligent Machines (2020b), https://people.inf.ethz.ch/omutlu/pub/onur-NSF-PIM-KeynoteTalk-IntelligentArchitecturesForIntelligentMachines-October-26-2020-final.pptx, video available at https://www.youtube.com/watch?v=2N-Knx6DHW8, keynote Talk at National Science Foundation Workshop on Processing-In-Memory Technology (NSF-PIM), Virtual, 26 October 2020
O. Mutlu, The RowHammer problem and other issues we may face as memory becomes denser, in DATE (2017)
Google Scholar
O. Mutlu, S. Ghose, R. Ausavarungnirun, Recent advances in DRAM and flash memory architectures. Invited J. Issue IPSI Trans. Internet Res. (2018)
Google Scholar
O. Mutlu et al., Processing data where it makes sense: enabling in-memory computation. MicPro (2019a)
Google Scholar
O. Mutlu, S. Ghose, J. Gómez-Luna, R. Ausavarungnirun, Enabling practical processing in and near memory for data-intensive computing, in DAC (2019b)
Google Scholar
O. Mutlu, H. Kim, Y.N. Patt, Address-value delta (AVD) prediction: a hardware technique for efficiently parallelizing dependent cache misses. IEEE Trans. Comput. (2006)
Google Scholar
O. Mutlu, T. Moscibroda, Stall-time fair memory access scheduling for chip multiprocessors, in MICRO (2007)
Google Scholar
O. Mutlu, T. Moscibroda, Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems, in ISCA (2008)
Google Scholar
O. Mutlu, L. Subramanian, Research problems and opportunities in memory systems in SUPERFRI (2014)
Google Scholar
O. Mutlu, J.S. Kim, RowHammer: a retrospective. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. (2020)
Google Scholar
MySQL: An Open Source Database (2021), http://www.mysql.com
H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu, J. Tschanz, STT-RAM scaling and retention failure. Intel Technol. J. (2013)
Google Scholar
L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, H. Kim, GraphPIM: enabling instruction-level PIM offloading in graph computing frameworks, in HPCA (2017)
Google Scholar
V. Narasiman, C.J. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, Y.N. Patt, Improving GPU performance via large warps and two-level warp scheduling, in MICRO (2011)
Google Scholar
T.-Y. Oh, H. Chung, J.-Y. Park, K.-W. Lee, S. Oh, S.-Y. Doo, H.-J. Kim, C. Lee, H.-R. Kim, J.-H. Lee et al., A 3.2 Gbps/pin 8 gbit 1.0 v LPDDR4 SDRAM with integrated ECC engine for sub-1 v DRAM core operation. IEEE J. Solid-State Circuits (2014)
Google Scholar
G.F. Oliveira, J. Gomez-Luna, L. Orosa, S. Ghose, N. Vijaykumar, I. Fernandez, M. Sadrosadati, O. Mutlu, A new methodology and open-source benchmark suite for evaluating data movement bottlenecks: a near-data processing case study. IEEE Access (2021)
Google Scholar
E. O’Neil, P. O’Neil, K. Wu, Bitmap index design choices and their performance implications, in IDEAS (2007)
Google Scholar
M. Oskin, F.T. Chong, T. Sherwood, Active pages: a computation model for intelligent memory, in ISCA (1998)
Google Scholar
J.K. Ousterhout, Why aren’t operating systems getting faster as fast as hardware?, in USENIX STC (1990)
Google Scholar
L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: bringing order to the web. Technical report (Stanford InfoLab, 1999)
Google Scholar
D. Pandiyan, C.-J. Wu, Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms, in IISWC (2014)
Google Scholar
M.S. Papamarcos, J.H. Patel, A low-overhead coherence solution for multiprocessors with private cache memories, in ISCA (1984)
Google Scholar
M. Patel, J.S. Kim, O. Mutlu, The reach profiler (REAPER): enabling the mitigation of DRAM retention failures via profiling at aggressive conditions, in ISCA (2017)
Google Scholar
M. Patel, J.S. Kim, H. Hassan, O. Mutlu, Understanding and modeling on-die error correction in modern DRAM: an experimental study using real devices, in DSN (2019)
Google Scholar
M. Patel, J.S. Kim, T. Shahroodi, H. Hassan, O. Mutlu, Bit-exact ECC recovery (BEER): determining DRAM on-die ECC functions by exploiting DRAM data retention characteristics, in MICRO (2020)
Google Scholar
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick, A case for intelligent RAM. IEEE Micro (1997)
Google Scholar
A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A.K. Mishra, M.T. Kandemir, O. Mutlu, C.R. Das, Scheduling techniques for GPU architectures with processing-in-memory capabilities, in PACT (2016)
Google Scholar
I. Paul, W. Huang, M. Arora, S. Yalamanchili, Harmonia: balancing compute and memory power in high-performance GPUs, in ISCA (2015)
Google Scholar
G. Pekhimenko, T.C. Mowry, O. Mutlu, Linearly compressed pages: a main memory compression framework with low complexity and low latency, in PACT (2012)
Google Scholar
G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P.B. Gibbons, M.A. Kozuch, T.C. Mowry, Linearly compressed pages: a low-complexity, low-latency main memory compression framework, in MICRO (2013)
Google Scholar
P. Pessl, D. Gruss, C. Maurice, M. Schwarz, S. Mangard, DRAMA: exploiting DRAM addressing for cross-CPU attacks, in USENIX Security (2016)
Google Scholar
D. Poddebniak, J. Somorovsky, S. Schinzel, M. Lochter, P. Rösler, Attacking deterministic signature schemes using fault attacks, in EuroS&P (2018)
Google Scholar
J. Power, J. Hestness, M.S. Orr, M.D. Hill, D. A. Wood, gem5-gpu: a heterogeneous CPU-GPU simulator. CAL (2015)
Google Scholar
S.H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, F. Li, NDC: analyzing the impact of 3D-stacked memory+logic devices on mapreduce workloads, in ISPASS (2014)
Google Scholar
R. Qiao, M. Seaborn, A new approach for rowhammer attacks, in HOST (2016)
Google Scholar
M.K. Qureshi, A. Jaleel, Y.N. Patt, S.C. Steely Jr., J. Emer, Adaptive insertion policies for high-performance caching, in ISCA (2007a)
Google Scholar
M.K. Qureshi, M.A. Suleman, Y.N. Patt, Line distillation: increasing cache capacity by filtering unused words in cache lines, in HPCA (2007b)
Google Scholar
M.K. Qureshi, D.H. Kim, S. Khan, P.J. Nair, O. Mutlu, AVATAR: a variable-retention-time (VRT) aware refresh for DRAM systems, in DSN (2015)
Google Scholar
M.K. Qureshi, D.N. Lynch, O. Mutlu, Y. N. Patt, A case for MLP-aware cache replacement, in ISCA (2006)
Google Scholar
M.K. Qureshi, V. Srinivasan, J.A. Rivers, Scalable high performance main memory system using phase-change memory technology, in ISCA (2009)
Google Scholar
L.E. Ramos, E. Gorbatov, R. Bianchini, Page placement in hybrid memory systems, in ICS (2011)
Google Scholar
K. Razavi, B. Gras, E. Bosman, B. Preneel, C. Giuffrida, H. Bos, Flip Feng Shui: hammering a needle in the software stack, in USENIX Security (2016)
Google Scholar
S.H.S. Rezaei, M. Modarressi, R. Ausavarungnirun, M. Sadrosadati, O. Mutlu, M. Daneshtalab, NoM: network-on-memory for inter-bank data transfer in highly-banked memories. CAL (2020)
Google Scholar
D. Rich, A. Bartolo, C. Gilardo, B. Le, H. Li, R. Park, R.M. Radway, M.M. Sabry Aly, H.-S.P. Wong, S. Mitra, Heterogeneous 3D nano-systems: the N3XT approach? (2020)
Google Scholar
E. Riedel, G. Gibson, C. Faloutsos, Active storage for large-scale data mining and multimedia applications, in VLDB (1998)
Google Scholar
M. Rosenblum et al., The impact of architectural trends on operating system performance, in SOSP (1995)
Google Scholar
C.D. Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. Aberger, K. Olukotun, C. Re, High-accuracy low-precision training (2018)
Google Scholar
M.M. Sabry Aly, M. Gao, G. Hills, C. Lee, G. Pitner, M. M. Shulaker, T.F. Wu, M. Asheghi, J. Bokor, F. Franchetti, K.E. Goodson, C. Kozyrakis, I. Markov, K. Olukotun, L. Pileggi, E. Pop, J. Rabaey, C. Ré, H.P. Wong, S. Mitra, Energy-efficient abundant-data computing: the N3XT 1,000x. Computer (2015)
Google Scholar
M.M. Sabry Aly, T.F. Wu, A. Bartolo, Y.H. Malviya, W. Hwang, G. Hills, I. Markov, M. Wootters, M.M. Shulaker, H.P. Wong, S. Mitra, The N3XT approach to energy-efficient abundant-data computing. Proc. IEEE (2019)
Google Scholar
F. Sadi, J. Sweeney, T.M. Low, J.C. Hoe, L. Pileggi, F. Franchetti, Efficient SPMV operation for large and highly sparse matrices using scalable multi-way merge parallelization, in MICRO (2019)
Google Scholar
SAFARI Research Group, Ramulator: a DRAM simulator–GitHub repository (2021a), https://github.com/CMU-SAFARI/ramulator/
SAFARI Research Group, Ramulator-PIM: a processing-in-memory simulation framework–GitHub repository (2021b), https://github.com/CMU-SAFARI/ramulator-pim
SAFARI Research Group, RowHammer–GitHub repository (2021c), https://github.com/CMU-SAFARI/rowhammer/
SAFARI Research Group, SoftMC v1.0–GitHub repository (2021d), https://github.com/CMU-SAFARI/SoftMC/
S. Salihoglu, J. Widom, GPS: a graph processing system, in SSDBM (2013)
Google Scholar
D. Sanchez, C. Kozyrakis, ZSim: fast and accurate microarchitectural simulation of thousand-core systems, in ISCA (2013)
Google Scholar
F. Schuiki, M. Schaffner, F.K. Gürkaynak, L. Benini, A scalable near-memory architecture for training deep neural networks on large in-memory datasets (2018)
Google Scholar
M. Seaborn, T. Dullien, Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges (2015), http://googleprojectzero.blogspot.com.tr/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
M. Seaborn, T. Dullien, Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges, BlackHat (2016)
Google Scholar
D. Senol, J. Kim, S. Ghose, C. Alkan, O. Mutlu, Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions, in Briefings in Bioinformatics (BIB) (2018)
Google Scholar
V. Seshadri, Simple DRAM and virtual memory abstractions to enable highly efficient memory systems. Ph.D. Thesis (Carnegie Mellon University, 2016)
Google Scholar
V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M.A. Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry, Fast bulk bitwise AND and OR in DRAM. CAL (2015a)
Google Scholar
V. Seshadri, T. Mullins, A. Boroumand, O. Mutli, P.B. Gibbons, M.A. Kozuch, T.C. Mowry, Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses, in MICRO (2015b)
Google Scholar
V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, M.A. Kozuch, P.B. Gibbons, T.C. Mowry, RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization, in MICRO (2013)
Google Scholar
V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M.A. Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry, Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology, in MICRO (2017)
Google Scholar
V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M.A. Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry, Buddy-RAM: improving the performance and efficiency of bulk bitwise operations using DRAM (2016), arXiv:1611.09988 [cs:AR]
V. Seshadri, O. Mutlu, The processing using memory paradigm: in-DRAM bulk copy, initialization, bitwise AND and OR (2016), arXiv:1610.09603 [cs:AR]
V. Seshadri, O. Mutlu, Simple operations in memory to reduce data movement, in Advances in Computers, vol. 106 (2017)
Google Scholar
V. Seshadri, O. Mutlu, In-DRAM bulk bitwise execution engine (2020)
Google Scholar
V. Seshadri, O. Mutlu, M.A. Kozuch, T.C. Mowry, The evicted-address filter: a unified mechanism to address both cache pollution and thrashing, in PACT (2012)
Google Scholar
A. Shafiee, A. Nag, N. Muralimanohar et al., ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars, in ISCA (2016)
Google Scholar
D.E. Shaw, S.J. Stolfo, H. Ibrahim, B. Hillyer, G. Wiederhold, J. Andrews, The NON-VON database machine: a brief overview. IEEE Database Eng. Bull. (1981)
Google Scholar
J. Shun, G.E. Blelloch, Ligra: a lightweight graph processing framework for shared memory, in PPoPP (2013)
Google Scholar
G. Singh, D. Diamantopoulos, C. Hagleitner, J. Gomez-Luna, S. Stuijk, O. Mutlu, H. Corporaal, NERO: a near high-bandwidth memory stencil accelerator for weather prediction modeling, in FPL (2020)
Google Scholar
G. Singh, J. Gomez-Luna, G. Mariani, G. F. Oliveira, S. Corda, S. Stujik, O. Mutlu, H. Corporaal, NAPEL: near-memory computing application performance prediction via ensemble learning, in DAC (2019)
Google Scholar
T. Singh, S. Rangarajan, D. John, C. Henrion, S. Southard, H. McIntyre, A. Novak, S. Kosonocky, R. Jotwani, A. Schaefer, E. Chang, J. Bell, M. Co, 3.2 Zen: a next-generation high-performance x86 core, in ISSCC (2017)
Google Scholar
S. Song, A. Das, O. Mutlu, N. Kandasamy, Improving phase change memory performance with data content aware access, in ISMM (2020)
Google Scholar
H.S. Stone, A logic-in-memory computer. IEEE Trans. Comput. (1970)
Google Scholar
D.B. Strukov, G.S. Snider, D.R. Stewart, R.S. Williams, The missing memristor found. Nature (2008)
Google Scholar
L. Subramanian, Providing high and controllable performance in multicore systems through shared resource management. Ph.D. Thesis (Carnegie Mellon University, 2015)
Google Scholar
L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, O. Mutlu, The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory, in MICRO (2015)
Google Scholar
L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, O. Mutlu, MISE: providing performance predictability and improving fairness in shared main memory systems, in HPCA (2013)
Google Scholar
Z. Sura, A. Jacob, T. Chen, B. Rosenburg, O. Sallenave, C. Bertolli, S. Antao, J. Brunheroto, Y. Park, K. O’Brien, R. Nair, Data access optimization in a processing-in-memory system, in CF (2015)
Google Scholar
A. Tatar et al., Throwhammer: Rowhammer attacks over the network and defenses, in USENIX ATC (2018a)
Google Scholar
A. Tatar, C. Giuffrida, H. Bos, K. Razavi, Defeating software mitigations against Rowhammer: a surgical precision hammer, in RAID (2018b)
Google Scholar
A. Tavakkol, J. Gómez-Luna, M. Sadrosadati, S. Ghose, O. Mutlu, MQSim: a framework for enabling realistic studies of modern multi-queue SSD devices, in FAST (2018a)
Google Scholar
A. Tavakkol, M. Sadrosadati, S. Ghose, J. Kim, Y. Luo, Y. Wang, N.M. Ghiasi, L. Orosa, J. Gómez-Luna, O. Mutlu, FLIN: enabling fairness and enhancing performance in modern NVMe solid state drives, in ISCA (2018b)
Google Scholar
Y. Tian, A. Balmin, S.A. Corsten, S. Tatikonda, J. McPherson, From “Think Like a Vertex” to “Think Like a Graph”. VLDB Endowment (2013)
Google Scholar
Y. Turakhia, G. Bejerano, W.J. Dally, Darwin: a genomics co-processor provides up to 15,000x acceleration on long read assembly, in ASPLOS (2018)
Google Scholar
P. Tuyls, H.D.L. Hollmann, J.H.V. Lint, L. Tolhuizen, XOR-based visual cryptography schemes, designs, codes and cryptography (2021)
Google Scholar
Y. Umuroglu, D. Morrison, M. Jahre, Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform, in FPL (2015)
Google Scholar
UPMEM, Introduction to UPMEM PIM. Processing-in-memory (PIM) on DRAM accelerator (2018)
Google Scholar
H. Usui, L. Subramanian, K. Chang, O. Mutlu, DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators, in TACO (2016)
Google Scholar
V. van der Veen, Y. Fratantonio, M. Lindorfer, D. Gruss, C. Maurice, G. Vigna, H. Bos, K. Razavi, C. Giuffrida, Drammer: deterministic Rowhammer attacks on mobile platforms, in CCS (2016)
Google Scholar
N. Vijaykumar, A. Jain, D. Majumdar, K. Hsieh, G. Pekhimenko, E. Ebrahimi, N. Hajinazar, P.B. Gibbons, O. Mutlu, A case for richer cross-layer abstractions: bridging the semantic gap with expressive memory, in ISCA (2018a)
Google Scholar
N. Vijaykumar, E. Ebrahimi, K. Hsieh, P.B. Gibbons, O. Mutlu, The locality descriptor: a holistic cross-layer abstraction to express data locality in GPUs, in ISCA (2018b)
Google Scholar
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P.B. Gibbons, O. Mutlu, Zorua: a holistic approach to resource virtualization in GPUs, in MICRO (2016)
Google Scholar
N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T.C. Mowry, O. Mutlu, A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps, in ISCA (2015)
Google Scholar
Y. Wang, L. Orosa, X. Peng, Y. Guo, S. Ghose, M. Patel, J.S. Kim, J.G. Luna, M. Sadrosadati, N.M. Ghiasi et al., FIGARO: improving system performance via fine-grained in-DRAM data relocation and caching, in MICRO (2020)
Google Scholar
Y. Wang, A. Tavakkol, L. Orosa, S. Ghose, N. Mansouri Ghiasi, M. Patel, J.S. Kim, H. Hassan, M. Sadrosadati, O. Mutlu, Reducing DRAM latency via charge-level-aware look-ahead partial restoration, in MICRO (2018)
Google Scholar
L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, B. Qiu, BigDataBench: a big data benchmark suite from internet services, in HPCA (2014)
Google Scholar
M. Ware, K. Rajamani, M. Floyd, B. Brock, J. C. Rubio, F. Rawson, J. B. Carter, Architecting for power management: the IBM^® POWER7™ approach, in HPCA (2010)
Google Scholar
H.S. Warren, Hacker’s Delight, 2nd ed. (Addison-Wesley Professional, 2012)
Google Scholar
M.V. Wilkes, The memory gap and the future of high performance memories. CAN (2001)
Google Scholar
H.-S.P. Wong, S. Raoux, S. Kim, J. Liang, J.P. Reifenberg, B. Rajendran, M. Asheghi, K.E. Goodson, Phase change memory. Proc. IEEE. (2010)
Google Scholar
H.-S.P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee, F.T. Chen, M.-J. Tsai, Metal-oxide RRAM. Proc. IEEE. (2012)
Google Scholar
S. Wu, U. Manber, Fast text searching: allowing errors. ACM Commun. (1992)
Google Scholar
K. Wu, E.J. Otoo, A. Shoshani, Compressing bitmap indexes for faster search operations, in SSDBM (2002)
Google Scholar
W.A. Wulf, S.A. McKee, Hitting the memory wall: implications of the obvious. CAN (1995)
Google Scholar
S.L. Xi, O. Babarinsa, M. Athanassoulis, S. Idreos, Beyond the wall: near-data processing for databases, in DaMoN (2015)
Google Scholar
Y. Xiao et al., One bit flips, one cloud flops: cross-VM Row Hammer attacks and privilege escalation, in USENIX Sec. (2016)
Google Scholar
L. Xie, H.A.D. Nguyen, M. Taouil et al., Fast Boolean logic papped on memristor crossbar, in ICCD (2015)
Google Scholar
H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford, C. Alkan, O. Mutlu, Shifted hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. Bioinformatics (2015)
Google Scholar
H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, C. Alkan, Accelerating read mapping with FastHASH. BMC Genom. (2013)
Google Scholar
X. Xin, Y. Zhang, J. Yang, ELP2IM: Efficient and low power bitwise operation processing in DRAM, in HPCA (2020)
Google Scholar
Q. Xu, H. Jeon, M. Annavaram, Graph processing on GPUs: where are the bottlenecks?, in IISWC (2014)
Google Scholar
J. Xue, Z. Yang, Z. Qu, S. Hou, Y. Dai, Seraph: an efficient, low-cost system for concurrent graph processing, in HPDC (2014)
Google Scholar
A. Yasin, Y. Ben-Asher, A. Mendelson, Deep-dive analysis of the data analytics workload in cloudsuite, in IISWC (2014)
Google Scholar
C.-C.M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H.A. Dau, D.F. Silva, A. Mueen, E. Keogh, Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets, in ICDM (2016)
Google Scholar
H. Yoon, J. Meza, R. Ausavarungnirun, R.A. Harding, O. Mutlu, Row buffer locality aware caching policies for hybrid memories, in ICCD (2012)
Google Scholar
H. Yoon, J. Meza, N. Muralimanohar, N.P. Jouppi, O. Mutlu, Efficient data mapping and buffering techniques for multilevel cell phase-change memories. ACM TACO (2014)
Google Scholar
X. Yu, C.J. Hughes, N. Satish, O. Mutlu, S. Devadas, Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation, in MICRO (2017)
Google Scholar
J. Yu, H.A.D. Nguyen, L. Xie et al., Memristive devices for computation-in-memory, in DATE (2018)
Google Scholar
D.P. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, M. Ignatowski, TOP-PIM: throughput-oriented programmable processing in memory, in HPDC (2014)
Google Scholar
W. Zhang, T. Li, Exploring Phase change memory and 3D die-stacking for power/thermal friendly, fast and durable memory architectures, in PACT (2009)
Google Scholar
Z. Zhang, Z. Zhan, D. Balasubramanian, X. Koutsoukos, G. Karsai, Triggering Rowhammer hardware faults on ARM: a revisit, in ASHES (2018a)
Google Scholar
M. Zhang, Y. Zhuo, C. Wang, M. Gao, Y. Wu, K. Chen, C. Kozyrakis, X. Qian, GraphP: reducing communication for PIM-based graph processing with efficient data partition, in HPCA (2018b)
Google Scholar
P. Zhou, B. Zhao, J. Yang, Y. Zhang, A durable and energy efficient main memory using phase change memory technology, in ISCA (2009)
Google Scholar
Q. Zhu, T. Graf, H.E. Sumbul, L. Pileggi, F. Franchetti, Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware, in HPEC (2013)
Google Scholar
M. Zhu, T. Zhang, Z. Gu, Y. Xie, Sparse tensor core: algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs, in MICRO (2019)
Google Scholar
Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, X. Qian, GraphQ: scalable PIM-based graph processing, in MICRO (2019)
Google Scholar

Download references

Acknowledgements

This chapter is a drastically revised and extended version of an earlier article published in 2019 (Mutlu et al. 2019a). This chapter also incorporates revised material from another earlier article published in 2019 (Ghose et al. 2019b). The shorter, initial version of this work Mutlu et al. (2019a) is based on a keynote talk delivered by Onur Mutlu at the 3rd Mobile System Technologies (MST) Workshop in Milan, Italy on 27 October 2017 (Mutlu 2017). The mentioned keynote talk is similar to a series of talks given by Onur Mutlu in a wide variety of venues since 2015 until now. This talk has evolved significantly over time with the accumulation of new works and feedback received from many audiences. Recent versions of the talk were delivered as a distinguished lecture at George Washington University in February 2019 (Mutlu 2019a), as an Invited Talk at ISSCC Special Forum on “Intelligence at the Edge: How Can We Make Machine Learning More Energy Efficient?”, as part of the 2019 International Solid State Circuits Conference in February 2019 (Mutlu 2019e), as a keynote talk at the 29th ACM Great Lakes Symposium on VLSI (Mutlu 2019c), as a keynote talk at the International Symposium on Advanced Parallel Processing Technology in August 2019 (Mutlu 2019d), and as a keynote talk at the 37th IEEE International Conference on Computer Design in November 2019 (Mutlu 2019b). This article and the associated talks are based on research done over the course of the past nine years in the SAFARI Research Group on the topic of processing-in-memory (PIM). We thank all of the members of the SAFARI Research Group, and our collaborators at Carnegie Mellon, ETH Zürich, and other universities, who have contributed to the various works we describe in this paper. Thanks also goes to our research group’s industrial sponsors over the past ten years, especially Alibaba, ASML, Google, Huawei, Intel, Microsoft, NVIDIA, Samsung, Seagate, and VMware. This work was also partially supported by the Intel Science and Technology Center for Cloud Computing, the Semiconductor Research Corporation, the Data Storage Systems Center at Carnegie Mellon University, various NSF and NIH grants, and various awards, including the NSF CAREER Award, the Intel Faculty Honor Program Award, and a number of Google and IBM Faculty Research Awards to Onur Mutlu.

Author information

Authors and Affiliations

SAFARI Research Group, ETH Zürich, Zürich, Switzerland
Onur Mutlu & Juan Gómez-Luna
Carnegie Mellon University, Pittsburgh, USA
Onur Mutlu
University of Illinois Urbana-Champaign, Urbana, USA
Saugata Ghose
King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand
Rachata Ausavarungnirun

Authors

Onur Mutlu
View author publications
You can also search for this author in PubMed Google Scholar
Saugata Ghose
View author publications
You can also search for this author in PubMed Google Scholar
Juan Gómez-Luna
View author publications
You can also search for this author in PubMed Google Scholar
Rachata Ausavarungnirun
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Mohamed M. Sabry Aly
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Anupam Chattopadhyay

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mutlu, O., Ghose, S., Gómez-Luna, J., Ausavarungnirun, R. (2023). A Modern Primer on Processing in Memory. In: Aly, M.M.S., Chattopadhyay, A. (eds) Emerging Computing: From Devices to Systems. Computer Architecture and Design Methodologies. Springer, Singapore. https://doi.org/10.1007/978-981-16-7487-7_7

Download citation

DOI: https://doi.org/10.1007/978-981-16-7487-7_7
Published: 09 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-7486-0
Online ISBN: 978-981-16-7487-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Modern Primer on Processing in Memory