Efficient Deep Learning Using Non-volatile Memory Technology in GPU Architectures

Inci, Ahmet; Isgenc, Mehmet Meric; Marculescu, Diana

doi:10.1007/978-3-031-19568-6_8

Ahmet Inci³,
Mehmet Meric Isgenc³ &
Diana Marculescu^4,5

398 Accesses
2 Citations

Abstract

Embedded machine learning (ML) systems have now become the dominant platform for deploying ML serving tasks and are projected to become of equal importance for training ML models. With this comes the challenge of overall efficient deployment, in particular low-power and high-throughput implementations, under stringent memory constraints. In this context, non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While prior work has investigated several architectural implications of NVM for generic applications, in this chapter, we present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. DeepNVM++ relies on iso-capacity and iso-area performance and energy models for last-level caches implemented using conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8 × and 4.7 × energy-delay product (EDP) reduction and 2.4 × and 2.8 × area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2.2 × and 2.4 × EDP reduction and accommodate 2.3 × and 3.3 × cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wulf, W.A., McKee, S.A.: Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588
Article Google Scholar
Dennard, R.H., Gaensslen, F.H., Yu, H., Rideout, V.L., Bassous, E., LeBlanc, A.R.: Design of ion-implanted mosfet’s with very small physical dimensions. IEEE J. Solid State Circ. 9(5), 256–268 (1974). https://doi.org/10.1109/JSSC.1974.1050511
Article Google Scholar
Murali, S., Mutapcic, A., Atienza, D., Gupta, R., Boyd, S., Benini, L., De Micheli, G.: Temperature control of high-performance multi-core platforms using convex optimization. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 110–115 (2008). https://doi.org/10.1109/DATE.2008.4484671
Coskun, A.K., Rosing, T.S., Whisnant, K.: Temperature aware task scheduling in MPSoCs. In: 2007 Design, Automation Test in Europe Conference Exhibition, pp. 1–6 (2007). https://doi.org/10.1109/DATE.2007.364540
Coskun, A.K., Rosing, T.S., Whisnant, K.A., Gross, K.C.: Static and dynamic temperature-aware scheduling for multiprocessor SoCs. IEEE Trans. Very Large Scale Integr. Syst. 16(9), 1127–1140 (2008). https://doi.org/10.1109/TVLSI.2008.2000726
Article Google Scholar
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks, pp. 6105–6114. PMLR, Long Beach, California, USA (2019). http://proceedings.mlr.press/v97/tan19a.html
Touvron, H., Vedaldi, A., Douze, M., Jégou, H.: Fixing the train-test resolution discrepancy. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Google Scholar
Tan, M., Pang, R., Le, Q.V.: Efficientdet: Scalable and efficient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 10778–10787 (2020)
Google Scholar
Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J., Songhori, E., Wang, S., Lee, Y.J., Johnson, E., Pathak, O., Bae, S., Nazi, A., Pak, J., Tong, A., Srinivasa, K., Hang, W., Tuncer, E., Babu, A., Le, Q.V., Laudon, J., Ho, R., Carpenter, R., Dean, J.: Chip placement with deep reinforcement learning (2020)
Google Scholar
Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding (2016)
Google Scholar
Ding, R., Liu, Z., Blanton, R.D.S., Marculescu, D.: Lightening the load with highly accurate storage- and energy-efficient lightnns. ACM Trans. Reconfigurable Technol. Syst. 11(3) (2018). https://doi.org/10.1145/3270689
Chin, T.W., Ding, R., Zhang, C., Marculescu, D.: Towards efficient model compression via learned global ranking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Chen, Y., Emer, J., Sze, V.: Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 367–379. IEEE Press, Piscataway, NJ, USA (2016). https://doi.org/10.1109/ISCA.2016.40
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: Efficient inference engine on compressed deep neural network. In: International Conference on Computer Architecture (ISCA) (2016)
Google Scholar
Chen, Y.H., Emer, J., Sze, V.: Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37(3), 12–21 (2017). https://doi.org/10.1109/MM.2017.54
Article Google Scholar
Shao, Y., Clemons, J., Venkatesan, R., Zimmer, B., Fojtik, M.R., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S., Zhang, Y., Dally, W., Emer, J., Gray, C.T., Khailany, B., Keckler, S.: Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (2019)
Google Scholar
Inci, A., Marculescu, D.: Solving the non-volatile memory conundrum for deep learning workloads. In: Architectures and Systems for Big Data Workshop in Conjunction with ISCA (2018)
Google Scholar
Inci, A.F., Isgenc, M.M., Marculescu, D.: Deepnvm: A framework for modeling and analysis of non-volatile memory technologies for deep learning applications. In: Proceedings of the 23rd Conference on Design, Automation and Test in Europe, DATE ’20, p. 1295–1298 (2020)
Google Scholar
Inci, A., Isgenc, M.M., Marculescu, D.: Deepnvm++: Cross-layer modeling and optimization framework of non-volatile memories for deep learning. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst., 1–1 (2021). https://doi.org/10.1109/TCAD.2021.3127148
Inci, A., Bolotin, E., Fu, Y., Dalal, G., Mannor, S., Nellans, D., Marculescu, D.: The architectural implications of distributed reinforcement learning on CPU-GPU systems. Preprint (2020). arXiv:2012.04210
Google Scholar
Inci, A., Isgenc, M.M., Marculescu, D.: Cross-layer design space exploration of NVM-based caches for deep learning. NVMW (2021)
Google Scholar
Inci, A., Virupaksha, S.G., Jain, A., Thallam, V.V., Ding, R., Marculescu, D.: QAPPA: Quantization-aware power, performance, and area modeling of DNN accelerators. Preprint (2022). arXiv:2205.08648
Google Scholar
Inci, A., Virupaksha, S.G., Jain, A., Thallam, V.V., Ding, R., Marculescu, D.: QADAM: Quantization-aware DNN accelerator modeling for pareto-optimality. Preprint (2022). arXiv:2205.13045
Google Scholar
Inci, A., Virupaksha, S.G., Jain, A., Chin, T.W., Thallam, V.V., Ding, R., Marculescu, D.: Quidam: A framework for quantization-aware DNN accelerator and model co-exploration. Preprint (2022). arXiv:2206.15463
Google Scholar
Chang, M., Rosenfeld, P., Lu, S., Jacob, B.: Technology comparison for large last-level caches (l3cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized edram. In: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 143–154 (2013). https://doi.org/10.1109/HPCA.2013.6522314
Homayoun, H., Veidenbaum, A.: Reducing leakage power in peripheral circuits of l2 caches. In: 2007 25th International Conference on Computer Design, pp. 230–237 (2007). https://doi.org/10.1109/ICCD.2007.4601907
Xu, W., Sun, H., Wang, X., Chen, Y., Zhang, T.: Design of last-level on-chip cache using spin-torque transfer RAM (STT RAM). IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(3), 483–493 (2011). https://doi.org/10.1109/TVLSI.2009.2035509
Dong, X., Wu, X., Sun, G., Xie, Y., Li, H., Chen, Y.: Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: 2008 45th ACM/IEEE Design Automation Conference, pp. 554–559 (2008)
Google Scholar
List of NVIDIA GPUs: https://en.wikipedia.org/wiki/List-of-Nvidia-graphics-processing-units
Kim, J., Chen, A., Behin-Aein, B., Kumar, S., Wang, J., Kim, C.H.: A technology-agnostic MTJ spice model with user-defined dimensions for STT-MRAM scalability studies. In: 2015 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4 (2015). https://doi.org/10.1109/CICC.2015.7338407
Kazemi, M., Rowlands, G.E., Ipek, E., Buhrman, R.A., Friedman, E.G.: Compact model for spin–orbit magnetic tunnel junctions. IEEE Trans. Electron Dev. 63(2), 848–855 (2016). https://doi.org/10.1109/TED.2015.2510543
Article Google Scholar
Dong, X., Xu, C., Xie, Y., Jouppi, N.P.: Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 31(7), 994–1007 (2012). https://doi.org/10.1109/TCAD.2012.2185930
Article Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, pp. 675–678. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2647868.2654889
Deng, J., Dong, W., Socher, R., Li, L., Kai Li, Li Fei-Fei: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed GPU simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009). https://doi.org/10.1109/ISPASS.2009.4919648
Isgenc, M.M.: Enabling design of low-volume high-performance ICs. Ph.D. thesis, Carnegie Mellon University (2019)
Google Scholar
Isgenc, M.M., Martins, M.G.A., Zackriya, V.M., Pagliarini, S.N., Pileggi, L.: Logic IP for low-cost IC design in advanced CMOS nodes. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 28(2), 585–595 (2020). https://doi.org/10.1109/TVLSI.2019.2942825
Pagliarini, S.N., Bhuin, S., Isgenc, M.M., Biswas, A.K., Pileggi, L.: A probabilistic synapse with strained MTJs for spiking neural networks. IEEE Trans. Neural Networks Learn. Syst. 31(4), 1113–1123 (2020). https://doi.org/10.1109/TNNLS.2019.2917819
Article Google Scholar
Scheuerlein, R.E.: Magneto-resistive IC memory limitations and architecture implications. In: Seventh Biennial IEEE International Nonvolatile Memory Technology Conference. Proceedings (Cat. No.98EX141), pp. 47–50 (1998). https://doi.org/10.1109/NVMT.1998.723217
Zhao, W., Belhaire, E., Mistral, Q., Chappert, C., Javerliac, V., Dieny, B., Nicolle, E.: Macro-model of spin-transfer torque based magnetic tunnel junction device for hybrid magnetic-CMOS design. In: 2006 IEEE International Behavioral Modeling and Simulation Workshop, pp. 40–43 (2006). https://doi.org/10.1109/BMAS.2006.283467
Kan, J.J., Park, C., Ching, C., Ahn, J., Xie, Y., Pakala, M., Kang, S.H.: A study on practically unlimited endurance of STT-MRAM. IEEE Trans. Electron Dev. 64(9), 3639–3646 (2017). https://doi.org/10.1109/TED.2017.2731959
Article Google Scholar
Hosomi, M., Yamagishi, H., Yamamoto, T., Bessho, K., Higo, Y., Yamane, K., Yamada, H., Shoji, M., Hachino, H., Fukumoto, C., Nagao, H., Kano, H.: A novel nonvolatile memory with spin torque transfer magnetization switching: spin-RAM. In: IEEE International Electron Devices Meeting, 2005. IEDM Technical Digest., pp. 459–462 (2005)
Google Scholar
Chi, P., Li, S., Yuanqing Cheng, Yu Lu, Kang, S.H., Xie, Y.: Architecture design with STT-RAM: Opportunities and challenges. In: 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 109–114 (2016)
Google Scholar
Rasquinha, M., Choudhary, D., Chatterjee, S., Mukhopadhyay, S., Yalamanchili, S.: An energy efficient cache design using spin torque transfer (STT) RAM. In: 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), pp. 389–394 (2010). https://doi.org/10.1145/1840845.1840931
Prenat, G., Jabeur, K., Vanhauwaert, P., Pendina, G.D., Oboril, F., Bishnoi, R., Ebrahimi, M., Lamard, N., Boulle, O., Garello, K., Langer, J., Ocker, B., Cyrille, M., Gambardella, P., Tahoori, M., Gaudin, G.: Ultra-fast and high-reliability SOT-MRAM: From cache replacement to normally-off computing. IEEE Trans. Multi-Scale Comput. Syst. 2(1), 49–60 (2016). https://doi.org/10.1109/TMSCS.2015.2509963
Article Google Scholar
Bishnoi, R., Ebrahimi, M., Oboril, F., Tahoori, M.B.: Architectural aspects in design and analysis of SOT-based memories. In: 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 700–707 (2014)
Google Scholar
Oboril, F., Bishnoi, R., Ebrahimi, M., Tahoori, M.B.: Evaluation of hybrid memory technologies using SOT-MRAM for on-chip cache hierarchy. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 34(3), 367–380 (2015). https://doi.org/10.1109/TCAD.2015.2391254
Article Google Scholar
Li, G., Chen, X., Sun, G., Hoffmann, H., Liu, Y., Wang, Y., Yang, H.: A STT-RAM-based low-power hybrid register file for GPGPUs. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6 (2015)
Google Scholar
Wu, X., Li, J., Zhang, L., Speight, E., Rajamony, R., Xie, Y.: Hybrid cache architecture with disparate memory technologies. In: ISCA ’09, p. 34–45. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1555754.1555761
Imani, M., Patil, S., Rosing, T.: Low power data-aware STT-RAM based hybrid cache architecture. In: 2016 17th International Symposium on Quality Electronic Design (ISQED), pp. 88–94 (2016). https://doi.org/10.1109/ISQED.2016.7479181
Beigi, M.V., Memik, G.: Tapas: Temperature-aware adaptive placement for 3D stacked hybrid caches. In: Proceedings of the Second International Symposium on Memory Systems, MEMSYS ’16, p. 415–426. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2989081.2989085
Smullen, C.W., Mohan, V., Nigam, A., Gurumurthi, S., Stan, M.R.: Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 50–61 (2011)
Google Scholar
Kuan, K., Adegbija, T.: Energy-efficient runtime adaptable l1 STT-RAM cache design. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 39(6), 1328–1339 (2020). https://doi.org/10.1109/TCAD.2019.2912920
Article Google Scholar
Jog, A., Mishra, A.K., Xu, C., Xie, Y., Narayanan, V., Iyer, R., Das, C.R.: Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs. In: DAC Design Automation Conference 2012, pp. 243–252 (2012). https://doi.org/10.1145/2228360.2228406
Google Scholar
Sun, Z., Bi, X., Li, H., Wong, W., Ong, Z., Zhu, X., Wu, W.: Multi retention level STT-RAM cache designs with a dynamic refresh scheme. In: 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 329–338 (2011)
Google Scholar
Wang, J., Dong, X., Xie, Y.: Oap: An obstruction-aware cache management policy for STT-RAM last-level caches. In: 2013 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 847–852 (2013)
Google Scholar
Sun, G., Dong, X., Xie, Y., Li, J., Chen, Y.: A novel architecture of the 3D stacked MRAM l2 cache for CMPs. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp. 239–249 (2009). https://doi.org/10.1109/HPCA.2009.4798259
Imani, M., Rahimi, A., Kim, Y., Rosing, T.: A low-power hybrid magnetic cache architecture exploiting narrow-width values. In: 2016 5th Non-Volatile Memory Systems and Applications Symposium (NVMSA), pp. 1–6 (2016). https://doi.org/10.1109/NVMSA.2016.7547174
Angizi, S., He, Z., Reis, D., Hu, X., Tsai, W., Lin, S.J., Fan, D.: Accelerating deep neural networks in processing-in-memory platforms: Analog or digital approach? In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 197–202 (2019)
Google Scholar
Reis, D., Gao, D., Angizi, S., Yin, X., Fan, D., Niemier, M., Zhuo, C., Hu, X.S.: Modeling and benchmarking computing-in-memory for design space exploration. In: Proceedings of the 2020 on Great Lakes Symposium on VLSI (2020)
Google Scholar
Angizi, S., Khoshavi, N., Marshall, A., Dowben, P., Fan, D.: Meram: Non-volatile cache memory based on magneto-electric fets (2020)
Google Scholar
Seo, Y., Roy, K.: High-density SOT-MRAM based on shared bitline structure. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(8), 1600–1603 (2018). https://doi.org/10.1109/TVLSI.2018.2822841
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1097–1105. Curran Associates Inc., Red Hook, NY, USA (2012)
Google Scholar
Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size (2016)
Google Scholar
Dongarra, J.J., Heroux, M., Luszczek, P.: HPCG benchmark: a new metric for ranking high performance computing systems (2015)
Google Scholar
NVIDIA CUDA Profiler: https://docs.nvidia.com/cuda/profiler-users-guide/nvprof-overview
Redmon, J.: Darknet: Open source neural networks in C. http://pjreddie.com/darknet/ (2013–2016)
Chen, Y., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circ. 52(1), 127–138 (2017). https://doi.org/10.1109/JSSC.2016.2616357
Article Google Scholar
Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C.: Tetris: Scalable and efficient neural network acceleration with 3D memory. SIGARCH Comput. Archit. News 45(1), 751–764 (2017). https://doi.org/10.1145/3093337.3037702
Article Google Scholar
Boroumand, A., Ghose, S., Kim, Y., Ausavarungnirun, R., Shiu, E., Thakur, R., Kim, D., Kuusela, A., Knies, A., Ranganathan, P., Mutlu, O.: Google workloads for consumer devices: Mitigating data movement bottlenecks. In: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’18, p. 316–331. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3173162.3173177
Donato, M., Reagen, B., Pentecost, L., Gupta, U., Brooks, D., Wei, G.: On-chip deep neural network storage with multi-level eNVM. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6 (2018). https://doi.org/10.1109/DAC.2018.8465818
Kannan, A., Kurach, K., Ravi, S., Kaufmann, T., Tomkins, A., Miklos, B., Corrado, G., Lukács, L., Ganea, M., Young, P., Ramavajjala, V.: Smart reply: Automated response suggestion for email. CoRR abs/1606.04870 (2016). http://arxiv.org/abs/1606.04870
Tucker, G., Wu, M., Sun, M., Panchapagesan, S., Fu, G., Vitaladevuni, S.: Model compression applied to small-footprint keyword spotting. In: Interspeech 2016, pp. 1878–1882 (2016). https://doi.org/10.21437/Interspeech.2016-1393
Google Scholar
Wu, C., Brooks, D., Chen, K., Chen, D., Choudhury, S., Dukhan, M., Hazelwood, K., Isaac, E., Jia, Y., Jia, B., Leyvand, T., Lu, H., Lu, Y., Qiao, L., Reagen, B., Spisak, J., Sun, F., Tulloch, A., Vajda, P., Wang, X., Wang, Y., Wasti, B., Wu, Y., Xian, R., Yoo, S., Zhang, P.: Machine learning at facebook: Understanding inference at the edge. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331–344 (2019). https://doi.org/10.1109/HPCA.2019.00048
Ghodsi, Z., Veldanda, A., Reagen, B., Garg, S.: Cryptonas: Private inference on a relu budget (2020)
Google Scholar
Korgaonkar, K., Bhati, I., Liu, H., Gaur, J., Manipatruni, S., Subramoney, S., Karnik, T., Swanson, S., Young, I., Wang, H.: Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 315–327 (2018). https://doi.org/10.1109/ISCA.2018.00035
Hankin, A., Shapira, T., Sangaiah, K., Lui, M., Hempstead, M.: Evaluation of non-volatile memory based last level cache given modern use case behavior. In: 2019 IEEE International Symposium on Workload Characterization (IISWC), pp. 143–154 (2019). https://doi.org/10.1109/IISWC47752.2019.9042051
Pentecost, L., Donato, M., Reagen, B., Gupta, U., Ma, S., Wei, G.Y., Brooks, D.: MaxNVM: Maximizing DNN storage density and inference efficiency with sparse encoding and error mitigation. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, p. 769–781. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3352460.3358258
Li, H., Bhargav, M., Whatmough, P.N., Philip Wong, H..: On-chip memory technology design space explorations for mobile deep neural network accelerators. In: 2019 56th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2019)
Google Scholar

Download references

Acknowledgements

This research was supported in part by NSF CCF Grant No. 1815899 and NSF CSR Grant No. 1815780.

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Ahmet Inci & Mehmet Meric Isgenc
Carnegie Mellon University, Pittsburgh, PA, USA
Diana Marculescu
The University of Texas at Austin, Austin, TX, USA
Diana Marculescu

Authors

Ahmet Inci
View author publications
You can also search for this author in PubMed Google Scholar
Mehmet Meric Isgenc
View author publications
You can also search for this author in PubMed Google Scholar
Diana Marculescu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diana Marculescu .

Editor information

Editors and Affiliations

Colorado State University, Fort Collins, CO, USA
Sudeep Pasricha
New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
Muhammad Shafique

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Inci, A., Isgenc, M.M., Marculescu, D. (2024). Efficient Deep Learning Using Non-volatile Memory Technology in GPU Architectures. In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-19568-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-19568-6_8
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19567-9
Online ISBN: 978-3-031-19568-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics