Abstract
Embedded machine learning (ML) systems have now become the dominant platform for deploying ML serving tasks and are projected to become of equal importance for training ML models. With this comes the challenge of overall efficient deployment, in particular low-power and high-throughput implementations, under stringent memory constraints. In this context, non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While prior work has investigated several architectural implications of NVM for generic applications, in this chapter, we present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. DeepNVM++ relies on iso-capacity and iso-area performance and energy models for last-level caches implemented using conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8 × and 4.7 × energy-delay product (EDP) reduction and 2.4 × and 2.8 × area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2.2 × and 2.4 × EDP reduction and accommodate 2.3 × and 3.3 × cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wulf, W.A., McKee, S.A.: Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588
Dennard, R.H., Gaensslen, F.H., Yu, H., Rideout, V.L., Bassous, E., LeBlanc, A.R.: Design of ion-implanted mosfet’s with very small physical dimensions. IEEE J. Solid State Circ. 9(5), 256–268 (1974). https://doi.org/10.1109/JSSC.1974.1050511
Murali, S., Mutapcic, A., Atienza, D., Gupta, R., Boyd, S., Benini, L., De Micheli, G.: Temperature control of high-performance multi-core platforms using convex optimization. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 110–115 (2008). https://doi.org/10.1109/DATE.2008.4484671
Coskun, A.K., Rosing, T.S., Whisnant, K.: Temperature aware task scheduling in MPSoCs. In: 2007 Design, Automation Test in Europe Conference Exhibition, pp. 1–6 (2007). https://doi.org/10.1109/DATE.2007.364540
Coskun, A.K., Rosing, T.S., Whisnant, K.A., Gross, K.C.: Static and dynamic temperature-aware scheduling for multiprocessor SoCs. IEEE Trans. Very Large Scale Integr. Syst. 16(9), 1127–1140 (2008). https://doi.org/10.1109/TVLSI.2008.2000726
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks, pp. 6105–6114. PMLR, Long Beach, California, USA (2019). http://proceedings.mlr.press/v97/tan19a.html
Touvron, H., Vedaldi, A., Douze, M., Jégou, H.: Fixing the train-test resolution discrepancy. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Tan, M., Pang, R., Le, Q.V.: Efficientdet: Scalable and efficient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 10778–10787 (2020)
Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J., Songhori, E., Wang, S., Lee, Y.J., Johnson, E., Pathak, O., Bae, S., Nazi, A., Pak, J., Tong, A., Srinivasa, K., Hang, W., Tuncer, E., Babu, A., Le, Q.V., Laudon, J., Ho, R., Carpenter, R., Dean, J.: Chip placement with deep reinforcement learning (2020)
Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding (2016)
Ding, R., Liu, Z., Blanton, R.D.S., Marculescu, D.: Lightening the load with highly accurate storage- and energy-efficient lightnns. ACM Trans. Reconfigurable Technol. Syst. 11(3) (2018). https://doi.org/10.1145/3270689
Chin, T.W., Ding, R., Zhang, C., Marculescu, D.: Towards efficient model compression via learned global ranking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Chen, Y., Emer, J., Sze, V.: Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 367–379. IEEE Press, Piscataway, NJ, USA (2016). https://doi.org/10.1109/ISCA.2016.40
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: Efficient inference engine on compressed deep neural network. In: International Conference on Computer Architecture (ISCA) (2016)
Chen, Y.H., Emer, J., Sze, V.: Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37(3), 12–21 (2017). https://doi.org/10.1109/MM.2017.54
Shao, Y., Clemons, J., Venkatesan, R., Zimmer, B., Fojtik, M.R., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S., Zhang, Y., Dally, W., Emer, J., Gray, C.T., Khailany, B., Keckler, S.: Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (2019)
Inci, A., Marculescu, D.: Solving the non-volatile memory conundrum for deep learning workloads. In: Architectures and Systems for Big Data Workshop in Conjunction with ISCA (2018)
Inci, A.F., Isgenc, M.M., Marculescu, D.: Deepnvm: A framework for modeling and analysis of non-volatile memory technologies for deep learning applications. In: Proceedings of the 23rd Conference on Design, Automation and Test in Europe, DATE ’20, p. 1295–1298 (2020)
Inci, A., Isgenc, M.M., Marculescu, D.: Deepnvm++: Cross-layer modeling and optimization framework of non-volatile memories for deep learning. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst., 1–1 (2021). https://doi.org/10.1109/TCAD.2021.3127148
Inci, A., Bolotin, E., Fu, Y., Dalal, G., Mannor, S., Nellans, D., Marculescu, D.: The architectural implications of distributed reinforcement learning on CPU-GPU systems. Preprint (2020). arXiv:2012.04210
Inci, A., Isgenc, M.M., Marculescu, D.: Cross-layer design space exploration of NVM-based caches for deep learning. NVMW (2021)
Inci, A., Virupaksha, S.G., Jain, A., Thallam, V.V., Ding, R., Marculescu, D.: QAPPA: Quantization-aware power, performance, and area modeling of DNN accelerators. Preprint (2022). arXiv:2205.08648
Inci, A., Virupaksha, S.G., Jain, A., Thallam, V.V., Ding, R., Marculescu, D.: QADAM: Quantization-aware DNN accelerator modeling for pareto-optimality. Preprint (2022). arXiv:2205.13045
Inci, A., Virupaksha, S.G., Jain, A., Chin, T.W., Thallam, V.V., Ding, R., Marculescu, D.: Quidam: A framework for quantization-aware DNN accelerator and model co-exploration. Preprint (2022). arXiv:2206.15463
Chang, M., Rosenfeld, P., Lu, S., Jacob, B.: Technology comparison for large last-level caches (l3cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized edram. In: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 143–154 (2013). https://doi.org/10.1109/HPCA.2013.6522314
Homayoun, H., Veidenbaum, A.: Reducing leakage power in peripheral circuits of l2 caches. In: 2007 25th International Conference on Computer Design, pp. 230–237 (2007). https://doi.org/10.1109/ICCD.2007.4601907
Xu, W., Sun, H., Wang, X., Chen, Y., Zhang, T.: Design of last-level on-chip cache using spin-torque transfer RAM (STT RAM). IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(3), 483–493 (2011). https://doi.org/10.1109/TVLSI.2009.2035509
Dong, X., Wu, X., Sun, G., Xie, Y., Li, H., Chen, Y.: Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: 2008 45th ACM/IEEE Design Automation Conference, pp. 554–559 (2008)
List of NVIDIA GPUs: https://en.wikipedia.org/wiki/List-of-Nvidia-graphics-processing-units
Kim, J., Chen, A., Behin-Aein, B., Kumar, S., Wang, J., Kim, C.H.: A technology-agnostic MTJ spice model with user-defined dimensions for STT-MRAM scalability studies. In: 2015 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4 (2015). https://doi.org/10.1109/CICC.2015.7338407
Kazemi, M., Rowlands, G.E., Ipek, E., Buhrman, R.A., Friedman, E.G.: Compact model for spin–orbit magnetic tunnel junctions. IEEE Trans. Electron Dev. 63(2), 848–855 (2016). https://doi.org/10.1109/TED.2015.2510543
Dong, X., Xu, C., Xie, Y., Jouppi, N.P.: Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 31(7), 994–1007 (2012). https://doi.org/10.1109/TCAD.2012.2185930
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, pp. 675–678. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2647868.2654889
Deng, J., Dong, W., Socher, R., Li, L., Kai Li, Li Fei-Fei: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed GPU simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009). https://doi.org/10.1109/ISPASS.2009.4919648
Isgenc, M.M.: Enabling design of low-volume high-performance ICs. Ph.D. thesis, Carnegie Mellon University (2019)
Isgenc, M.M., Martins, M.G.A., Zackriya, V.M., Pagliarini, S.N., Pileggi, L.: Logic IP for low-cost IC design in advanced CMOS nodes. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 28(2), 585–595 (2020). https://doi.org/10.1109/TVLSI.2019.2942825
Pagliarini, S.N., Bhuin, S., Isgenc, M.M., Biswas, A.K., Pileggi, L.: A probabilistic synapse with strained MTJs for spiking neural networks. IEEE Trans. Neural Networks Learn. Syst. 31(4), 1113–1123 (2020). https://doi.org/10.1109/TNNLS.2019.2917819
Scheuerlein, R.E.: Magneto-resistive IC memory limitations and architecture implications. In: Seventh Biennial IEEE International Nonvolatile Memory Technology Conference. Proceedings (Cat. No.98EX141), pp. 47–50 (1998). https://doi.org/10.1109/NVMT.1998.723217
Zhao, W., Belhaire, E., Mistral, Q., Chappert, C., Javerliac, V., Dieny, B., Nicolle, E.: Macro-model of spin-transfer torque based magnetic tunnel junction device for hybrid magnetic-CMOS design. In: 2006 IEEE International Behavioral Modeling and Simulation Workshop, pp. 40–43 (2006). https://doi.org/10.1109/BMAS.2006.283467
Kan, J.J., Park, C., Ching, C., Ahn, J., Xie, Y., Pakala, M., Kang, S.H.: A study on practically unlimited endurance of STT-MRAM. IEEE Trans. Electron Dev. 64(9), 3639–3646 (2017). https://doi.org/10.1109/TED.2017.2731959
Hosomi, M., Yamagishi, H., Yamamoto, T., Bessho, K., Higo, Y., Yamane, K., Yamada, H., Shoji, M., Hachino, H., Fukumoto, C., Nagao, H., Kano, H.: A novel nonvolatile memory with spin torque transfer magnetization switching: spin-RAM. In: IEEE International Electron Devices Meeting, 2005. IEDM Technical Digest., pp. 459–462 (2005)
Chi, P., Li, S., Yuanqing Cheng, Yu Lu, Kang, S.H., Xie, Y.: Architecture design with STT-RAM: Opportunities and challenges. In: 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 109–114 (2016)
Rasquinha, M., Choudhary, D., Chatterjee, S., Mukhopadhyay, S., Yalamanchili, S.: An energy efficient cache design using spin torque transfer (STT) RAM. In: 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), pp. 389–394 (2010). https://doi.org/10.1145/1840845.1840931
Prenat, G., Jabeur, K., Vanhauwaert, P., Pendina, G.D., Oboril, F., Bishnoi, R., Ebrahimi, M., Lamard, N., Boulle, O., Garello, K., Langer, J., Ocker, B., Cyrille, M., Gambardella, P., Tahoori, M., Gaudin, G.: Ultra-fast and high-reliability SOT-MRAM: From cache replacement to normally-off computing. IEEE Trans. Multi-Scale Comput. Syst. 2(1), 49–60 (2016). https://doi.org/10.1109/TMSCS.2015.2509963
Bishnoi, R., Ebrahimi, M., Oboril, F., Tahoori, M.B.: Architectural aspects in design and analysis of SOT-based memories. In: 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 700–707 (2014)
Oboril, F., Bishnoi, R., Ebrahimi, M., Tahoori, M.B.: Evaluation of hybrid memory technologies using SOT-MRAM for on-chip cache hierarchy. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 34(3), 367–380 (2015). https://doi.org/10.1109/TCAD.2015.2391254
Li, G., Chen, X., Sun, G., Hoffmann, H., Liu, Y., Wang, Y., Yang, H.: A STT-RAM-based low-power hybrid register file for GPGPUs. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6 (2015)
Wu, X., Li, J., Zhang, L., Speight, E., Rajamony, R., Xie, Y.: Hybrid cache architecture with disparate memory technologies. In: ISCA ’09, p. 34–45. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1555754.1555761
Imani, M., Patil, S., Rosing, T.: Low power data-aware STT-RAM based hybrid cache architecture. In: 2016 17th International Symposium on Quality Electronic Design (ISQED), pp. 88–94 (2016). https://doi.org/10.1109/ISQED.2016.7479181
Beigi, M.V., Memik, G.: Tapas: Temperature-aware adaptive placement for 3D stacked hybrid caches. In: Proceedings of the Second International Symposium on Memory Systems, MEMSYS ’16, p. 415–426. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2989081.2989085
Smullen, C.W., Mohan, V., Nigam, A., Gurumurthi, S., Stan, M.R.: Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 50–61 (2011)
Kuan, K., Adegbija, T.: Energy-efficient runtime adaptable l1 STT-RAM cache design. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 39(6), 1328–1339 (2020). https://doi.org/10.1109/TCAD.2019.2912920
Jog, A., Mishra, A.K., Xu, C., Xie, Y., Narayanan, V., Iyer, R., Das, C.R.: Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs. In: DAC Design Automation Conference 2012, pp. 243–252 (2012). https://doi.org/10.1145/2228360.2228406
Sun, Z., Bi, X., Li, H., Wong, W., Ong, Z., Zhu, X., Wu, W.: Multi retention level STT-RAM cache designs with a dynamic refresh scheme. In: 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 329–338 (2011)
Wang, J., Dong, X., Xie, Y.: Oap: An obstruction-aware cache management policy for STT-RAM last-level caches. In: 2013 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 847–852 (2013)
Sun, G., Dong, X., Xie, Y., Li, J., Chen, Y.: A novel architecture of the 3D stacked MRAM l2 cache for CMPs. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp. 239–249 (2009). https://doi.org/10.1109/HPCA.2009.4798259
Imani, M., Rahimi, A., Kim, Y., Rosing, T.: A low-power hybrid magnetic cache architecture exploiting narrow-width values. In: 2016 5th Non-Volatile Memory Systems and Applications Symposium (NVMSA), pp. 1–6 (2016). https://doi.org/10.1109/NVMSA.2016.7547174
Angizi, S., He, Z., Reis, D., Hu, X., Tsai, W., Lin, S.J., Fan, D.: Accelerating deep neural networks in processing-in-memory platforms: Analog or digital approach? In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 197–202 (2019)
Reis, D., Gao, D., Angizi, S., Yin, X., Fan, D., Niemier, M., Zhuo, C., Hu, X.S.: Modeling and benchmarking computing-in-memory for design space exploration. In: Proceedings of the 2020 on Great Lakes Symposium on VLSI (2020)
Angizi, S., Khoshavi, N., Marshall, A., Dowben, P., Fan, D.: Meram: Non-volatile cache memory based on magneto-electric fets (2020)
Seo, Y., Roy, K.: High-density SOT-MRAM based on shared bitline structure. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(8), 1600–1603 (2018). https://doi.org/10.1109/TVLSI.2018.2822841
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1097–1105. Curran Associates Inc., Red Hook, NY, USA (2012)
Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size (2016)
Dongarra, J.J., Heroux, M., Luszczek, P.: HPCG benchmark: a new metric for ranking high performance computing systems (2015)
NVIDIA CUDA Profiler: https://docs.nvidia.com/cuda/profiler-users-guide/nvprof-overview
Redmon, J.: Darknet: Open source neural networks in C. http://pjreddie.com/darknet/ (2013–2016)
Chen, Y., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circ. 52(1), 127–138 (2017). https://doi.org/10.1109/JSSC.2016.2616357
Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C.: Tetris: Scalable and efficient neural network acceleration with 3D memory. SIGARCH Comput. Archit. News 45(1), 751–764 (2017). https://doi.org/10.1145/3093337.3037702
Boroumand, A., Ghose, S., Kim, Y., Ausavarungnirun, R., Shiu, E., Thakur, R., Kim, D., Kuusela, A., Knies, A., Ranganathan, P., Mutlu, O.: Google workloads for consumer devices: Mitigating data movement bottlenecks. In: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’18, p. 316–331. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3173162.3173177
Donato, M., Reagen, B., Pentecost, L., Gupta, U., Brooks, D., Wei, G.: On-chip deep neural network storage with multi-level eNVM. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6 (2018). https://doi.org/10.1109/DAC.2018.8465818
Kannan, A., Kurach, K., Ravi, S., Kaufmann, T., Tomkins, A., Miklos, B., Corrado, G., Lukács, L., Ganea, M., Young, P., Ramavajjala, V.: Smart reply: Automated response suggestion for email. CoRR abs/1606.04870 (2016). http://arxiv.org/abs/1606.04870
Tucker, G., Wu, M., Sun, M., Panchapagesan, S., Fu, G., Vitaladevuni, S.: Model compression applied to small-footprint keyword spotting. In: Interspeech 2016, pp. 1878–1882 (2016). https://doi.org/10.21437/Interspeech.2016-1393
Wu, C., Brooks, D., Chen, K., Chen, D., Choudhury, S., Dukhan, M., Hazelwood, K., Isaac, E., Jia, Y., Jia, B., Leyvand, T., Lu, H., Lu, Y., Qiao, L., Reagen, B., Spisak, J., Sun, F., Tulloch, A., Vajda, P., Wang, X., Wang, Y., Wasti, B., Wu, Y., Xian, R., Yoo, S., Zhang, P.: Machine learning at facebook: Understanding inference at the edge. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331–344 (2019). https://doi.org/10.1109/HPCA.2019.00048
Ghodsi, Z., Veldanda, A., Reagen, B., Garg, S.: Cryptonas: Private inference on a relu budget (2020)
Korgaonkar, K., Bhati, I., Liu, H., Gaur, J., Manipatruni, S., Subramoney, S., Karnik, T., Swanson, S., Young, I., Wang, H.: Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 315–327 (2018). https://doi.org/10.1109/ISCA.2018.00035
Hankin, A., Shapira, T., Sangaiah, K., Lui, M., Hempstead, M.: Evaluation of non-volatile memory based last level cache given modern use case behavior. In: 2019 IEEE International Symposium on Workload Characterization (IISWC), pp. 143–154 (2019). https://doi.org/10.1109/IISWC47752.2019.9042051
Pentecost, L., Donato, M., Reagen, B., Gupta, U., Ma, S., Wei, G.Y., Brooks, D.: MaxNVM: Maximizing DNN storage density and inference efficiency with sparse encoding and error mitigation. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, p. 769–781. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3352460.3358258
Li, H., Bhargav, M., Whatmough, P.N., Philip Wong, H..: On-chip memory technology design space explorations for mobile deep neural network accelerators. In: 2019 56th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2019)
Acknowledgements
This research was supported in part by NSF CCF Grant No. 1815899 and NSF CSR Grant No. 1815780.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Inci, A., Isgenc, M.M., Marculescu, D. (2024). Efficient Deep Learning Using Non-volatile Memory Technology in GPU Architectures. In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-19568-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-19568-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19567-9
Online ISBN: 978-3-031-19568-6
eBook Packages: EngineeringEngineering (R0)