Skip to main content

Efficient Deep Learning Using Non-volatile Memory Technology in GPU Architectures

  • Chapter
  • First Online:
Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing

Abstract

Embedded machine learning (ML) systems have now become the dominant platform for deploying ML serving tasks and are projected to become of equal importance for training ML models. With this comes the challenge of overall efficient deployment, in particular low-power and high-throughput implementations, under stringent memory constraints. In this context, non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While prior work has investigated several architectural implications of NVM for generic applications, in this chapter, we present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. DeepNVM++ relies on iso-capacity and iso-area performance and energy models for last-level caches implemented using conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8 × and 4.7 × energy-delay product (EDP) reduction and 2.4 × and 2.8 × area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2.2 × and 2.4 × EDP reduction and accommodate 2.3 × and 3.3 × cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wulf, W.A., McKee, S.A.: Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588

    Article  Google Scholar 

  2. Dennard, R.H., Gaensslen, F.H., Yu, H., Rideout, V.L., Bassous, E., LeBlanc, A.R.: Design of ion-implanted mosfet’s with very small physical dimensions. IEEE J. Solid State Circ. 9(5), 256–268 (1974). https://doi.org/10.1109/JSSC.1974.1050511

    Article  Google Scholar 

  3. Murali, S., Mutapcic, A., Atienza, D., Gupta, R., Boyd, S., Benini, L., De Micheli, G.: Temperature control of high-performance multi-core platforms using convex optimization. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 110–115 (2008). https://doi.org/10.1109/DATE.2008.4484671

  4. Coskun, A.K., Rosing, T.S., Whisnant, K.: Temperature aware task scheduling in MPSoCs. In: 2007 Design, Automation Test in Europe Conference Exhibition, pp. 1–6 (2007). https://doi.org/10.1109/DATE.2007.364540

  5. Coskun, A.K., Rosing, T.S., Whisnant, K.A., Gross, K.C.: Static and dynamic temperature-aware scheduling for multiprocessor SoCs. IEEE Trans. Very Large Scale Integr. Syst. 16(9), 1127–1140 (2008). https://doi.org/10.1109/TVLSI.2008.2000726

    Article  Google Scholar 

  6. Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks, pp. 6105–6114. PMLR, Long Beach, California, USA (2019). http://proceedings.mlr.press/v97/tan19a.html

  7. Touvron, H., Vedaldi, A., Douze, M., Jégou, H.: Fixing the train-test resolution discrepancy. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)

    Google Scholar 

  8. Tan, M., Pang, R., Le, Q.V.: Efficientdet: Scalable and efficient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 10778–10787 (2020)

    Google Scholar 

  9. Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J., Songhori, E., Wang, S., Lee, Y.J., Johnson, E., Pathak, O., Bae, S., Nazi, A., Pak, J., Tong, A., Srinivasa, K., Hang, W., Tuncer, E., Babu, A., Le, Q.V., Laudon, J., Ho, R., Carpenter, R., Dean, J.: Chip placement with deep reinforcement learning (2020)

    Google Scholar 

  10. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding (2016)

    Google Scholar 

  11. Ding, R., Liu, Z., Blanton, R.D.S., Marculescu, D.: Lightening the load with highly accurate storage- and energy-efficient lightnns. ACM Trans. Reconfigurable Technol. Syst. 11(3) (2018). https://doi.org/10.1145/3270689

  12. Chin, T.W., Ding, R., Zhang, C., Marculescu, D.: Towards efficient model compression via learned global ranking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  13. Chen, Y., Emer, J., Sze, V.: Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 367–379. IEEE Press, Piscataway, NJ, USA (2016). https://doi.org/10.1109/ISCA.2016.40

  14. Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: Efficient inference engine on compressed deep neural network. In: International Conference on Computer Architecture (ISCA) (2016)

    Google Scholar 

  15. Chen, Y.H., Emer, J., Sze, V.: Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37(3), 12–21 (2017). https://doi.org/10.1109/MM.2017.54

    Article  Google Scholar 

  16. Shao, Y., Clemons, J., Venkatesan, R., Zimmer, B., Fojtik, M.R., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S., Zhang, Y., Dally, W., Emer, J., Gray, C.T., Khailany, B., Keckler, S.: Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (2019)

    Google Scholar 

  17. Inci, A., Marculescu, D.: Solving the non-volatile memory conundrum for deep learning workloads. In: Architectures and Systems for Big Data Workshop in Conjunction with ISCA (2018)

    Google Scholar 

  18. Inci, A.F., Isgenc, M.M., Marculescu, D.: Deepnvm: A framework for modeling and analysis of non-volatile memory technologies for deep learning applications. In: Proceedings of the 23rd Conference on Design, Automation and Test in Europe, DATE ’20, p. 1295–1298 (2020)

    Google Scholar 

  19. Inci, A., Isgenc, M.M., Marculescu, D.: Deepnvm++: Cross-layer modeling and optimization framework of non-volatile memories for deep learning. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst., 1–1 (2021). https://doi.org/10.1109/TCAD.2021.3127148

  20. Inci, A., Bolotin, E., Fu, Y., Dalal, G., Mannor, S., Nellans, D., Marculescu, D.: The architectural implications of distributed reinforcement learning on CPU-GPU systems. Preprint (2020). arXiv:2012.04210

    Google Scholar 

  21. Inci, A., Isgenc, M.M., Marculescu, D.: Cross-layer design space exploration of NVM-based caches for deep learning. NVMW (2021)

    Google Scholar 

  22. Inci, A., Virupaksha, S.G., Jain, A., Thallam, V.V., Ding, R., Marculescu, D.: QAPPA: Quantization-aware power, performance, and area modeling of DNN accelerators. Preprint (2022). arXiv:2205.08648

    Google Scholar 

  23. Inci, A., Virupaksha, S.G., Jain, A., Thallam, V.V., Ding, R., Marculescu, D.: QADAM: Quantization-aware DNN accelerator modeling for pareto-optimality. Preprint (2022). arXiv:2205.13045

    Google Scholar 

  24. Inci, A., Virupaksha, S.G., Jain, A., Chin, T.W., Thallam, V.V., Ding, R., Marculescu, D.: Quidam: A framework for quantization-aware DNN accelerator and model co-exploration. Preprint (2022). arXiv:2206.15463

    Google Scholar 

  25. Chang, M., Rosenfeld, P., Lu, S., Jacob, B.: Technology comparison for large last-level caches (l3cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized edram. In: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 143–154 (2013). https://doi.org/10.1109/HPCA.2013.6522314

  26. Homayoun, H., Veidenbaum, A.: Reducing leakage power in peripheral circuits of l2 caches. In: 2007 25th International Conference on Computer Design, pp. 230–237 (2007). https://doi.org/10.1109/ICCD.2007.4601907

  27. Xu, W., Sun, H., Wang, X., Chen, Y., Zhang, T.: Design of last-level on-chip cache using spin-torque transfer RAM (STT RAM). IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(3), 483–493 (2011). https://doi.org/10.1109/TVLSI.2009.2035509

  28. Dong, X., Wu, X., Sun, G., Xie, Y., Li, H., Chen, Y.: Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: 2008 45th ACM/IEEE Design Automation Conference, pp. 554–559 (2008)

    Google Scholar 

  29. List of NVIDIA GPUs: https://en.wikipedia.org/wiki/List-of-Nvidia-graphics-processing-units

  30. Kim, J., Chen, A., Behin-Aein, B., Kumar, S., Wang, J., Kim, C.H.: A technology-agnostic MTJ spice model with user-defined dimensions for STT-MRAM scalability studies. In: 2015 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4 (2015). https://doi.org/10.1109/CICC.2015.7338407

  31. Kazemi, M., Rowlands, G.E., Ipek, E., Buhrman, R.A., Friedman, E.G.: Compact model for spin–orbit magnetic tunnel junctions. IEEE Trans. Electron Dev. 63(2), 848–855 (2016). https://doi.org/10.1109/TED.2015.2510543

    Article  Google Scholar 

  32. Dong, X., Xu, C., Xie, Y., Jouppi, N.P.: Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 31(7), 994–1007 (2012). https://doi.org/10.1109/TCAD.2012.2185930

    Article  Google Scholar 

  33. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, pp. 675–678. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2647868.2654889

  34. Deng, J., Dong, W., Socher, R., Li, L., Kai Li, Li Fei-Fei: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

    Google Scholar 

  35. Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed GPU simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009). https://doi.org/10.1109/ISPASS.2009.4919648

  36. Isgenc, M.M.: Enabling design of low-volume high-performance ICs. Ph.D. thesis, Carnegie Mellon University (2019)

    Google Scholar 

  37. Isgenc, M.M., Martins, M.G.A., Zackriya, V.M., Pagliarini, S.N., Pileggi, L.: Logic IP for low-cost IC design in advanced CMOS nodes. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 28(2), 585–595 (2020). https://doi.org/10.1109/TVLSI.2019.2942825

  38. Pagliarini, S.N., Bhuin, S., Isgenc, M.M., Biswas, A.K., Pileggi, L.: A probabilistic synapse with strained MTJs for spiking neural networks. IEEE Trans. Neural Networks Learn. Syst. 31(4), 1113–1123 (2020). https://doi.org/10.1109/TNNLS.2019.2917819

    Article  Google Scholar 

  39. Scheuerlein, R.E.: Magneto-resistive IC memory limitations and architecture implications. In: Seventh Biennial IEEE International Nonvolatile Memory Technology Conference. Proceedings (Cat. No.98EX141), pp. 47–50 (1998). https://doi.org/10.1109/NVMT.1998.723217

  40. Zhao, W., Belhaire, E., Mistral, Q., Chappert, C., Javerliac, V., Dieny, B., Nicolle, E.: Macro-model of spin-transfer torque based magnetic tunnel junction device for hybrid magnetic-CMOS design. In: 2006 IEEE International Behavioral Modeling and Simulation Workshop, pp. 40–43 (2006). https://doi.org/10.1109/BMAS.2006.283467

  41. Kan, J.J., Park, C., Ching, C., Ahn, J., Xie, Y., Pakala, M., Kang, S.H.: A study on practically unlimited endurance of STT-MRAM. IEEE Trans. Electron Dev. 64(9), 3639–3646 (2017). https://doi.org/10.1109/TED.2017.2731959

    Article  Google Scholar 

  42. Hosomi, M., Yamagishi, H., Yamamoto, T., Bessho, K., Higo, Y., Yamane, K., Yamada, H., Shoji, M., Hachino, H., Fukumoto, C., Nagao, H., Kano, H.: A novel nonvolatile memory with spin torque transfer magnetization switching: spin-RAM. In: IEEE International Electron Devices Meeting, 2005. IEDM Technical Digest., pp. 459–462 (2005)

    Google Scholar 

  43. Chi, P., Li, S., Yuanqing Cheng, Yu Lu, Kang, S.H., Xie, Y.: Architecture design with STT-RAM: Opportunities and challenges. In: 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 109–114 (2016)

    Google Scholar 

  44. Rasquinha, M., Choudhary, D., Chatterjee, S., Mukhopadhyay, S., Yalamanchili, S.: An energy efficient cache design using spin torque transfer (STT) RAM. In: 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), pp. 389–394 (2010). https://doi.org/10.1145/1840845.1840931

  45. Prenat, G., Jabeur, K., Vanhauwaert, P., Pendina, G.D., Oboril, F., Bishnoi, R., Ebrahimi, M., Lamard, N., Boulle, O., Garello, K., Langer, J., Ocker, B., Cyrille, M., Gambardella, P., Tahoori, M., Gaudin, G.: Ultra-fast and high-reliability SOT-MRAM: From cache replacement to normally-off computing. IEEE Trans. Multi-Scale Comput. Syst. 2(1), 49–60 (2016). https://doi.org/10.1109/TMSCS.2015.2509963

    Article  Google Scholar 

  46. Bishnoi, R., Ebrahimi, M., Oboril, F., Tahoori, M.B.: Architectural aspects in design and analysis of SOT-based memories. In: 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 700–707 (2014)

    Google Scholar 

  47. Oboril, F., Bishnoi, R., Ebrahimi, M., Tahoori, M.B.: Evaluation of hybrid memory technologies using SOT-MRAM for on-chip cache hierarchy. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 34(3), 367–380 (2015). https://doi.org/10.1109/TCAD.2015.2391254

    Article  Google Scholar 

  48. Li, G., Chen, X., Sun, G., Hoffmann, H., Liu, Y., Wang, Y., Yang, H.: A STT-RAM-based low-power hybrid register file for GPGPUs. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6 (2015)

    Google Scholar 

  49. Wu, X., Li, J., Zhang, L., Speight, E., Rajamony, R., Xie, Y.: Hybrid cache architecture with disparate memory technologies. In: ISCA ’09, p. 34–45. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1555754.1555761

  50. Imani, M., Patil, S., Rosing, T.: Low power data-aware STT-RAM based hybrid cache architecture. In: 2016 17th International Symposium on Quality Electronic Design (ISQED), pp. 88–94 (2016). https://doi.org/10.1109/ISQED.2016.7479181

  51. Beigi, M.V., Memik, G.: Tapas: Temperature-aware adaptive placement for 3D stacked hybrid caches. In: Proceedings of the Second International Symposium on Memory Systems, MEMSYS ’16, p. 415–426. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2989081.2989085

  52. Smullen, C.W., Mohan, V., Nigam, A., Gurumurthi, S., Stan, M.R.: Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 50–61 (2011)

    Google Scholar 

  53. Kuan, K., Adegbija, T.: Energy-efficient runtime adaptable l1 STT-RAM cache design. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 39(6), 1328–1339 (2020). https://doi.org/10.1109/TCAD.2019.2912920

    Article  Google Scholar 

  54. Jog, A., Mishra, A.K., Xu, C., Xie, Y., Narayanan, V., Iyer, R., Das, C.R.: Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs. In: DAC Design Automation Conference 2012, pp. 243–252 (2012). https://doi.org/10.1145/2228360.2228406

    Google Scholar 

  55. Sun, Z., Bi, X., Li, H., Wong, W., Ong, Z., Zhu, X., Wu, W.: Multi retention level STT-RAM cache designs with a dynamic refresh scheme. In: 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 329–338 (2011)

    Google Scholar 

  56. Wang, J., Dong, X., Xie, Y.: Oap: An obstruction-aware cache management policy for STT-RAM last-level caches. In: 2013 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 847–852 (2013)

    Google Scholar 

  57. Sun, G., Dong, X., Xie, Y., Li, J., Chen, Y.: A novel architecture of the 3D stacked MRAM l2 cache for CMPs. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp. 239–249 (2009). https://doi.org/10.1109/HPCA.2009.4798259

  58. Imani, M., Rahimi, A., Kim, Y., Rosing, T.: A low-power hybrid magnetic cache architecture exploiting narrow-width values. In: 2016 5th Non-Volatile Memory Systems and Applications Symposium (NVMSA), pp. 1–6 (2016). https://doi.org/10.1109/NVMSA.2016.7547174

  59. Angizi, S., He, Z., Reis, D., Hu, X., Tsai, W., Lin, S.J., Fan, D.: Accelerating deep neural networks in processing-in-memory platforms: Analog or digital approach? In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 197–202 (2019)

    Google Scholar 

  60. Reis, D., Gao, D., Angizi, S., Yin, X., Fan, D., Niemier, M., Zhuo, C., Hu, X.S.: Modeling and benchmarking computing-in-memory for design space exploration. In: Proceedings of the 2020 on Great Lakes Symposium on VLSI (2020)

    Google Scholar 

  61. Angizi, S., Khoshavi, N., Marshall, A., Dowben, P., Fan, D.: Meram: Non-volatile cache memory based on magneto-electric fets (2020)

    Google Scholar 

  62. Seo, Y., Roy, K.: High-density SOT-MRAM based on shared bitline structure. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(8), 1600–1603 (2018). https://doi.org/10.1109/TVLSI.2018.2822841

  63. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pp. 1097–1105. Curran Associates Inc., Red Hook, NY, USA (2012)

    Google Scholar 

  64. Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)

    Google Scholar 

  65. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556

  66. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  67. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size (2016)

    Google Scholar 

  68. Dongarra, J.J., Heroux, M., Luszczek, P.: HPCG benchmark: a new metric for ranking high performance computing systems (2015)

    Google Scholar 

  69. NVIDIA CUDA Profiler: https://docs.nvidia.com/cuda/profiler-users-guide/nvprof-overview

  70. Redmon, J.: Darknet: Open source neural networks in C. http://pjreddie.com/darknet/ (2013–2016)

  71. Chen, Y., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circ. 52(1), 127–138 (2017). https://doi.org/10.1109/JSSC.2016.2616357

    Article  Google Scholar 

  72. Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C.: Tetris: Scalable and efficient neural network acceleration with 3D memory. SIGARCH Comput. Archit. News 45(1), 751–764 (2017). https://doi.org/10.1145/3093337.3037702

    Article  Google Scholar 

  73. Boroumand, A., Ghose, S., Kim, Y., Ausavarungnirun, R., Shiu, E., Thakur, R., Kim, D., Kuusela, A., Knies, A., Ranganathan, P., Mutlu, O.: Google workloads for consumer devices: Mitigating data movement bottlenecks. In: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’18, p. 316–331. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3173162.3173177

  74. Donato, M., Reagen, B., Pentecost, L., Gupta, U., Brooks, D., Wei, G.: On-chip deep neural network storage with multi-level eNVM. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6 (2018). https://doi.org/10.1109/DAC.2018.8465818

  75. Kannan, A., Kurach, K., Ravi, S., Kaufmann, T., Tomkins, A., Miklos, B., Corrado, G., Lukács, L., Ganea, M., Young, P., Ramavajjala, V.: Smart reply: Automated response suggestion for email. CoRR abs/1606.04870 (2016). http://arxiv.org/abs/1606.04870

  76. Tucker, G., Wu, M., Sun, M., Panchapagesan, S., Fu, G., Vitaladevuni, S.: Model compression applied to small-footprint keyword spotting. In: Interspeech 2016, pp. 1878–1882 (2016). https://doi.org/10.21437/Interspeech.2016-1393

    Google Scholar 

  77. Wu, C., Brooks, D., Chen, K., Chen, D., Choudhury, S., Dukhan, M., Hazelwood, K., Isaac, E., Jia, Y., Jia, B., Leyvand, T., Lu, H., Lu, Y., Qiao, L., Reagen, B., Spisak, J., Sun, F., Tulloch, A., Vajda, P., Wang, X., Wang, Y., Wasti, B., Wu, Y., Xian, R., Yoo, S., Zhang, P.: Machine learning at facebook: Understanding inference at the edge. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331–344 (2019). https://doi.org/10.1109/HPCA.2019.00048

  78. Ghodsi, Z., Veldanda, A., Reagen, B., Garg, S.: Cryptonas: Private inference on a relu budget (2020)

    Google Scholar 

  79. Korgaonkar, K., Bhati, I., Liu, H., Gaur, J., Manipatruni, S., Subramoney, S., Karnik, T., Swanson, S., Young, I., Wang, H.: Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 315–327 (2018). https://doi.org/10.1109/ISCA.2018.00035

  80. Hankin, A., Shapira, T., Sangaiah, K., Lui, M., Hempstead, M.: Evaluation of non-volatile memory based last level cache given modern use case behavior. In: 2019 IEEE International Symposium on Workload Characterization (IISWC), pp. 143–154 (2019). https://doi.org/10.1109/IISWC47752.2019.9042051

  81. Pentecost, L., Donato, M., Reagen, B., Gupta, U., Ma, S., Wei, G.Y., Brooks, D.: MaxNVM: Maximizing DNN storage density and inference efficiency with sparse encoding and error mitigation. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, p. 769–781. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3352460.3358258

  82. Li, H., Bhargav, M., Whatmough, P.N., Philip Wong, H..: On-chip memory technology design space explorations for mobile deep neural network accelerators. In: 2019 56th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2019)

    Google Scholar 

Download references

Acknowledgements

This research was supported in part by NSF CCF Grant No. 1815899 and NSF CSR Grant No. 1815780.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diana Marculescu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Inci, A., Isgenc, M.M., Marculescu, D. (2024). Efficient Deep Learning Using Non-volatile Memory Technology in GPU Architectures. In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-19568-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19568-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19567-9

  • Online ISBN: 978-3-031-19568-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics