Skip to main content

Advertisement

Log in

A Compute Cache System for Signal Processing Applications

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Nowadays, processing systems are constrained by the low efficiency of their memory subsystems. Although memories evolved into faster and more efficient devices through the years, they were still unable to keep up with the computational power offered by processors, i.e., feed the processors with the data they require at the rhythm it is consumed. Consequently, with the advent of Big Data, the need for fetching large amounts of data from memory became the most prominent performance bottleneck. Naturally, several approaches seeking to mitigate this problem have arisen through the years, such as application-specific accelerators and Near Data Processing (NDP) solutions. However, none were capable to offer a satisfactory general-purpose solution without imposing rather limiting constraints. For instance, NDP solutions often require the programmer to have low-level knowledge of how data is physically stored in memory. In this paper, we propose an alternative mechanism that operates at the cache level, leveraging both proximity to the data and the parallelism enabled by accessing an entire cache line per cycle. We detail the internal architecture of the Cache Compute System (CCS) and demonstrate its integration with a conventional high-performance ARM Cortex-A53 Central Processing Unit (CPU). Furthermore, we assess the performance benefits of the novel CCS using an extensive set of microbenchmarks as well as six kernels widely used in the context of Convolutional Neural Networks (CNNs) and clustering algorithms. Results show that the CCS provides performance improvements ranging from 3.9× to 40.6× regarding the six tested kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Listing 1
Listing 2
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10

Similar content being viewed by others

References

  1. Wulf, W.A., & McKee, S.A. (1995). Hitting the memory wall: implications of the obvious. SIGARCH Computer Architecture News, 23(1), 20–24.

    Article  Google Scholar 

  2. Vieira, J., Duarte, R.P., & Neto, H.C. (2019). kNN-STUFF: kNN streaming unit for Fpgas. IEEE Access, 7, 170864–170877.

    Article  Google Scholar 

  3. Aga, S., Jeloka, S., Subramaniyan, A., Narayanasamy, S., Blaauw, D.T., & Das, R. (2017). Compute caches. In HPCA (pp. 481–492): IEEE Computer Society.

  4. Vieira, J., Giacomin, E., Qureshi, Y.M., Zapater, M., Tang, X., Kvatinsky, S., Atienza, D., & Gaillardon, P. (2019). A product engine for energy-efficient execution of binary neural networks using resistive memories (pp. 160–165): IEEE.

  5. Ghose, S., Hsieh, K., Boroumand, A., Ausavarungnirun, R., & Mutlu, O. (2018). Enabling the adoption of processing-in-memory: challenges, mechanisms, future research directions. arXiv:1802.00320.

  6. Kim, N.S., & Mehra, P. (2019). Practical near-data processing to evolve memory and storage devices into mainstream heterogeneous computing systems. In DAC (p. 22): ACM.

  7. Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, R.S., & Srikumar, V. (2016). ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In ISCA (pp. 14–26): IEEE Computer Society.

  8. Vieira, J., Roma, N., Tomás, P., Ienne, P., & Falcao, G. (2018). Exploiting compute caches for memory bound vector operations. In SBAC-PAD (pp. 197–200): IEEE.

  9. Vieira, J., Roma, N., Falcao, G., & Tomás, P. (2020). Processing convolutional neural networks on cache. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1658–1662): IEEE.

  10. Li, S., Niu, D., Malladi, K.T., Zheng, H., Brennan, B., & Xie, Y. (2017). DRISA: a DRAM-based reconfigurable in-situ accelerator. In MICRO (pp. 288–301): ACM.

  11. Seshadri, V., Lee, D., Mullins, T., Hassan, H., Boroumand, A., Kim, J., Kozuch, M.A., Mutlu, O., Gibbons, P.B., & Mowry, T.C. (2017). Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology. In MICRO (pp. 273–287): ACM.

  12. Seshadri, V., Hsieh, K., Boroumand, A., Lee, D., Kozuch, M.A., Mutlu, O., Gibbons, P.B., & Mowry, T.C. (2015). Fast bulk bitwise AND and OR in DRAM. IEEE Computer Architecture Letters, 14(2), 127–131.

    Article  Google Scholar 

  13. Li, S., Xu, C., Zou, Q., Zhao, J., Lu, Y., & Xie, Y. (2016). Pinatubo: a processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories. In DAC (pp. 173:1–173:6): ACM.

  14. Yitbarek, S.F., Yang, T., Das, R., & Austin, T.M. (2016). Exploring specialized near-memory processing for data intensive operations. In DATE (pp. 1449–1452): IEEE.

  15. Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., & Xie, Y. (2016). PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ISCA (pp. 27–39): IEEE Computer Society.

  16. Cheng, M., Xia, L., Zhu, Z., Cai, Y., Xie, Y., Wang, Y., & Yang, H. (2019). TIME: a training-in-memory architecture For RRAM-based deep neural networks. IEEE Trans. on CAD of Integrated Circuits and Systems, 38(5), 834–847.

    Article  Google Scholar 

  17. Wang, Y., Chen, W., Yang, J., & Li, T. (2018). Towards memory-efficient allocation of CNNs on processing-in-memory architecture. IEEE Trans. Parallel Distrib. Syst., 29(6), 1428– 1441.

    Article  Google Scholar 

  18. Subramaniyan, A., Wang, J., Balasubramanian, E.R.M., Blaauw, D.T., Sylvester, D., & Das, R. (2017). Cache automaton. In MICRO (pp. 259–272): ACM.

  19. Wang, X., Yu, J., Augustine, C., Iyer, R.R., & Das, R. (2019). Bit prudent in-cache acceleration of deep convolutional neural networks. In HPCA (pp. 81–93): IEEE.

  20. Eckert, C., Wang, X., Wang, J., Subramaniyan, A., Sylvester, D., Blaauw, D.T., Das, R., & Iyer, R.R. (2019). Neural cache: bit-serial in-cache acceleration of deep neural networks. IEEE Micro, 39 (3), 11–19.

    Article  Google Scholar 

  21. Nag, A., Ramachandra, C.N., Balasubramonian, R., Stutsman, R., Giacomin, E., Kambalasubramanyam, H., & Gaillardon, P. (2019). GenCache: leveraging in-cache operators for efficient sequence alignment. In MICRO (pp. 334–346): ACM.

  22. Ahn, J., Yoo, S., Mutlu, O., & Choi, K. (2015). PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In ISCA (pp. 336–348): ACM.

  23. Cong, J., & Xiao, B. (2014). Minimizing computation in convolutional neural networks. In ICANN. Volume 8681 of Lecture Notes in Computer Science (pp. 281–290): Springer.

  24. Giacomin, E., Greenberg-Toledo, T., Kvatinsky, S., & Gaillardon, P. (2019). A robust digital rram-based convolutional block for low-power image processing and learning applications. IEEE Trans. Circuits Syst. I Regul. Pap., 66-I(2), 643–654.

    Article  Google Scholar 

  25. Pouyan, P., Amat, E., Hamdioui, S., & Rubio, A. (2016). RRAM variability and its mitigation schemes. In 2016 26th international workshop on power and timing modeling, optimization and simulation (PATMOS) (pp. 141–146): IEEE.

  26. Liu, X., Zhou, M., Rosing, T.S., & Zhao, J. (2019). HR3AM: a heat resilient design for RRAM-based neuromorphic computing. In ISLPED (pp. 1–6): IEEE.

  27. Bo, C., Wang, K., Fox, J.J., & Skadron, K. (2016). Entity resolution acceleration using the automata processor. In BigData (pp. 311–318): IEEE Computer Society.

  28. Qureshi, Y.M., Simon, W.A., Zapater, M., Atienza, D., & Olcoz, K. (2019). Gem5-X: a gem5-based system level simulation framework to optimize many-core platforms. In SpringSim (pp. 1–12): IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to João Vieira.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work supported by national funds through Fundação para a Ciência e a Tecnologia (FCT), under projects UIDB/50021/2020 and PTDC/EEI-HAC/30485/2017–HAnDLE (INESC-ID), UIDB/ EEA/50008/2020 (Instituto de Telecomunicações), and research grant SFRH/BD/144047/2019.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vieira, J., Roma, N., Falcao, G. et al. A Compute Cache System for Signal Processing Applications. J Sign Process Syst 93, 1173–1186 (2021). https://doi.org/10.1007/s11265-020-01626-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-020-01626-y

Keywords

Navigation