Skip to main content

Optimal Tiling Strategy for Memory Bandwidth Reduction for CNNs

  • Conference paper
  • First Online:
Advanced Concepts for Intelligent Vision Systems (ACIVS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10617))

Abstract

Convolutional Neural Networks (CNNs), are nowadays present in many different embedded solutions. One of the biggest problems related to their execution is the memory bottleneck. In this work we propose an optimal double buffering tiling strategy, to reduce the memory bandwidth in the execution of deep CNN architecture, testing our model on one of the two cores of a Zynq®-7020 embedded platform. An optimal tiling strategy is found for each layer of the network, optimizing for lowest external memory \(\rightleftharpoons \) On-Chip memory bandwidth. Performance test results show an improvement in the total execution time of 50% (cache disabled/34% cache enabled), compared to a non double buffered implementation. Moreover, a 5x lower external memory \(\rightleftharpoons \) On-Chip memory double buffering memory bandwidth is achieved, with respect to naive tiling settings. Furthermore it is shown that tiling settings for highest OCM usage do not generally lead to the lowest bandwidth scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. PrimeCell DMA Controller (PL330) (2007)

    Google Scholar 

  2. Zedboard, Zynq Evaluation and Development Hardware Users Guide (2014)

    Google Scholar 

  3. Al Maashri, A., Cotter, M., Chandramoorthy, N., DeBole, M., Yu, C.-L., Narayanan, V., Chakrabarti, C.: Hardware acceleration for neuromorphic vision algorithms. J. Sig. Process. Syst. 70(2), 163–175 (2013)

    Article  Google Scholar 

  4. S. C. class. Cs231n: convolutional neural networks for visual recognition (2016)

    Google Scholar 

  5. Conti, F., Pullini, A., Benini, L.: Brain-inspired classroom occupancy monitoring on a low-power mobile platform. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 610–615 (2014)

    Google Scholar 

  6. Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y.: Neuflow: a runtime reconfigurable dataflow processor for vision. In: 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 109–116. IEEE (2011)

    Google Scholar 

  7. Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)

  8. Huang, Q., Xue, J., Vera, X.: Code tiling for improving the cache performance of PDE solvers. In: 2003 International Conference on Parallel Processing, Proceedings, pp. 615–624. IEEE (2003)

    Google Scholar 

  9. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

  10. Kandemir, M., Ramanujam, J., Irwin, M.J., Vijaykrishnan, N., Kadayif, I., Parikh, A.: Dynamic management of scratch-pad memory space. In: Design Automation Conference, Proceedings, pp. 690–695. IEEE (2001)

    Google Scholar 

  11. Kodukula, I., Ahmed, N., Pingali, K.: Data-centric multi-level blocking. In: ACM SIGPLAN Notices, vol. 32, pp. 346–357. ACM (1997)

    Google Scholar 

  12. Saidi, S., Tendulkar, P., Lepley, T., Maler, O.: Optimizing two-dimensional DMA transfers for scratchpad based MPSoCs platforms. Microprocess. Microsyst. 37(8), 848–857 (2013)

    Article  Google Scholar 

  13. Yang, X., Wang, L., Xue, J., Tang, T., Ren, X., Ye, S.: Improving scratchpad allocation with demand-driven data tiling. In: Proceedings of the 2010 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 127–136. ACM (2010)

    Google Scholar 

  14. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170. ACM (2015)

    Google Scholar 

Download references

Acknowledgment

The work of S. Smets was supported by a Doctoral Fellowship of the Research Foundation Flanders (FWO).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sander Smets .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cecconi, L., Smets, S., Benini, L., Verhelst, M. (2017). Optimal Tiling Strategy for Memory Bandwidth Reduction for CNNs. In: Blanc-Talon, J., Penne, R., Philips, W., Popescu, D., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2017. Lecture Notes in Computer Science(), vol 10617. Springer, Cham. https://doi.org/10.1007/978-3-319-70353-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-70353-4_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-70352-7

  • Online ISBN: 978-3-319-70353-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics