Advertisement

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLS

Conference paper
  • 316 Downloads
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 561)

Abstract

With the increased interest in energy efficiency, a lot of application domains experiment with Field Programmable Gate Arrays (FPGAs), which promise customized hardware accelerators with high-performance and low power consumption. These experiments possible due to the development of High-Level Languages (HLLs) for FPGAs, which permit non-experts in hardware design languages (HDLs) to program reconfigurable hardware for general purpose computing.

However, some of the expert knowledge remains difficult to integrate in HLLs, eventually leading to performance loss for HLL-based applications. One example of such a missing feature is the efficient exploitation of the local memories on FPGAs. A solution to address this challenge is PolyMem, an easy-to-use polymorphic parallel memory that uses BRAMs. In this work, we present HLS-PolyMem, the first complete implementation and in-depth evaluation of PolyMem optimized for the Xilinx Design Suite. Our evaluation demonstrates that HLS-PolyMem is a viable alternative to HLS memory partitioning, the current approach for memory parallelism in Vivado HLS. Specifically, we show that PolyMem offers the same performance as HLS partitioning for simple access patterns, and outperforms partitioning as much as 13x when combining multiple access patterns for the same data structure. We further demonstrate the use of PolyMem for two different case studies, highlighting the superior capabilities of HLS-PolyMem in terms of performance, resource utilization, flexibility, and usability.

Based on all the evidence provided in this work, we conclude that HLS-PolyMem enables the efficient use of BRAMs as parallel memories, without compromising the HLS level or the achievable performance.

Keywords

Polymorphic Parallel Memory High-Level Synthesis FPGA 

References

  1. 1.
    White Paper: Vivado Design Suite: “Vivado Design Suite” (2012). https://www.xilinx.com/support/documentation/white_papers/wp416-Vivado-Design-Suite.pdf
  2. 2.
    Weinhardt, M., Luk, W.: Memory access optimisation for reconfigurable systems. IEE Proc. Comput. Digit. Tech. 148(3), 105–112 (2001)CrossRefGoogle Scholar
  3. 3.
    Ciobanu, C.B., Stramondo, G., de Laat, C., Varbanescu, A.L.: MAX-PolyMem: high-bandwidth polymorphic parallel memories for DFEs. In: IEEE IPDPSW - RAW 2018, pp. 107–114, May 2018Google Scholar
  4. 4.
    Ciobanu, C.: Customizable register files for multidimensional SIMD architectures. Ph.D. thesis, TU Delft, The Netherlands (2013)Google Scholar
  5. 5.
    Ciobanu, C., Kuzmanov, G.K., Gaydadjiev, G.N.: Scalability study of polymorphic register files. In: Proceedings of DSD, pp. 803–808 (2012)Google Scholar
  6. 6.
    Ciobanu, C.B., et al.: EXTRA: an open platform for reconfigurable architectures. In: SAMOS XVIII, pp. 220–229 (2018)Google Scholar
  7. 7.
    Stornaiuolo, L., et al.: HLS support for polymorphic parallel memories. In: 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 143–148. IEEE (2018)Google Scholar
  8. 8.
    Gou, C., Kuzmanov, G., Gaydadjiev, G.N.: SAMS multi-layout memory: providing multiple views of data to boost SIMD performance. In: ICS, pp. 179–188. ACM (2010)Google Scholar
  9. 9.
    Harper, D.T.: Block, multistride vector, and FFT accesses in parallel memory systems. IEEE Trans. Parallel Distrib. Syst. 2(1), 43–51 (1991)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Kuzmanov, G., Gaydadjiev, G., Vassiliadis, S.: Multimedia rectangularly addressable memory. IEEE Trans. Multimedia 8, 315–322 (2006)CrossRefGoogle Scholar
  11. 11.
    Wang, Y., Li, P., Zhang, P., Zhang, C., Cong, J.: Memory partitioning for multidimensional arrays in high-level synthesis. In: DAC, p. 12. ACM (2013)Google Scholar
  12. 12.
    Yin, S., Xie, Z., Meng, C., Liu, L., Wei, S.: Multibank memory optimization for parallel data access in multiple data arrays. In: Proceedings of ICCAD, pp. 1–8. IEEE (2016)Google Scholar
  13. 13.
    auf der Heide, F.M., Scheideler, C., Stemann, V.: Exploiting storage redundancy to speed up randomized shared memory simulations. Theor. Comput. Sci. 162(2), 245–281 (1996)Google Scholar
  14. 14.
    Stramondo, G., Ciobanu, C.B., Varbanescu, A.L., de Laat, C.: Towards application-centric parallel memories. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 481–493. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-10549-5_38CrossRefGoogle Scholar
  15. 15.
    Arsanjani, J.J., Helbich, M., Kainz, W., Boloorani, A.D.: Integration of logistic regression, Markov chain and cellular automata models to simulate urban expansion. Int. J. Appl. Earth Obs. Geoinformation 21, 265–275 (2013)CrossRefGoogle Scholar
  16. 16.
    Smith, A.F., Roberts, G.O.: Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. R. Stat. Society. Ser. B (Methodol.) 55, 3–23 (1993)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Gilks, W.R., Richardson, S., Spiegelhalter, D.: Markov Chain Monte Carlo in Practice. CRC Press, Boca Raton (1995)CrossRefGoogle Scholar
  18. 18.
    Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating PageRank computations. In: Proceedings of the 12th International Conference on World Wide Web, pp. 261–270. ACM (2003)Google Scholar
  19. 19.
    Budnik, P., Kuck, D.: The organization and use of parallel memories. IEEE Trans. Comput. C–20(12), 1566–1569 (1971)CrossRefGoogle Scholar
  20. 20.
    Van Voorhis, D.C., Morrin, T.: Memory systems for image processing. IEEE Trans. Comput. C–27(2), 113–125 (1978)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Kumagai, T., Sugai, N., Takakuwa, M.: Access methods of a two-dimensional access memory by two-dimensional inverse omega network. Syst. Comput. Jpn. 22(7), 22–31 (1991)CrossRefGoogle Scholar
  22. 22.
    Park, J.W.: Multiaccess memory system for attached SIMD computer. IEEE Trans. Comput. 53(4), 439–452 (2004)CrossRefGoogle Scholar
  23. 23.
    Lawrie, D.H., Vora, C.R.: The prime memory system for array access. IEEE Trans. Comput. 31(5), 435–442 (1982)CrossRefGoogle Scholar
  24. 24.
    Liu, C., Yan, X., Qin, X.: An optimized linear skewing interleave scheme for on-chip multi-access memory systems. In: Proceedings of the 17th ACM Great Lakes Symposium on VLSI, GLSVLSI 2007, pp. 8–13 (2007)Google Scholar
  25. 25.
    Peng, J.y., Yan, X.l., Li, D.x., Chen, L.z.: A parallel memory architecture for video coding. J. Zhejiang Univ. Sci. A 9, 1644–1655 (2008).  https://doi.org/10.1631/jzus.A0820052CrossRefGoogle Scholar
  26. 26.
    Yang, H.J., Fleming, K., Winterstein, F., Chen, A.I., Adler, M., Emer, J.: Automatic construction of program-optimized FPGA memory networks. In: FPGA 2017, pp. 125–134 (2017)Google Scholar
  27. 27.
    Putnam, A., et al.: Performance and power of cache-based reconfigurable computing. In: ISCA 2009, pp. 395–405 (2009)Google Scholar
  28. 28.
    Adler, M., Fleming, K.E., Parashar, A., Pellauer, M., Emer, J.: Leap scratchpads: automatic memory and cache management for reconfigurable logic. In: FPGA 2011, pp. 25–28 (2011)Google Scholar
  29. 29.
    Chung, E.S., Hoe, J.C., Mai, K.: CoRAM: an in-fabric memory architecture for FPGA-based computing. In: FPGA 2011, pp. 97–106 (2011)Google Scholar
  30. 30.
    Yiannacouras, P., Rose, J.: A parameterized automatic cache generator for FPGAs. In: FPT 2003 (2003)Google Scholar
  31. 31.
    Gil, A.S., Benitez, J.B., Calvino, M.H., Gomez, E.H.: Reconfigurable cache implemented on an FPGA. In: ReConFig 2010 (2010)Google Scholar
  32. 32.
    Mirian, V., Chow, P.: FCache: a system for cache coherent processing on FPGAs. In: FPGA 2012, pp. 233–236 (2012)Google Scholar
  33. 33.
    Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., Zhang, Z.: High-level synthesis for FPGAs: from prototyping to deployment. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 30(4), 473–491 (2011)CrossRefGoogle Scholar
  34. 34.
    Wang, Y., Li, P., Cong, J.: Theory and algorithm for generalized memory partitioning in high-level synthesis. In: Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, FPGA 2014, pp. 199–208. ACM, New York (2014)Google Scholar
  35. 35.
    Putnam, A.R., Bennett, D., Dellinger, E., Mason, J., Sundararajan, P.: CHiMPS: a high-level compilation flow for hybrid CPU-FPGA architectures. In: FPGA 2008, p. 261 (2008)Google Scholar
  36. 36.
    Nalabalapu, P., Sass, R.: Bandwidth management with a reconfigurable data cache. In: IPDPS 2005. IEEE (2005)Google Scholar
  37. 37.
    Kuck, D., Stokes, R.: The Burroughs scientific processor (BSP). IEEE Trans. Comput. C–31(5), 363–376 (1982)CrossRefGoogle Scholar
  38. 38.
    Panda, D., Hwang, K.: Reconfigurable vector register windows for fast matrix computation on the orthogonal multiprocessor. In: Proceedings of ASAP, pp. 202–213, May–July 1990Google Scholar
  39. 39.
    Corbal, J., Espasa, R., Valero, M.: MOM: a matrix SIMD instruction set architecture for multimedia applications. In: Proceedings of the SC 1999 Conference, pp. 1–12 (1999)Google Scholar
  40. 40.
    Park, J., Park, S.B., Balfour, J.D., Black-Schaffer, D., Kozyrakis, C., Dally, W.J.: Register pointer architecture for efficient embedded processors. In: Proceedings of DATE, pp. 600–605 (2007)Google Scholar
  41. 41.
    Ramirez, A., et al.: The SARC architecture. IEEE Micro 30(5), 16–29 (2010)CrossRefGoogle Scholar
  42. 42.
    Ciobanu, C., Martorell, X., Kuzmanov, G.K., Ramirez, A., Gaydadjiev, G.N.: Scalability evaluation of a polymorphic register file: a CG case study. In: Proceedings of ARCS, pp. 13–25 (2011)CrossRefGoogle Scholar
  43. 43.
    Ciobanu, C., Gaydadjiev, G., Pilato, C., Sciuto, D.: The case for polymorphic registers in dataflow computing. Int. J. Parallel Program. 46, 1185–1219 (2018)CrossRefGoogle Scholar
  44. 44.
    Avior, A., Calamoneri, T., Even, S., Litman, A., Rosenberg, A.L.: A tight layout of the butterfly network. Theory Comput. Syst. 31(4), 475–488 (1998)MathSciNetCrossRefGoogle Scholar
  45. 45.

Copyright information

© IFIP International Federation for Information Processing 2019

Authors and Affiliations

  1. 1.Politecnico di MilanoMilanItaly
  2. 2.Technische Universiteit DelftDelftThe Netherlands
  3. 3.Universiteit van AmsterdamAmsterdamThe Netherlands

Personalised recommendations