Advertisement

A Configurable Shared Scratchpad Memory for GPU-like Processors

  • Alessandro Cilardo
  • Mirko GagliardiEmail author
  • Ciro Donnarumma
Conference paper
Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT, volume 1)

Abstract

During the last years Field Programmable Gate Arrays and Graphics Processing Units have become increasingly important for high-performance computing. In particular, a number of industrial solutions and academic projects are proposing design frameworks based on FPGA-implemented GPU-like compute units. Existing GPU-like core projects provide limited hardware support for shared scratchpad memory and particularly for the problem of bank conflicts, a major source of performance loss with many parallel kernels. In this paper, we present a configurable, GPU-like oriented scratchpad memory with built-in support for bank remapping. The core is fully synthetizable on FPGA with a contained hardware cost. We also validated the presented architecture with a cycle-accurate event-driven emulator written in C++ as well as an RTL simulator tool. Last, we demonstrated the impact of bank remapping and other parameters available with the proposed configurable shared scratchpad memory by evaluating the performance of two real-world parallelized kernels.

Keywords

Memory Access Clock Cycle Field Programmable Gate Array Very Large Scale Integration Memory Bank 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
    Nvidia’s next generation cuda compute architecture. NVidia, Santa Clara, Calif, USA (2009)Google Scholar
  3. 3.
    An independent analysis of Altera’s FPGA floating-point DSP design flow. Berkeley Design Technology, Inc (2011)Google Scholar
  4. 4.
    Al-Dujaili, A., Deragisch, F., Hagiescu, A.,Wong,W.F.: Guppy: A GPU-like soft-core processor. In: Field-Programmable Technology (FPT), 2012 International Conference on, pp. 57–60 (2012)Google Scholar
  5. 5.
    Amato, F., Barbareschi, M., Casola, V., Mazzeo, A.: An FPGA-based smart classifier for decision support systems. Studies in Computational Intelligence 511, 289–299 (2014)Google Scholar
  6. 6.
    Amato, F., Fasolino, A., Mazzeo, A., Moscato, V., Picariello, A., Romano, S., Tramontana, P.: Ensuring semantic interoperability for e-health applications. In: Proceedings of the International Conference on Complex, Intelligent and Software Intensive Systems, CISIS 2011, pp. 315–320 (2011)Google Scholar
  7. 7.
    Amato, F., Mazzeo, A., Penta, A., Picariello, A.: Building RDF ontologies from semistructured legal documents. pp. 997–1002 (2008)Google Scholar
  8. 8.
    Balasubramanian, R., Gangadhar, V., Guo, Z., Ho, C.H., Joseph, C., Menon, J., Drumond, M.P., Paul, R., Prasad, S., Valathol, P., Sankaralingam, K.: Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU. ACM Trans. Archit. Code Optim. 12(2), 21:21:1–21:21:25 (2015)Google Scholar
  9. 9.
    Barbareschi, M., Del Prete, S., Gargiulo, F., Mazzeo, A., Sansone, C.: Decision tree-based multiple classifier systems: An FPGA perspective. In: International Workshop on Multiple Classifier Systems, pp. 194–205. Springer (2015)Google Scholar
  10. 10.
    Barbareschi, M., Iannucci, F., Mazzeo, A.: Automatic design space exploration of approximate algorithms for big data applications. In: 2016 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 40–45. IEEE (2016)Google Scholar
  11. 11.
    Barbareschi, M., Iannucci, F., Mazzeo, A.: An extendible design exploration tool for supporting approximate computing techniques. In: 2016 International Conference on Design and Technology of Integrated Systems in Nanoscale Era (DTIS), pp. 1–6. IEEE (2016)Google Scholar
  12. 12.
    Bush, J., Dexter, P., Miller, T.N.: Nyami: a synthesizable GPU architectural model for generalpurpose and graphics-specific workloads. In: Performance Analysis of Systems and Software (ISPASS), 2015 IEEE International Symposium on, pp. 173–182 (2015)Google Scholar
  13. 13.
    Chatterjee, S., et al.: Generating local addresses and communication sets for data-parallel programs. SIGPLAN Not. 28(7), 149–158 (1993)Google Scholar
  14. 14.
    Cilardo, A.: Exploring the potential of threshold logic for cryptography-related operations. IEEE Transactions on Computers 60(4), 452–462 (2011)Google Scholar
  15. 15.
    Cilardo, A., De Caro, D., Petra, N., Caserta, F., Mazzocca, N., Napoli, E., Strollo, A.: High speed speculative multipliers based on speculative carry-save tree. IEEE Transactions on Circuits and Systems I: Regular Papers 61(12), 3426–3435 (2014)Google Scholar
  16. 16.
    Cilardo, A., Durante, P., Lofiego, C., Mazzeo, A.: Early prediction of hardware complexity in HLL-to-HDL translation. pp. 483–488 (2010)Google Scholar
  17. 17.
    Cilardo, A., Gallo, L.: Improving multibank memory access parallelism with lattice-based partitioning. ACM Transactions on Architecture and Code Optimization (TACO) 11(4), 45 (2015)Google Scholar
  18. 18.
    Cilardo, A., Gallo, L., Mazzeo, A., Mazzocca, N.: Efficient and scalable OpenMP-based system-level design. pp. 988–991 (2013)Google Scholar
  19. 19.
    Coon, B., et al.: Shared memory with parallel access and access conflict resolution mechanism. U.S. Patent No. 8,108,625 (2012)Google Scholar
  20. 20.
    Farber, R.: CUDA application design and development. Elsevier (2011)Google Scholar
  21. 21.
    Fusella, E., Cilardo, A.: H2ONoC: A hybrid optical-electronic NoC based on hybrid topology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2016)Google Scholar
  22. 22.
    Fusella, E., Cilardo, A.: Minimizing power loss in optical networks-on-chip through application-specific mapping. Microprocessors and Microsystems (2016)Google Scholar
  23. 23.
    Kingyens, J., Steffan, J.: The potential for a GPU-like overlay architecture for FPGAs. International Journal of Reconfigurable Computing (2011)Google Scholar
  24. 24.
    Kuon, I., Rose, J.: Measuring the gap between FPGAs and ASICs. In: Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays, FPGA ’06, pp. 21–30. ACM, New York, NY, USA (2006)Google Scholar
  25. 25.
    Paranjape, K., Hebert, S., Masson, B.: Heterogeneous computing in the cloud: Crunching big data and democratizing HPC access for the life sciences. Intel Corporation (2010)Google Scholar
  26. 26.
    Pouchet, L.N.: Polybench: The polyhedral benchmark suite. http://www.cs.ucla.edu/pouchet/software/polybench (2012)
  27. 27.
    Sarkar, S., et al.: Hardware accelerators for biocomputing: A survey. In: Proceedings of 2010 IEEE International Symposium on Circuits and Systems (2010)Google Scholar
  28. 28.
    Snyder, W., Wasson, P., Galbi, D.: Verilator (2007)Google Scholar
  29. 29.
    Wang, Y., Li, P., Cong, J.: Theory and algorithm for generalized memory partitioning in highlevel synthesis. In: Proceedings of the 2014 ACM/SIGDA International Symposium on Fieldprogrammable Gate Arrays, FPGA ’14, pp. 199–208. ACM, New York, NY, USA (2014)Google Scholar
  30. 30.
    Wirbel, L.: Xilinx SDAccel: a unified development environment for tomorrow’s data center. The Linley Group Inc (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Alessandro Cilardo
    • 1
  • Mirko Gagliardi
    • 1
    Email author
  • Ciro Donnarumma
    • 2
  1. 1.University of Naples Federico II and Centro Regionale ICT (CeRICT)NaplesItaly
  2. 2.University of Naples Federico IINaplesItaly

Personalised recommendations