Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

  • Chien-Ting Chen
  • Yoshi Shih-Chieh Huang
  • Yuan-Ying Chang
  • Chiao-Yun Tu
  • Chung-Ta King
  • Tai-Yuan Wang
  • Janche Sang
  • Ming-Hua Li
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8707)


The massive multithreading architecture of General Purpose Graphic Processors Units (GPGPU) makes them ideal for data parallel computing. However, designing efficient GPGPU chips poses many challenges. One major hurdle is the interface to the external DRAM, particularly the buffers in the memory controllers (MCs), which is stressed heavily by the many concurrent memory accesses from the GPGPU. Previous approaches considered scheduling the memory requests in the memory buffers to reduce switching of memory rows. The problem is that the window of requests that can be considered for scheduling is too narrow and the memory controller is very complex, affecting the critical path. In view of the massive multithreading architecture of GPGPUs that can hide memory access latencies, we exploit in this paper the novel idea of rearranging the memory requests in the network-on-chip (NoC), called packet coalescing. To study the feasibility of this idea, we have designed an expanded NoC router that supports packet coalescing and evaluated its performance extensively. Evaluation results show that this NoC-assisted design strategy can improve the row buffer hit rate in the memory controllers. A comprehensive investigation of factors affecting the performance of coalescing is also conducted and reported.


Network-on-chip general-purpose graphic processors unit memory controller latency hiding router design 


  1. 1.
    Nvidia gpu computing sdk suite,
  2. 2.
  3. 3.
    Ausavarungnirun, R., Chang, K., Subramanian, L., Loh, G., Mutlu, O.: Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In: Proceedings of the 39th International Symposium on Computer Architecture, pp. 416–427. IEEE Press (2012)Google Scholar
  4. 4.
    Bakhoda, A., Kim, J., Aamodt, T.: Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 421–432. IEEE Computer Society (2010)Google Scholar
  5. 5.
    Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing cuda workloads using a detailed gpu simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 163–174. IEEE (2009)Google Scholar
  6. 6.
    Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54. IEEE (2009)Google Scholar
  7. 7.
    Dally, W.: Virtual-channel flow control. IEEE Transactions on Parallel and Distributed Systems 3(2), 194–205 (1992)CrossRefGoogle Scholar
  8. 8.
    Dally, W., Towles, B.: Principles and practices of interconnection networks. Morgan Kaufmann (2004)Google Scholar
  9. 9.
    Das, R., Mutlu, O., Moscibroda, T., Das, C.: Aérgia: A network-on-chip exploiting packet latency slack. IEEE Micro 31(1), 29–41 (2011)CrossRefGoogle Scholar
  10. 10.
    Kim, Y., Lee, H., Kim, J.: An alternative memory access scheduling in manycore accelerators. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 195–196. IEEE (2011)Google Scholar
  11. 11.
    Mutlu, O., Moscibroda, T.: Stall-time fair memory access scheduling for chip multiprocessors. In: Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture, pp. 146–160. IEEE Computer Society (2007)Google Scholar
  12. 12.
    Mutlu, O., Moscibroda, T.: Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems. In: ACM SIGARCH Computer Architecture News, vol. 36, pp. 63–74. IEEE Computer Society (2008)Google Scholar
  13. 13.
    Nesbit, K., Aggarwal, N., Laudon, J., Smith, J.: Fair queuing memory systems. In: Proceedings of the 39th IEEE/ACM International Symposium on Microarchitecture, pp. 208–222. IEEE (2006)Google Scholar
  14. 14.
    Nickolls, J., Dally, W.: The gpu computing era. IEEE Micro 30(2), 56–69 (2010)CrossRefGoogle Scholar
  15. 15.
    NVIDIA: Nvidia’s next generation cuda compute architecture: Fermi (2009)Google Scholar
  16. 16.
    Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., Phillips, J.: Gpu computing. Proceedings of the IEEE 96(5), 879–899 (2008)CrossRefGoogle Scholar
  17. 17.
  18. 18.
    Rixner, S., Dally, W., Kapasi, U., Mattson, P., Owens, J.: Memory access scheduling. In: Proceedings of the 27th International Symposium on Computer Architecture, pp. 128–138. IEEE (2000)Google Scholar
  19. 19.
    Sanders, J., Kandrot, E.: CUDA by example: An introduction to general-purpose GPU programming. Addison-Wesley Professional (2010)Google Scholar
  20. 20.
    Stone, J., Gohara, D., Shi, G.: Opencl: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering 12(3), 66 (2010)CrossRefGoogle Scholar
  21. 21.
    Yin, J., Zhou, P., Holey, A., Sapatnekar, S., Zhai, A.: Energy-efficient non-minimal path on-chip interconnection network for heterogeneous systems. In: Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 57–62. ACM (2012)Google Scholar
  22. 22.
    Yuan, G., Bakhoda, A., Aamodt, T.: Complexity effective memory access scheduling for many-core accelerator architectures. In: Proceedings of the 42nd IEEE/ACM International Symposium on Microarchitecture, pp. 34–44. IEEE (2009)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Chien-Ting Chen
    • 1
  • Yoshi Shih-Chieh Huang
    • 1
  • Yuan-Ying Chang
    • 1
  • Chiao-Yun Tu
    • 1
  • Chung-Ta King
    • 1
  • Tai-Yuan Wang
    • 1
  • Janche Sang
    • 2
  • Ming-Hua Li
    • 3
  1. 1.Department of Computer ScienceNational Tsing Hua UniversityHsinchuTaiwan
  2. 2.Department of Computer and Information ScienceCleveland State UniversityClevelandUSA
  3. 3.Information and Communications Research LaboratoriesIndustrial Technology Research InstituteHsinchuTaiwan

Personalised recommendations