Skip to main content

Advertisement

Log in

Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing elements (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Manferdelli J L, Govindaraju N K, Crall C. Challenges and opportunities in many-core computing. Proceedings of the IEEE, 2008, 96(5): 808-815.

    Article  Google Scholar 

  2. Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In Proc. the 9th Int. High Performance Computing for Computational Science{VECPAR, June 2011, pp.1-25.

  3. Daga M, Aji A M, Feng W. On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In Proc. Symposium on Application Accelerators in High-Performance Computing, July 2011, pp.141-149.

  4. Chung E S, Milder P A, Hoe J C, Mai K. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010, pp.225-236.

  5. Lee V W, Grochowski E, Geva R. Performance benefits of heterogeneous computing in HPC workloads. In Proc. the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), May 2012, pp.16-26.

  6. Kumar R, Farkas K I, Jouppi N P et al. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proc. the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2003, pp.81-92.

  7. Lee V W, Kim C, Chhugani J et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. the 37th Annual International Symposium on Computer Architecture (ISCA), June 2010, pp. 451–460.

  8. Wittenbrink C M, Kilgariff E, Prabhu A. Fermi GF100 GPU architecture. IEEE Micro, 2011, 31(2): 50-59.

    Article  Google Scholar 

  9. Kapasi U J, Dally W J, Rixner S et al. The imagine stream processor. In Proc. IEEE International Conference on Computer Design: VLSI in Computers and Processors(ICCD), September 2002, pp. 282–288.

  10. Duran A, Klemm M. The Intelr many integrated core architecture. In Proc. International Conference on High Performance Computing and Simulation (HPCS), July 2012, pp. 365-366.

  11. Alves M A Z, Freitas H C, Navaux P O A. Investigation of shared L2 cache on many-core processors. In Proc. the 22nd International Conference on Architecture of Computing Systems (ARCS), March 2009, pp. 1-10.

  12. Wentzlaff D, Griffin P, Hoffmann H et al. On-chip interconnection architecture of the Tile Processor. IEEE Micro, 2007, 27(5): 15-31.

    Article  Google Scholar 

  13. Howard J, Dighe S, Hoskote Y et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proc. IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb. 2010, pp.108-109.

  14. Hoskote Y, Vangal S, Singh A et al. A 5-GHz mesh interconnect for a Teraflops processor. IEEE Micro, 2007, 27(5): 51-61.

    Article  Google Scholar 

  15. Gries M, Hoffmann U, Konow M et al. SCC: A flexible architecture for many-core platform research. Computing in Science and Engineering, 2011, 13(6): 79-83.

    Article  Google Scholar 

  16. Balakrishnan A, Naeemi A. Interconnect network analysis of many-core chips. IEEE Transactions on Electron Devices, 2011, 58(9): 2831-2837.

    Article  Google Scholar 

  17. Taylor M B, Lee W, Amarasinghe S et al. Scalar operand networks: On-chip interconnect for ILP in partitioned architectures. In Proc. the 9th IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2003, pp.341-353.

  18. Kim J. Low-cost router microarchitecture for on-chip networks. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2009, pp.255-266.

  19. Jung H, Ju M, Che H. A theoretical framework for design space exploration of manycore processors. In Proc. the 19th Annual IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, July 2011, pp.117-125.

  20. Seiler L, Carmean D, Sprangle E et al. Larrabee: A manycore x86 architecture for visual computing. IEEE Micro, 2009, 29(1): 10-21.

    Article  Google Scholar 

  21. Chen P, Zhao H L, Tao C, Sang H S. Block-run-based connected component labelling algorithm for GPGPU using shared memory. Electronics Letters, 2011, 47(24): 1309-1311.

    Article  Google Scholar 

  22. Sawant N, Kulkarni D. Performance evaluation of feature extraction algorithm on GPGPU. In Proc. International Conference on Communication Systems and Network Technologies (CSNT), June 2011, pp. 536-540.

  23. Heinecke A, Klemm M, Bungartz H J. From GPGPU to many-core: Nvidia Fermi and Intel many integrated core architecture. Computing in Science & Engineering, 2012, 14(2): 78-83.

    Article  Google Scholar 

  24. Bell S, Edwards B, Amann J et al. TILE64TM-processor: A 64-core SoC with mesh interconnect. In Proc. IEEE Int. Solid-State Circuits Conference (ISSCC), February 2008, pp.88-89, 598.

  25. Sewell K, Dreslinski R G, Manville T et al. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2012, 2(2): 278-294.

    Article  Google Scholar 

  26. Kim J, Balfour J, Dally W. Flattened butterfly topology for on-chip networks. In Proc. the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2007, pp.172-182.

  27. Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd Annual IEEE/ACM MICRO, Dec. 2010, pp. 421-432.

  28. Fan D, Zhang H, Wang D et al. Godson-T: An efficient many-core processor exploring thread-level parallelism. IEEE Micro, 2012, 32(2): 38-47.

    Article  MathSciNet  Google Scholar 

  29. Wang X, Gan G, Manzano J et al. A quantitative study of the on-chip network and memory hierarchy design for manycore processor. In Proc. the 14th IEEE International Conference on Parallel and Distributed Systems, Dec. 2008, pp. 689-696.

  30. Taylor M B, Psota J, Saraf A et al. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA), June 2004, pp. 2-13.

  31. Taylor M B, Kim J, Miller J et al. The Raw microprocessor: A computational fabric for software circuits and generalpurpose programs. IEEE Micro, 2002, 22(2): 25-35.

    Article  Google Scholar 

  32. Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2010, pp.421-432.

  33. Asanovic K, Bodik R, Catanzaro B C et al. The landscape of parallel computing research: A view from Berkeley. Technical Report, UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

  34. Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No.18.

  35. Choi J W, Singh A, Vuduc R. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, January 2010, pp.115–126.

  36. Luo L, Wong M, Hwu W. An effective GPU implementation of breadth-first search. In Proc. the 47th Design Automation Conference (DAC), June 2010, pp.52-55.

  37. Bo Z, Zheng-hui X, Wu R et al. Accelerating FDTD algorithm using GPU computing. In Proc. IEEE International Conference on Microwave Technology & Computational Electromagnetics, May 2011, pp.410-413.

  38. Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (5th edition): Morgan Kaufmann, 2011.

  39. Hill M, Marty M. Amdahl’s law in the multicore era. IEEE Computer, 2008, 41(7): 33-38.

    Article  Google Scholar 

  40. Riley M W, Warnock J D, Wendel D F. Cell broadband engine processor: Design and implementation. IBM Journal of Research and Development, 2007, 51(5): 545-557.

    Article  Google Scholar 

  41. Kahle J A, Day M N, Hofstee H P et al. Introduction to the Cell multiprocessor. IBM Journal Research and Development, 2005, 49(4): 589-604.

    Article  Google Scholar 

  42. Woo D H, Lee H H S. Extending Amdahl’s law for energyefficient computing in the many-core era. IEEE Computer, 2008, 41(12): 24-31.

    Article  Google Scholar 

  43. Kumar R, Tullsen D M, Ranganathan P et al. Single-ISA heterogeneous multicore architectures for multithreaded workload performance. In Proc. the 31st Annual International Symposium on Computer Architecture, June 2004, pp. 64-75.

  44. Yang Y, Xiang P, Mantor M et al. CPU-assisted GPGPU on fused CPU-GPU architectures. In Proc. the 18th IEEE International Symposium on High Performance Computer Architecture (HPCA), February 2012.

  45. Branover A, Foley D, Steinman M. AMD fusion APU: Llano. IEEE Micro, 2012, 32(2): 28-37.

    Article  Google Scholar 

  46. Keckler S W, Dally W J, Khailany B et al. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7-17.

    Article  Google Scholar 

  47. Khunjush F, Dimopoulos N J. Extended characterization of DMA transfers on the Cell BE processor. In Proc. IEEE International Symposium on Parallel and Distributed Processing, April 2008.

  48. Gebhart M, Keckler S W, Khailany B et al. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.96-106.

  49. Keckler S W, Dally W J, Maskit D et al. Exploiting finegrain thread level parallelism on the MIT multi-ALU processor. ACM SIGARCH Computer Architecture News, 1998, 26(3): 306-317.

    Article  Google Scholar 

  50. Korch M, Rauber T, Scholtes C. Memory-intensive applications on a many-core processor. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications (HPCC), September 2011, pp.126-134.

  51. Abellán J L, Fernández J, Acacio M E. Efficient hardware barrier synchronization in many-core CMPs. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(8): 1453-1466.

  52. WatkinsMA, Albonesi D H. ReMAP: A reconfigurable heterogeneous multicore architecture. In Proc. the 43rd IEEE International Symposium on Microarchitecture, Dec. 2010, pp. 497-508.

  53. Yu L, Liu Z, Fan D et al. Study on fine-grained synchronization in many-core architecture. In Proc. the 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, May 2009, pp.524-529.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fang Zheng.

Additional information

The work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2014AA01A300 and the National Science and Technology Major Project of HeGaoJi under Grant No. 2013ZX0102-8001-001-001.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, F., Li, HL., Lv, H. et al. Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture. J. Comput. Sci. Technol. 30, 145–162 (2015). https://doi.org/10.1007/s11390-015-1510-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-015-1510-9

Keywords

Navigation