Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture
- 162 Downloads
- 4 Citations
Abstract
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing elements (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.
Keywords
heterogeneous many-core processor data stream transfer register-level communication mechanism hardware synchronization technique processor prototypePreview
Unable to display preview. Download preview PDF.
References
- [1]Manferdelli J L, Govindaraju N K, Crall C. Challenges and opportunities in many-core computing. Proceedings of the IEEE, 2008, 96(5): 808-815.CrossRefGoogle Scholar
- [2]Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In Proc. the 9th Int. High Performance Computing for Computational Science{VECPAR, June 2011, pp.1-25.Google Scholar
- [3]Daga M, Aji A M, Feng W. On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In Proc. Symposium on Application Accelerators in High-Performance Computing, July 2011, pp.141-149.Google Scholar
- [4]Chung E S, Milder P A, Hoe J C, Mai K. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010, pp.225-236.Google Scholar
- [5]Lee V W, Grochowski E, Geva R. Performance benefits of heterogeneous computing in HPC workloads. In Proc. the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), May 2012, pp.16-26.Google Scholar
- [6]Kumar R, Farkas K I, Jouppi N P et al. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proc. the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2003, pp.81-92.Google Scholar
- [7]Lee V W, Kim C, Chhugani J et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. the 37th Annual International Symposium on Computer Architecture (ISCA), June 2010, pp. 451–460.Google Scholar
- [8]Wittenbrink C M, Kilgariff E, Prabhu A. Fermi GF100 GPU architecture. IEEE Micro, 2011, 31(2): 50-59.CrossRefGoogle Scholar
- [9]Kapasi U J, Dally W J, Rixner S et al. The imagine stream processor. In Proc. IEEE International Conference on Computer Design: VLSI in Computers and Processors(ICCD), September 2002, pp. 282–288.Google Scholar
- [10]Duran A, Klemm M. The Intelr many integrated core architecture. In Proc. International Conference on High Performance Computing and Simulation (HPCS), July 2012, pp. 365-366.Google Scholar
- [11]Alves M A Z, Freitas H C, Navaux P O A. Investigation of shared L2 cache on many-core processors. In Proc. the 22nd International Conference on Architecture of Computing Systems (ARCS), March 2009, pp. 1-10.Google Scholar
- [12]Wentzlaff D, Griffin P, Hoffmann H et al. On-chip interconnection architecture of the Tile Processor. IEEE Micro, 2007, 27(5): 15-31.CrossRefGoogle Scholar
- [13]Howard J, Dighe S, Hoskote Y et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proc. IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb. 2010, pp.108-109.Google Scholar
- [14]Hoskote Y, Vangal S, Singh A et al. A 5-GHz mesh interconnect for a Teraflops processor. IEEE Micro, 2007, 27(5): 51-61.CrossRefGoogle Scholar
- [15]Gries M, Hoffmann U, Konow M et al. SCC: A flexible architecture for many-core platform research. Computing in Science and Engineering, 2011, 13(6): 79-83.CrossRefGoogle Scholar
- [16]Balakrishnan A, Naeemi A. Interconnect network analysis of many-core chips. IEEE Transactions on Electron Devices, 2011, 58(9): 2831-2837.CrossRefGoogle Scholar
- [17]Taylor M B, Lee W, Amarasinghe S et al. Scalar operand networks: On-chip interconnect for ILP in partitioned architectures. In Proc. the 9th IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2003, pp.341-353.Google Scholar
- [18]Kim J. Low-cost router microarchitecture for on-chip networks. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2009, pp.255-266.Google Scholar
- [19]Jung H, Ju M, Che H. A theoretical framework for design space exploration of manycore processors. In Proc. the 19th Annual IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, July 2011, pp.117-125.Google Scholar
- [20]Seiler L, Carmean D, Sprangle E et al. Larrabee: A manycore x86 architecture for visual computing. IEEE Micro, 2009, 29(1): 10-21.CrossRefGoogle Scholar
- [21]Chen P, Zhao H L, Tao C, Sang H S. Block-run-based connected component labelling algorithm for GPGPU using shared memory. Electronics Letters, 2011, 47(24): 1309-1311.CrossRefGoogle Scholar
- [22]Sawant N, Kulkarni D. Performance evaluation of feature extraction algorithm on GPGPU. In Proc. International Conference on Communication Systems and Network Technologies (CSNT), June 2011, pp. 536-540.Google Scholar
- [23]Heinecke A, Klemm M, Bungartz H J. From GPGPU to many-core: Nvidia Fermi and Intel many integrated core architecture. Computing in Science & Engineering, 2012, 14(2): 78-83.CrossRefGoogle Scholar
- [24]Bell S, Edwards B, Amann J et al. TILE64TM-processor: A 64-core SoC with mesh interconnect. In Proc. IEEE Int. Solid-State Circuits Conference (ISSCC), February 2008, pp.88-89, 598.Google Scholar
- [25]Sewell K, Dreslinski R G, Manville T et al. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2012, 2(2): 278-294.CrossRefGoogle Scholar
- [26]Kim J, Balfour J, Dally W. Flattened butterfly topology for on-chip networks. In Proc. the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2007, pp.172-182.Google Scholar
- [27]Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd Annual IEEE/ACM MICRO, Dec. 2010, pp. 421-432.Google Scholar
- [28]Fan D, Zhang H, Wang D et al. Godson-T: An efficient many-core processor exploring thread-level parallelism. IEEE Micro, 2012, 32(2): 38-47.CrossRefMathSciNetGoogle Scholar
- [29]Wang X, Gan G, Manzano J et al. A quantitative study of the on-chip network and memory hierarchy design for manycore processor. In Proc. the 14th IEEE International Conference on Parallel and Distributed Systems, Dec. 2008, pp. 689-696.Google Scholar
- [30]Taylor M B, Psota J, Saraf A et al. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA), June 2004, pp. 2-13.Google Scholar
- [31]Taylor M B, Kim J, Miller J et al. The Raw microprocessor: A computational fabric for software circuits and generalpurpose programs. IEEE Micro, 2002, 22(2): 25-35.CrossRefGoogle Scholar
- [32]Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2010, pp.421-432.Google Scholar
- [33]Asanovic K, Bodik R, Catanzaro B C et al. The landscape of parallel computing research: A view from Berkeley. Technical Report, UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.Google Scholar
- [34]Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No.18.Google Scholar
- [35]Choi J W, Singh A, Vuduc R. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, January 2010, pp.115–126.Google Scholar
- [36]Luo L, Wong M, Hwu W. An effective GPU implementation of breadth-first search. In Proc. the 47th Design Automation Conference (DAC), June 2010, pp.52-55.Google Scholar
- [37]Bo Z, Zheng-hui X, Wu R et al. Accelerating FDTD algorithm using GPU computing. In Proc. IEEE International Conference on Microwave Technology & Computational Electromagnetics, May 2011, pp.410-413.Google Scholar
- [38]Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (5th edition): Morgan Kaufmann, 2011.Google Scholar
- [39]Hill M, Marty M. Amdahl’s law in the multicore era. IEEE Computer, 2008, 41(7): 33-38.CrossRefGoogle Scholar
- [40]Riley M W, Warnock J D, Wendel D F. Cell broadband engine processor: Design and implementation. IBM Journal of Research and Development, 2007, 51(5): 545-557.CrossRefGoogle Scholar
- [41]Kahle J A, Day M N, Hofstee H P et al. Introduction to the Cell multiprocessor. IBM Journal Research and Development, 2005, 49(4): 589-604.CrossRefGoogle Scholar
- [42]Woo D H, Lee H H S. Extending Amdahl’s law for energyefficient computing in the many-core era. IEEE Computer, 2008, 41(12): 24-31.CrossRefGoogle Scholar
- [43]Kumar R, Tullsen D M, Ranganathan P et al. Single-ISA heterogeneous multicore architectures for multithreaded workload performance. In Proc. the 31st Annual International Symposium on Computer Architecture, June 2004, pp. 64-75.Google Scholar
- [44]Yang Y, Xiang P, Mantor M et al. CPU-assisted GPGPU on fused CPU-GPU architectures. In Proc. the 18th IEEE International Symposium on High Performance Computer Architecture (HPCA), February 2012.Google Scholar
- [45]Branover A, Foley D, Steinman M. AMD fusion APU: Llano. IEEE Micro, 2012, 32(2): 28-37.CrossRefGoogle Scholar
- [46]Keckler S W, Dally W J, Khailany B et al. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7-17.CrossRefGoogle Scholar
- [47]Khunjush F, Dimopoulos N J. Extended characterization of DMA transfers on the Cell BE processor. In Proc. IEEE International Symposium on Parallel and Distributed Processing, April 2008.Google Scholar
- [48]Gebhart M, Keckler S W, Khailany B et al. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.96-106.Google Scholar
- [49]Keckler S W, Dally W J, Maskit D et al. Exploiting finegrain thread level parallelism on the MIT multi-ALU processor. ACM SIGARCH Computer Architecture News, 1998, 26(3): 306-317.CrossRefGoogle Scholar
- [50]Korch M, Rauber T, Scholtes C. Memory-intensive applications on a many-core processor. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications (HPCC), September 2011, pp.126-134.Google Scholar
- [51]Abellán J L, Fernández J, Acacio M E. Efficient hardware barrier synchronization in many-core CMPs. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(8): 1453-1466.Google Scholar
- [52]WatkinsMA, Albonesi D H. ReMAP: A reconfigurable heterogeneous multicore architecture. In Proc. the 43rd IEEE International Symposium on Microarchitecture, Dec. 2010, pp. 497-508.Google Scholar
- [53]Yu L, Liu Z, Fan D et al. Study on fine-grained synchronization in many-core architecture. In Proc. the 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, May 2009, pp.524-529.Google Scholar