A Low-Energy Wide SIMD Architecture with Explicit Datapath
- 363 Downloads
Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3 % on average and up to 94 %, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64 % of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5 %, and maximum of 43.0 %, reduction is achieved.
KeywordsWide SIMD Explicit Datapath Configurable Low Energy
This work is supported by the Ministry of Economic Affairs of the Netherlands, project EVA PID07121, and the Dutch Technology Foundation STW, project NEST 10346.
- 4.Corporaal, H. (1998). Microprocessor architectures: from VLIW to TTA. Wiley.Google Scholar
- 5.Frijns, R., Fatemi, H., Mesman, B., Corporaal, H. (2008). DC-SIMD: dynamic communication for SIMD processors. Proceedings of international symposium on parallel and distributed processing (IPDPS) (pp. 1–10).Google Scholar
- 6.Goel, N., Kumar, A., Panda, P. (2007). Power reduction in VLIW processor with compiler driven bypass network. Proceedings of the 20th international conference on vlsi design (VLSID) (pp. 233–238).Google Scholar
- 7.Guan, X., & Fei, Y. (2008). Reducing power consumption of embedded processors through register file partitioning and compiler support. Proceedings of international conference on application-specific systems, architectures and processors (ASAP) (pp. 269–274).Google Scholar
- 9.He, Y. (2013). Low power architectures for streaming applications. PhD Thesis.Google Scholar
- 10.He, Y., Pu, Y., Ye, Z., Londono, S., Kleihorst, R., Abbo, A., Corporaal, H. (2010). Xetal-Pro: An ultra-low energy and high throughput SIMD processor. Proceedings of the 47th design automation conference (DAC) (pp. 543–548).Google Scholar
- 11.He, Y., She, D., Mesman, B., Corporaal, H. (2011). MOVE-Pro: a low power and high code density TTA architecture. Proceedings of the 11th international conference on embedded computer systems: architectures, modeling, and simulation (SAMOS) (pp. 294–301).Google Scholar
- 12.He, Y., Ye, Z., She, D., Mesman, B., Corporaal, H. (2011). Feasibility analysis of ultra high frame rate visual servoing on FPGA and SIMD processor. Proceedings of advances concepts for intelligent vision systems (ACIVS) (pp. 623–634).Google Scholar
- 13.He, Y., Ye, Z., She, D., Pieters, R., Mesman, B. (2010). Corporaal, H.: 1000 fps visual servoing on the reconfigurable wide SIMD processor. Proceedings of the 16th annual conference of the advanced school for computing and imgaging (ASCI) (pp. 302–309).Google Scholar
- 14.He, Y., Zivkovic, Z., Kleihorst, R., Danilin, A., Corporaal, H. (2008). Real-time implementations of hough transform on SIMD architecture. Proceedings of the ACM/IEEE international conference on distributed smart cameras (ICDSC) (pp. 1–8).Google Scholar
- 15.He, Y., Zivkovic, Z., Kleihorst, R., Danilin, A., Corporaal, H., Mesman, B. (2008). Real-time hough transform on 1-D SIMD processors: implementation and architecture exploration. Proceedings of the international conference advanced concepts for intelligent vision systems (ACIVS) (pp. 254–265).Google Scholar
- 16.Kapasi, U., Dally, W., Rixner, S., Owens, J., Khailany, B. (2002). The Imagine stream processor. Proceedings of international conference on computer design: vlsi in computers and processors (ICCD) (pp. 282–288).Google Scholar
- 17.Kyo, S., & Okazaki, S. (2008). IMAPCAR: A 100 GOPS in-vehicle vision processor based on 128 ring connected four-way VLIW processing elements. Journal of Signal Processing Systems, 1–12.Google Scholar
- 18.Otsu, N. (1975). A threshold selection method from gray-level histograms. IEEE Transactions on Systems Man, and Cybernetics, 11, 285–296.Google Scholar
- 19.Prengler, A., & Adi, K. (2009). A reconfigurable SIMD-MIMD processor architecture for embedded vision processing applications. SAE World Congress, (pp. 1–9).Google Scholar
- 20.CACTI: cacti 5.3, rev 174. http://quid.hpl.hp.com:9081/cacti/.
- 21.Delft University of Technology: MOVE project. http://ce.et.tudelft.nl/MOVE/.
- 22.Tampere University of Technology: TTA-based codesign environment (TCE). http://tce.cs.tut.fi/.
- 24.Raghavan, P., Munaga, S., Ramos, E., Lambrechts, A., Jayapala, M., Catthoor, F., Verkest, D. (2007). A customized cross-bar for data-shuffling in domain-specific SIMD processors. Proceedings of architecture of computing systems (ARCS) (pp. 57–68).Google Scholar
- 25.Satpathy, S., Foo, Z., Giridhar, B., Dreslinski, R., Sylvester, D., Mudge, T., Blaauw, D. (2010). A 1.07 Tbit/s 128x128 swizzle network for SIMD processors. Proceedings of IEEE symposium on VLSI circuits (VLSIC) (pp. 81–82).Google Scholar
- 26.She, D., He, Y., Corporaal, H. (2012). Energy efficient special instruction support in an embedded processor with compact ISA. Proceedings of the international conference on compilers, architectures and synthesis for embedded systems (CASES) (pp. 131–140).Google Scholar
- 27.She, D., He, Y., Mesman, B., Corporaal, H. (2012). Scheduling for register file energy minimization in explicit datapath architectures. Proceedings of design, automation test in europe conference exhibition (DATE) (pp. 388–393).Google Scholar
- 28.She, D., He, Y., Waeijen, L., Corporaal, H. (2013). OpenCL code generation for low energy wide SIMD architectures with explicit datapath. Proceedings of international conference on embedded computer systems: architectures, modeling, and simulation (SAMOS) (pp. 322–329).Google Scholar
- 29.Waeijen, L., She, D., Corporaal, H., He, Y. (2013). SIMD made explicit. Proceedings of international conference on embedded computer systems: architectures, modeling, and simulation (SAMOS) (pp. 330–337).Google Scholar
- 30.Waeijen, L., She, D., Corporaal, H., He, Y. (2014). Reduction operator for Wide-SIMDs reconsidered. Proceedings of the 51st design automation conference (DAC) (pp. 1–6).Google Scholar
- 31.van de Waerdt, J., & et al. (2005). The TM3270 media-processor. Proceedings of the 38th international symposium on microarchitecture (MICRO) (pp. 331–342).Google Scholar
- 32.Woh, M., & et al. (2008). From SODA to scotch: The evolution of a wireless baseband processor. Proceedings of the 41st IEEE/ACM international symposium on microarchitecture (pp. 152–163).Google Scholar
- 34.Yan, J., & Zhang, W. (2007). Virtual registers: Reducing register pressure without enlarging the register file. Proceedings of high performance embedded architectures and compilers (HiPEAC) (pp. 57–70).Google Scholar