Journal of Signal Processing Systems

, Volume 80, Issue 1, pp 65–86 | Cite as

A Low-Energy Wide SIMD Architecture with Explicit Datapath

  • Luc Waeijen
  • Dongrui She
  • Henk Corporaal
  • Yifan He


Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3 % on average and up to 94 %, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64 % of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5 %, and maximum of 43.0 %, reduction is achieved.


Wide SIMD Explicit Datapath Configurable Low Energy 



This work is supported by the Ministry of Economic Affairs of the Netherlands, project EVA PID07121, and the Dutch Technology Foundation STW, project NEST 10346.


  1. 1.
    Abbo, A., Kleihorst, R., Choudhary, V., Sevat, L., Wielage, P., Mouy, S., Vermeulen, B., Heijligers, M. (2008). Xetal-II: A 107 GOPS, 600 mW massively parallel processor for video scene analysis. IEEE Journal of Solid-State Circuits (JSSC), 43(1), 192–201.CrossRefGoogle Scholar
  2. 2.
    Amdahl, G.M. (2007). Validity of the single processor approach to achieving large scale computing capabilities. IEEE Solid-State Circuits Society Newsletter, 12(3), 19–20.CrossRefGoogle Scholar
  3. 3.
    Balfour, J., Harting, R., Dally, W. (2009). Operand registers and explicit operand forwarding. IEEE Computer Architecture Letters, 8(2), 60–63.CrossRefGoogle Scholar
  4. 4.
    Corporaal, H. (1998). Microprocessor architectures: from VLIW to TTA. Wiley.Google Scholar
  5. 5.
    Frijns, R., Fatemi, H., Mesman, B., Corporaal, H. (2008). DC-SIMD: dynamic communication for SIMD processors. Proceedings of international symposium on parallel and distributed processing (IPDPS) (pp. 1–10).Google Scholar
  6. 6.
    Goel, N., Kumar, A., Panda, P. (2007). Power reduction in VLIW processor with compiler driven bypass network. Proceedings of the 20th international conference on vlsi design (VLSID) (pp. 233–238).Google Scholar
  7. 7.
    Guan, X., & Fei, Y. (2008). Reducing power consumption of embedded processors through register file partitioning and compiler support. Proceedings of international conference on application-specific systems, architectures and processors (ASAP) (pp. 269–274).Google Scholar
  8. 8.
    Gustafson, J.L. (1988). Reevaluating Amdahl’s law. Communications of the ACM, 31(5), 532–533.CrossRefGoogle Scholar
  9. 9.
    He, Y. (2013). Low power architectures for streaming applications. PhD Thesis.Google Scholar
  10. 10.
    He, Y., Pu, Y., Ye, Z., Londono, S., Kleihorst, R., Abbo, A., Corporaal, H. (2010). Xetal-Pro: An ultra-low energy and high throughput SIMD processor. Proceedings of the 47th design automation conference (DAC) (pp. 543–548).Google Scholar
  11. 11.
    He, Y., She, D., Mesman, B., Corporaal, H. (2011). MOVE-Pro: a low power and high code density TTA architecture. Proceedings of the 11th international conference on embedded computer systems: architectures, modeling, and simulation (SAMOS) (pp. 294–301).Google Scholar
  12. 12.
    He, Y., Ye, Z., She, D., Mesman, B., Corporaal, H. (2011). Feasibility analysis of ultra high frame rate visual servoing on FPGA and SIMD processor. Proceedings of advances concepts for intelligent vision systems (ACIVS) (pp. 623–634).Google Scholar
  13. 13.
    He, Y., Ye, Z., She, D., Pieters, R., Mesman, B. (2010). Corporaal, H.: 1000 fps visual servoing on the reconfigurable wide SIMD processor. Proceedings of the 16th annual conference of the advanced school for computing and imgaging (ASCI) (pp. 302–309).Google Scholar
  14. 14.
    He, Y., Zivkovic, Z., Kleihorst, R., Danilin, A., Corporaal, H. (2008). Real-time implementations of hough transform on SIMD architecture. Proceedings of the ACM/IEEE international conference on distributed smart cameras (ICDSC) (pp. 1–8).Google Scholar
  15. 15.
    He, Y., Zivkovic, Z., Kleihorst, R., Danilin, A., Corporaal, H., Mesman, B. (2008). Real-time hough transform on 1-D SIMD processors: implementation and architecture exploration. Proceedings of the international conference advanced concepts for intelligent vision systems (ACIVS) (pp. 254–265).Google Scholar
  16. 16.
    Kapasi, U., Dally, W., Rixner, S., Owens, J., Khailany, B. (2002). The Imagine stream processor. Proceedings of international conference on computer design: vlsi in computers and processors (ICCD) (pp. 282–288).Google Scholar
  17. 17.
    Kyo, S., & Okazaki, S. (2008). IMAPCAR: A 100 GOPS in-vehicle vision processor based on 128 ring connected four-way VLIW processing elements. Journal of Signal Processing Systems, 1–12.Google Scholar
  18. 18.
    Otsu, N. (1975). A threshold selection method from gray-level histograms. IEEE Transactions on Systems Man, and Cybernetics, 11, 285–296.Google Scholar
  19. 19.
    Prengler, A., & Adi, K. (2009). A reconfigurable SIMD-MIMD processor architecture for embedded vision processing applications. SAE World Congress, (pp. 1–9).Google Scholar
  20. 20.
    CACTI: cacti 5.3, rev 174.
  21. 21.
    Delft University of Technology: MOVE project.
  22. 22.
    Tampere University of Technology: TTA-based codesign environment (TCE).
  23. 23.
    Pu, Y., He, Y., Ye, Z., Londono, S., Abbo, A., Kleihorst, R., Corporaal, H. (2011). From Xetal-II to Xetal-Pro: on the road toward an ultra low-energy and high-throughput SIMD processor. IEEE Transactions on Circuits and Systems for Video Technology (TCAS-VT), 21(4), 472–484.CrossRefGoogle Scholar
  24. 24.
    Raghavan, P., Munaga, S., Ramos, E., Lambrechts, A., Jayapala, M., Catthoor, F., Verkest, D. (2007). A customized cross-bar for data-shuffling in domain-specific SIMD processors. Proceedings of architecture of computing systems (ARCS) (pp. 57–68).Google Scholar
  25. 25.
    Satpathy, S., Foo, Z., Giridhar, B., Dreslinski, R., Sylvester, D., Mudge, T., Blaauw, D. (2010). A 1.07 Tbit/s 128x128 swizzle network for SIMD processors. Proceedings of IEEE symposium on VLSI circuits (VLSIC) (pp. 81–82).Google Scholar
  26. 26.
    She, D., He, Y., Corporaal, H. (2012). Energy efficient special instruction support in an embedded processor with compact ISA. Proceedings of the international conference on compilers, architectures and synthesis for embedded systems (CASES) (pp. 131–140).Google Scholar
  27. 27.
    She, D., He, Y., Mesman, B., Corporaal, H. (2012). Scheduling for register file energy minimization in explicit datapath architectures. Proceedings of design, automation test in europe conference exhibition (DATE) (pp. 388–393).Google Scholar
  28. 28.
    She, D., He, Y., Waeijen, L., Corporaal, H. (2013). OpenCL code generation for low energy wide SIMD architectures with explicit datapath. Proceedings of international conference on embedded computer systems: architectures, modeling, and simulation (SAMOS) (pp. 322–329).Google Scholar
  29. 29.
    Waeijen, L., She, D., Corporaal, H., He, Y. (2013). SIMD made explicit. Proceedings of international conference on embedded computer systems: architectures, modeling, and simulation (SAMOS) (pp. 330–337).Google Scholar
  30. 30.
    Waeijen, L., She, D., Corporaal, H., He, Y. (2014). Reduction operator for Wide-SIMDs reconsidered. Proceedings of the 51st design automation conference (DAC) (pp. 1–6).Google Scholar
  31. 31.
    van de Waerdt, J., & et al. (2005). The TM3270 media-processor. Proceedings of the 38th international symposium on microarchitecture (MICRO) (pp. 331–342).Google Scholar
  32. 32.
    Woh, M., & et al. (2008). From SODA to scotch: The evolution of a wireless baseband processor. Proceedings of the 41st IEEE/ACM international symposium on microarchitecture (pp. 152–163).Google Scholar
  33. 33.
    Woh, M., Seo, S., Mahlke, S., Mudge, T., Chakrabarti, C., Flautner, K. (2010). AnySP: anytime anywhere anyway signal processing. IEEE Micro, 30(1), 81–91.CrossRefGoogle Scholar
  34. 34.
    Yan, J., & Zhang, W. (2007). Virtual registers: Reducing register pressure without enlarging the register file. Proceedings of high performance embedded architectures and compilers (HiPEAC) (pp. 57–70).Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Luc Waeijen
    • 1
  • Dongrui She
    • 1
  • Henk Corporaal
    • 1
  • Yifan He
    • 1
    • 2
  1. 1.Eindhoven University of TechnologyEindhovenThe Netherlands
  2. 2.Recore Systems B.V.EnschedeThe Netherlands

Personalised recommendations