Journal of Signal Processing Systems

, Volume 80, Issue 1, pp 87–101 | Cite as

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

  • Dongrui She
  • Yifan He
  • Luc Waeijen
  • Henk Corporaal


Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.


SIMD OpenCL Code Generation Compiler Low Power 



This work is supported by the Ministry of Economic Affairs of the Netherlands, project EVA PID07121, and the Dutch Technology Foundation STW, project NEST 10346.


  1. 1.
    Cadence: Tensilica Customizable Processor IP. URL
  2. 2.
    Kyo, S., & Okazaki, S. (2008). IMAPCAR: A 100 GOPS In-Vehicle Vision Processor Based on 128 Ring Connected Four-Way VLIW Processing Elements. Journal of Signal Processing Systems, 1–12.Google Scholar
  3. 3.
    Abbo, A., & et al. (2008). Xetal-II: a 107 GOPS, 600 mW massively parallel processor for video scene analysis. IEEE Journal of Solid-State Circuits, 43(1), 192–201.CrossRefGoogle Scholar
  4. 4.
  5. 5.
    Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis & transformation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04), pp. 75–86.Google Scholar
  6. 6.
    Wittenbrink, C., & et al. (2011). Fermi GF100 GPU architecture. IEEE Micro, 31(2), 50–59.CrossRefGoogle Scholar
  7. 7.
    CACTI: cacti 5.3, rev 174. URL
  8. 8.
    She, D., & et al. (2012). Energy efficient special instruction support in an embedded processor with compact isa. In: Proceedings of the 2012 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES ’12), pp. 131–140. ACM.Google Scholar
  9. 9.
    She, D., & et al. (2012). Scheduling for register file energy minimization in explicit datapath architectures. In: Design, Automation Test in Europe Conference Exhibition, 2012 (DATE ’12), pp. 388–393. EDAA.Google Scholar
  10. 10.
    She, D., & et al. (2013). OpenCL Code Generation for Low Energy Wide SIMD Architectures with Explicit Datapath. In: Proceedings of the 13th International Conference on Embedded Computer Systems (SAMOS-XIII), pp. 322–329. IEEE.Google Scholar
  11. 11.
    Corporaal, H. (1998). Microprocessor Architectures, From VLIW to TTA. Wiley.Google Scholar
  12. 12.
    Finlayson, I., & et al. (2012). An overview of static pipelining. Computer Architecture Letters, 11(1), 17–20.CrossRefGoogle Scholar
  13. 13.
    Balfour, J., & et al. (2007). An energy-efficient processor architecture for embedded systems. Computer Architecture Letters, 7(1), 29–32.CrossRefGoogle Scholar
  14. 14.
    Balfour, J., & et al. (2009). Operand registers and explicit operand forwarding. Computer Architecture Letters, 8(2), 60–63.CrossRefGoogle Scholar
  15. 15.
    Heikkinen, J., & et al. (2005). Dictionary-based program compression on TTAs: effects on area and power consumption. In: Proceedings of the 2005 IEEE Workshop on Signal Processing Systems Design and Implementation, pp. 479–484.Google Scholar
  16. 16.
    Khronos OpenCL Working Group: The OpenCL Specification, version 1.2 (2012). URL
  17. 17.
    Waeijen, L., & et al. (2013). SIMD Made Explicit. In: Proceedings of the 13th International Conference on Embedded Computer Systems (SAMOS-XIII), pp. 330–337. IEEE.Google Scholar
  18. 18.
    Owaida, M., & et al. (2011). Synthesis of platform architectures from OpenCL programs. In: Proceedings of the 19th International Symposium on Field Programmable Custom Computing Machines (FCCM ’11), pp. 186–193. IEEE.Google Scholar
  19. 19.
    Woh, M., & et al. (2009). AnySP: anytime anywhere anyway signal processing. In: Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA ’09), pp. 128–139.Google Scholar
  20. 20.
    Esko, O., & et al. (2010). Customized exposed datapath soft-core design flow with compiler support. In: Proceedings of 20th International Conference on Field Programmable Logic and Applications, pp. 217–222.Google Scholar
  21. 21.
    Jääskeläinen, P, & et al. (2010). OpenCL-based design methodology for application-specific processors. In: Proceedings of the 10th International Conference on Embedded Computer Systems (SAMOS-X), pp. 223–230.Google Scholar
  22. 22.
    Govindarajan, R., & et al. (2001). Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs. In: Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS ’01), pp. 26–33. IEEE Computer Society.Google Scholar
  23. 23.
    Karrenberg, R., & Hack, S. (2012). Improving performance of OpenCL on CPUs. In: Proceedings of the 21st International Conference on Compiler Construction (CC ’12), pp. 1–20. Springer-Verlag.Google Scholar
  24. 24.
    Sethi, R., & Ullman, J. D. (1970). The generation of optimal code for arithmetic expressions. Journal of the ACM, 17(4), 715–728.CrossRefzbMATHMathSciNetGoogle Scholar
  25. 25.
    Park, S., & et al. (2006). Bypass aware instruction scheduling for register file power reduction. In: Proceedings of the 2006 ACM Conference on Language, Compilers, and Tool Support for Embedded Systems (LCTES ’06), pp. 173–181. ACM.Google Scholar
  26. 26.
    Guzma, V., & et al. (2009). Reducing processor energy consumption by compiler optimization. In: IEEE Workshop on Signal Processing Systems (SiPS), pp. 63–68.Google Scholar
  27. 27.
    Guzma, V., & et al. (2013). Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architectures. EURASIP Journal on Embedded Systems, 2013 (1).Google Scholar
  28. 28.
    He, Y., & et al. (2010). Xetal-Pro: An Ultra-Low Energy and High Throughput SIMD Processor. In: Proceedings of the 47th Annual Design Automation Conference (DAC ’10), pp. 543–548.Google Scholar
  29. 29.
    He, Y., & et al. (2011). MOVE-Pro: a low power and high code density tta architecture. In: Proceedings of the 11th International Conference on Embedded Computer Systems (SAMOS-XI), pp. 294–301.Google Scholar
  30. 30.
    Pu, Y., & et al. (2011). From Xetal-II to Xetal-Pro: On the Road Toward an Ultra-Low-Energy and High-Throughput SIMD processor. IEEE Transactions on Circuits and Systems for Video Technology, 21(4), 472–484.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Dongrui She
    • 1
  • Yifan He
    • 1
    • 2
  • Luc Waeijen
    • 1
  • Henk Corporaal
    • 1
  1. 1.Eindhoven University of TechnologyEindhovenThe Netherlands
  2. 2.Recore Systems B.V.EnschedeThe Netherlands

Personalised recommendations