Skip to main content

An energy efficient multi-target binary translator for instruction and data level parallelism exploitation

Abstract

Embedded devices are omnipresent in our daily routine, from smartphones to home appliances, that run data and control-oriented applications. To maximize the energy-performance tradeoff, data and instruction-level parallelism are exploited by using superscalar and specific accelerators. However, as such devices have severe time-to-market, binary compatibility should be maintained to avoid recurrent engineering, which is not considered in current embedded processors. This work visited a set of embedded applications showing the need for concurrent ILP and DLP exploitation. For that, we propose a Hybrid Multi-Target Binary Translator (HMTBT) to transparently exploit ILP and DLP by using a CGRA and ARM NEON engine as targeted accelerators. Results show that HMTBT transparently achieves 24% performance improvements and 54% energy savings over an OoO superscalar processor coupled to an ARM NEON engine. The proposed approach improves performance and energy in 10%, 24% over decoupled binary translators using the same accelerator with the same ILP and DLP capabilities.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

References

  1. Beck ACS, Carro L (2007) Transparent acceleration of data dependent instructions for general purpose processors. In: IFIP VLSI-SoC, pp 66–71

  2. Beck ACS, Rutzig MB, Carro L (2014) A transparent and adaptive reconfigurable system. Microprocess Microsyst 38(5):509–524. https://doi.org/10.1016/j.micpro.2014.03.004. https://www.sciencedirect.com/science/article/pii/S0141933114000313

  3. Beck ACS., Rutzig MB, Gaydadjiev G, Carro L (2008) Transparent reconfigurable acceleration for heterogeneous embedded applications. In: 2008 Design, automation and test in Europe, pp 1208–1213. IEEE

  4. Brandalero M, Beck ACS (2017) A mechanism for energy-efficient reuse of decoding and scheduling of x86 instruction streams. In: Design, automation & test in Europe conference & exhibition (DATE), 2017, pp 1468–1473. IEEE

  5. Clark N, Kudlur M, Park H, Mahlke S, Flautner K (2004) Application-specific processing on a general-purpose core via transparent instruction set customization. In: 37th International symposium on microarchitecture (MICRO-37’04), pp 30–40. IEEE

  6. DeVuyst M, Venkat A, Tullsen DM (2012) Execution migration in a heterogeneous-isa chip multiprocessor. In: ASPLOS, pp 261–272

  7. Fajardo J Jr, Rutzig MB, Carro L, Beck AC (2013) Towards a multiple-isa embedded system. J Syst Architect 59(2):103–119

    Article  Google Scholar 

  8. Fu SY, Hong DY, Liu YP, Wu JJ, Hsu WC (2018) Efficient and retargetable SIMD translation in a dynamic binary translator. Softw Pract Exp 48(6):1312–1330

    Article  Google Scholar 

  9. Georgakoudis G, Nikolopoulos DS, Vandierendonck H, Lalis S (2014) Fast dynamic binary rewriting for flexible thread migration on shared-isa heterogeneous mpsocs. In: SAMOS XIV, pp 156–163. IEEE

  10. Govindaraju V, Ho CH, Nowatzki T, Chhugani J, Satish N, Sankaralingam K, Kim C (2012) DySER: unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro 32(5):38–51. https://doi.org/10.1109/MM.2012.51. http://ieeexplore.ieee.org/document/6235947/

  11. Jordan MG, Knorst T, Vicenzi J, Rutzig MB (2019) Boosting simd benefits through a run-time and energy efficient dlp detection. In: 2019 Design, automation & test in Europe conference & exhibition (DATE), pp 722–727. IEEE. https://doi.org/10.23919/DATE.2019.8714826

  12. Junior JF, Rutzig MB, Carro L, Beck AC (2011) A transparent and adaptable multiple-isa embedded system. In: Proceedings of the international conference on engineering of reconfigurable systems and algorithms (ERSA), p 1. The steering committee of the world congress in computer science, computer

  13. Korol G, Jordan MG, Brandalero M, Hübner M, Beck Rutzig M, Schneider Beck AC (2020) MCEA: A resource-aware multicore CGRA architecture for the edge. In: 2020 30th International conference on field-programmable logic and applications (FPL), pp 33–39. https://doi.org/10.1109/FPL50879.2020.00017. ISSN: 1946-1488

  14. Martins MGA, Matos JM, Ribas RP, Reis AI, Schlinker G, Rech L, Michelsen J (2015) Open cell library in 15 nm freepdk technology. In: ISPD, pp 171–178

  15. Nakamura T, Miki S, Oikawa S (2011) Automatic vectorization by runtime binary translation. In: 2011 second international conference on networking and computing, pp 87–94

  16. Nuzman D, Zaks A (2008) Outer-loop vectorization—revisited for short SIMD architectures. In: 2008 International conference on parallel architectures and compilation techniques (PACT), pp 2–11

  17. Park S, Wu Y, Lee J, Aupov A, Mahlke S (2019) Multi-objective exploration for practical optimization decisions in binary translation. ACM Trans Embed Comput Syst 18(5s):1–19

    Article  Google Scholar 

  18. Podobas A, Sano K, Matsuoka S (2020) A survey on coarse-grained reconfigurable architectures from a performance perspective. arXiv preprint arXiv:2004.04509

  19. Rokicki S, Rohou E, Derrien S (2019) Hybrid-dbt: Hardware/software dynamic binary translation targeting vliw. IEEE Trans Comput Aided Des Integr Circuits Syst 38(10):1872–1885. https://doi.org/10.1109/TCAD.2018.2864288

    Article  Google Scholar 

  20. Rutzig MB, Beck ACS, Carro L (2013) A transparent and energy aware reconfigurable multiprocessor platform for simultaneous ILP and TLP exploitation. In: 2013 Design, automation test in europe conference exhibition (DATE), pp 1559–1564. https://doi.org/10.7873/DATE.2013.317. ISSN: 1530-1591

  21. Vahid F, Stitt G, Lysecky R (2008) Warp processing: dynamic translation of binaries to fpga circuits. Computer 41(7):40–46

    Article  Google Scholar 

  22. Watkins MA, Nowatzki T, Carno A (2016) Software transparent dynamic binary translation for coarse-grain reconfigurable architectures. In: 2016 IEEE International symposium on high performance computer architecture (HPCA), pp 138–150. IEEE, Barcelona, Spain. https://doi.org/10.1109/HPCA.2016.7446060. http://ieeexplore.ieee.org/document/7446060/

  23. Zhou R, Wort G, Erdös M, Jones TM (2019) The janus triad: exploiting parallelism through dynamic binary modification. In: Proceedings of the 15th ACM SIGPLAN/SIGOPS international conference on virtual execution environments–VEE 2019, pp 88–100. ACM Press. https://doi.org/10.1145/3313808.3313812. http://dl.acm.org/citation.cfm?doid=3313808.3313812

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mateus B. Rutzig.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This study was financed in part by: CNPq; FAPERGS/CNPq 11/2014 - PRONEM; and the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Knorst, T., Vicenzi, J., Jordan, M.G. et al. An energy efficient multi-target binary translator for instruction and data level parallelism exploitation. Des Autom Embed Syst 26, 55–82 (2022). https://doi.org/10.1007/s10617-021-09258-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10617-021-09258-6

Keywords

  • CGRA
  • ARM NEON
  • ILP
  • DLP
  • Binary Translator