WCET-Aware Assembly Level Optimizations

  • Paul Lokuciejewski
  • Peter Marwedel
Part of the Embedded Systems book series (EMSY)


The major shortcoming of source code optimizations is their lack of intrinsic knowledge about the underlying architecture. Hence, the development of transformations that exploit processor-specific features is limited or even infeasible at all. As a result, a maximal optimization potential can not be explored. In contrast, assembly level optimizations operate on a code representation that reflects the finally executed code. Thus, the compiler is fully aware of numerous critical details about the utilized resources during execution. In this chapter, novel WCET-aware assembly level optimizations are discussed. In detail, the optimizations WCET-aware procedure positioning and WCET-aware trace scheduling are presented.


Basic Block Call Graph Instruction Schedule Local Schedule Translation Lookaside Buffer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. CPI+05.
    A.M. Campoy, I. Puaut, A.P. Ivars et al., Cache contents selection for statically-locked instruction caches: an algorithm comparison, in Proceedings of the 17th Euromicro Conference on Real-Time Systems (ECRTS), Palma de Mallorca, Spain, July 2005, pp. 49–56 Google Scholar
  2. CM04.
    J. Cavazos, J.E.B. Moss, Inducing heuristics to decide whether to schedule. SIGPLAN Not. 39(6), 183–194 (2004) CrossRefGoogle Scholar
  3. CNO+87.
    R.P. Colwell, R.P. Nix, J.J. O’Donnell et al., A VLIW architecture for a trace scheduling compiler. ACM SIGPLAN Not. 22(10), 180–192 (1987) CrossRefGoogle Scholar
  4. CT04.
    K.D. Cooper, L. Torczon, Engineering A Compiler (Morgan Kaufmann, San Francisco, 2004) Google Scholar
  5. DP07.
    J.F. Deverge, I. Puaut, WCET-directed dynamic scratchpad memory allocation of data, in Proceedings of the 19th Euromicro Conference on Real-Time Systems (ECRTS), Pisa, Italy, July 2007, pp. 179–190 Google Scholar
  6. Fal09.
    H. Falk, WCET-aware register allocation based on graph coloring, in Proceedings of the 46th Design Automation Conference (DAC), San Francisco, USA, July 2009, pp. 726–731 Google Scholar
  7. FK09.
    H. Falk, J.C. Kleinsorge, Optimal static WCET-aware scratchpad allocation of program code, in Proceedings of the 46th Design Automation Conference (DAC), San Francisco, USA, July 2009, pp. 732–737 Google Scholar
  8. FPT07.
    H. Falk, S. Plazar, H. Theiling, Compile-time decided instruction cache locking using worst-case execution paths, in Proceedings of the 5th IEEE/ACM International Conference on Hardware/software Codesign and System Synthesis (CODES+ISSS), Salzburg, Austria, September 2007, pp. 143–148 Google Scholar
  9. FHL+01.
    C. Ferdinand, R. Heckmann, M. Langenbach et al., Reliable and precise WCET determination for a real-life processor, in Proceedings of the 1st International Workshop on Embedded Software (EMSOFT), Tahoe City, USA, October 2001, pp. 496–485 Google Scholar
  10. Fis81.
    J.A. Fisher, Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. 30(7), 478–490 (1981) CrossRefGoogle Scholar
  11. GRE+01.
    M. Guthaus, J. Ringenberg, D. Ernst et al., MiBench: a free, commercially representative embedded benchmark suite, in Proceedings of the 4th IEEE International Workshop on Workload Characteristics (WWC), Austin, USA, December 2001, pp. 3–14 Google Scholar
  12. HP03.
    J.L. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach (Morgan Kaufmann, San Francisco, 2003) Google Scholar
  13. HS89.
    M. Hill, A. Smith, Evaluating associativity in CPU caches. IEEE Trans. Comput. 38(12), 1612–1630 (1989) CrossRefGoogle Scholar
  14. HC89.
    W.W. Hwu, P.P. Chang, Achieving high instruction cache performance with an optimizing compiler. ACM SIGARCH Comput. Archit. News 17(3), 242–251 (1989) CrossRefGoogle Scholar
  15. HMC+93.
    W.W. Hwu, S.A. Mahlke, W.Y. Chen et al., The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 229–248 (1993) CrossRefGoogle Scholar
  16. LW94.
    A.R. Lebeck, D.A. Wood, Cache profiling and the SPEC benchmarks: a case study. IEEE Comput. 27(10), 16–26 (1994) CrossRefGoogle Scholar
  17. LPMS97.
    C. Lee, M. Potkonjak, W.H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communications systems, in Proceedings of the 30th Annual International Symposium on Microarchitecture (MICRO), Research Triangle Park, USA, December 1997, pp. 330–335 Google Scholar
  18. LJC+10.
    Y. Liang, L. Ju, S. Chakraborty et al., Cache-aware optimization of BAN applications. ACM Trans. Des. Automat. Electron. Syst. (2010) Google Scholar
  19. LFM08.
    P. Lokuciejewski, H. Falk, P. Marwedel, WCET-driven cache-based procedure positioning optimizations, in Proceedings of the 21st Euromicro Conference on Real-Time Systems (ECRTS), Prague, Czech Republic, July 2008, pp. 321–330 Google Scholar
  20. MLC+92.
    S.A. Mahlke, D.C. Lin, W.Y. Chen et al., Effective compiler support for predicated execution using the hyperblock. ACM SIGMICRO Newsl. 23(1–2), 45–54 (1992) CrossRefGoogle Scholar
  21. MWRG10.
    Mälardalen WCET Research Group. WCET Benchmarks,, March 2010
  22. MPS94.
    A. Mendlson, S.S. Pinter, R. Shtokhamer, Compile time instruction cache optimizations. ACM SIGARCH Comput. Archit. News 22(1), 44–51 (1994) CrossRefGoogle Scholar
  23. MPSR95.
    R. Motwani, K.V. Palem, V. Sarkar, S. Reyen, Combining register allocation and instruction scheduling, Technical report, Stanford University, Stanford, USA, 1995 Google Scholar
  24. Muc97.
    S.S. Muchnick, Advanced Compiler Design and Implementation (Morgan Kaufmann, San Francisco, 1997) Google Scholar
  25. MG04.
    S.S. Muchnick, P.B. Gibbons, Efficient instruction scheduling for a pipelined architecture. ACM SIGPLAN Not. 39(4), 167–174 (2004) CrossRefGoogle Scholar
  26. NGE+99.
    X. Nie, L. Gazsi, F. Engel et al., A new network processor architecture for high-speed communications, in Proceedings of the IEEE Workshop on Signal Processing Systems (SiPS), Taipei, Taiwan, October 1999, pp. 548–557 Google Scholar
  27. NP93.
    C. Norris, L.L. Pollock, A schedular-sensitive global register allocator, in Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, Portland, USA, November 1993, pp. 804–813 Google Scholar
  28. PS93.
    K.V. Palem, B.B. Simons, Scheduling time-critical instructions on RISC machines. ACM Trans. Program. Lang. Syst. (TOPLAS) 15(4), 632–658 (1993) CrossRefGoogle Scholar
  29. PH90.
    K. Pettis, R.C. Hansen, Profile guided code positioning. ACM SIGPLAN Not. 25(6), 16–27 (1990) CrossRefGoogle Scholar
  30. PLM09.
    S. Plazar, P. Lokuciejewski, P. Marwedel, WCET-aware software based cache partitioning for multi-task real-time systems, in Proceedings of the 9th International Workshop on Worst-Case Execution Time Analysis (WCET), Dublin, Ireland, June 2009, pp. 78–88 Google Scholar
  31. PLM10.
    S. Plazar, P. Lokuciejewski, P. Marwedel, WCET-driven cache-aware memory content selection, in Proceedings of the 13th IEEE International Symposium on Object/Component/Service-oriented Real-time Distributed Computing (ISORC), Carmona, Spain, 2010, pp. 107–114 Google Scholar
  32. Pua06.
    I. Puaut, WCET-centric software-controlled instruction caches for hard real-time systems, in Proceedings of the 18th Euromicro Conference on Real-Time Systems (ECRTS), Dresden, Germany, July 2006, pp. 217–226 Google Scholar
  33. PP07.
    I. Puaut, C. Pais, Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison, in Proceedings of the Conference on Design, Automation and Test in Europe (DATE), Nice, France, March 2007, pp. 1484–1489 Google Scholar
  34. RTG+07.
    H. Rong, Z. Tang, R. Govindarajan et al., Single-dimension software pipelining for multidimensional loops. ACM Trans. Archit. Code Optim. 4(1), 7–51 (2007) CrossRefGoogle Scholar
  35. RMC+09.
    T. Russell, A.M. Malik, M. Chase et al., Learning heuristics for the superblock instruction scheduling problem. IEEE Trans. Knowl. Data Eng. 21(10), 1489–1502 (2009) CrossRefGoogle Scholar
  36. SM08.
    V. Suhendra, T. Mitra, Exploring locking & partitioning for predictable shared caches on multi-cores, in Proceedings of the 45th annual Design Automation Conference (DAC), Anaheim, California, June 2008, pp. 300–303 Google Scholar
  37. SRM08.
    V. Suhendra, A. Roychoudhury, T. Mitra, Scratchpad allocation for concurrent embedded software, in Proceedings of the 6th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Atlanta, USA, October 2008, pp. 37–42 Google Scholar
  38. SMR+05.
    V. Suhendra, T. Mitra, A. Roychoudhury et al., WCET centric data allocation to scratchpad memory, in Proceedings of the 26th IEEE International Real-Time Systems Symposium (RTSS), Miami, USA, December 2005, pp. 223–232 Google Scholar
  39. Inf08a.
    Tc1796 32-bit single-chip microcontroller tricore—data sheet. Infineon Technologies AG, Document Revision 2008-04 (2008) Google Scholar
  40. TY97.
    H. Tomiyama, H. Yasuura, Code placement techniques for cache miss rate reduction. ACM Trans. Des. Automat. Electron. Syst. 2(4), 410–429 (1997) CrossRefGoogle Scholar
  41. Inf04.
    Tricore 1 pipeline behaviour & instruction execution timing. Infineon Technologies AG, Document Revision 2004-06 (2004) Google Scholar
  42. Inf08b.
    TriCore 1 32-bit unified processor core v1.3 architecture—architecture manual. Infineon Technologies AG, Document Revision 2008-01 (2008) Google Scholar
  43. UTD10.
  44. VLX03.
    X. Vera, B. Lisper, J. Xue, Data cache locking for higher program predictability, in Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), San Diego, USA, July 2003, pp. 272–282 Google Scholar
  45. VM07.
    M. Verma, P. Marwedel, Advanced Memory Optimization Techniques for Low-Power Embedded Processors (Springer, Berlin, 2007) zbMATHGoogle Scholar
  46. WHSB92.
    N.J. Warter, G.E. Haab, K. Subramanian, J.W. Bockhaus, Enhanced modulo scheduling for loops with conditional branches. ACM SIGMICRO Newsl. 23(1–2), 170–179 (1992) CrossRefGoogle Scholar
  47. ZWH+05.
    W. Zhao, D. Whalley, C. Healy et al., Improving WCET by applying a WC code-positioning optimization. ACM Trans. Archit. Code Optim. 2(4), 335–365 (2005) CrossRefGoogle Scholar
  48. ZVS+94.
    V. Zivojnović, J. Martínez Velarde, C. Schläger et al., DSPstone: a DSP-oriented benchmarking methodology, in Proceedings of the International Conference on Signal Processing and Technology (ICSPAT), Dallas, USA, January 1994, pp. 715–720 Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.DüsseldorfGermany
  2. 2.Embedded Systems GroupTU Dortmund UniversityDortmundGermany

Personalised recommendations