Design Automation for Embedded Systems

, Volume 14, Issue 3, pp 309–326 | Cite as

Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

  • Garo BournoutianEmail author
  • Alex Orailoglu
Open Access


Today, embedded processors are expected to be able to run algorithmically complex, memory-intensive applications that were originally designed and coded for general-purpose processors. As such, the impact of memory latencies on the execution time increasingly becomes evident. All the while, it is also expected that embedded processors be power-conscientious as well as of minimal area impact, as they are often used in mobile devices such as wireless smartphones and portable MP3 players. As a result, traditional methods for addressing performance and memory latencies, such as multiple issue, out-of-order execution and large, associative caches, are not aptly suited for the mobile embedded domain due to the significant area and power overhead. This paper explores a novel approach to mitigating execution delays caused by memory latencies that would otherwise not be possible in a regular in-order, single-issue embedded processor without large, power-hungry constructs like a Reorder Buffer (ROB). The concept relies on efficiently leveraging both compile-time and run-time information to safely allow non-data-dependent instructions to continue executing in the event of a memory stall. The simulation results show significant improvement in overall execution throughput of approximately 11%, while having a minimal impact on area overhead and power.


Embedded processors Data cache Pipeline stalls Compiler assisted hardware 


  1. 1.
    Wilkes MV (2001) The memory gap and the future of high performance memories. SIGARCH Comput Archit News 29(1):2–7 CrossRefGoogle Scholar
  2. 2.
    Lee L, Kannan S, Fridman J (2004) MPEG4 video codec on a wireless handset baseband system. In: Proc workshop media and signal processors for embedded systems and SoCs Google Scholar
  3. 3.
    Jouppi NP (1990) Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput Archit News 18:364–373 CrossRefGoogle Scholar
  4. 4.
    Bournoutian G, Orailoglu A (2008) Miss reduction in embedded processors through dynamic, power-friendly cache design. In: DAC’08: proceedings of the 45th annual conference on design automation. ACM, New York, pp 304–309 CrossRefGoogle Scholar
  5. 5.
    Sprangle E, Carmean D (2002) Increasing processor performance by implementing deeper pipelines. SIGARCH Comput Archit News 30(2):25–34 CrossRefGoogle Scholar
  6. 6.
    Tomasulo RM (1967) An efficient algorithm for exploiting multiple arithmetic units. IBM J Res Develop 11:25–33 zbMATHCrossRefGoogle Scholar
  7. 7.
    Smith JE, Pleszkun AR (1985) Implementation of precise interrupts in pipelined processors. In: ISCA’85: proceedings of the 12th annual international symposium on computer architecture. IEEE Comput Soc, Los Alamitos, pp 36–44 Google Scholar
  8. 8.
    Hily S, Seznec A (1999) Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In: HPCA’99: proceedings of the 5th international symposium on high performance computer architecture. IEEE Comput Soc, Los Alamitos, pp 64–67 CrossRefGoogle Scholar
  9. 9.
    Grossman JP (2000) Cheap out-of-order execution using delayed issue. In: ICCD’00: proceedings of the 2000 IEEE international conference on computer design, pp 549–551 Google Scholar
  10. 10.
    Callahan D, Kennedy K, Porterfield A (1991) Software prefetching. In: ASPLOS-IV: proceedings of the 4th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 40–52 CrossRefGoogle Scholar
  11. 11.
    Klaiber AC, Levy HM (1991) An architecture for software-controlled data prefetching. SIGARCH Comput Archit News 19(3):43–53 CrossRefGoogle Scholar
  12. 12.
    Mowry TC, Lam MS, Gupta A (1992) Design and evaluation of a compiler algorithm for prefetching. In: ASPLOS-V: proceedings of the 5th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 62–73 CrossRefGoogle Scholar
  13. 13.
    Badawy A-HA, Aggarwal A, Yeung D, Tseng C-W (2001) Evaluating the impact of memory system performance on software prefetching and locality optimizations. In: ICS’01: proceedings of the 15th international conference on supercomputing. ACM, New York, pp 486–500 CrossRefGoogle Scholar
  14. 14.
    Baer J-L, Chen T-F (1991) An effective on-chip preloading scheme to reduce data access penalty. In: Supercomputing’91: proceedings of the 1991 ACM/IEEE conference on supercomputing. ACM, New York, pp 176–186 CrossRefGoogle Scholar
  15. 15.
    Fu JWC, Patel JH, Janssens BL (1992) Stride directed prefetching in scalar processors. In: MICRO 25: proceedings of the 25th annual international symposium on microarchitecture. IEEE Comput Soc, Los Alamitos, pp 102–110 CrossRefGoogle Scholar
  16. 16.
    Joseph D, Grunwald D (1997) Prefetching using Markov predictors. In: ISCA’97: proceedings of the 24th annual international symposium on computer architecture. ACM, New York, pp 252–263 CrossRefGoogle Scholar
  17. 17.
    Park S, Shrivastava A, Paek Y (2008) Hiding cache miss penalty using priority-based execution for embedded processors. In: DATE’08: proceedings of the conference on design, automation and test in Europe, pp 1190–1195 Google Scholar
  18. 18.
    Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. SIGOPS Oper Syst Rev 30(5):2–11 CrossRefGoogle Scholar
  19. 19.
    Austin T, Larson E, Ernst D (2002) Simplescalar: an infrastructure for computer system modeling. Computer 35(2):59–67 CrossRefGoogle Scholar
  20. 20.
    SPEC CPU2000 Benchmarks.
  21. 21.
    Lee C, Potkonjak M, Mangione-Smith WH (1997) Mediabench: a tool for evaluating and synthesizing multimedia and communications systems. In: MICRO 30: proceedings of the 30th annual ACM/IEEE international symposium on microarchitecture. IEEE Comput Soc, Los Alamitos, pp 330–335 Google Scholar
  22. 22.
    Guthaus MR, Ringenberg JS, Ernst D, Austin TM, Mudge T, Brown RB (2001) Mibench: a free, commercially representative embedded benchmark suite. In: WWC’01: proceedings of the IEEE international workshop on workload characterization. IEEE Comput Soc, Los Alamitos, pp 3–14 CrossRefGoogle Scholar
  23. 23.
    Folegnani D, González A (2001) Energy-effective issue logic. SIGARCH Comput Archit News 29(2):230–239 CrossRefGoogle Scholar
  24. 24.
    Wilton SJE, Jouppi NP (1996) CACTI: an enhanced cache access and cycle time model. IEEE J Solid-State Circuits 31(5):677–688 CrossRefGoogle Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.CSE DepartmentUniversity of California, San DiegoLa JollaUSA

Personalised recommendations