Abstract
Today, embedded processors are expected to be able to run algorithmically complex, memory-intensive applications that were originally designed and coded for general-purpose processors. As such, the impact of memory latencies on the execution time increasingly becomes evident. All the while, it is also expected that embedded processors be power-conscientious as well as of minimal area impact, as they are often used in mobile devices such as wireless smartphones and portable MP3 players. As a result, traditional methods for addressing performance and memory latencies, such as multiple issue, out-of-order execution and large, associative caches, are not aptly suited for the mobile embedded domain due to the significant area and power overhead. This paper explores a novel approach to mitigating execution delays caused by memory latencies that would otherwise not be possible in a regular in-order, single-issue embedded processor without large, power-hungry constructs like a Reorder Buffer (ROB). The concept relies on efficiently leveraging both compile-time and run-time information to safely allow non-data-dependent instructions to continue executing in the event of a memory stall. The simulation results show significant improvement in overall execution throughput of approximately 11%, while having a minimal impact on area overhead and power.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Wilkes MV (2001) The memory gap and the future of high performance memories. SIGARCH Comput Archit News 29(1):2–7
Lee L, Kannan S, Fridman J (2004) MPEG4 video codec on a wireless handset baseband system. In: Proc workshop media and signal processors for embedded systems and SoCs
Jouppi NP (1990) Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput Archit News 18:364–373
Bournoutian G, Orailoglu A (2008) Miss reduction in embedded processors through dynamic, power-friendly cache design. In: DAC’08: proceedings of the 45th annual conference on design automation. ACM, New York, pp 304–309
Sprangle E, Carmean D (2002) Increasing processor performance by implementing deeper pipelines. SIGARCH Comput Archit News 30(2):25–34
Tomasulo RM (1967) An efficient algorithm for exploiting multiple arithmetic units. IBM J Res Develop 11:25–33
Smith JE, Pleszkun AR (1985) Implementation of precise interrupts in pipelined processors. In: ISCA’85: proceedings of the 12th annual international symposium on computer architecture. IEEE Comput Soc, Los Alamitos, pp 36–44
Hily S, Seznec A (1999) Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In: HPCA’99: proceedings of the 5th international symposium on high performance computer architecture. IEEE Comput Soc, Los Alamitos, pp 64–67
Grossman JP (2000) Cheap out-of-order execution using delayed issue. In: ICCD’00: proceedings of the 2000 IEEE international conference on computer design, pp 549–551
Callahan D, Kennedy K, Porterfield A (1991) Software prefetching. In: ASPLOS-IV: proceedings of the 4th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 40–52
Klaiber AC, Levy HM (1991) An architecture for software-controlled data prefetching. SIGARCH Comput Archit News 19(3):43–53
Mowry TC, Lam MS, Gupta A (1992) Design and evaluation of a compiler algorithm for prefetching. In: ASPLOS-V: proceedings of the 5th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 62–73
Badawy A-HA, Aggarwal A, Yeung D, Tseng C-W (2001) Evaluating the impact of memory system performance on software prefetching and locality optimizations. In: ICS’01: proceedings of the 15th international conference on supercomputing. ACM, New York, pp 486–500
Baer J-L, Chen T-F (1991) An effective on-chip preloading scheme to reduce data access penalty. In: Supercomputing’91: proceedings of the 1991 ACM/IEEE conference on supercomputing. ACM, New York, pp 176–186
Fu JWC, Patel JH, Janssens BL (1992) Stride directed prefetching in scalar processors. In: MICRO 25: proceedings of the 25th annual international symposium on microarchitecture. IEEE Comput Soc, Los Alamitos, pp 102–110
Joseph D, Grunwald D (1997) Prefetching using Markov predictors. In: ISCA’97: proceedings of the 24th annual international symposium on computer architecture. ACM, New York, pp 252–263
Park S, Shrivastava A, Paek Y (2008) Hiding cache miss penalty using priority-based execution for embedded processors. In: DATE’08: proceedings of the conference on design, automation and test in Europe, pp 1190–1195
Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. SIGOPS Oper Syst Rev 30(5):2–11
Austin T, Larson E, Ernst D (2002) Simplescalar: an infrastructure for computer system modeling. Computer 35(2):59–67
SPEC CPU2000 Benchmarks. http://www.spec.org/cpu/
Lee C, Potkonjak M, Mangione-Smith WH (1997) Mediabench: a tool for evaluating and synthesizing multimedia and communications systems. In: MICRO 30: proceedings of the 30th annual ACM/IEEE international symposium on microarchitecture. IEEE Comput Soc, Los Alamitos, pp 330–335
Guthaus MR, Ringenberg JS, Ernst D, Austin TM, Mudge T, Brown RB (2001) Mibench: a free, commercially representative embedded benchmark suite. In: WWC’01: proceedings of the IEEE international workshop on workload characterization. IEEE Comput Soc, Los Alamitos, pp 3–14
Folegnani D, González A (2001) Energy-effective issue logic. SIGARCH Comput Archit News 29(2):230–239
Wilton SJE, Jouppi NP (1996) CACTI: an enhanced cache access and cycle time model. IEEE J Solid-State Circuits 31(5):677–688
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Bournoutian, G., Orailoglu, A. Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions. Des Autom Embed Syst 14, 309–326 (2010). https://doi.org/10.1007/s10617-010-9058-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10617-010-9058-y