Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Bournoutian, Garo; Orailoglu, Alex

doi:10.1007/s10617-010-9058-y

Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Open access
Published: 21 July 2010

Volume 14, pages 309–326, (2010)
Cite this article

Download PDF

You have full access to this open access article

Design Automation for Embedded Systems Aims and scope Submit manuscript

Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Download PDF

Garo Bournoutian¹ &
Alex Orailoglu¹

1256 Accesses
Explore all metrics

Abstract

Today, embedded processors are expected to be able to run algorithmically complex, memory-intensive applications that were originally designed and coded for general-purpose processors. As such, the impact of memory latencies on the execution time increasingly becomes evident. All the while, it is also expected that embedded processors be power-conscientious as well as of minimal area impact, as they are often used in mobile devices such as wireless smartphones and portable MP3 players. As a result, traditional methods for addressing performance and memory latencies, such as multiple issue, out-of-order execution and large, associative caches, are not aptly suited for the mobile embedded domain due to the significant area and power overhead. This paper explores a novel approach to mitigating execution delays caused by memory latencies that would otherwise not be possible in a regular in-order, single-issue embedded processor without large, power-hungry constructs like a Reorder Buffer (ROB). The concept relies on efficiently leveraging both compile-time and run-time information to safely allow non-data-dependent instructions to continue executing in the event of a memory stall. The simulation results show significant improvement in overall execution throughput of approximately 11%, while having a minimal impact on area overhead and power.

Article PDF

The Return of Power Gating: Smart Leakage Energy Reductions in Modern Out-of-Order Processor Architectures

Memory Partitioning in the Limit

Article 26 October 2015

Low power memory allocation and mapping for area-constrained systems-on-chips

Article Open access 07 July 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Wilkes MV (2001) The memory gap and the future of high performance memories. SIGARCH Comput Archit News 29(1):2–7
Article Google Scholar
Lee L, Kannan S, Fridman J (2004) MPEG4 video codec on a wireless handset baseband system. In: Proc workshop media and signal processors for embedded systems and SoCs
Jouppi NP (1990) Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput Archit News 18:364–373
Article Google Scholar
Bournoutian G, Orailoglu A (2008) Miss reduction in embedded processors through dynamic, power-friendly cache design. In: DAC’08: proceedings of the 45th annual conference on design automation. ACM, New York, pp 304–309
Chapter Google Scholar
Sprangle E, Carmean D (2002) Increasing processor performance by implementing deeper pipelines. SIGARCH Comput Archit News 30(2):25–34
Article Google Scholar
Tomasulo RM (1967) An efficient algorithm for exploiting multiple arithmetic units. IBM J Res Develop 11:25–33
Article MATH Google Scholar
Smith JE, Pleszkun AR (1985) Implementation of precise interrupts in pipelined processors. In: ISCA’85: proceedings of the 12th annual international symposium on computer architecture. IEEE Comput Soc, Los Alamitos, pp 36–44
Google Scholar
Hily S, Seznec A (1999) Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In: HPCA’99: proceedings of the 5th international symposium on high performance computer architecture. IEEE Comput Soc, Los Alamitos, pp 64–67
Chapter Google Scholar
Grossman JP (2000) Cheap out-of-order execution using delayed issue. In: ICCD’00: proceedings of the 2000 IEEE international conference on computer design, pp 549–551
Callahan D, Kennedy K, Porterfield A (1991) Software prefetching. In: ASPLOS-IV: proceedings of the 4th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 40–52
Chapter Google Scholar
Klaiber AC, Levy HM (1991) An architecture for software-controlled data prefetching. SIGARCH Comput Archit News 19(3):43–53
Article Google Scholar
Mowry TC, Lam MS, Gupta A (1992) Design and evaluation of a compiler algorithm for prefetching. In: ASPLOS-V: proceedings of the 5th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 62–73
Chapter Google Scholar
Badawy A-HA, Aggarwal A, Yeung D, Tseng C-W (2001) Evaluating the impact of memory system performance on software prefetching and locality optimizations. In: ICS’01: proceedings of the 15th international conference on supercomputing. ACM, New York, pp 486–500
Chapter Google Scholar
Baer J-L, Chen T-F (1991) An effective on-chip preloading scheme to reduce data access penalty. In: Supercomputing’91: proceedings of the 1991 ACM/IEEE conference on supercomputing. ACM, New York, pp 176–186
Chapter Google Scholar
Fu JWC, Patel JH, Janssens BL (1992) Stride directed prefetching in scalar processors. In: MICRO 25: proceedings of the 25th annual international symposium on microarchitecture. IEEE Comput Soc, Los Alamitos, pp 102–110
Chapter Google Scholar
Joseph D, Grunwald D (1997) Prefetching using Markov predictors. In: ISCA’97: proceedings of the 24th annual international symposium on computer architecture. ACM, New York, pp 252–263
Chapter Google Scholar
Park S, Shrivastava A, Paek Y (2008) Hiding cache miss penalty using priority-based execution for embedded processors. In: DATE’08: proceedings of the conference on design, automation and test in Europe, pp 1190–1195
Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. SIGOPS Oper Syst Rev 30(5):2–11
Article Google Scholar
Austin T, Larson E, Ernst D (2002) Simplescalar: an infrastructure for computer system modeling. Computer 35(2):59–67
Article Google Scholar
SPEC CPU2000 Benchmarks. http://www.spec.org/cpu/
Lee C, Potkonjak M, Mangione-Smith WH (1997) Mediabench: a tool for evaluating and synthesizing multimedia and communications systems. In: MICRO 30: proceedings of the 30th annual ACM/IEEE international symposium on microarchitecture. IEEE Comput Soc, Los Alamitos, pp 330–335
Google Scholar
Guthaus MR, Ringenberg JS, Ernst D, Austin TM, Mudge T, Brown RB (2001) Mibench: a free, commercially representative embedded benchmark suite. In: WWC’01: proceedings of the IEEE international workshop on workload characterization. IEEE Comput Soc, Los Alamitos, pp 3–14
Chapter Google Scholar
Folegnani D, González A (2001) Energy-effective issue logic. SIGARCH Comput Archit News 29(2):230–239
Article Google Scholar
Wilton SJE, Jouppi NP (1996) CACTI: an enhanced cache access and cycle time model. IEEE J Solid-State Circuits 31(5):677–688
Article Google Scholar

Download references

Author information

Authors and Affiliations

CSE Department, University of California, San Diego, 9500 Gilman Dr. #0404, La Jolla, CA, 92093-0404, USA
Garo Bournoutian & Alex Orailoglu

Authors

Garo Bournoutian
View author publications
You can also search for this author in PubMed Google Scholar
Alex Orailoglu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Garo Bournoutian.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Bournoutian, G., Orailoglu, A. Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions. Des Autom Embed Syst 14, 309–326 (2010). https://doi.org/10.1007/s10617-010-9058-y

Download citation

Received: 11 May 2010
Accepted: 24 June 2010
Published: 21 July 2010
Issue Date: September 2010
DOI: https://doi.org/10.1007/s10617-010-9058-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Abstract

Article PDF

Similar content being viewed by others

The Return of Power Gating: Smart Leakage Energy Reductions in Modern Out-of-Order Processor Architectures

Memory Partitioning in the Limit

Low power memory allocation and mapping for area-constrained systems-on-chips

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Abstract

Article PDF

Similar content being viewed by others

The Return of Power Gating: Smart Leakage Energy Reductions in Modern Out-of-Order Processor Architectures

Memory Partitioning in the Limit

Low power memory allocation and mapping for area-constrained systems-on-chips

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation