Skip to main content

Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors

  • Conference paper
High Performance Embedded Architectures and Compilers (HiPEAC 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5952))

Abstract

As CMOS feature sizes venture deep into the nanometer regime, wearout mechanisms including negative-bias temperature instability and time-dependent dielectric breakdown can severely reduce processor operating lifetimes and performance. This paper presents an introspective reliability management system, Maestro, to tackle reliability challenges in future chip multiprocessors (CMPs) head-on. Unlike traditional approaches, Maestro relies on low-level sensors to monitor the CMP as it ages (introspection). Leveraging this real-time assessment of CMP health, runtime heuristics identify wearout-centric job assignments (management). By exploiting the complementary effects of the natural heterogeneity (due to process variation and wearout) that exists in CMPs and the diversity found in system workloads, Maestro composes job schedules that intelligently control the aging process. Monte Carlo experiments show that Maestro significantly enhances lifetime reliability through intelligent wear-leveling, increasing the expected service life of a population of 16-core CMPs by as much as 38% compared to a naive, round-robin scheduler. Furthermore, in the presence of process variation, Maestro’s wearout-centric scheduling outperformed both performance counter and temperature sensor based schedulers, achieving an order of magnitude more improvement in lifetime throughput – the amount of useful work done by a system prior to failure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alpha. 21364 family (2001), http://www.alphaprocessors.com/21364.htm

  2. Andrzejak, A., Arlitt, M., Rolia, J.: Bounding the resource savings of utility computing models. HP Laboratories (December 2002), http://www.hpl.hp.com/techreports/2002/HPL-2002-339.html

  3. Blome, J., Feng, S., Gupta, S., Mahlke, S.: Self-calibrating online wearout detection. In: Proc. of the 40th Annual International Symposium on Microarchitecture, pp. 109–120 (2007)

    Google Scholar 

  4. Borkar, S.: Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)

    Article  Google Scholar 

  5. Brooks, D., Tiwari, V., Martonosi, M.: A framework for architectural-level power analysis and optimizations. In: Proc. of the 27th Annual International Symposium on Computer Architecture, June 2000, pp. 83–94 (2000)

    Google Scholar 

  6. Cabe, A., Qi, Z., Wooters, S., Blalock, T., Stan, M.: Small embeddable nbti sensors (sens) for tracking on-chip performance decay, Washington, DC, USA. IEEE Computer Society, Los Alamitos (2009)

    Google Scholar 

  7. Choi, J., Cher, C., Franke, H., Haman, H., Wedger, A., Bose, P.: Thermal-aware task scheduling at the system software level. In: Proc. of the 2007 International Symposium on Low Power Electronics and Design, August 2007, pp. 213–218 (2007)

    Google Scholar 

  8. Chu, P.C., Beasley, J.E.: A genetic algorithm for the generalised assignment problem 24(1), 17–23 (1997)

    Google Scholar 

  9. Donald, J., Martonosi, M.: Techniques for multicore thermal management: Classification and new exploration. In: Proc. of the 33rd Annual International Symposium on Computer Architecture (June 2006)

    Google Scholar 

  10. Ernst, D., Das, S., Lee, S., Blaauw, D., Austin, T., Mudge, T., Kim, N.S., Flautner, K.: Razor: Circuit-level correction of timing errors for low-power operation. In: Proc. of the 37th Annual International Symposium on Microarchitecture, pp. 10–20 (2004)

    Google Scholar 

  11. Feng, S., Gupta, S., Ansari, A., Mahlke, S.: Maestro: Orchestrating lifetime reliability in chip multiprocessors. Technical Report CSE-TR-557-09, University of Michigan, Ann Arbor (November 2009), http://cccp.eecs.umich.edu/papers/CSE-TR-557-09.pdf

  12. Feng, S., Gupta, S., Mahlke, S.: Olay: Combat the signs of aging with intropsective reliability management. In: Proc. of the Workshop on Architectural Reliability (June 2008)

    Google Scholar 

  13. Friedrich, J., et al.: Desing of the power6 microprocessor. In: Proc. of ISSCC (February 2007)

    Google Scholar 

  14. Gupta, S., Feng, S., Ansari, A., Blome, J., Mahlke, S.: The stagenet fabric for constructing resilient multicore systems. In: Proc. of the 41st Annual International Symposium on Microarchitecture, pp. 141–151 (2008)

    Google Scholar 

  15. Li, X., Huang, B., Qin, J., Zhang, X., Talmor, M., Gur, Z., Bernstein, J.B.: Deep submicron cmos integrated circuit reliability simulation with spice. In: Proc. of the 2005 International Symposium on Quality of Electronic Design, March 2005, pp. 382–389 (2005)

    Google Scholar 

  16. Lu, Z., Lach, J., Stan, M.R., Skadron, K.: Improved thermal management with reliability banking. IEEE Micro 25(6), 40–49 (2005)

    Article  Google Scholar 

  17. Powell, M., Gomaa, M., Vijaykumar, T.: Heat-and-run: Leveraging smt and cmp to manage power density through the operating system. In: 12th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2004, pp. 260–270 (2004)

    Google Scholar 

  18. Roberts, D., Dreslinski, R., Karl, E., Mudge, T., Sylvester, D., Blaauw, D.: When homogeneous becomes heterogeneous: Wearout aware task scheduling for streaming applications. In: Proc. of the Workshop on Operationg System Support for Heterogeneous Multicore Architectures (September 2007)

    Google Scholar 

  19. Sarangi, S., Greskamp, B., Teodorescu, R., Nakano, J., Tiwari, A., Torrellas, J.: Varius: A model of process variation and resulting timing errors for microarchitects. IEEE Transactions on Semiconductor Manufacturing, 3–13 (February 2008)

    Google Scholar 

  20. Skadron, K., Stan, M.R., Sankaranarayanan, K., Huang, W., Velusamy, S., Tarjan, D.: Temperature-aware microarchitecture: Modeling and implementation. ACM Transactions on Architecture and Code Optimization 1(1), 94–125 (2004)

    Article  Google Scholar 

  21. Srinivasan, J., Adve, S.V., Bose, P., Rivers, J.A.: The case for lifetime reliability-aware microprocessors. In: Proc. of the 31st Annual International Symposium on Computer Architecture, June 2004, pp. 276–287 (2004)

    Google Scholar 

  22. Sylvester, D., Blaauw, D., Karl, E.: Elastic: An adaptive self-healing architecture for unpredictable silicon. IEEE Journal of Design and Test 23(6), 484–490 (2006)

    Article  Google Scholar 

  23. Teodorescu, R., Torrellas, J.: Variation-aware application scheduling and power management for chip multiprocessors. In: Proc. of the 35th Annual International Symposium on Computer Architecture, June 2008, pp. 363–374 (2008)

    Google Scholar 

  24. Tiwari, A., Sarangi, S., Torrellas, J.: Recycle: Pipeline adaptation to tolerate process variation. In: Proc. of the 34th Annual International Symposium on Computer Architecture, June 2007, pp. 323–334 (2007)

    Google Scholar 

  25. Tiwari, A., Torrellas, J.: Facelift: Hiding and slowing down aging in multicores. In: Proc. of the 41st Annual International Symposium on Microarchitecture, December 2008, pp. 129–140 (2008)

    Google Scholar 

  26. Winter, J., Albonesi, D.: Scheduling algorithms for unpredictably heterogeneous cmp architectures. In: Proc. of the 2008 International Conference on Dependable Systems and Networks (June 2008) (to appear)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Feng, S., Gupta, S., Ansari, A., Mahlke, S. (2010). Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds) High Performance Embedded Architectures and Compilers. HiPEAC 2010. Lecture Notes in Computer Science, vol 5952. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11515-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-11515-8_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-11514-1

  • Online ISBN: 978-3-642-11515-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics