Abstract
Reliability, Availability, and Serviceability (RAS) are key considerations in hardware design, be it for mobile devices or high-end servers. However, provisioning RAS is often at odds with meeting performance and energy targets and increases the overall cost of design of the chip. As a result of this tension, chip design companies have to make difficult decisions about how much RAS they can incorporate into each product in their portfolio and even what customers and market segments they can realistically target. On the other hand, highly scaled silicon technology nodes are susceptible to a variety of reliability problems and emerging technologies such as die-stacking and non-volatile memory, while critical for meeting the demands of future computing needs, have significant reliability challenges of their own. RAS features can actually serve to reduce the deployment costs of these technologies (e.g., by increasing effective yield). Determining the tradeoff between design cost, deployment cost, and the RAS needs of a market is the critical issue to address when evaluating RAS features. In this article, we shed light on this struggle between driving greater efficiency, lowering costs, and meeting the RAS demands of various market segments from an industry perspective. We argue that ending this struggle requires having sufficient flexibility in the design to adapt to the needs of a wide range of applications and hardware configurations. We call such an approach “resilience proportionality” and believe that this approach should guide future architectural reliability research. Finally, we discuss how resilience proportionality can be achieved and certain challenges that need to be addressed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams and K. Yelick, Exascale computing study: technology challenges in achieving exascale systems, peter kogge, editor & study lead, 2008
A. Avizienis, J.-C. Laprie, B. Randell, C. Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, in IEEE Transactions on Dependable and Secure Computing, (Jan–Mar 2004), pp. 11–33
S. S. Mukherjee, J. Emer, S. Reinhardt, The Soft Error Problem: An Architectural Perspective, in International Symposium on High-Performance Computer Architecture, 2005
R. Baumann, Radiation-Induced Soft Errors In Advanced Semiconductor Technologies, in IEEE Transactions on Device and Materials Reliability, 2005
C. Constantinescu, Trends and challenges in vlsi circuit reliability, in IEEE Micro, (Jul–Aug 2003), pp. 14–19
V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, S. Gurumurthi, Memory Errors in Modern Systems: The Good, The Bad, and The Ugly, in International Conference on Architectural Support for Programming Languages and Operating Systems, 2015
S. Mittal, J. S. Vetter, A survey of Software Techniques for Using Non-Volatime Memories for Storage and main Memory Systems, in IEEE Transactions on Parallel and Distributed Systems, 2015
T. Siddiqua, S. Gurumurthi, A Multi-Level Approach to Reduce The Impact of Nbti on Processor Functional Units, in Great lakes symposium on VLSI, 2010
M. R. Shaneyfelt, P. E. Dodd, B. L. Draper, R. S. Flores, Challenges in Hardening Technologies Using Shallow-Trench Isolation, in IEEE Transactions on Nuclear Science, pp. 2584–2592, 1998
R. W. Hamming, Error Detecting and Correcting Codes, in Bell System Technical Journal, 1950
D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, J. Smullen, Nonstop Advanced Architecture, in International Conference on Dependable Systems and Networks, 2005
L. A. Barroso, J. Clidaras, U. Holzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2nd edn. (2013)
D. Lyons, Sun Screen, Forbes, 13 Nov 2000
A. Biswas, C. Recchia, S. S. Mukherjee, V. Ambrose, L. Chan, A. Jaleel, A. Papathanasiou, M. Plaster, N. Seifert, Explaining Cache SER Anomaly Using DUE AVF Measurement, 2010
L. Szafaryn, B. H. Meyer, K. Skadron, Evaluating Overheads of Multibit Soft-Error Protection in the Processor Core, in IEEE Micro, pp. 56–65, 2013
S. S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, T. Austin, A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor, in International Symposium on Microarchitecture, 2003
L. Barroso, U. Holzle, The Case for Energy-Proportional Computing, in IEEE Computer, pp. 33–37, 2007
Khronos Group, OpenCL, [Online]. Available: www.khronos.org/opencl
Heterogeneous System Architecture Foundation, [Online]. Available: http://www.hsafoundation.com
J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, K. Skadron, Real-World Design and Evaluation of Compiler-Managed GPU Redundant Multi-Threading, in International Symposium on Computer Architecture, 2014
S. Li, V. Sridharan, S. Gurumurthi, S. Yalamanchili, Software-based Dynamic Reliability Management for GPU Applications, in Workshop on Silicon Errors in Logic—System Effects, 2015
V. Sridharan, D. R. Kaeli, Eliminating Microarchitectural Dependency from Architectural Vulnerability, in International Symposium on High-Performance Computer Architecture, 2009
B. Fang, K. Pattabiraman, M. Ripeanu, S. Gurumurthi, GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications, in International Symposium on Performance Analysis of Systems and Software, 2014
S. Hari, T. Tsai, M. Stephenson, S. Keckler, J. Emer, SASSIFI: Evaluating Resilience of GPU Applications, in IEEE Workshop on Silicon Errors in Logic—System Effects, 2015
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Sridharan, V., Gurumurthi, S. (2018). Resilience Proportionality—A Paradigm for Efficient and Reliable System Design. In: Ottavi, M., Gizopoulos, D., Pontarelli, S. (eds) Dependable Multicore Architectures at Nanoscale. Springer, Cham. https://doi.org/10.1007/978-3-319-54422-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-54422-9_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54421-2
Online ISBN: 978-3-319-54422-9
eBook Packages: EngineeringEngineering (R0)