Soft-error mitigation by means of decoupled transactional memory threads

Sánchez, Daniel; Cebrián, Juan M.; García, José M.; Aragón, Juan L.

doi:10.1007/s00446-014-0215-6

Soft-error mitigation by means of decoupled transactional memory threads

Published: 29 April 2014

Volume 28, pages 75–90, (2015)
Cite this article

Distributed Computing Aims and scope Submit manuscript

Daniel Sánchez¹,
Juan M. Cebrián¹,
José M. García¹ &
…
Juan L. Aragón¹

233 Accesses
1 Citation
Explore all metrics

Abstract

CMOS scaling exacerbates hardware errors making reliability a big concern for recent and future microarchitecture designs. Mechanisms to provide fault tolerance in architectures must accomplish several objectives such as low performance degradation, power consumption and area overhead. Several studies have already proposed fault tolerance for parallel codes. However, these proposals are usually implemented over non-realistic environments including the use of shared-buses among processors or modifying highly optimized hardware designs such as caches. Our attempt to face this multiple challenge is an architectural design called LBRA (Log-Based Redundant Architecture). Based on a Hardware Transactional Memory architecture, LBRA executes redundant threads which communicate through a pair-shared virtual memory log allocated in cache. Our initial version of LBRA executes these redundant threads in SMT cores. To avoid the performance penalty inherent to this architecture, we propose to decouple their execution in different cores, solving the inter-core communication by means of a log buffer empowered by a simple prefetch strategy. Simulation results using a variety of scientific and multimedia applications show that the execution time overhead of our best design is less than 7 % over a base case without fault tolerance. Additionally, we show that LBRA outperforms previous proposals that we have implemented and evaluated in the same framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Redundant Execution on Heterogeneous Multi-cores Utilizing Transactional Memory

Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support

Multi-Threaded Mitigation of Radiation-Induced Soft Errors in Bare-Metal Embedded Systems

Article 19 December 2019

Notes

We could increase fault detection granularity to memory operations as well, but this requires a bigger log. Refer to Sect. 4.1.1 for more details.
A conflict occurs when an address appears in the write-set of two transactions or the write-set of one and the read-set of another [38].
Provided that the cache coherence protocol is MESI.

References

Agarwal, R., Garg, P., Torrellas, J.: Rebound: scalable checkpointing for coherent shared memory. In: Proceeding of the 38th Annual International Symposium on Computer Architecture, ISCA ’11, pp. 153–164. ACM, New York (2011)
Bartlett, J., Gray, J., Horst, B.: Fault tolerance in tandem computer systems. In: The Evolution of Fault-Tolerant Systems (1987)
Bernick, D., Bruckert, B., Vigna, P. D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstop advanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), pp. 12–21. Yokohama, Japan (2005)
Bienia, C., Kumar, S., Singh, J.P., Li, K.: The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008)
Borkar, S.: Design challenges of technology scaling. IEEE Micro 19(4), 23–29 (1999)
Article Google Scholar
Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter variations and impact on circuits and microarchitecture. In: DAC ’03: Proceedings of the 40th Annual Design Automation Conference, pp. 338–342, ACM, New York (2003)
Bowman, K., Tschanz, J., Wilkerson, C., Lu, S.-L., Karnik, T., De, V., Borkar, S.: Circuit techniques for dynamic variation tolerance. In: DAC ’09: Proceedings of the 46th Annual Design Automation Conference, pp. 4–7. ACM, NY (2009)
Censier, L.M., Feautrier, P.: Readings in computer architecture. chapter a new solution to coherence problems in multicache systems, pp. 576–582. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Ceze, L., Tuck, J., Montesinos, P., Torrellas, J.: Bulksc: bulk enforcement of sequential consistency. In: Proceedings of the 34th International Symposium on Computer Architecture, pp. 278–289 (2007)
Culler, D., Singh, J.P., Gupta, A.: Parallel Computer Architecture: A Hardware/Software Approach (1998)
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, pp. 365–376 (2011)
Frank, D.J.: Power-constrained CMOS scaling limits. IBM J. Res. Dev. 46(2/3), 235–244 (2002)
Article Google Scholar
Gomaa, M., Scarbrough, C., Vijaykumar, T.N., Pomeranz, I.: Transient-fault recovery for chip multiprocessors. In: Proceedings of the 30th International Symposium on Computer Architecture, pp. 98–109. San Diego, California (2003)
Kim, T.-H., Liu, J., Keane, J., Kim, C.: A 0.2 v, 480 kb subthreshold sram with 1 k cells per bitline for ultra-low-voltage computing. IEEE J. Solid-State Circuits 43(2), 518–529 (2008)
Article Google Scholar
Kumar, R., Zyuban, V., Tullsen, D.M.: Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proceedings of the 32th International Symposium on Computer Architecture (ISCA’05), pp. 408–419. Madison, Wisconsin (2005)
LaFrieda, C., Ipek, E., Martinez, J.F., Manohar, R.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceedings of the 37th International Conference on Dependable Systems and Networks, pp. 317–326. Edinburgh, UK (2007)
Li, M.-L., Ramachandran, P., Sahoo, S., Adve, S., Adve, V., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. In: Proceedings of the 13th International Conference on Architectural Support for Programming Language and Operating Systems, pp. 265–276. Seattle, WA, USA (2008)
Li, M.-L., Sasanka, R., Adve, S.V., Kuang Chen, Y., Debes, E.: The alpbench benchmark suite for complex multimedia applications. In: Proceedings of the IEEE International Symposium on Workload Characterization, pp. 34–45 (2005)
Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B., Werner, B.: Simics: a full system simulation platform. Computer 35(2), 50–58 (2002)
Martin, M.M.K., Sorin, D.J., Beckmann, B.M., Marty, M.R., Xu, M., Alameldeen, A.R., Moore, K.E., Hill, M.D., Wood, D.A.: Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News 33(4), 92–99 (2005)
Mukherjee, S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: Proceedings of the 29th International Symposium on Computer Architecture, pp. 99–110. Anchorage, Alaska, USA (2002)
Olukotun, K., Nayfeh, B.A., Hammond, L., Wilson, K., Chang, K.: The case for a single-chip multiprocessor. SIGPLAN Not. 31, 2–11 (1996)
Article Google Scholar
Rashid, M., Huang, M.: Supporting highly-decoupled thread-level redundancy for parallel programs. In: Proceedings of the 14th International Symposium on High Performance Computer Architecture, pp. 393–404. Salt Lake City, USA (2008)
Reinhardt, S.K., Mukherjee, S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th International Symposium on Computer Architecture, pp. 25–36. Vancouver, British Columbia, Canada (2000)
Sánchez, D., Aragón, J.L., García, J.M.: Repas: reliable execution for parallel applications in tiled-cmps. In: 15th International European Conference on Parallel and Distributed Computing (Euro-Par 2009), pp. 321–333 (2009)
Sánchez, D., Aragón, J.L., García, J.M.: A log-based redundant architecture for reliable parallel computation. In: 17th International Conference on High Performance Computing (HiPC), pp. 1–10. Goa (India) (2010)
Sasanka, R., Adve, S.V., Chen, Y.-K., Debes, E.: The energy efficiency of cmp vs. smt for multimedia workloads. In: Proceedings of the 18th Annual International Conference on Supercomputing, ICS ’04, pp. 196–206. ACM, NY (2004)
Smolens, J.C., Gold, B.T., Falsafi, B., Hoe, J.C.: Reunion: complexity-effective multicore redundancy. In: Proceedings of the 39th International Symposium on Microarchitecture, pp. 223–234. Orlando, Florida, USA (2006)
Smolens, J.C., Gold, B.T., Kim, J., Falsafi, B., Hoe, J.C., Nowatzyk, A.G.: Fingerprinting: bounding soft-error-detection latency and bandwidth. IEEE Micro. 24(6), 22–29 (2004)
Spainhower, L., Gregg, T.A.: Ibm s/390 parallel enterprise server g5 fault tolerance: a historical perspective. IBM J. Res. Dev. 43, 863–873 (1999)
Article Google Scholar
Taur, Y.: CMOS design near to the limit of scaling. IBM J. Res. Dev. 46(2/3), 213–222 (2002)
Article Google Scholar
Taylor, M.B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffman, H., Johnson, P., Lee, J.-W., Lee, W., Ma, A., Saraf, A., Seneski, M., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., Agarwal, A.: The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro 22(2), 25–35 (2002)
Article Google Scholar
Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Iyer, P., Singh, A., Jacob, T., Jain, S., Venkataraman, S., Hoskote, Y., Borkar, N.: An 80-tile 1.28tflops network-on-chip in 65nm cmos. In: Solid-State Circuits Conference, 2007, ISSCC 2007. Digest of Technical Papers. IEEE, International, pp. 98–589 (2007)
Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient fault recovery using simultaneous multithreading. In: Proceedings of the 29th International Symposium on Computer Architecture, pp. 98–109. Anchorage, Alaska (2002)
Wang, N.J., Patel, S.J.: Restore: symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secur. Comput. 3(3), 188–201 (2006)
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22th International Symposium on Computer Architecture (ISCA’95), pp. 24–36. Santa Margherita Ligure, Italy (1995)
Yalcin, G., Unsal, O., Hur, I., Cristal, A., Valero, M.: Faultm: fault-tolerance using hardware transactional memory. In: The 3nd Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures (PESPMA 2010), pp. 34–47 (2010)
Yen, L., Bobba, J., Marty, M. R. Moore, K.E., Volos, H., Hill, M.D., Swift, M.M., Wood, D.A.: Logtm-se: decoupling hardware transactional memory from caches. In: Proceedings of the 19th International Symposium on High-Performance Computer Architecture, pp. 261–272 (2007)

Download references

Acknowledgments

Thanks to the anonymous reviewers for their comments and suggestions which definitely improved this work. This work was jointly supported by the Spanish MINECO and Spanish MEC, as well as European Commission FEDER funds under grant numbers TIN2012-38341-C04-03 and TIN2012-31345.

Author information

Authors and Affiliations

Computer Engineering Department, Facultad de Informática, University of Murcia, 30100 , Murcia, Spain
Daniel Sánchez, Juan M. Cebrián, José M. García & Juan L. Aragón

Authors

Daniel Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Juan M. Cebrián
View author publications
You can also search for this author in PubMed Google Scholar
José M. García
View author publications
You can also search for this author in PubMed Google Scholar
Juan L. Aragón
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, D., Cebrián, J.M., García, J.M. et al. Soft-error mitigation by means of decoupled transactional memory threads. Distrib. Comput. 28, 75–90 (2015). https://doi.org/10.1007/s00446-014-0215-6

Download citation

Received: 18 April 2012
Accepted: 08 April 2014
Published: 29 April 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s00446-014-0215-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Soft-error mitigation by means of decoupled transactional memory threads

Abstract

Access this article

Similar content being viewed by others

Redundant Execution on Heterogeneous Multi-cores Utilizing Transactional Memory

Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support

Multi-Threaded Mitigation of Radiation-Induced Soft Errors in Bare-Metal Embedded Systems

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Soft-error mitigation by means of decoupled transactional memory threads

Abstract

Access this article

Similar content being viewed by others

Redundant Execution on Heterogeneous Multi-cores Utilizing Transactional Memory

Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support

Multi-Threaded Mitigation of Radiation-Induced Soft Errors in Bare-Metal Embedded Systems

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation