Skip to main content
Log in

Soft-error mitigation by means of decoupled transactional memory threads

  • Published:
Distributed Computing Aims and scope Submit manuscript

Abstract

CMOS scaling exacerbates hardware errors making reliability a big concern for recent and future microarchitecture designs. Mechanisms to provide fault tolerance in architectures must accomplish several objectives such as low performance degradation, power consumption and area overhead. Several studies have already proposed fault tolerance for parallel codes. However, these proposals are usually implemented over non-realistic environments including the use of shared-buses among processors or modifying highly optimized hardware designs such as caches. Our attempt to face this multiple challenge is an architectural design called LBRA (Log-Based Redundant Architecture). Based on a Hardware Transactional Memory architecture, LBRA executes redundant threads which communicate through a pair-shared virtual memory log allocated in cache. Our initial version of LBRA executes these redundant threads in SMT cores. To avoid the performance penalty inherent to this architecture, we propose to decouple their execution in different cores, solving the inter-core communication by means of a log buffer empowered by a simple prefetch strategy. Simulation results using a variety of scientific and multimedia applications show that the execution time overhead of our best design is less than 7 % over a base case without fault tolerance. Additionally, we show that LBRA outperforms previous proposals that we have implemented and evaluated in the same framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. We could increase fault detection granularity to memory operations as well, but this requires a bigger log. Refer to Sect. 4.1.1 for more details.

  2. A conflict occurs when an address appears in the write-set of two transactions or the write-set of one and the read-set of another [38].

  3. Provided that the cache coherence protocol is MESI.

References

  1. Agarwal, R., Garg, P., Torrellas, J.: Rebound: scalable checkpointing for coherent shared memory. In: Proceeding of the 38th Annual International Symposium on Computer Architecture, ISCA ’11, pp. 153–164. ACM, New York (2011)

  2. Bartlett, J., Gray, J., Horst, B.: Fault tolerance in tandem computer systems. In: The Evolution of Fault-Tolerant Systems (1987)

  3. Bernick, D., Bruckert, B., Vigna, P. D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstop advanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), pp. 12–21. Yokohama, Japan (2005)

  4. Bienia, C., Kumar, S., Singh, J.P., Li, K.: The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008)

  5. Borkar, S.: Design challenges of technology scaling. IEEE Micro 19(4), 23–29 (1999)

    Article  Google Scholar 

  6. Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter variations and impact on circuits and microarchitecture. In: DAC ’03: Proceedings of the 40th Annual Design Automation Conference, pp. 338–342, ACM, New York (2003)

  7. Bowman, K., Tschanz, J., Wilkerson, C., Lu, S.-L., Karnik, T., De, V., Borkar, S.: Circuit techniques for dynamic variation tolerance. In: DAC ’09: Proceedings of the 46th Annual Design Automation Conference, pp. 4–7. ACM, NY (2009)

  8. Censier, L.M., Feautrier, P.: Readings in computer architecture. chapter a new solution to coherence problems in multicache systems, pp. 576–582. Morgan Kaufmann Publishers Inc., San Francisco (2000)

  9. Ceze, L., Tuck, J., Montesinos, P., Torrellas, J.: Bulksc: bulk enforcement of sequential consistency. In: Proceedings of the 34th International Symposium on Computer Architecture, pp. 278–289 (2007)

  10. Culler, D., Singh, J.P., Gupta, A.: Parallel Computer Architecture: A Hardware/Software Approach (1998)

  11. Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, pp. 365–376 (2011)

  12. Frank, D.J.: Power-constrained CMOS scaling limits. IBM J. Res. Dev. 46(2/3), 235–244 (2002)

    Article  Google Scholar 

  13. Gomaa, M., Scarbrough, C., Vijaykumar, T.N., Pomeranz, I.: Transient-fault recovery for chip multiprocessors. In: Proceedings of the 30th International Symposium on Computer Architecture, pp. 98–109. San Diego, California (2003)

  14. Kim, T.-H., Liu, J., Keane, J., Kim, C.: A 0.2 v, 480 kb subthreshold sram with 1 k cells per bitline for ultra-low-voltage computing. IEEE J. Solid-State Circuits 43(2), 518–529 (2008)

    Article  Google Scholar 

  15. Kumar, R., Zyuban, V., Tullsen, D.M.: Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proceedings of the 32th International Symposium on Computer Architecture (ISCA’05), pp. 408–419. Madison, Wisconsin (2005)

  16. LaFrieda, C., Ipek, E., Martinez, J.F., Manohar, R.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceedings of the 37th International Conference on Dependable Systems and Networks, pp. 317–326. Edinburgh, UK (2007)

  17. Li, M.-L., Ramachandran, P., Sahoo, S., Adve, S., Adve, V., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. In: Proceedings of the 13th International Conference on Architectural Support for Programming Language and Operating Systems, pp. 265–276. Seattle, WA, USA (2008)

  18. Li, M.-L., Sasanka, R., Adve, S.V., Kuang Chen, Y., Debes, E.: The alpbench benchmark suite for complex multimedia applications. In: Proceedings of the IEEE International Symposium on Workload Characterization, pp. 34–45 (2005)

  19. Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B., Werner, B.: Simics: a full system simulation platform. Computer 35(2), 50–58 (2002)

  20. Martin, M.M.K., Sorin, D.J., Beckmann, B.M., Marty, M.R., Xu, M., Alameldeen, A.R., Moore, K.E., Hill, M.D., Wood, D.A.: Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News 33(4), 92–99 (2005)

  21. Mukherjee, S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: Proceedings of the 29th International Symposium on Computer Architecture, pp. 99–110. Anchorage, Alaska, USA (2002)

  22. Olukotun, K., Nayfeh, B.A., Hammond, L., Wilson, K., Chang, K.: The case for a single-chip multiprocessor. SIGPLAN Not. 31, 2–11 (1996)

    Article  Google Scholar 

  23. Rashid, M., Huang, M.: Supporting highly-decoupled thread-level redundancy for parallel programs. In: Proceedings of the 14th International Symposium on High Performance Computer Architecture, pp. 393–404. Salt Lake City, USA (2008)

  24. Reinhardt, S.K., Mukherjee, S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th International Symposium on Computer Architecture, pp. 25–36. Vancouver, British Columbia, Canada (2000)

  25. Sánchez, D., Aragón, J.L., García, J.M.: Repas: reliable execution for parallel applications in tiled-cmps. In: 15th International European Conference on Parallel and Distributed Computing (Euro-Par 2009), pp. 321–333 (2009)

  26. Sánchez, D., Aragón, J.L., García, J.M.: A log-based redundant architecture for reliable parallel computation. In: 17th International Conference on High Performance Computing (HiPC), pp. 1–10. Goa (India) (2010)

  27. Sasanka, R., Adve, S.V., Chen, Y.-K., Debes, E.: The energy efficiency of cmp vs. smt for multimedia workloads. In: Proceedings of the 18th Annual International Conference on Supercomputing, ICS ’04, pp. 196–206. ACM, NY (2004)

  28. Smolens, J.C., Gold, B.T., Falsafi, B., Hoe, J.C.: Reunion: complexity-effective multicore redundancy. In: Proceedings of the 39th International Symposium on Microarchitecture, pp. 223–234. Orlando, Florida, USA (2006)

  29. Smolens, J.C., Gold, B.T., Kim, J., Falsafi, B., Hoe, J.C., Nowatzyk, A.G.: Fingerprinting: bounding soft-error-detection latency and bandwidth. IEEE Micro. 24(6), 22–29 (2004)

  30. Spainhower, L., Gregg, T.A.: Ibm s/390 parallel enterprise server g5 fault tolerance: a historical perspective. IBM J. Res. Dev. 43, 863–873 (1999)

    Article  Google Scholar 

  31. Taur, Y.: CMOS design near to the limit of scaling. IBM J. Res. Dev. 46(2/3), 213–222 (2002)

    Article  Google Scholar 

  32. Taylor, M.B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffman, H., Johnson, P., Lee, J.-W., Lee, W., Ma, A., Saraf, A., Seneski, M., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., Agarwal, A.: The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro 22(2), 25–35 (2002)

    Article  Google Scholar 

  33. Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Iyer, P., Singh, A., Jacob, T., Jain, S., Venkataraman, S., Hoskote, Y., Borkar, N.: An 80-tile 1.28tflops network-on-chip in 65nm cmos. In: Solid-State Circuits Conference, 2007, ISSCC 2007. Digest of Technical Papers. IEEE, International, pp. 98–589 (2007)

  34. Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient fault recovery using simultaneous multithreading. In: Proceedings of the 29th International Symposium on Computer Architecture, pp. 98–109. Anchorage, Alaska (2002)

  35. Wang, N.J., Patel, S.J.: Restore: symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secur. Comput. 3(3), 188–201 (2006)

  36. Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22th International Symposium on Computer Architecture (ISCA’95), pp. 24–36. Santa Margherita Ligure, Italy (1995)

  37. Yalcin, G., Unsal, O., Hur, I., Cristal, A., Valero, M.: Faultm: fault-tolerance using hardware transactional memory. In: The 3nd Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures (PESPMA 2010), pp. 34–47 (2010)

  38. Yen, L., Bobba, J., Marty, M. R. Moore, K.E., Volos, H., Hill, M.D., Swift, M.M., Wood, D.A.: Logtm-se: decoupling hardware transactional memory from caches. In: Proceedings of the 19th International Symposium on High-Performance Computer Architecture, pp. 261–272 (2007)

Download references

Acknowledgments

Thanks to the anonymous reviewers for their comments and suggestions which definitely improved this work. This work was jointly supported by the Spanish MINECO and Spanish MEC, as well as European Commission FEDER funds under grant numbers TIN2012-38341-C04-03 and TIN2012-31345.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Sánchez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sánchez, D., Cebrián, J.M., García, J.M. et al. Soft-error mitigation by means of decoupled transactional memory threads. Distrib. Comput. 28, 75–90 (2015). https://doi.org/10.1007/s00446-014-0215-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00446-014-0215-6

Keywords

Navigation