Abstract
CMOS scaling exacerbates hardware errors making reliability a big concern for recent and future microarchitecture designs. Mechanisms to provide fault tolerance in architectures must accomplish several objectives such as low performance degradation, power consumption and area overhead. Several studies have already proposed fault tolerance for parallel codes. However, these proposals are usually implemented over non-realistic environments including the use of shared-buses among processors or modifying highly optimized hardware designs such as caches. Our attempt to face this multiple challenge is an architectural design called LBRA (Log-Based Redundant Architecture). Based on a Hardware Transactional Memory architecture, LBRA executes redundant threads which communicate through a pair-shared virtual memory log allocated in cache. Our initial version of LBRA executes these redundant threads in SMT cores. To avoid the performance penalty inherent to this architecture, we propose to decouple their execution in different cores, solving the inter-core communication by means of a log buffer empowered by a simple prefetch strategy. Simulation results using a variety of scientific and multimedia applications show that the execution time overhead of our best design is less than 7 % over a base case without fault tolerance. Additionally, we show that LBRA outperforms previous proposals that we have implemented and evaluated in the same framework.
Similar content being viewed by others
Notes
We could increase fault detection granularity to memory operations as well, but this requires a bigger log. Refer to Sect. 4.1.1 for more details.
A conflict occurs when an address appears in the write-set of two transactions or the write-set of one and the read-set of another [38].
Provided that the cache coherence protocol is MESI.
References
Agarwal, R., Garg, P., Torrellas, J.: Rebound: scalable checkpointing for coherent shared memory. In: Proceeding of the 38th Annual International Symposium on Computer Architecture, ISCA ’11, pp. 153–164. ACM, New York (2011)
Bartlett, J., Gray, J., Horst, B.: Fault tolerance in tandem computer systems. In: The Evolution of Fault-Tolerant Systems (1987)
Bernick, D., Bruckert, B., Vigna, P. D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstop advanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), pp. 12–21. Yokohama, Japan (2005)
Bienia, C., Kumar, S., Singh, J.P., Li, K.: The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008)
Borkar, S.: Design challenges of technology scaling. IEEE Micro 19(4), 23–29 (1999)
Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter variations and impact on circuits and microarchitecture. In: DAC ’03: Proceedings of the 40th Annual Design Automation Conference, pp. 338–342, ACM, New York (2003)
Bowman, K., Tschanz, J., Wilkerson, C., Lu, S.-L., Karnik, T., De, V., Borkar, S.: Circuit techniques for dynamic variation tolerance. In: DAC ’09: Proceedings of the 46th Annual Design Automation Conference, pp. 4–7. ACM, NY (2009)
Censier, L.M., Feautrier, P.: Readings in computer architecture. chapter a new solution to coherence problems in multicache systems, pp. 576–582. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Ceze, L., Tuck, J., Montesinos, P., Torrellas, J.: Bulksc: bulk enforcement of sequential consistency. In: Proceedings of the 34th International Symposium on Computer Architecture, pp. 278–289 (2007)
Culler, D., Singh, J.P., Gupta, A.: Parallel Computer Architecture: A Hardware/Software Approach (1998)
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, pp. 365–376 (2011)
Frank, D.J.: Power-constrained CMOS scaling limits. IBM J. Res. Dev. 46(2/3), 235–244 (2002)
Gomaa, M., Scarbrough, C., Vijaykumar, T.N., Pomeranz, I.: Transient-fault recovery for chip multiprocessors. In: Proceedings of the 30th International Symposium on Computer Architecture, pp. 98–109. San Diego, California (2003)
Kim, T.-H., Liu, J., Keane, J., Kim, C.: A 0.2 v, 480 kb subthreshold sram with 1 k cells per bitline for ultra-low-voltage computing. IEEE J. Solid-State Circuits 43(2), 518–529 (2008)
Kumar, R., Zyuban, V., Tullsen, D.M.: Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proceedings of the 32th International Symposium on Computer Architecture (ISCA’05), pp. 408–419. Madison, Wisconsin (2005)
LaFrieda, C., Ipek, E., Martinez, J.F., Manohar, R.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceedings of the 37th International Conference on Dependable Systems and Networks, pp. 317–326. Edinburgh, UK (2007)
Li, M.-L., Ramachandran, P., Sahoo, S., Adve, S., Adve, V., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. In: Proceedings of the 13th International Conference on Architectural Support for Programming Language and Operating Systems, pp. 265–276. Seattle, WA, USA (2008)
Li, M.-L., Sasanka, R., Adve, S.V., Kuang Chen, Y., Debes, E.: The alpbench benchmark suite for complex multimedia applications. In: Proceedings of the IEEE International Symposium on Workload Characterization, pp. 34–45 (2005)
Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B., Werner, B.: Simics: a full system simulation platform. Computer 35(2), 50–58 (2002)
Martin, M.M.K., Sorin, D.J., Beckmann, B.M., Marty, M.R., Xu, M., Alameldeen, A.R., Moore, K.E., Hill, M.D., Wood, D.A.: Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News 33(4), 92–99 (2005)
Mukherjee, S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: Proceedings of the 29th International Symposium on Computer Architecture, pp. 99–110. Anchorage, Alaska, USA (2002)
Olukotun, K., Nayfeh, B.A., Hammond, L., Wilson, K., Chang, K.: The case for a single-chip multiprocessor. SIGPLAN Not. 31, 2–11 (1996)
Rashid, M., Huang, M.: Supporting highly-decoupled thread-level redundancy for parallel programs. In: Proceedings of the 14th International Symposium on High Performance Computer Architecture, pp. 393–404. Salt Lake City, USA (2008)
Reinhardt, S.K., Mukherjee, S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th International Symposium on Computer Architecture, pp. 25–36. Vancouver, British Columbia, Canada (2000)
Sánchez, D., Aragón, J.L., García, J.M.: Repas: reliable execution for parallel applications in tiled-cmps. In: 15th International European Conference on Parallel and Distributed Computing (Euro-Par 2009), pp. 321–333 (2009)
Sánchez, D., Aragón, J.L., García, J.M.: A log-based redundant architecture for reliable parallel computation. In: 17th International Conference on High Performance Computing (HiPC), pp. 1–10. Goa (India) (2010)
Sasanka, R., Adve, S.V., Chen, Y.-K., Debes, E.: The energy efficiency of cmp vs. smt for multimedia workloads. In: Proceedings of the 18th Annual International Conference on Supercomputing, ICS ’04, pp. 196–206. ACM, NY (2004)
Smolens, J.C., Gold, B.T., Falsafi, B., Hoe, J.C.: Reunion: complexity-effective multicore redundancy. In: Proceedings of the 39th International Symposium on Microarchitecture, pp. 223–234. Orlando, Florida, USA (2006)
Smolens, J.C., Gold, B.T., Kim, J., Falsafi, B., Hoe, J.C., Nowatzyk, A.G.: Fingerprinting: bounding soft-error-detection latency and bandwidth. IEEE Micro. 24(6), 22–29 (2004)
Spainhower, L., Gregg, T.A.: Ibm s/390 parallel enterprise server g5 fault tolerance: a historical perspective. IBM J. Res. Dev. 43, 863–873 (1999)
Taur, Y.: CMOS design near to the limit of scaling. IBM J. Res. Dev. 46(2/3), 213–222 (2002)
Taylor, M.B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffman, H., Johnson, P., Lee, J.-W., Lee, W., Ma, A., Saraf, A., Seneski, M., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., Agarwal, A.: The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro 22(2), 25–35 (2002)
Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Iyer, P., Singh, A., Jacob, T., Jain, S., Venkataraman, S., Hoskote, Y., Borkar, N.: An 80-tile 1.28tflops network-on-chip in 65nm cmos. In: Solid-State Circuits Conference, 2007, ISSCC 2007. Digest of Technical Papers. IEEE, International, pp. 98–589 (2007)
Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient fault recovery using simultaneous multithreading. In: Proceedings of the 29th International Symposium on Computer Architecture, pp. 98–109. Anchorage, Alaska (2002)
Wang, N.J., Patel, S.J.: Restore: symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secur. Comput. 3(3), 188–201 (2006)
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22th International Symposium on Computer Architecture (ISCA’95), pp. 24–36. Santa Margherita Ligure, Italy (1995)
Yalcin, G., Unsal, O., Hur, I., Cristal, A., Valero, M.: Faultm: fault-tolerance using hardware transactional memory. In: The 3nd Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures (PESPMA 2010), pp. 34–47 (2010)
Yen, L., Bobba, J., Marty, M. R. Moore, K.E., Volos, H., Hill, M.D., Swift, M.M., Wood, D.A.: Logtm-se: decoupling hardware transactional memory from caches. In: Proceedings of the 19th International Symposium on High-Performance Computer Architecture, pp. 261–272 (2007)
Acknowledgments
Thanks to the anonymous reviewers for their comments and suggestions which definitely improved this work. This work was jointly supported by the Spanish MINECO and Spanish MEC, as well as European Commission FEDER funds under grant numbers TIN2012-38341-C04-03 and TIN2012-31345.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sánchez, D., Cebrián, J.M., García, J.M. et al. Soft-error mitigation by means of decoupled transactional memory threads. Distrib. Comput. 28, 75–90 (2015). https://doi.org/10.1007/s00446-014-0215-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00446-014-0215-6