International Journal of Parallel Programming

, Volume 44, Issue 2, pp 208–232 | Cite as

Architectural Support for Fault Tolerance in a Teradevice Dataflow System

  • Sebastian Weis
  • Arne Garbade
  • Bernhard Fechner
  • Avi Mendelson
  • Roberto Giorgi
  • Theo Ungerer


The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates.


Coarse-grained dataflow Fault tolerance Fault detection Recovery Reliability 



This work was partly funded by the European FP7 Projects TERAFLUX (id. 249013) and HiPEAC (IST-217068). The authors wish to thank N. Puzovic and Z. Popovic for their initial studies on the DTA-C architecture and P. Faraboschi of HP for his precious suggestions and support on the COTSon simulator.


  1. 1.
    International Technology Roadmap for Semiconductors 2011 Edition. Website.
  2. 2.
    Agarwal, R., Garg, P., Torrellas, J.: Rebound: scalable checkpointing for coherent shared memory. In: International Symposium on Computer Architecture (ISCA), pp. 153–164. IEEE (2011)Google Scholar
  3. 3.
    AMD Inc.: AMD64 Architecture Programmer’s Manual Volume 2: System Programming (2006)Google Scholar
  4. 4.
    Arandi, S., Kyriacou, C., George, M., George, M., Masrujeh, N., Trancoso, P., Evripidou, S., Giorgi, R., Zhibin, Y., Collange, S., Scionti, A., Khan, B., Khan, S., Lujan, M., Watson, I., Etsion, Y., Ungerer, T., Fechner, B., Garbade, A., Weis, S.: D6.2-advanced teraflux architecture. Public deliverable, The TERAFLUX Project (FP7/2007-2013 Grant Agreement No. 249013) (2011)Google Scholar
  5. 5.
    Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., Ortega, D.: COTSon: infrastructure for full system simulation. ACM SIGOPS Oper. Syst. Rev. 43(1), 52–61 (2009)Google Scholar
  6. 6.
    Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: International Symposium on Microarchitecture (MICRO), pp. 196–207. IEEE (1999)Google Scholar
  7. 7.
    Bell, S., et al.: TILE64-processor: a 64-core soc with mesh interconnect. In: International Solid-State Circuits Conference (ISSCC). Digest of Technical Papers, pp. 88–89. IEEE (2008)Google Scholar
  8. 8.
    Bernick, D., Bruckert, B., Vigna, P., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstop advanced architecture. In: International Conference on Dependable Systems and Networks (DSN), pp. 12–21. IEEE (2005)Google Scholar
  9. 9.
    Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)CrossRefGoogle Scholar
  10. 10.
    Borkar, S.: Thousand core chips: a technology perspective. In: Annual Design Automation Conference (DAC), pp. 746–749. ACM (2007)Google Scholar
  11. 11.
    Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)Google Scholar
  12. 12.
    Etsion, Y., Cabarcas, F., Rico, A., Ramirez, A., Badia, R. M., Ayguade, E., Labarta, J., Valero, M.: Task superscalar: an out-of-order task pipeline. In: International Symposium on Microarchitecture (MICRO), pp. 89–100. IEEE (2010)Google Scholar
  13. 13.
    Gautier, T., Besseron, X., Pigeon, L.: KAAPI: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: International Workshop on Parallel Symbolic Computation (PASCO), pp. 15–23. ACM (2007)Google Scholar
  14. 14.
    Giorgi, R.: TERAFLUX: exploiting dataflow parallelism in teradevices. In: International Conference on Computing Frontiers (CF), pp. 303–304. ACM (2012)Google Scholar
  15. 15.
    Giorgi, R., Popovic, Z., Puzovic, N.: DTA-C: a decoupled multi-threaded architecture for CMP systems. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 263–270. IEEE (2007)Google Scholar
  16. 16.
    Giorgi, R., Popovic, Z., Puzovic, N.: Implementing fine/medium grained TLP support in a many-core architecture. In: Bertels, K., Dimopoulos, N., Silvano, C., Wong, S. (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, Lecture Notes in Computer Science (LNCS), vol. 5657, pp. 78–87. Springer (2009)Google Scholar
  17. 17.
    Gupta, G., Sohi, G.S.: Dataflow execution of sequential imperative programs on multicore architectures. In: International Symposium on Microarchitecture (MICRO), pp. 59–70. ACM (2011)Google Scholar
  18. 18.
    Hammond, L., Wong, V., Chen, M., Carlstrom, B.D., Davis, J.D., Hertzberg, B., Prabhu, M.K., Wijaya, H., Kozyrakis, C., Olukotun, K.: Transactional memory coherence and consistency. In: International Symposium on Computer Architecture (ISCA), pp. 102–113. IEEE (2004)Google Scholar
  19. 19.
    Howard, J., et al.: A 48-core ia-32 message-passing processor with dvfs in 45nm CMOS. In: International Solid-State Circuits Conference (ISSCC). Digest of Technical Papers, pp. 108–109. IEEE (2010)Google Scholar
  20. 20.
    Hum, H.H.J., Maquelin, O., Theobald, K.B., Tian, X., Tang, X., Gao, G.R., Cupryk, P., Elmasri, N., Hendren, L.J., Jimenez, A., Krishnan, S., Marquez, A., Merali, S., Nemawarkar, S.S., Panangaden, P., Xue, X., Zhu, Y.: A design study of the EARTH multiprocessor. In: International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 59–68. IFIP Working Group (1995)Google Scholar
  21. 21.
    Iyer, R., Nakka, N., Kalbarczyk, Z., Mitra, S.: Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6), 18–29 (2005)CrossRefGoogle Scholar
  22. 22.
    Jafar, S., Gautier, T., Krings, A., louis Roch, J.: A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing. In: Cunha, J.C., Medeiros P.D. (eds.) Euro-Par 2005 Parallel Processing, Lecture Notes in Computer Science (LNCS), vol. 3648, pp. 675–684. Springer, Berlin, Heidelberg (2005)Google Scholar
  23. 23.
    Kelm, J.H., Johnson, D.R., Johnson, M.R., Crago, N.C., Tuohy, W., Mahesri, A., Lumetta, S.S., Frank, M.I., Patel, S.J.: Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In: International Symposium on Computer Architecture (ISCA), pp. 140–151. IEEE (2009)Google Scholar
  24. 24.
    Kephart, J., Chess, D.: The vision of autonomic computing. Computer 36(1), 41–50 (2003)MathSciNetCrossRefGoogle Scholar
  25. 25.
    LaFrieda, C., Ipek, E., Martinez, J., Manohar, R.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: International Conference on Dependable Systems and Networks (DSN), pp. 317–326. IEEE (2007)Google Scholar
  26. 26.
    Lee, B., Hurson, A.R.: Dataflow architectures and multithreading. Computer 27(8), 27–39 (1994)CrossRefGoogle Scholar
  27. 27.
    Li, F., Pop, A., Cohen, A.: Automatic extraction of coarse-grained data-flow threads from imperative programs. IEEE Micro 32(4), 19–31 (2012)Google Scholar
  28. 28.
    Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives In: International Symposium on Computer Architecture (ISCA), pp. 99–110. IEEE (2002)Google Scholar
  29. 29.
    Nguyen-tuong, A., Grimshaw, A.S., Hyett, M.: Exploiting data-flow for fault-tolerance in a wide-area parallel system. In: International Symposium on Reliable and Distributed Systems, pp. 1–11 (1996)Google Scholar
  30. 30.
    Prvulovic, M., Zhang, Z., Torrellas, J.: Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: International Symposium on Computer Architecture (ISCA), pp. 111–122. IEEE (2002)Google Scholar
  31. 31.
    Rashid, M., Huang, M.: Supporting highly-decoupled thread-level redundancy for parallel programs. In: International Symposium on High Performance Computer Architecture (HPCA), pp. 393–404. IEEE (2008)Google Scholar
  32. 32.
    Ray, J., Hoe, J.C., Falsafi, B.: Dual use of superscalar datapath for transient-fault detection and recovery. In: International Symposium on Microarchitecture (MICRO), pp. 214–224. IEEE (2001)Google Scholar
  33. 33.
    Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: International Symposium on Computer Architecture (ISCA), pp. 25–36. ACM (2000)Google Scholar
  34. 34.
    Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers, pp. 84-91 (1999)Google Scholar
  35. 35.
    Sánchez, D., Aragón, J., García, J.: Evaluating dynamic core coupling in a scalable tiled-cmp architecture. In: International Workshop on Duplicating, Deconstructing, and Debunking (WDDD) (2008)Google Scholar
  36. 36.
    Sánchez, D., Aragón, J., García, J.: Extending SRT for parallel applications in tiled-CMP architectures. In: International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8. IEEE (2009)Google Scholar
  37. 37.
    Sánchez, D., Aragón, J.L., García, J.M.: REPAS: Reliable Execution for Parallel Applications in Tiled-CMPs. In: Sips, H., Epema, D., Lin, H.X. (eds.) International Euro-Par Conference on Parallel Processing, Lecture Notes in Computer Science (LNCS), vol. 5704, pp. 321–333. Springer, Berlin, Heidelberg (2009)Google Scholar
  38. 38.
    Sánchez, D., Aragón, J. L., García, J.M.: A log-based redundant architecture for reliable parallel computation. In: International Conference on High Performance Computing (HiPC), pp. 1–10. IEEE (2010)Google Scholar
  39. 39.
    Slegel, T., Averill, R.M.I., Check, M., Giamei, B., Krumm, B., Krygowski, C., Li, W., Liptay, J., Macdougall, J., McPherson, T., Navarro, J., Schwarz, E., Shum, K., Webb, C.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)Google Scholar
  40. 40.
    Smolens, J.C., Gold, B.T., Falsafi, B., Hoe, J.C.: Reunion: complexity-effective multicore redundancy. In: International Symposium on Microarchitecture (MICRO), pp. 223–234. IEEE (2006)Google Scholar
  41. 41.
    Smolens, J.C., Gold, B.T., Kim, J., Falsafi, B., Hoe, J.C., Nowatzyk, A.G.: Fingerprinting: bounding soft-error detection latency and bandwidth. In: International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 224–234. IEEE (2004)Google Scholar
  42. 42.
    Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: International Symposium on Computer Architecture (ISCA), pp. 123–134. IEEE (2002)Google Scholar
  43. 43.
    Srinivasan, J., Adve, S.V., Bose, P., Rivers, J.A.: The impact of technology scaling on lifetime reliability. In: International Conference on Dependable Systems and Networks (DSN), pp. 177–186. IEEE (2004)Google Scholar
  44. 44.
    Stavrou, K., Evripidou, P., Trancoso, P.: DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor. In: Hämäläinen, T.D., Pimentel, A.D., Takala, J., Vassiliadis, S. (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, Lecture Notes in Computer Science (LNCS), vol. 3553, pp. 364–373. Springer, Berlin, Heidelberg (2005)Google Scholar
  45. 45.
    Weis, S., Garbade, A., Schlingmann, S., Ungerer, T.: Towards fault detection units as an autonomous fault detection approach for future many-cores. In: ARCS 2011 Workshop Proceedings, pp. 20–23. VDE (2011)Google Scholar
  46. 46.
    Weis, S., Garbade, A., Wolf, J., Fechner, B., Mendelson, A., Giorgi, R., Ungerer, T.: A fault detection and recovery architecture for a teradevice dataflow system. In: International Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM), pp. 38–44. IEEE (2011)Google Scholar
  47. 47.
    Wittenbrink, C., Kilgariff, E., Prabhu, A.: Fermi GF100 GPU architecture. IEEE Micro 31(2), 50–59 (2011)Google Scholar
  48. 48.
    Yeh, Y.: Triple-triple redundant 777 primary flight computer. In: Proceedings of the Aerospace Applications Conference, pp. 293–307. IEEE (1996)Google Scholar
  49. 49.
    Zuckerman, S., Suetterlein, J., Knauerhase, R., Gao, G.R.: Using a “codelet” program execution model for exascale machines: position paper. In: Proceedings of the International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT), pp. 64–69. ACM (2011)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Sebastian Weis
    • 1
  • Arne Garbade
    • 1
  • Bernhard Fechner
    • 1
  • Avi Mendelson
    • 2
  • Roberto Giorgi
    • 3
  • Theo Ungerer
    • 1
  1. 1.University of AugsburgAugsburgGermany
  2. 2.TechnionHaifaIsrael
  3. 3.University of SienaSienaItaly

Personalised recommendations