Distributed Computing

, Volume 24, Issue 6, pp 323–355 | Cite as

Reconciling fault-tolerant distributed computing and systems-on-chip

Open Access
Article

Abstract

Classic distributed computing abstractions do not match well the reality of digital logic gates, which are the elementary building blocks of Systems-on-Chip (SoCs) and other Very Large Scale Integrated (VLSI) circuits: Massively concurrent, continuous computations undermine the concept of sequential processes executing sequences of atomic zero-time computing steps, and very limited computational resources at gate-level make even simple operations prohibitively costly. In this paper, we introduce a modeling and analysis framework based on continuous computations and zero-bit message channels, and employ this framework for the correctness & performance analysis of a distributed fault-tolerant clocking approach for Systems-on-Chip (SoCs). Starting out from a “classic” distributed Byzantine fault-tolerant tick generation algorithm, we show how to adapt it for direct implementation in clockless digital logic, and rigorously prove its correctness and derive analytic expressions for worst case performance metrics like synchronization precision and clock frequency. Rather than on absolute delay values, both the algorithm’s correctness and the achievable synchronization precision depend solely on the ratio of certain path delays. Since these ratios can be mapped directly to placement & routing constraints, there is typically no need for changing the algorithm when migrating to a faster implementation technology and/or when using a slightly different layout in an SoC.

Keywords

Clock synchronization Fault-tolerant distributed systems Modeling approaches VLSI 

References

  1. 1.
    Attiya H., Herzberg A., Rajsbaum S.: Optimal clock synchronization under different delay assumptions. SIAM J. Comput. 25(2), 369–389 (1996)MATHMathSciNetCrossRefGoogle Scholar
  2. 2.
    Bar-Noy A., Dolev D.: Consensus algorithms with one-bit messages. Distrib. Comput. 4, 105–110 (1991)MATHMathSciNetCrossRefGoogle Scholar
  3. 3.
    Barros J.C., Johnson B.W.: Equivalence of the arbiter, the synchronizer, the latch, and the inertial delay. IEEE Trans. Comput. 32(7), 603–614 (1983)CrossRefGoogle Scholar
  4. 4.
    Baumann R.: Soft errors in advanced computer systems. IEEE Des. Test Comput. 22(3), 258–266 (2005)CrossRefGoogle Scholar
  5. 5.
    Belluomini, W., Myers, C.J.: Verification of timed systems using posets. In: Computer Aided Verification, pp. 403–415 (1998)Google Scholar
  6. 6.
    Bhamidipati R., Zaidi A., Makineni S., Low K., Chen R., Liu K.-Y., Dalgrehn J.: Challenges and methodologies for implementing high-performance network processors. Intel Technol. J. 6(3), 83–92 (2002)Google Scholar
  7. 7.
    Black D.L.: On the existince of delay-insensitive fair arbiters: trace theory and its limitations. Distrib. Comput. 1, 205–225 (1986)MATHCrossRefGoogle Scholar
  8. 8.
    Chapiro, D.M.: Globally-Asynchronous Locally-Synchronous Systems. PhD thesis, Stanford University (1984)Google Scholar
  9. 9.
    Charron-Bost, B., Dolev, S., Ebergen, J., Schmid, U.: 08371 summary—fault-tolerant distributed algorithms on VLSI chips. In: Charron-Bost, B., Dolev, S., Ebergen, J., Schmid, U. (eds.) Fault-Tolerant Distributed Algorithms on VLSI Chips, number 08371 in Dagstuhl Seminar Proceedings, Dagstuhl, Germany, 2009. Schloss Dagstuhl—Leibniz-Zentrum fuer Informatik, GermanyGoogle Scholar
  10. 10.
    Clarke E.M.: Editorial: distributed computing issues in hardware design. Distrib. Comput. 1, 185–186 (1986)CrossRefGoogle Scholar
  11. 11.
    Constantinescu C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23(4), 14–19 (2003)CrossRefGoogle Scholar
  12. 12.
    Dolev D., Halpern J.Y., Strong H.R.: On the possibility and impossibility of achieving clock synchronization. J. Comput. Syst. Sci. 32, 230–250 (1986)MATHMathSciNetCrossRefGoogle Scholar
  13. 13.
    Dolev S., Haviv Y.: Self-stabilizing microprocessors, analyzing and overcoming soft-errors. IEEE Trans. Comput. 55(4), 385–399 (2006)CrossRefGoogle Scholar
  14. 14.
    Dolev, S., Tzachar, N.: Brief announcment: Corruption resilient fountain codes. In: Taubenfeld, G. (ed.) Distributed Computing, Lecture Notes in Computer Science, vol. 5218, pp. 502–503. Springer, Berlin/Heidelberg (2008)Google Scholar
  15. 15.
    Dyer, C., Rodgers, D.: Effects on spacecraft & aircraft electronics. In: Proceedings ESA Workshop on Space Weather, ESA WPP-155, pp. 17–27. ESA, Nordwijk, The Netherlands (1998)Google Scholar
  16. 16.
    Ebergen J.C.: A formal approach to designing delay-insensitive circuits. Distrib. Comput. 5, 107–119 (1991)MATHCrossRefGoogle Scholar
  17. 17.
    Fairbanks, S.: Method and apparatus for a distributed clock generator, 2004. US patent no. US2004108876Google Scholar
  18. 18.
    Fairbanks, S., Moore, S.: Self-timed circuitry for global clocking. In: Proceedings of the Eleventh International IEEE Symposium on Asynchronous Circuits and Systems, pp. 86–96 (2005)Google Scholar
  19. 19.
    Ferri C., Moreshet T., Iris Bahar R., Benini L., Herlihy M.: A hardware/software framework for supporting transactional memory in a MPSoC environment. SIGARCH Comput. Archit. News 35(1), 47–54 (2007)CrossRefGoogle Scholar
  20. 20.
    Ferringer, M., Fuchs, G., Steininger, A., Kempf, G.: VLSI Implementation of a Fault-Tolerant Distributed Clock Generation. In: IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT2006), pp. 563–571 (2006)Google Scholar
  21. 21.
    Fischer M.J., Lynch N.A., Paterson M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985)MATHMathSciNetCrossRefGoogle Scholar
  22. 22.
    Friedman E.G.: Clock distribution networks in synchronous digital integrated circuits. Proc. IEEE 89(5), 665–692 (2001)CrossRefGoogle Scholar
  23. 23.
    Friedman R., Mostefaoui A., Rajsbaum S., Raynal M.: Asynchronous agreement and its relation with error-correcting codes. IEEE Trans. Comput. 56(7), 865–875 (2007)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Fuchs, G.: Fault-Tolerant Distributed Algorithms for On-Chip Tick Generation: Concepts, Implementations and Evaluations. PhD thesis, Vienna University of Technology, Fakultät für Informatik (2009)Google Scholar
  25. 25.
    Fuchs, G., Függer, M., Steininger, A.: On the threat of metastability in an asynchronous fault-tolerant clock generation scheme. In: 15th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’09), pp. 127–136, Chapel Hill, N. Carolina, USA (2009)Google Scholar
  26. 26.
    Fuchs, G., Függer, M., Steininger, A., Zangerl, F.: Analysis of constraints in a fault-tolerant distributed clock generation scheme. In: 3rd International Workshop on Dependable Embedded Systems (WDES’06) (2006)Google Scholar
  27. 27.
    Fuchs, G., Steininger, A.: VLSI implementation of a distributed algorithm for fault-tolerant clock generation. J. Electr. Comput. Eng. 2011, 23 (2011). doi:10.1155/2011/936712
  28. 28.
    Függer, M.: Analysis of On-Chip Fault-Tolerant Distributed Algorithms. PhD thesis, Technische Universität Wien, Institut für Technische Informatik, Treitlstr. 1-3/182-2, 1040 Vienna, Austria (2010)Google Scholar
  29. 29.
    Gadlage M.J., Eaton P.H., Benedetto J.M., Carts M., Zhu V., Turflinger T.L.: Digital device error rate trends in advanced CMOS technologies. IEEE Trans. Nucl. Sci. 53(6), 3466–3471 (2006)CrossRefGoogle Scholar
  30. 30.
    Grahsl, J., Handl, T., Steininger, A.: Exploring the usefulness of the gate-level stuck-at fault model for Muller C-elements. In: Proceedings 20. Workshop für Testmethoden und Zuverlässigkeit von Schaltungen und Systemen (TuZ’08), pp. 165–169, Vienna, Austria (2008)Google Scholar
  31. 31.
    Halpern J.Y., Megiddo N., Munshi A.A.: Optimal precision in the presence of uncertainty. J. Complex. 1(2), 170–196 (1985)MATHMathSciNetCrossRefGoogle Scholar
  32. 32.
    Hauck S.: Asynchronous design methodologies: an overview. Proc. IEEE 83(1), 69–93 (1995)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Hoyme, K., Driscoll, K.: Safebus. In: Proceedings IEEE/AIAA 11th Digital Avionics Systems Conference, pp. 68–73 (1992)Google Scholar
  34. 34.
    International technology roadmap for semiconductors (2007)Google Scholar
  35. 35.
    Jang, W., Martin, A.J.: SEU-tolerant QDI circuits. In: Proceedings 11th Int’l Symposium on Asynchronous Circuits and Systems (ASYNC’05), pp. 156–165 (2005)Google Scholar
  36. 36.
    Karnik T., Hazucha P., Patel J.: Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Trans. Dependable Secur. Comput. 1(2), 128–143 (2004)CrossRefGoogle Scholar
  37. 37.
    Kaynar, D.K., Lynch, N., Segala, R., Vaandrager, F.: Timed I/O automata: a mathematical framework for modeling and analyzing real-time systems. In: Proceedings 24th IEEE International Real-Time Systems Symposium (RTSS’03), vol. 00, 166–177 (2003)Google Scholar
  38. 38.
    Kieckhafer R.M., Walter C.J., Finn A.M., Thambidurai P.M.: The MAFT architecture for distributed fault tolerance. IEEE Trans. Comput. 37, 398–405 (1988)CrossRefGoogle Scholar
  39. 39.
    Kopetz H., Grünsteidl G.: TTP-A protocol for fault-tolerant real-time systems. Computer 27(1), 14–23 (1994)CrossRefGoogle Scholar
  40. 40.
    Koren I., Koren Z.: Defect tolerance in VLSI circuits: techniques and yield analysis. Proc. IEEE 86(9), 1819–1838 (1998)CrossRefGoogle Scholar
  41. 41.
    Lamport, L.: Buridan’s principle. Technical report, SRI Technical Report (1984)Google Scholar
  42. 42.
    Lamport L.: Specifying Systems, The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley, Boston (2002)Google Scholar
  43. 43.
    Lamport L.: Arbitration-free synchronization. Distrib. Comput. 16(2/3), 219–237 (2003)CrossRefGoogle Scholar
  44. 44.
    Le Lann, G., Schmid, U.: How to implement a timer-free perfect failure detector in partially synchronous systems. Technical Report 183/1-127, Department of Automation, Technische Universität Wien, January 2003. (Replaced by Research Report 28/2005, Institut für Technische Informatik, TU Wien, 2005.)Google Scholar
  45. 45.
    Lynch N.: Distributed Algorithms. Morgan Kaufman, San Francisco (1996)MATHGoogle Scholar
  46. 46.
    Maheshwari, A., Koren, I., Burleson, W.: Accurate estimation of Soft Error Rate (SER) in VLSI circuits. In: Proceedings of the 2004 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 377–385 (2004)Google Scholar
  47. 47.
    Marino L.: General theory of metastable operation. IEEE Trans. Comput. C-30(2), 107–115 (1981)Google Scholar
  48. 48.
    Martin A.J.: Compiling communicating processes into delay-insensitive VLSI circuits. Distrib. Comput. 1, 226–234 (1986)MATHCrossRefGoogle Scholar
  49. 49.
    Martin, A.J.: The limitations to delay-insensitivity in asynchronous circuits. In: AUSCRYPT ’90: Proceedings of the sixth MIT conference on Advanced research in VLSI, pp. 263–278. MIT Press, Cambridge, MA, USA (1990)Google Scholar
  50. 50.
    Maza, M.S., Aranda, M.L.: Analysis of clock distribution networks in the presence of crosstalk and groundbounce. In: Proceedings International IEEE Conference on Electronics, Circuits, and Systems (ICECS), pp. 773–776 (2001)Google Scholar
  51. 51.
    Maza, M.S., Aranda, M.L.: Interconnected rings and oscillators as gigahertz clock distribution nets. In: GLSVLSI ’03: Proceedings of the 13th ACM Great Lakes symposium on VLSI, pp. 41–44. ACM Press (2003)Google Scholar
  52. 52.
    Metra C., Francescantonio S.D., Mak T.M.: Implications of clock distribution faults and issues with screening them during manufacturing testing. IEEE Trans. Comput. 53(5), 531–546 (2004)CrossRefGoogle Scholar
  53. 53.
    Mitra S., Seifert N., Zhang M., Shi Q., Kim K.S.: Robust system design with built-in soft-error resilience. IEEE Comput. 38(5), 43–52 (2005)CrossRefGoogle Scholar
  54. 54.
    Moscibroda, T., Mutlu, O.: Distributed order scheduling and its application to multi-core DRAM controllers. In: Proceedings of the 27th ACM Symposium on Principles of Distributed Computing (PODC’08), pp. 365–374, Toronto, Canada (2008)Google Scholar
  55. 55.
    Myers C.J., Meng T.H.Y.: Synthesis of timed asynchronous circuits. IEEE Trans. VLSI Syst. 1(2), 106–119 (1993)CrossRefGoogle Scholar
  56. 56.
    Nicolaidis, M.: GRAAL: a fault-tolerant architecture for enabling nanometric technologies. In: Proceedings 13th IEEE International On-Line Testing Symposium (IOLTS’07), pp. 255–255 (2007)Google Scholar
  57. 57.
    Normand E.: Single-event effects in avionics. IEEE Trans. Nucl. Sci. 43(2), 461–474 (1996)CrossRefGoogle Scholar
  58. 58.
    Ostrovsky, R., Patt-Shamir, B.: Optimal and efficient clock synchronization under drifting clocks. In: PODC ’99: Proceedings of the Eighteenth Annual ACM Symposium on Principles of Distributed Computing, pp. 3–12. ACM, New York, NY, USA (1999)Google Scholar
  59. 59.
    Palit, A.K., Meyer, V., Anheier, W., Schloeffel, J.: Modeling and analysis of crosstalk coupling effect on the victim interconnect using the ABCD network model. In: Proceedings of the 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’04), pp. 174–182 (2004)Google Scholar
  60. 60.
    Patt-Shamir, B., Rajsbaum, S.: A theory of clock synchronization (extended abstract). In: STOC ’94: Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of computing, pp. 810–819. ACM Press, New York, NY, USA (1994)Google Scholar
  61. 61.
    Polzer, T., Handl, T., Steininger, A.: A metastability-free multi-synchronous communication scheme for socs. In: Proceedings of the Stabilization, Safety, and Security of Distributed Systems, 11th International Symposium, SSS 2009, Lyon, France, November 3–6, 2009, pp. 578–592 (2009)Google Scholar
  62. 62.
    Powell D., Arlat J., Beus-Dukic L., Bondavalli A., Coppola P., Fantechi A., Jenn E., Rabejac C., Wellings A.: GUARDS: a generic upgradable architecture for real-time dependable systems. IEEE Trans. Parallel Distrib. Syst. 10(6), 580–599 (1999)CrossRefGoogle Scholar
  63. 63.
    Ramanathan P., Shin K.G., Butler R.W.: Fault-tolerant clock synchronization in distributed systems. IEEE Comput. 23(10), 33–42 (1990)CrossRefGoogle Scholar
  64. 64.
    Restle P.J. et al.: A clock distribution network for microprocessors. IEEE J. Solid-State Circuits 36(5), 792–799 (2001)CrossRefGoogle Scholar
  65. 65.
    Rokicki, T., Myers, C.J.: Automatic verification of timed circuits. In: Computer Aided Verification, pp. 468–480 (1994)Google Scholar
  66. 66.
    Schmid, U.: How to model link failures: A perception-based fault model. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN’01), pp. 57–66, Göteborg, Sweden (2001)Google Scholar
  67. 67.
    Schmid, U.: Keynote: distributed algorithms and VLSI. In: Proceedings of the 10th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS’08), Lecture Notes in Computer Science, vol. 5340, page 3, Detroit, USA, November 2008. Springer Verlag. (http://www.vmars.tuwien.ac.at/documents/extern/2467/sss08.pdf)
  68. 68.
    Schmid U., Klasek J., Mandl T., Nachtnebel H., Cadek G.R., Kerö N.: A network time interface M-module for distributing GPS-time over LANs. Real-Time Syst. 18(1), 24–57 (2000)CrossRefGoogle Scholar
  69. 69.
    Schmid, U., Steininger, A.: Dezentrale Fehlertolerante Taktgenerierung in VLSI Chips. Research Report 69/2004, Technische Universität Wien, Institut f"ur Technische Informatik, 2004. International patent PCT WO2006/007619: EP 1769356, US 2009/0102534, ZL 200580024166.6, AT 501510Google Scholar
  70. 70.
    Seifert, N., Shipley, P., Pant, M.D., Ambrose, V., Gill, B.: Radiation-induced clock jitter and race. In: Proceedings 43rd Annual IEEE International Reliability Physics Symposium, pp. 215–222, 17–21 (2005)Google Scholar
  71. 71.
    Seitz, C.L.: System timing. In: Introduction to VLSI Systems, pp. 218–262. Addison Wesley, Boston (1980)Google Scholar
  72. 72.
    Semiat Y., Ginosar R.: Timing measurements of synchronization circuits. Int. Symp. Asynchr. Circuits Syst. 0, 68 (2003)CrossRefGoogle Scholar
  73. 73.
    Shivakumar, P., Kistler, M., Keckler, S.W., Burger, D., Alvisi, L.: Modeling the effect of technology trends on the soft error rate of combinational logic. In: Proceedings of International Conference on Dependable Systems and Networks, DSN, pp. 389–398 (2002)Google Scholar
  74. 74.
    Simons, B., Lundelius-Welch, J., Lynch, N.: An overview of clock synchronization. In: Simons, B., Spector, A. (eds.) Fault-Tolerant Distributed Computing, LNCS 448, pp. 84–96. Springer, Berlin (1990)Google Scholar
  75. 75.
    Srikanth T.K., Toueg S.: Optimal clock synchronization. J. ACM 34(3), 626–645 (1987)MathSciNetCrossRefGoogle Scholar
  76. 76.
    Stevens K.S., Ginosar R., Rotem S.: Relative timing [asynchronous design]. IEEE Trans. VLSI Syst. 11(1), 129–140 (2003)CrossRefGoogle Scholar
  77. 77.
    Sutherland, I.E.: Micropipelines. Communications of the ACM, Turing Award, 32(6), 720–738, June 1989. ISSN:0001-0782Google Scholar
  78. 78.
    Teehan P., Greenstreet M., Lemieux G.: A survey and taxonomy of GALS design styles. IEEE Des. Test Comput. 24(5), 418–428 (2007)CrossRefGoogle Scholar
  79. 79.
    Thaker D.D., Impens F., Chuang I.L., Amirtharajah R., Chong F.T.: Recursive TMR: scaling fault tolerance in the nanoscale era. IEEE Des. Test Comput. 22(4), 298–305 (2005)CrossRefGoogle Scholar
  80. 80.
    Verdel, T., Makris, Y.: Duplication-based concurrent error detection in asynchronous circuits: shortcomings and remedies. In: Proceedings 17th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2002), pp. 345–353 (2002)Google Scholar
  81. 81.
    Widder, J., Le Lann, G., Schmid, U.: Failure detection with booting in partially synchronous systems. In: Proceedings of the 5th European Dependable Computing Conference (EDCC-5), LNCS, vol. 3463, pp. 20–37. Springer Budapest, Hungary (2005)Google Scholar
  82. 82.
    Widder J., Schmid U.: The theta-model: achieving synchrony without clocks. Distrib. Comput. 22(1), 29–47 (2009)CrossRefGoogle Scholar
  83. 83.
    Yakovlev, A., Lavagno, L., Sangiovanni-Vincentelli, A.: A unified signal transition graph model for asynchronous control circuit synthesis. In: Proceedings of the 1992 IEEE/ACM international conference on Computer-aided design (ICCAD’92), pp. 104–111. IEEE Computer Society Press, Los Alamitos, CA, USA (1992)Google Scholar
  84. 84.
    Yoneda, T., Kitai, T., Myers, C.J.: Automatic derivation of timing constraints by failure analysis. In: Proceedings 14th International Conference on Computer Aided Verification (CAV’02), Lecture Notes in Computer Science, vol. 2404, pp. 195–208. Springer, Berlin (2002)Google Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  1. 1.Technische Universität Wien, Embedded Computing Systems Group (E182/2)ViennaAustria

Personalised recommendations