Journal of Signal Processing Systems

, Volume 77, Issue 3, pp 257–280 | Cite as

Self-Adapting Resource Escalation for Resilient Signal Processing Architectures

  • Naveed Imran
  • Ronald F. DeMara
  • Jooheung Lee
  • Jian Huang
Article

Abstract

To deal with susceptibility to aging and process variation in the deep submicron era, signal processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. The Priority Using Resource Escalation (PURE) online resiliency approach is developed herein to maintain throughput quality based on the output Peak Signal-to-Noise Ratio (PSNR) or other health metric. PURE is evaluated using an H.263 video encoder and shown to maintain signal processing throughput despite hardware faults. Its performance is compared to two alternative reconfiguration algorithms which prioritize the optimization of the number of reconfiguration occurrences and the fault detection latency, respectively. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. Compared to the alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. The results indicate the benefits of priority-aware resiliency over conventional redundancy in terms of fault-recovery, power consumption, and resource-area requirements.

Keywords

Survivability Reconfigurable hardware for video compression Semiconductor aging Process variations Autonomous recovery Mission-critical systems FPGAs 

NOTATIONS

G(V,E)

An undirected graph, where V is the set of all nodes, E is the set of edges

C

Connectivity matrix

C(t)

Connectivity C at time instant t

\(\mathbf {\Psi }\)

Syndrome Matrix

\(\Phi \)

Fitness State Vector

\(\hat {\Phi }\)

Estimated Fitness State Vector

P

Priority Vector

t(G)

Diagnosability of G

d(G)

Average degree of a node in G

\(V_{a}\)

Set of active nodes

\(V_{s}\)

Set of Reconfigurable Slack (RS) to diagnose the active nodes by comparison-based diagnosis

\(V_{h}\)

Set of healthy nodes

\(V_{NMR}\)

Set for N-Modular Redundancy checking

M

Number of PRRs

N

Total number of nodes

\(N_{a}\)

Number of nodes in the datapath (i.e., \(|V_a|\))

\(N_{s}\)

Number of Reconfigurable Slacks (i.e., \(|V_s|\))

\(N_{d}\)

Number of defectives

r

Testing arrangement instance (may involve multiple reconfigurations)

s

Slack update instance (a slack is reconfigured with some function)

t

Time instant

\(T_{recon}\)

Reconfiguration Time

\(T_{eval}\)

Evaluation window period

F

Functions assignments vector

\(F^{*}\)

Solution vector F after recovery

\(T_{d}\)

Latency of fault detection

\(T_{diag}\)

Latency of fault diagnosis

\(T_{rec}\)

Recovery time

\(N_{r}\)

Number of testing arrangement instances before the diagnosis completes

\(N_{sup}\)

Number of slack updates

References

  1. 1.
    Cho, H., Leem, L., Mitra, S. (2012). ERSA: error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 31(4), 546–558.CrossRefGoogle Scholar
  2. 2.
    Karakonstantis, G., Mohapatra, D., Roy, K. (2012). Logic and memory design based on unequal error protection for voltage-scalable, robust and adaptive DSP systems. Journal of Signal Processing Systems (JSPS), 68, 415–431.CrossRefGoogle Scholar
  3. 3.
    Whatmough, P.N., Das, S., Bull, D.M., Darwazeh, I. Circuit-level timing error tolerance for low-power dsp filters and transforms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(6), 989–999.Google Scholar
  4. 4.
    Mitra, S., & McCluskey, E.J. (2000). Which concurrent error detection scheme to choose? In International test conference pp 985–994.Google Scholar
  5. 5.
    Mitra, S., Huang, W.J., Saxena, N.R., Yu, S.Y., McCluskey, E.J. (2004). Reconfigurable architecture for autonomous self-repair. IEEE Design Test of Computers, 21(3), 228–240.CrossRefGoogle Scholar
  6. 6.
    Rao, W., Orailoglu, A., Karri, R. (2006). Nanofabric topologies and reconfiguration algorithms to support dynamically adaptive fault tolerance. In VLSI test symposium (p. 6).Google Scholar
  7. 7.
    Gmez-Pulido, J.A., Vega-Rodrguez, M.A., Snchez-Prez, J.M. (2012). High-speed reconfigurable parallel system to design good error correcting codes in communications. Journal of Signal Processing Systems, 66, 147–152.CrossRefGoogle Scholar
  8. 8.
    Kthiri, M., Loukil, H., Atitallah, A., Kadionik, P., Dallet, D., Masmoudi, N. (2012). FPGA architecture of the LDPS motion estimation for H.264/AVC video coding. Journal of Signal Processing Systems, 68, 273–285.CrossRefGoogle Scholar
  9. 9.
    Imran, N., & DeMara, R.F. (2011). A self-configuring TMR scheme utilizing discrepancy resolution. In International conference on reconfigurable computing and FPGAs (ReConFig) (pp. 398–403).Google Scholar
  10. 10.
    Rubin, R., & DeHon, A. (2009). Choose-your-own-adventure routing: lightweight load-time defect avoidance. In Proceedings of the ACM/SIGDA international symposium on FPGAs (pp. 23–32). New York: ACM. ISBN 978-1-60558-410-2.Google Scholar
  11. 11.
    Berg, M., Poivey, C., Petrick, D., Espinosa, D., Lesea, A., LaBel, K.A., Friendlich, M., Kim, H., Phan, A. (2008). Effectiveness of internal versus external SEU scrubbing mitigation strategies in a Xilinx FPGA: design, test, and analysis. IEEE Transactions on Nuclear Science, 55(4), 2259–2266.CrossRefGoogle Scholar
  12. 12.
    Stoica, A., Keymeulen, D., Zebulum, R., Mojarradi, M., Katkoori, S., Daud, T. (2007). Adaptive and evolvable analog electronics for space applications. In Proceedings of the 7th international conference on evolvable systems: from biology to hardware. ICES’07 (pp. 379–390). Berlin: Springer. ISBN 3-540-74625-0, 978-3-540-74625-6.Google Scholar
  13. 13.
    Pereira, M.M., Braun, L., Hubner, M., Becker, J., Carro, L. (2011). Run-time resource instantiation for fault tolerance in FPGAs. In NASA/ESA conference on adaptive hardware and systems (AHS) (pp. 88–95).Google Scholar
  14. 14.
    Siozios, K., & Soudris, D. (2012). Low-cost fault tolerant targeting FPGA devices. In NASA/ESA conference on adaptive hardware and systems (AHS), special session on dependability by reconfigurable hardware.Google Scholar
  15. 15.
    Palem, K.V., Chakrapani, L.N.B., Kedem, Z.M., Lingamneni, A., Muntimadugu, K.K. (2009). Sustaining moore’s law in embedded computing through probabilistic and approximate design: retrospects and prospects. In International conference on compilers, architecture, and synthesis for embedded systems. CASES (pp. 1–10). New York: ACM. ISBN 978-1-60558-626-7.Google Scholar
  16. 16.
    Chippa, V.K., Mohapatra, D., Raghunathan, A., Roy, K., Chakradhar, S.T. (2010). Scalable effort hardware design: Exploiting algorithmic resilience for energy efficiency. In 47th ACM/IEEE design automation conference (DAC) (pp. 555–560).Google Scholar
  17. 17.
    Mohapatra, D., Karakonstantis, G., Roy, K. (2009). Significance driven computation: a voltage-scalable, variation-aware, quality-tuning motion estimator. In 14th ACM/IEEE international symposium on low power electronics and design (ISLPED) (pp. 195–200). New York: ACM. ISBN 978-1-60558-684-7.Google Scholar
  18. 18.
    Abdallah, R.A., & Shanbhag, N.R. (2010). Minimum-energy operation via error resiliency. IEEE Embedded Systems Letters, 2(4), 115–118.CrossRefGoogle Scholar
  19. 19.
    Kim, E.P., & Shanbhag, N.R. (2012). Soft n-modular redundancy. IEEE Transactions on Computers, 61(3), 323–336.MathSciNetCrossRefGoogle Scholar
  20. 20.
    Narayanan, S., Varatkar, G.V., Jones, D.L., Shanbhag, N.R. (2010). Computation as estimation: a general framework for robustness and energy efficiency in SoCs. IEEE Transactions on Signal Processing, 58(8), 4416–4421.MathSciNetCrossRefGoogle Scholar
  21. 21.
    Varatkar, G.V., & Shanbhag, N.R. (2008). Error-resilient motion estimation architecture. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 16(10), 1399–1412.CrossRefGoogle Scholar
  22. 22.
    Greenwood, G.W. (2005). On the practicality of using intrinsic reconfiguration for fault recovery. IEEE Transactions on Evolutionary Computation, 9(4), 398–405.CrossRefGoogle Scholar
  23. 23.
    Laprie, J.-C. (1995). Dependable computing and fault tolerance: concepts and terminology. In 25th international symposium on fault-tolerant computing - highlights from twenty-five years.Google Scholar
  24. 24.
    Miroslaw, M. (1980). A comparison connection assignment for diagnosis of multiprocessor systems. In Proceedings of the 7th annual symposium on computer architecture (ISCA) (pp. 31–36). New York: ACM.Google Scholar
  25. 25.
    Preparata, F.P., Metze, G., Chien, R.T. (1967). On the connection assignment problem of diagnosable systems. IEEE Transactions on EC-Electronic Computers, 16(6), 848–854.CrossRefMATHGoogle Scholar
  26. 26.
    DeMara, R.F., Zhang, K., Sharma, C.A. (2011). Autonomic fault-handling and refurbishment using throughput-driven assessment. Applied Soft Computing, 11, 1588–1599.CrossRefGoogle Scholar
  27. 27.
    Carmichael, C. (2006). Xilinx application note: virtex series XAPP197 (v1.0.1), July 6, 2006.Google Scholar
  28. 28.
    Kastensmidt, F.L., Sterpone, L., Carro, L., Reorda, M.S. (2005). On the optimal design of triple modular redundancy logic for SRAM-based FPGAs. In Design, automation and test in Europe (pp. 1290–12952).Google Scholar
  29. 29.
    Sloan, J., Kesler, D., Kumar, R., Rahimi, A. (2010). A numerical optimization-based methodology for application robustification: transforming applications for error tolerance. In IEEE/IFIP international conference on dependable systems and networks (DSN) (pp. 161–170).Google Scholar
  30. 30.
    Gao, M., Chang, H.-M.S., Lisherness, P., Cheng, K.-T.T. (2011). Time-multiplexed online checking. IEEE Transactions on Computers, 60(9), 1300–1312.MathSciNetCrossRefGoogle Scholar
  31. 31.
    Stott, E., Sedcole, P., Cheung, P. (2008). Fault tolerant methods for reliability in FPGAs. In International conference on field programmable logic and applications (FPL) (pp. 415–420).Google Scholar
  32. 32.
    Garvie, M., & Thompson, A. (2004). Scrubbing away transients and jiggling around the permanent: long survival of FPGA systems through evolutionary self-repair. In IEEE international on-line testing symposium (IOLTS) (pp. 155–160).Google Scholar
  33. 33.
    Keymeulen, D., Zebulum, R.S., Jin, Y., Stoica, A. (2000). Fault-tolerant evolvable hardware using field-programmable transistor arrays. IEEE Transactions on Reliability, 49(3), 305–316.CrossRefGoogle Scholar
  34. 34.
    Emmert, J.M., Stroud, C.E., Abramovici, M. (2007). Online fault tolerance for FPGA logic blocks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(2), 216–226.CrossRefGoogle Scholar
  35. 35.
    Dutt, S., Verma, V., Suthar, V. (2008). Built-in-self-test of FPGAs with provable diagnosabilities and high diagnostic coverage with application to online testing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 27(2), 309–326.CrossRefGoogle Scholar
  36. 36.
    Abramovici, M., Stroud, C.E., Emmert, J.M. (2004). Online BIST and BIST-based diagnosis of FPGA logic blocks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(12), 1284–1294.Google Scholar
  37. 37.
    Gericota, M.G., Alves, G.R., Silva, M.L., Ferreira, J.M. (2008). Reliability and availability in reconfigurable computing: a basis for a common solution. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 16(11), 1545–1558.CrossRefGoogle Scholar
  38. 38.
    Mizan, E., Amimeur, T., Jacome, M.F. Oct. Self-imposed temporal redundancy: an efficient technique to enhance the reliability of pipelined functional units. In 19th international symposium on computer architecture and high performance computing (pp. 45–53).Google Scholar
  39. 39.
    Paulsson, K., Hubner, M., Becker, J. (2006). Strategies to on-line failure recovery in self-adaptive systems based on dynamic and partial reconfiguration. In First NASA/ESA conference on adaptive hardware and systems (AHS).Google Scholar
  40. 40.
    Koren, I., & Su, S.Y.H. (1979). Reliability analysis of n-modular redundancy systems with intermittent and permanent faults. IEEE Transactions on C-Computers, c-28(7), 514–520.MathSciNetCrossRefGoogle Scholar
  41. 41.
    Dorfman, R. (1943). The detection of defective members of large populations. The Annals of Mathematical Statistics, 14(4), 436–440.CrossRefGoogle Scholar
  42. 42.
    Litvak, E., Tu, X.M., Pagano, M. (1994). Screening for the presence of a disease by pooling sera samples. Journal of the American Statistical Association, 89(426), 424–434.CrossRefMATHGoogle Scholar
  43. 43.
    Smith, C.A.B. (1947). The counterfeit coin problem. The Mathematical Gazette, 31(293), 31–39.CrossRefGoogle Scholar
  44. 44.
    Imran, N., Lee, J., DeMara, R.F. (2012). Fault demotion using reconfigurable slack (FaDReS). IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21(7), 1364–1368.Google Scholar
  45. 45.
    Wiegand, T., Schwarz, H., Joch, A., Kossentini, F., Sullivan, G.J. (2003). Rate-constrained coder control and comparison of video coding standards. IEEE Transactions on Circuits and Systems for Video Technology, 13(7), 688–703.CrossRefGoogle Scholar
  46. 46.
    Karakonstantis, G., Mohapatra, D., Roy, K. (2009). System level DSP synthesis using voltage overscaling, unequal error protection & adaptive quality tuning. In IEEE workshop on signal processing systems (SiPS) (pp. 133–138).Google Scholar
  47. 47.
    Karakonstantis, G., Banerjee, N., Roy, K. (2010). Process-variation resilient and voltage-scalable DCT architecture for robust low-power computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 18(10), 1461–1470.CrossRefGoogle Scholar
  48. 48.
    Bucciero, M., Walters, J.P., French, M. (2011). Software fault tolerance methodology and testing for the embedded Power PC. In Proceedings of the IEEE aerospace conference (pp. 1–9). Big Sky, MT.Google Scholar
  49. 49.
    Imran, N., & DeMara, R.F. (2011). Heterogeneous concurrent error detection (hCED) based on output anticipation. In International conference on reconfigurable computing and FPGAs (ReConFig) (pp. 61–66).Google Scholar
  50. 50.
    Abramovici, M., Emmert, J.M., Stroud, C.E. (2001). Roving STARs: an integrated approach to on-line testing, diagnosis, and fault tolerance for FPGAs in adaptive computing systems. In The third NASA/DoD workshop on evolvable hardware (pp. 73–92).Google Scholar
  51. 51.
    Huang, W.-J., Mitra, S., McCluskey, E.J. (2001). Fast run-time fault location in dependable FPGA-based applications. In IEEE international symposium on defect and fault tolerance in vlsi systems (DFT) (pp. 206–214).Google Scholar
  52. 52.
    Xilinx (2012). Partial reconfiguration user guide UG702 (v14.3) October 16.Google Scholar
  53. 53.
    Becker, T., Luk, W., Cheung, P.Y.K. (2007). Enhancing relocatability of partial bitstreams for run-time reconfiguration. In 15th annual IEEE symposium on field-programmable custom computing machines (FCCM) (pp. 35–44).Google Scholar
  54. 54.
    Huang, J., & Lee, J. (2011). Reconfigurable architecture for ZQDCT using computational complexity prediction and bitstream relocation. IEEE Embedded Systems Letters, 3(1), 1–4.CrossRefGoogle Scholar
  55. 55.
    NIST. FIPS PUB 197, Advanced Encryption Standard (AES), National Institute of Standards and Technology, U.S. Department of Commerce, November 2001. http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf.
  56. 56.
    Drimer, S. (2009). Security for volatile FPGAs. Phd dissertation, University of Cambridge, 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom.Google Scholar
  57. 57.
    Sharma, C.A., Sarvi, A., Alzahrani, A., DeMara, R.F. (2013). Self-healing reconfigurable logic using autonomous group testing. Microprocessors and Microsystems, 37(2), 174–184.CrossRefGoogle Scholar
  58. 58.
    Srinivasan, S., Krishnan, R., Mangalagiri, P., Xie, Y., Narayanan, V., Irwin, M.J., Sarpatwari, K. (2008). Toward increasing FPGA lifetime. IEEE Transactions on Dependable and Secure Computing, 5(2), 115–127.CrossRefGoogle Scholar
  59. 59.
    Poivey, C., Berg, M., Stansberry, S., Friendlich, M., Kim, H., Petrick, D., LaBel, K. (2007). Heavy ion SEE test of Virtex-4 FPGA XC4VFX60 from Xilinx. Retrieved on January 8, 2012 http://radhome.gsfc.nasa.gov/radhome/papers/T021607_XC4VFX60.pdf.
  60. 60.
    Bolchini, C., & Sandionigi, C. (2010). Fault classification for SRAM-Based FPGAs in the space environment for fault mitigation. IEEE Embedded Systems Letters, 2(4), 107–110.CrossRefGoogle Scholar
  61. 61.
    Trace. Video trace library: YUV 4:2:0 video sequences. Retrieved on January 20, 2012 http://trace.eas.asu.edu/yuv/.

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Naveed Imran
    • 1
  • Ronald F. DeMara
    • 1
  • Jooheung Lee
    • 2
  • Jian Huang
    • 3
  1. 1.Department of Electrical Engineering and Computer ScienceUniversity of Central FloridaOrlandoUSA
  2. 2.Department of Electronic and Electrical EngineeringHongik UniversitySejongKorea
  3. 3.Advanced Micro Devices (AMD)SunnyvaleUS

Personalised recommendations