Journal of Signal Processing Systems

, Volume 53, Issue 3, pp 243–259 | Cite as

Architecture and Evaluation of an Asynchronous Array of Simple Processors

  • Zhiyi Yu
  • Michael J. Meeuwsen
  • Ryan W. Apperson
  • Omar Sattari
  • Michael A. Lai
  • Jeremy W. Webb
  • Eric W. Work
  • Tinoosh Mohsenin
  • Bevan M. Baas
Article

Abstract

This paper presents the architecture of an asynchronous array of simple processors (AsAP), and evaluates its key architectural features as well as its performance and energy efficiency. The AsAP processor calculates DSP applications with high energy-efficiency, is capable of high-performance, is easily scalable, and is well-suited to future fabrication technologies. It is composed of a two-dimensional array of simple single-issue programmable processors interconnected by a reconfigurable mesh network. Processors are designed to capture the kernels of many DSP algorithms with very little additional overhead. Each processor contains its own tunable and haltable clock oscillator, and processors operate completely asynchronously with respect to each other in a globally asynchronous locally synchronous (GALS) fashion. A 6×6 AsAP array has been designed and fabricated in a 0.18 μm CMOS technology. Each processor occupies 0.66 mm2, is fully functional at a clock rate of 520–540 MHz at 1.8 V, and dissipates an average of 35 mW per processor at 520 MHz under typical conditions while executing applications such as a JPEG encoder core and a complete IEEE 802.11a/g wireless LAN baseband transmitter. Most processors operate at over 600 MHz at 2.0 V. Processors dissipate 2.4 mW at 116 MHz and 0.9 V. A single AsAP processor occupies 4% or less area than a single processing element in other multi-processor chips. Compared to several RISC processors (single issue MIPS and ARM), AsAP achieves performance 27–275 times greater, energy efficiency 96–215 times greater, while using far less area. Compared to the TI C62x high-end DSP processor, AsAP achieves performance 0.8–9.6 times greater, energy efficiency 10–75 times greater, with an area 7–19 times smaller. Compared to ASIC implementations, AsAP achieves performance within a factor of 2–5, energy efficiency within a factor of 3–50, with area within a factor of 2.5–3. These data are for varying numbers of AsAP processors per benchmark.

Keywords

array processor chip multi-processor digital signal processing DSP globally asynchronous locally synchronous GALS many-core multi-core programmable DSP 

References

  1. 1.
    Horowitz, M., & Dally, W. (2004). How scaling will change processor architecture. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 132–133) (February).Google Scholar
  2. 2.
    Ho, R., Mai, K. W., & Horowitz, M. A. (2001). The future of wires. Proceedings of the IEEE (pp. 490–504) (April).Google Scholar
  3. 3.
    Chapiro, D. M. (1984). Globally-asynchronous locally-synchronous Systems, PhD thesis. Stanford, CA: Stanford University (October).Google Scholar
  4. 4.
    Agarwala, S., Anderson, T., Hill, A., et al. (2002). A 600-MHz VLIW DSP, IEEE Journal of Solid-State Circuits (JSSC) (pp. 1532–1544) (November).Google Scholar
  5. 5.
    Stinson, J., & Rusu, S. (2003). A 1.5 GHz third generation Itanium 2 processor. In Design Automation Conference (DAC) (pp. 706–710) (June).Google Scholar
  6. 6.
    Yu, Z., Meeuwsen, M., Apperson, R., et al. (2006). An asynchronous array of simple processors for DSP applications. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 428–429) (February).Google Scholar
  7. 7.
    Bindal, N., et al. (2003). Scalable sub-10ps skew global clock distribution for a 90 nm multi-GHz IA microprocessor. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 346–347) (February).Google Scholar
  8. 8.
    Apperson, R., Yu, Z., Meeuwsen, M., Mohsenin, T., & Baas, B. (2007). A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(10), 1125–1134 (October).CrossRefGoogle Scholar
  9. 9.
    Kung, S. Y. (1985). VLSI array processors. In IEEE ASSP Magazine (pp. 4–22) (July).Google Scholar
  10. 10.
    Taylor, M., et al. (2003). A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 170–171) (February).Google Scholar
  11. 11.
    Keckler, S., et al. (2003). A wire-delay scalable microprocessor architecture for high performance systems. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 168–169) (February).Google Scholar
  12. 12.
    Glossner, J., Moreno, J., Moudgill, M., et al. (2000). Trends in compilable DSP architecture. In IEEE Workshop on Signal Processing Systems (SiPS) (pp. 181–199) (October).Google Scholar
  13. 13.
    IEEE Computer Society (1999). Wireless LAN medium access control (MAC) and physical layer (PHY) specifications: High speed physical layer in the 5 GHz band. In Standard for Information Technology. Institute of Electrical and Electronics Engineers.Google Scholar
  14. 14.
    Meeuwsen, M. J., Sattari, O., & Baas, B. M. (2004) A full-rate software implementation of an IEEE 802.11a compliant digital baseband transmitter. In IEEE Workshop on Signal Processing Systems (SiPS) (pp. 124–129) (October).Google Scholar
  15. 15.
    Pham, D., et al. (2005) The design and implementation of a first-generation CELL processor. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 184–185) (February).Google Scholar
  16. 16.
    Naffziger, S., et al. (2005). The implementation of a 2-core multi-threaded itanium family processor. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 182–183) (February).Google Scholar
  17. 17.
    Oliver, J., et al. (2006). Tile size selection for low-power tile-based architecture. In ACM Computing Frontiers (pp. 83–94) (May).Google Scholar
  18. 18.
    Baas, B., (2003). A parallel programmable energy-efficient architecture for computationally-intensive DSP systems. In Asilomar Conference on Signals, Systems and Computers (pp. 2185–2189) (November).Google Scholar
  19. 19.
    Mai, K., et al. (2000). Smart memories: A modular reconfigurable architecture. In Proceedings of the International Symposium on Computer Architecture (ISCA) (pp. 161–171) (June).Google Scholar
  20. 20.
    Sungtae, J., et al. (2003). Energy characterization of a tiled architecture processor with on-chip network. In International Symposium on Low Power Electronics and Design (ISLPED) (pp. 424–427) (August).Google Scholar
  21. 21.
    Bright, A. A., et al. (2005). Creating the BlueGene/L supercomputer from low-power SOC ASICs. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 188–189) (February).Google Scholar
  22. 22.
    Leon, A. S., et al. (2006). A power-efficienct high-throughput 32-thread SPARC processor. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 98–99) (February).Google Scholar
  23. 23.
    Texas Instruments, DSP platforms benchmarks, Tech. Rep., http://www.ti.com/.
  24. 24.
    Berkeley Design Technology (2000). Evaluating DSP Processor Performance. Berkeley, CA, USA.Google Scholar
  25. 25.
    The Embedded Microprocessor Benchmark Consortium (2006). Data sheets, www.eembc.org.
  26. 26.
    Vangal, S., Howard, J., Ruhl, G., et al. (2007). An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 98–99) (February).Google Scholar
  27. 27.
    Yu, Z., et al. (2006). Performance and power analysis of globally asynchronous locally synchronous multi-processor systems. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI) (pp. 378–384) (March).Google Scholar
  28. 28.
    Kozyrakis, C., et al. (2002). Vector vs. superscalar and vliw architectures for embedded multimedia benchmarks. In Micro (pp. 283–289) (November).Google Scholar
  29. 29.
    Schmit, H., et al. (2002). PipeRench: A virtualized programmable datapath in 0.18 micron technology. In IEEE Custom Integrated Circuits Conference (CICC) (pp. 63–66) (May).Google Scholar
  30. 30.
    Gorjiara, B., et al. (2005). Custom processor design using NISC: A case-study on DCT algorithm. In ESTIMedia (pp. 55–60) (September).Google Scholar
  31. 31.
    Matsui, M., et al. (1994). A 200 MHz 13 mm2 2-D DCT macrocell using sense-amplifying pipeline flip-flop scheme. In IEEE Journal of Solid-State Circuits (JSSC) (pp. 1482–1490) (December).Google Scholar
  32. 32.
    Maharatna, K., et al. (2004) A 64-point fourier transform chip for high-speed wireless LAN application using OFDM. In IEEE Journal of Solid-State Circuits (JSSC) (pp. 484–493) (March).Google Scholar
  33. 33.
    Lin, T., & Jen, C. (2002) Cascade—Configurable and scalable DSP environment. In IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 26–29) (May).Google Scholar
  34. 34.
    Tariq, M., et al. (2002). Development of an OFDM based high speed wireless LAN platform using the TI C6x DSP. In IEEE International Conference on Communications (ICC) (pp. 522–526) (April).Google Scholar
  35. 35.
    Thomson, J., et al. (2002). An Integrated 802.11a Baseband and MAC Processor. In IEEE International Solid-State Circuits Conference (ISSCC), 45, 126–127, 451.Google Scholar
  36. 36.
    Zhang, H., et al. (2000). A 1-V heterogeneous reconfigurable DSP IC for wireless baseband digital signal processing. In IEEE Journal of Solid-State Circuits (JSSC) (pp. 1697–1704) (November).Google Scholar
  37. 37.
    Baines, R., et al. (2003). A total cost approach to evaluating different reconfigurable architectures for baseband processing in wireless receivers. In IEEE Communication Magazine (pp. 105–113) (January).Google Scholar
  38. 38.
    Cradle Technologies, Multiprocessor DSPs: Next stage in the evolution of media processor DSPs, Tech. Rep., http://www.cradle.com/.
  39. 39.
    Khailany, B., et al. (2002). VLSI design and verification of the imagine processor. In IEEE International Conference on Computer Design (ICCD) (pp. 289–294) (September).Google Scholar
  40. 40.
    Cronquist, D. C., et al. (1999). Architecture design of reconfigurable pipelined datapaths. In Advanced research in VLSI (ARVLSI) (pp. 23–40) (March).Google Scholar
  41. 41.
    Oliver, J., et al. (2004). Synchroscalar: A multiple clock domain, power-aware, tile-based embedded processor. In Proceedings of the International Symposium on Computer Architecture (ISCA) (pp. 150–161) (June).Google Scholar
  42. 42.
    ClearSpeed, CSX600: Advanced product, Tech. Rep., http://www.clearspeed.com/.
  43. 43.
    Sandbridge, The sandbridge sandblaster convegence platform, Tech. Rep., http://www.sandbridgetech.com/.
  44. 44.
    Whitby-Strevens, C. (1990). Transputers-past, present and future. In IEEE Micro (pp. 16–19) (December).Google Scholar
  45. 45.
    Kung, H. T. (1982). Why systolic architectures? In Computer Magazine (pp. 37–46) (January).Google Scholar
  46. 46.
    Kung, H. T. (1988). Systolic communication. In International Conference on Systolic Arrays (pp. 695–703) (May).Google Scholar
  47. 47.
    Kung, S., et al. (1982). Wavefront array processor: Language, architecture, and applications. IEEE Transactions on Computers, C-31(11), 1054–1066 (November).CrossRefMathSciNetGoogle Scholar
  48. 48.
    Schmidt, U., & Mehrgardt, S. (1990). Wavefront array processor for video applications. In IEEE International Conference on Computer Design (ICCD) (pp. 307–310) (September).Google Scholar
  49. 49.
    Lattard, D., Beigne, E., Bernard, C., et al. (2007). A telecom baseband circuit based on an asynchronous network- on-chip. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 258–259) (February).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Zhiyi Yu
    • 1
  • Michael J. Meeuwsen
    • 1
  • Ryan W. Apperson
    • 1
  • Omar Sattari
    • 1
  • Michael A. Lai
    • 1
  • Jeremy W. Webb
    • 1
  • Eric W. Work
    • 1
  • Tinoosh Mohsenin
    • 1
  • Bevan M. Baas
    • 1
  1. 1.ECE departmentUC DavisDavisUSA

Personalised recommendations