Skip to main content

Advertisement

Log in

Architecture and Evaluation of an Asynchronous Array of Simple Processors

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

This paper presents the architecture of an asynchronous array of simple processors (AsAP), and evaluates its key architectural features as well as its performance and energy efficiency. The AsAP processor calculates DSP applications with high energy-efficiency, is capable of high-performance, is easily scalable, and is well-suited to future fabrication technologies. It is composed of a two-dimensional array of simple single-issue programmable processors interconnected by a reconfigurable mesh network. Processors are designed to capture the kernels of many DSP algorithms with very little additional overhead. Each processor contains its own tunable and haltable clock oscillator, and processors operate completely asynchronously with respect to each other in a globally asynchronous locally synchronous (GALS) fashion. A 6×6 AsAP array has been designed and fabricated in a 0.18 μm CMOS technology. Each processor occupies 0.66 mm2, is fully functional at a clock rate of 520–540 MHz at 1.8 V, and dissipates an average of 35 mW per processor at 520 MHz under typical conditions while executing applications such as a JPEG encoder core and a complete IEEE 802.11a/g wireless LAN baseband transmitter. Most processors operate at over 600 MHz at 2.0 V. Processors dissipate 2.4 mW at 116 MHz and 0.9 V. A single AsAP processor occupies 4% or less area than a single processing element in other multi-processor chips. Compared to several RISC processors (single issue MIPS and ARM), AsAP achieves performance 27–275 times greater, energy efficiency 96–215 times greater, while using far less area. Compared to the TI C62x high-end DSP processor, AsAP achieves performance 0.8–9.6 times greater, energy efficiency 10–75 times greater, with an area 7–19 times smaller. Compared to ASIC implementations, AsAP achieves performance within a factor of 2–5, energy efficiency within a factor of 3–50, with area within a factor of 2.5–3. These data are for varying numbers of AsAP processors per benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15

Similar content being viewed by others

References

  1. Horowitz, M., & Dally, W. (2004). How scaling will change processor architecture. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 132–133) (February).

  2. Ho, R., Mai, K. W., & Horowitz, M. A. (2001). The future of wires. Proceedings of the IEEE (pp. 490–504) (April).

  3. Chapiro, D. M. (1984). Globally-asynchronous locally-synchronous Systems, PhD thesis. Stanford, CA: Stanford University (October).

  4. Agarwala, S., Anderson, T., Hill, A., et al. (2002). A 600-MHz VLIW DSP, IEEE Journal of Solid-State Circuits (JSSC) (pp. 1532–1544) (November).

  5. Stinson, J., & Rusu, S. (2003). A 1.5 GHz third generation Itanium 2 processor. In Design Automation Conference (DAC) (pp. 706–710) (June).

  6. Yu, Z., Meeuwsen, M., Apperson, R., et al. (2006). An asynchronous array of simple processors for DSP applications. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 428–429) (February).

  7. Bindal, N., et al. (2003). Scalable sub-10ps skew global clock distribution for a 90 nm multi-GHz IA microprocessor. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 346–347) (February).

  8. Apperson, R., Yu, Z., Meeuwsen, M., Mohsenin, T., & Baas, B. (2007). A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(10), 1125–1134 (October).

    Article  Google Scholar 

  9. Kung, S. Y. (1985). VLSI array processors. In IEEE ASSP Magazine (pp. 4–22) (July).

  10. Taylor, M., et al. (2003). A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 170–171) (February).

  11. Keckler, S., et al. (2003). A wire-delay scalable microprocessor architecture for high performance systems. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 168–169) (February).

  12. Glossner, J., Moreno, J., Moudgill, M., et al. (2000). Trends in compilable DSP architecture. In IEEE Workshop on Signal Processing Systems (SiPS) (pp. 181–199) (October).

  13. IEEE Computer Society (1999). Wireless LAN medium access control (MAC) and physical layer (PHY) specifications: High speed physical layer in the 5 GHz band. In Standard for Information Technology. Institute of Electrical and Electronics Engineers.

  14. Meeuwsen, M. J., Sattari, O., & Baas, B. M. (2004) A full-rate software implementation of an IEEE 802.11a compliant digital baseband transmitter. In IEEE Workshop on Signal Processing Systems (SiPS) (pp. 124–129) (October).

  15. Pham, D., et al. (2005) The design and implementation of a first-generation CELL processor. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 184–185) (February).

  16. Naffziger, S., et al. (2005). The implementation of a 2-core multi-threaded itanium family processor. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 182–183) (February).

  17. Oliver, J., et al. (2006). Tile size selection for low-power tile-based architecture. In ACM Computing Frontiers (pp. 83–94) (May).

  18. Baas, B., (2003). A parallel programmable energy-efficient architecture for computationally-intensive DSP systems. In Asilomar Conference on Signals, Systems and Computers (pp. 2185–2189) (November).

  19. Mai, K., et al. (2000). Smart memories: A modular reconfigurable architecture. In Proceedings of the International Symposium on Computer Architecture (ISCA) (pp. 161–171) (June).

  20. Sungtae, J., et al. (2003). Energy characterization of a tiled architecture processor with on-chip network. In International Symposium on Low Power Electronics and Design (ISLPED) (pp. 424–427) (August).

  21. Bright, A. A., et al. (2005). Creating the BlueGene/L supercomputer from low-power SOC ASICs. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 188–189) (February).

  22. Leon, A. S., et al. (2006). A power-efficienct high-throughput 32-thread SPARC processor. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 98–99) (February).

  23. Texas Instruments, DSP platforms benchmarks, Tech. Rep., http://www.ti.com/.

  24. Berkeley Design Technology (2000). Evaluating DSP Processor Performance. Berkeley, CA, USA.

  25. The Embedded Microprocessor Benchmark Consortium (2006). Data sheets, www.eembc.org.

  26. Vangal, S., Howard, J., Ruhl, G., et al. (2007). An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 98–99) (February).

  27. Yu, Z., et al. (2006). Performance and power analysis of globally asynchronous locally synchronous multi-processor systems. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI) (pp. 378–384) (March).

  28. Kozyrakis, C., et al. (2002). Vector vs. superscalar and vliw architectures for embedded multimedia benchmarks. In Micro (pp. 283–289) (November).

  29. Schmit, H., et al. (2002). PipeRench: A virtualized programmable datapath in 0.18 micron technology. In IEEE Custom Integrated Circuits Conference (CICC) (pp. 63–66) (May).

  30. Gorjiara, B., et al. (2005). Custom processor design using NISC: A case-study on DCT algorithm. In ESTIMedia (pp. 55–60) (September).

  31. Matsui, M., et al. (1994). A 200 MHz 13 mm2 2-D DCT macrocell using sense-amplifying pipeline flip-flop scheme. In IEEE Journal of Solid-State Circuits (JSSC) (pp. 1482–1490) (December).

  32. Maharatna, K., et al. (2004) A 64-point fourier transform chip for high-speed wireless LAN application using OFDM. In IEEE Journal of Solid-State Circuits (JSSC) (pp. 484–493) (March).

  33. Lin, T., & Jen, C. (2002) Cascade—Configurable and scalable DSP environment. In IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 26–29) (May).

  34. Tariq, M., et al. (2002). Development of an OFDM based high speed wireless LAN platform using the TI C6x DSP. In IEEE International Conference on Communications (ICC) (pp. 522–526) (April).

  35. Thomson, J., et al. (2002). An Integrated 802.11a Baseband and MAC Processor. In IEEE International Solid-State Circuits Conference (ISSCC), 45, 126–127, 451.

  36. Zhang, H., et al. (2000). A 1-V heterogeneous reconfigurable DSP IC for wireless baseband digital signal processing. In IEEE Journal of Solid-State Circuits (JSSC) (pp. 1697–1704) (November).

  37. Baines, R., et al. (2003). A total cost approach to evaluating different reconfigurable architectures for baseband processing in wireless receivers. In IEEE Communication Magazine (pp. 105–113) (January).

  38. Cradle Technologies, Multiprocessor DSPs: Next stage in the evolution of media processor DSPs, Tech. Rep., http://www.cradle.com/.

  39. Khailany, B., et al. (2002). VLSI design and verification of the imagine processor. In IEEE International Conference on Computer Design (ICCD) (pp. 289–294) (September).

  40. Cronquist, D. C., et al. (1999). Architecture design of reconfigurable pipelined datapaths. In Advanced research in VLSI (ARVLSI) (pp. 23–40) (March).

  41. Oliver, J., et al. (2004). Synchroscalar: A multiple clock domain, power-aware, tile-based embedded processor. In Proceedings of the International Symposium on Computer Architecture (ISCA) (pp. 150–161) (June).

  42. ClearSpeed, CSX600: Advanced product, Tech. Rep., http://www.clearspeed.com/.

  43. Sandbridge, The sandbridge sandblaster convegence platform, Tech. Rep., http://www.sandbridgetech.com/.

  44. Whitby-Strevens, C. (1990). Transputers-past, present and future. In IEEE Micro (pp. 16–19) (December).

  45. Kung, H. T. (1982). Why systolic architectures? In Computer Magazine (pp. 37–46) (January).

  46. Kung, H. T. (1988). Systolic communication. In International Conference on Systolic Arrays (pp. 695–703) (May).

  47. Kung, S., et al. (1982). Wavefront array processor: Language, architecture, and applications. IEEE Transactions on Computers, C-31(11), 1054–1066 (November).

    Article  MathSciNet  Google Scholar 

  48. Schmidt, U., & Mehrgardt, S. (1990). Wavefront array processor for video applications. In IEEE International Conference on Computer Design (ICCD) (pp. 307–310) (September).

  49. Lattard, D., Beigne, E., Bernard, C., et al. (2007). A telecom baseband circuit based on an asynchronous network- on-chip. In IEEE International Solid-State Circuits Conference (ISSCC) (pp. 258–259) (February).

Download references

Acknowledgements

The authors gratefully acknowledge support from Intel, UC Micro, NSF Grant 0430090 and CAREER Award 0546907, SRC GRC Grant 1598, Intellasys, S Machines, MOSIS, Artisan, and a UCD Faculty Research Grant; and thank D. Truong, M. Singh, R. Krishnamurthy, M. Anders, S. Mathew, S. Muroor, W. Li, and C. Chen.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiyi Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Z., Meeuwsen, M.J., Apperson, R.W. et al. Architecture and Evaluation of an Asynchronous Array of Simple Processors. J Sign Process Syst Sign Image Video Technol 53, 243–259 (2008). https://doi.org/10.1007/s11265-008-0162-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-008-0162-1

Keywords

Navigation