Architecture and Evaluation of an Asynchronous Array of Simple Processors
- First Online:
- Cite this article as:
- Yu, Z., Meeuwsen, M.J., Apperson, R.W. et al. J Sign Process Syst Sign Image Video Technol (2008) 53: 243. doi:10.1007/s11265-008-0162-1
- 104 Views
This paper presents the architecture of an asynchronous array of simple processors (AsAP), and evaluates its key architectural features as well as its performance and energy efficiency. The AsAP processor calculates DSP applications with high energy-efficiency, is capable of high-performance, is easily scalable, and is well-suited to future fabrication technologies. It is composed of a two-dimensional array of simple single-issue programmable processors interconnected by a reconfigurable mesh network. Processors are designed to capture the kernels of many DSP algorithms with very little additional overhead. Each processor contains its own tunable and haltable clock oscillator, and processors operate completely asynchronously with respect to each other in a globally asynchronous locally synchronous (GALS) fashion. A 6×6 AsAP array has been designed and fabricated in a 0.18 μm CMOS technology. Each processor occupies 0.66 mm2, is fully functional at a clock rate of 520–540 MHz at 1.8 V, and dissipates an average of 35 mW per processor at 520 MHz under typical conditions while executing applications such as a JPEG encoder core and a complete IEEE 802.11a/g wireless LAN baseband transmitter. Most processors operate at over 600 MHz at 2.0 V. Processors dissipate 2.4 mW at 116 MHz and 0.9 V. A single AsAP processor occupies 4% or less area than a single processing element in other multi-processor chips. Compared to several RISC processors (single issue MIPS and ARM), AsAP achieves performance 27–275 times greater, energy efficiency 96–215 times greater, while using far less area. Compared to the TI C62x high-end DSP processor, AsAP achieves performance 0.8–9.6 times greater, energy efficiency 10–75 times greater, with an area 7–19 times smaller. Compared to ASIC implementations, AsAP achieves performance within a factor of 2–5, energy efficiency within a factor of 3–50, with area within a factor of 2.5–3. These data are for varying numbers of AsAP processors per benchmark.
Keywordsarray processorchip multi-processordigital signal processingDSPglobally asynchronous locally synchronousGALSmany-coremulti-coreprogrammable DSP
Applications that require the computation of complex DSP workloads are becoming increasingly commonplace. These applications often comprise multiple DSP tasks and are found in applications such as: wired and wireless communications, multimedia, sensor signal processing, and medical/biological processing. Many are embedded and are strongly energy-constrained. In addition, many of these workloads require very high throughputs and often dissipate a significant portion of the system power budget and are therefore of considerable interest.
Increasing clock frequencies and an increasing number of circuits per chip has resulted in modern chip performance being limited by power dissipation rather than circuit constraints. This implies a new era of high-performance design that must now focus on energy-efficient implementations . Future fabrication technologies are expected to have large variations in devices and wires, and “long” wires are expected to significantly reduce maximum clock rates. Therefore, architectures that enable the elimination of long high-speed wires will likely be easier to design and may operate at higher clock rates .
A chip multiprocessor architecture achieves high performance through parallel computation. Many DSP applications are composed of a collection of cascaded DSP tasks, so an architecture that allows the parallel computation of independent tasks will likely be more efficient.
Small memories and simple single-issue architecture for each processor achieves high energy efficiency. Since large memories—which are normally used in modern processors [4, 5]—dissipate significant energy and require larger delays per memory transaction, architectures that minimize the need for memory and keep data near or within processing elements are likely to be more efficient. Along with reduced memory sizes, the datapath and control logic complexity of AsAP are also reduced.
GALS clocking style is suitable for future fabrication technologies and can achieve high energy efficiency due to the fact that global clock circuits have become increasingly difficult to design and they consume significant power.
Nearest neighbor communication is used to avoid global wires to make it suitable for future fabrication technologies, due to the fact that global chip wires will dramatically limit performance if not properly addressed since their delay is roughly constant when scaled .
A prototype 6×6 AsAP chip has been implemented in 0.18 μm CMOS and is fully functional . In this paper, we discuss AsAP’s architectural design and investigate how the key features affect system results. In addition, we present a thorough evaluation of its performance and energy efficiency for several DSP applications.
2 The AsAP Processor System
2.1 Architecture of the AsAP Processor
Each AsAP processor is a simple single-issue processor with a 64-word 32-bit instruction memory (IMEM), a 128-word 16-bit data memory (DMEM), a dynamic configuration memory (DCMEM), a 16×16-bit multiplier with a 40-bit accumulator, a 16-bit ALU, and four address generators. It utilizes a memory-to-memory architecture with no register file. No support is provided for branch prediction, out of order execution, or speculative operation. During the design phase, hardware was added only when it significantly increased performance and/or energy-efficiency for our benchmarks. A nine stage pipeline is implemented as shown in Fig. 1. All control signals are generated in the instruction decode stage, and pipelined appropriately. Interlocks are not implemented in hardware, so all code is scheduled prior to execution by the compiler.
2.1.1 Instruction Set
AsAP 32-bit instruction types and fields.
Classes of the 54 supported instructions.
Number of instructions
Other than a bit-reverse instruction and a bit-reverse mode in the address generators, no algorithm-specific instructions or hardware are implemented. While single-purpose hardware can greatly speed computation for specific algorithms, it can prove detrimental to the performance of a complex multi-algorithmic system and limits performance for future presently-unknown algorithms—which is one of the key domains for programmable processors.
2.1.2 Data Addressing
Data fetch addressing modes.
Move Obuf DMEM 0
Obuf ← DMEM
Move Obuf aptr0
Obuf ← DMEM[DCMEM]
Move Obuf ag0
Obuf ← DMEM[generator]
Add Obuf #3 #3
Obuf ← 3+3
Move Obuf #256
Obuf ← 256
Move Obuf DCMEM 0
Obuf ← DCMEM
Move Obuf regbp1
Obuf ← first bypass
Move Obuf Ibuf0
Obuf ← FIFO 0
Move Obuf Acc
Obuf ← ACC[15:0]
2.1.3 Completely Independent Clocking and Circuits for Crossing Asynchronous Clock Domains
Each processor has its own digitally programmable clock oscillator which occupies only about 0.5% of the processor’s area. There are no phase-locked loops (PLLs), delay-locked loops (DLLs), or global frequency or phase-related signals, and the system is fully GALS. While impressive low clock skew designs have been achieved at multi-GHz clock rates, the effort expended in clock tree management and layout is considerable . Placing a clock oscillator inside each processor reduces the size of the clock tree circuit to a fraction of a square millimeter—the size of a processing element. Large systems can be made with arrays of processing elements with no change whatsoever to clock trees (that are wholly contained within processing elements), simplifying overall design complexity and scalability.
The reliable transfer of data across unrelated asynchronous clock domains is accomplished by dual-clock FIFOs . The FIFO’s write clock and data are supplied in a source-synchronous fashion by the upstream processor and its read clock is supplied by the downstream processor—which is the host for the dual-clock FIFO in AsAP.
Special clock control circuits enable processing elements to power down completely—dissipating leakage power only—if no work is available for nine clock cycles. The local oscillator is fully restored to full speed in less than one cycle after work again becomes available.
2.1.4 Reconfigurable Two-Dimensional Mesh Network
Processors connect via a configurable two-dimensional mesh. To maintain link communication at full clock rates, inter-processor connections are made to nearest-neighbor processors only. A number of architectures including wavefront , RAW , and TRIPS , have specifically addressed this concern and have demonstrated the advantages of a tile-based architecture. AsAP’s nearest neighbor connections result in no high-speed wires with a length greater than the linear dimension of a processing element. The inter-processor delay decreases with advancing fabrication technologies and allows clock rates to scale upward. Longer distance data transfers in AsAP are handled by routing through intermediary processors or by “folding” the application’s data flow graph so that communicating processing elements are placed adjacent or near each other—for example, the Pilot Insert processor and the first G.I. Wind. processor in Fig. 5b.
Each AsAP processor has two asynchronous input data ports and can connect each port to any of its four nearest neighboring processors. Because changing active clock signals can cause runt clock pulses, a processor may change its input connection only during times when both input clocks are guaranteed to both be low—which is normally only during power-up. On the other hand, output port connections can be changed among any combination of the four neighboring processors at any time through software.
2.2 AsAP Implementation
The right part of Fig. 3 shows the test environment for the AsAP prototype including a printed circuit board hosting an AsAP processor and a supporting FPGA board to interface between AsAP and a host PC. AsAP’s SPI-style serial port receives configuration information and programs for each processor.
Required instruction memory and data memory sizes for various DSP tasks on a simple single-issue processor.
IMem size (words)
DMem size (words)
16-Tap FIR filter
Level-shifting for JPEG
8×8 two-dimensional DCT
Quantization for 64 elements
Zig-zag re-ordering for JPEG
Huffman encoding for JPEG
Scrambling for 802.11a/g
Padding OFDM bitstream
Convolutional coding (k = 7)
Interleaving 1 for 802.11a/g
Interleaving 2 for 802.11a/g
Modulation for BPSK, QPSK, 16QAM, 64QAM
Pilot insertion for OFDM
Training symbol generation for 802.11a/g
64-point complex FFT
Guard interval insertion for OFDM
2× upsampling + 21-tap Nyquist FIR filter
N-element merge sort
The job of programming processors also includes the mapping of processor graphs to the two-dimensional planar array. While this is normally done at compile time, an area of current work is tools for the automatic mapping of graphs to accommodate rapid programming and to recover from hardware faults and extreme variations in circuits, environment, and workload.
2.4 Task and Application Implementations
3 Analysis of the Key Features
Most chip multiprocessors target a broad range of applications, and each processor in such systems normally contains powerful computational resources—such as large memories, wide issue processors , and powerful inter-processor communication —to support widely varying requirements. Extra computational resources can enable systems to provide high performance to a diverse set of applications, but they reduce energy efficiency for tasks that can not make use of those specialized resources. Most DSP applications AsAP targets are made up of computationally intensive tasks with very small instruction and data kernels, which makes it possible to use extremely simple computational resources—small memory, simple single issue datapath, and nearest neighbor communication—to achieve high energy efficiency while maintaining high performance.
In this section, we analyze these key features of the AsAP processor which justify its fine grain architecture. We also briefly analyze AsAP’s GALS clocking style.
3.1 Small Memory
A clear trend among all types of programmable processors is not only an increasing amount of on-chip memory, but also an increasing percentage of die area used for memory. For example, the TI C64x  and third generation Itanium processor  use approximately 75% and 70% of their area for memory respectively. Since large memories dissipate more energy and require larger delays per transaction, we seek architectures that minimize the need for memory and keep data near or within processing elements.
3.1.1 Inherent Small Memory Requirement for DSP Applications
A notable characteristic of the targeted DSP tasks is that many have very limited memory requirements compared to general-purpose tasks. The level of required memory must be differentiated from the amount of memory that can be used or is typically used to calculate these kernels. For example, an N-tap filter may be programmed using a vast amount of memory though the base kernel requires only 2N data words. Table 4 lists the actual amounts of instruction and data memory required for 22 common DSP tasks and shows the very small required memory sizes compared to memories commonly available in modern processors. This analysis assumes a simple single-issue processor like AsAP. Although programs were hand written in assembly code, little effort was placed on optimizing them such as scheduling instructions for the pipeline or using forwarding paths.
3.1.2 Finding the Optimal Memory Size for DSP Applications
The non-memory processor size is 0.55 mm2 in 0.18 μm CMOS and is not a function of memory size,
Memory area scales linearly with capacity and the area is 400 μm2 for a 16-bit word and 800 μm2 for a 32-bit word,
A fixed partitioning overhead is added each time a task is split onto multiple processors—this overhead is estimated per task and varies from two to eight instructions and from 0–30% of the total space, and
Additional processors used only for routing data may be needed for designs using a large number of processors, but are neglected.
These analyses show that processors with memories of a few hundred words will likely produce highly energy efficient systems due to their low overall memory power and their very short intra-processor wires. On the negative side, processors with very small memories that require parallelization of tasks across processors may require greater communication energy and present significant programming challenges.
3.1.3 Several Architectural Features Help Reduce Memory Requirement
3.2 Datapath—Wide Issue vs. Single Issue
The datapath, or execution unit, plays a key role in processor computation, and also occupies a considerable amount of chip area. Uniprocessor systems are shifting from single issue architectures to wide issue architectures in which multiple execution units are available to enhance system performance. For chip multiprocessor systems, there remains a question about the trade-off between using many small single-issue processors, versus larger but fewer wide-issue processors.
A large wide-issue processor has a centralized controller, contains more complex wiring and control logic, and its area and power consumption increase faster than linearly along with the number of execution units. One model of area and power for processors with different issues derived by J. Oliver et al.  shows using wide-issue processors consumes significantly more area and power than using multiple single-issue processors. Their work shows a single 32-issue processor occupies more than two times the area and dissipates approximately three times the power of 32 single-issue processors.
However, chip multiprocessor systems composed of single-issue processors will not always have higher area and energy efficiency—much depends on specific applications. Wide-issue processors work well when instructions fetched during the same cycle are highly independent and can take full advantage of functional unit parallelism, but this is not always the case. Multiple single-issue processors such as AsAP are less efficient if the application is not easy to partition, but it can perform particularly well on many DSP applications since they are often made up of complex components exhibiting task level parallelism so that tasks are easily spread across multiple processors. Large numbers of simple processors also introduce extra inter-processor communication overhead, which we discuss further in Section 3.3.
When all processors have a balanced computational load with little communication overhead, the system throughput increases close to linearly with the number of processors, such as for the task of finding the maximum value of a data set (Max in 100 data in Fig. 12). Clearly, applications that are difficult to parallelize show far less scalability at some point. For example, the performance of the 8×8 DCT increases well up to 10 processors where 4.4 times higher performance is achieved, but after that, little improvement is seen and only 5.4 times higher performance is seen using 24 processors. However, there is significant improvement in the FIR filter and FFT after a certain number of processors is reached. The reason for this is because increasing the number of processors in these applications avoids extra computation in some cases. For example, the FFT avoids the calculation of data and coefficient addresses when each processor is dedicated to one stage of the FFT computation. On average, 10 processor and 20 processor systems achieve 5.5 times and 12.3 times higher performance compared to a single processor system, respectively.
3.3 Nearest Neighbor Communication
Currently, most chip multiprocessors target broad general purpose applications and use complex inter-processor communication strategies [10, 19–22]. For example, RAW  uses a separate complete processor to provide powerful static routing and dynamic routing functions, BlueGene/L  uses a torus network and a collective network to handle inter-processor communication, and Niagara  uses a crossbar to connect 8 cores and memories. These methods provide flexible communication abilities, but consume a significant portion of the area and power in communication circuits.
The GALS clocking style simplifies the clock tree design and provides the opportunity for a joint clock/voltage scaling method to achieve very high energy efficiency. However, at the same time, it introduces an extra performance penalty since it requires extra circuitry to handle asynchronous boundaries which introduce additional latency. It has been shown that the performance penalty from a GALS chip multiprocessor architecture like AsAP can be highly reduced, due to its localized computation and less frequent communication loops. Simulation results show the performance penalty of the AsAP processor is less than 1% compared to the corresponding synchronous system .
4 Evaluation of the AsAP Processor
This section provides a detailed evaluation and discussion of the AsAP processor including performance, area, and power consumption.
Each processor occupies 0.66 mm2 and the 6×6 array occupies 32.1 mm2 including pads. Due to its small memories and simple architecture, each AsAP processor’s area is divided as follows: 8% for communication circuitry, 26% for memory circuitry, and a favorable 66% for the remaining core.
Processors operate at 520–540 MHz under typical conditions. The average power consumption for each processor is 35 mW when processors are executing applications such as a JPEG encoder or an 802.11a/g baseband transmitter, and they consume 94 mW when 100% active at 520 MHz. At a supply voltage of 2.0 V, most processors operate at clock frequencies over 600 MHz.
4.1 High Speed, Small Area, and High Peak Performance
Small memories and simple processing elements enable high clock frequencies and high system performance. The AsAP processor operates at frequencies among the highest possible for a digital system designed using a particular design approach and fabrication technology. The clock frequency information listed in Table 6 supports this assertion.
AsAP is also highly area efficient. AsAP has a processing element density about 23–100 times greater than that of other broadly-similar projects , and thus each AsAP processor occupies 4% or less area compared to other reported processing elements.
Estimates for a 13 × 13 mm AsAP array implemented in various semiconductor technologies.
CMOS tech (nm)
Processor size (mm2)
Num Procs per Chip
Clock freq (GHz)
Peak system processing (Tera-Op)
4.2 High Performance and Low Power Consumption for DSP Applications
Area, performance and power comparison of various processors for several key DSP kernels and applications; all data are scaled to 0.18 μm technology assuming a 1/s2 reduction in area, a factor of s increase in speed, and a 1 / s2 reduction in power consumption.
Scaled area (mm2)
Scaled clock freq. (MHz)
Scaled execution time (ns)
Scaled power (mW)
Scaled energy (nJ)
40-tap FIR filter
Array (8 proc.)
Array (8 proc.)
DCT processor 
Radix-2 complex 64-pt FFT
Array (13 proc.)
MIPS VR5000 
FFT processor 
JPEG encoder (8 × 8 block)
Array (9 proc.)
ASIC + RISC
802.11a/g transmitter (1 symbol)
Array (22 proc.)
In support of our assertion that the AsAP prototype has significant room for improvement, we note that measurements show approximately 2/3 of AsAP’s power is dissipated in its clocking system. This is largely due to the fact that we did not implement clock gating in this first prototype. All circuits within each processor are clocked continuously—except during idle periods when the oscillator is halted.
The area used by AsAP, shown in Table 6, is the combined area required for all processors including those used for communication. Data for the FIR, 8×8 DCT, and FFT are deduced from measured results of larger applications. We estimated the performance of the JPEG encoder on the TI C62x by using the relative performance of the C62x compared to MIPS processors , and a reported similar ARM processor .
AsAP achieves 27–275 times higher performance and 96–215 times higher energy efficiency than RISC processors (single issue MIPS and ARM);
Compared to a high-end programmable DSP (TI C62x), AsAP achieves 0.8–9.6 times higher performance and 10–75 times higher energy efficiency; and
Compared to ASIC implementations, AsAP achieves performance within a factor of 2–5 and energy efficiency within a factor of 3–50 with an area within a factor of 2.5–3.
Another source of AsAP’s high energy efficiency comes from its haltable clock, which is greatly aided by the GALS clocking style. Halting clocks while processors are even momentarily inactive results in power reductions of 53% for the JPEG core and 65% for the 802.11a/g baseband transmitter.
Supply voltage scaling can be used to further improve power savings. Processors dissipate an average of 2.4 mW at a clock rate of 116 MHz using a supply voltage of 0.9 V while executing the described applications.
5 Related Work
There have been many other styles of parallel processors. The key features of the AsAP processor are a small memory, a simple processor, GALS clocking style, and reconfigurable nearest-neighbor mesh network. These features distinguish it from other previous and current parallel processors.
The transputer  is a popular parallel processor originally developed in the 1980’s. It shares the philosophy of using multiple relatively simple processors to achieve high performance. The transputer is designed for a multiple processor board, where each transputer processor is a complete standalone system. It uses a bit serial channel for inter-processor communication which can support communication of different word lengths to save hardware, but with dramatically reduced communication speeds.
Systolic processors and wavefront processors are two more classic parallel architectures. Systolic processors  contain synchronously-operating processors which send and receive data in a highly regular manner . Wavefront array processors  are similar to systolic processors but rely on dataflow properties for inter-processor data synchronization. Previous designs were optimized for simple and single algorithm workloads such as matrix operations  and image processing kernels .
Comparison of key features of selected parallel processor architectures.
Heterogeneous (ASIC + proc.)
Heterogeneous (DSP + coproc.)
Heterogeneous (RISC + DSP)
Multiple execution units
Execution unit stripe
Two-dimensional mesh; dynamic route
Intel 80-core 
Two-dimensional mesh; dynamic route
High-bandwidth ring bus
High perf. apps
SIMD processor with cache
Smart Memories 
Single-issue proc. with 128 KB mem
Packet dynamic route
Single-issue proc. with 128 KB mem
Static + dynamic route
Single-issue proc. with 512 B mem
2D reconfig. mesh
The AsAP platform is well-suited for the computation of complex DSP workloads comprised of many DSP tasks, as well as single highly-parallel computationally demanding tasks. By its very nature of having independent clock domains, very small processing elements, and short interconnects, it is highly energy-efficient and capable of high throughput.
Measured results show that on average, AsAP can achieve several times higher performance and ten times higher energy efficiency than a high performance DSP processor, while utilizing an area more than ten times smaller.
Areas of interesting future work include: mapping a broader range of applications to AsAP; developing algorithms and hardware for intelligent clock and voltage scaling; automatic software mapping tools to optimize utilization, throughput, and power; C compiler enhancements; connecting large memories when more memory is needed; and automatic fault detection and recovery.
The authors gratefully acknowledge support from Intel, UC Micro, NSF Grant 0430090 and CAREER Award 0546907, SRC GRC Grant 1598, Intellasys, S Machines, MOSIS, Artisan, and a UCD Faculty Research Grant; and thank D. Truong, M. Singh, R. Krishnamurthy, M. Anders, S. Mathew, S. Muroor, W. Li, and C. Chen.