Mixed signal SIMD processor array vision chip for real-time image processing
- 2.6k Downloads
A prototype vision chip has been designed that incorporates a 20 × 64 array of processing elements on a 31 μm pitch. Each processor element includes 14 bits of digital memory in addition to seven analogue registers. Digital operands include NOR and NOT with operations of diffusion, subtraction, inversion and squaring available in the analogue domain. The cells of the array can be configured as an asynchronous propagation network allowing operations such as flood filling to occur with times of ~1 μs across the array. Exploiting this feature allows the chip to recognise the difference between closed and open shapes at 30,000 frames per second. The chip is fabricated in 0.18 μm CMOS technology.
KeywordsVision chip Smart sensor Asynchronous image processing Cellular processor array
Recent vision chips, in common with the chip described here, operate as SIMD computational devices; such devices have been presented widely in both digital [8, 9, 10], analogue [11, 12, 13, 14] and mixed signal form [15, 16]. Designers of wholly digital vision chips have tended to adopt larger cell sizes (65 × 25 μm , 67 × 64 μm  and 51 × 54 μm ) than their analogue (49 × 49 μm , 35 × 35 μm , 34 × 29 μm , 26 × 26 μm ) counterparts. With smaller semiconductor process nodes (e.g. 45 nm) becoming available, it will be possible to shrink the same digital functionality onto a much smaller silicon area than the analogue equivalent. However, while utilizing lower cost processes, specifically 0.18 μm, analogue processors can achieve bit-equivalent memories and ALUs similar to their digital counterparts in a more compact space. The relatively high pixel pitch of all recent vision chips [8, 9, 10, 11, 12, 13, 14, 15, 16] limits the pixel count to typically <20 kpixels, making them suitable for applications in which high speed or low power is required, but very high spatial resolution is not essential, for example, surveillance systems [7, 17], performing on-line analysis in industrial production systems , or providing vision systems for robots . The separation of processing and imaging areas (described in ) removes some of the constraints on occupied PE silicon space resulting in low fill factor; the design compromise made for this approach is that algorithms must start by reading out data from the array into the close-connected PEs; for fast image processing, frame rates might be slowed by this process.
This paper, which discusses a prototype vision chip designed with a 64 × 20 pixel array, extends the descriptions provided in . This test IC is intended to be the precursor of a larger array. In Sect. 2, we discuss the architecture of the chip with Sect. 3 describing the circuits incorporated into the PE. Section 4 discusses the programming model and language used to define the operation of the IC, and Sect. 5, the implementation and measurement results. Section 6 relates an example application of the IC. We include a description of a program that distinguishes between open and closed shapes at 30,000 fps, demonstrating the utility of focal-plane processing. With the processor-per-pixel architecture, the program is expected to scale to larger processor arrays without significant reduction in execution speed. As earlier applications have also demonstrated , frames are not read-out at this speed, only the results of processing operations are transferred to the program controller.
2 Architecture overview
The IC is controlled by an external instruction issuing device. All instructions provided to the IC are executed according to the SIMD paradigm, that is, all the processors of the array execute simultaneously the same instruction but upon their own local data. The local data available extends to that of their nearest orthogonal neighbours. The instruction set supports arithmetic operations of addition, subtraction, negation, squaring and comparison performed on analogue registers, and logic OR and NOR operations performed on binary registers. Hardware that facilitates SIMD programming is a good fit with low-level pixel processing operations, whereby operations are repeated on a per-pixel basis.
The IC described has an architecture that is intended to allow its deployment in many different applications, while the physical pitch of the PEs allows an IC with greater resolution at lower cost to be fabricated in comparison to its predecessor SCAMP-3 —and additionally allows use of smaller optics. The IC utilises switched current memories for analogue storage, with binary storage provided by 3-transistor digital DRAM memories as advanced in ASPA vision chips . The sub-systems that could be incorporated in the PEs of the array were limited by the connectivity that could be provided to the PEs. However, relative to SCAMP-3 with three metal layers available, the six metal layers of the 0.18 μm process has afforded considerable additional features and functionality. The additional metal layers however, also have deleterious effects on sensing uniformity since the photodiode is now at the base of the metal-layer stack; this will partially shadow the light sensitive area. Applications of vision chips such as motion, edge detection, tracking and shape detection require a combination of analogue and digital operations. In the design of this IC, we have shifted the ratio of analogue to digital memory, reducing the number of space-consuming analogue registers while increasing the digital resources available. Furthermore, the digital memories can be used to configure an asynchronous propagation network , allowing global operations on data, in addition to performing logical local operations. Also added to this chip is a controllable diffusion network allowing large-scale spatial low-pass filters to occur in a single instruction, and a squarer offering one-quadrant operational capability. The IC provides for global analogue and digital data write and read; PE-specific digital write is also provided, hence allowing PE-specific analogue write via a flagged global analogue write operation.
PEs are controlled by means of a 61-bit instruction code word (ICW), with additional control supplied from 12 analogue inputs. The long ICW is registered on chip and input by means of a 16-bit bus, allowing a reduction in external pin-count. To deliver the ICW to the PE array requires large scale connectivity (relative to the compact PE size) with row and column drivers spread across all four sides of the IC.
The chip supports random access readout of analogue or digital data from the PEs, allowing for full array or region-of-interest readout. Additionally, flexible row and column decoders allow simultaneous addressing of pixel-blocks: analogue data to be readout is summed to provide regional or global analogue outputs; logic OR operation is performed on the digital output data within the addressed block.
3.1 Analogue processor fundamentals
The SI memory cell remembers the value of the input current by storing charge on the gate capacitance C gs of the MOS transistor. When writing to the cell both switches, S and W, are closed (Fig. 3b). The transistor is diode-connected and the input current i in forces the gate-source voltage V gs of the transistor to the value corresponding to the drain current I d s = I REF + i in , according to (1). At the end of the write phase the switch W is opened and thus the gate of the transistor is disconnected, i.e. put into a high impedance state. Due to charge conservation on the capacitor C gs , the voltage at the gate V gs will remain constant.
Therefore, the SI memory cell is, in principle, capable of storing a continuous-valued (i.e. real) number, within some operational dynamic range.
The basic operation is the transfer operation, as shown in Fig. 4(a). Register A (configured for reading) provides the current i A to the analogue bus. This current is consumed by register B (configured for writing); register C is not selected, and hence it is disconnected from the bus. Therefore, the analogue (current) data value is transferred from A to B. This transfer is denoted as B ← A. According to the current memory operation, Register B will produce the same current when it is read from (at some later instruction cycle). If we consider that the data value is always the one provided into the analogue bus, it can be easily seen that i B = −i A , i.e. the basic transfer operation includes negation of the stored data value.
The addition operation (and in general current summation of a number of register values) can be achieved by configuring one register for writing, and two (or more) registers for reading. The currents are summed, according to Kirchhoff’s current law, directly on the analogue bus. For example, the situation shown in Fig. 4(b) produces the operation A ← B + C. Subtraction is performed by negation followed by addition.
The division operation is achieved by configuring one register cell for reading and many (typically two) registers for writing. The current is split, and if the registers are identical then it is divided equally between the registers that are configured for writing, producing a division by a fixed factor. For example, in Fig. 4(c) both registers A and B will store current equal to half of the current provided by register C. We denote this instruction as DIV (A + B) ← C.
Multiplication and division by constant factors (e.g. coefficients in a convolution kernel) can be simply achieved by multiple application of add or divide by two operations. In the presented design, a compact current-mode squarer sub-system  is implemented. It provides the squared analogue current value (this is useful for algorithms requiring, for example, energy or mean-squared error calculations), and allows multiplication to be achieved through multiple steps, by determining (x + y)2 − x 2 − y 2 (=2xy). Other instructions could be implemented in hardware using current mode circuits, however, in a vision chip application silicon area is at a premium, so only those computation elements that can be implemented compactly can be used. Consequently, more silicon area can be devoted to register circuits, increasing the amount of local memory (or improving the accuracy of computations, which largely depends on device matching, i.e. device area). While this is not a one-step command, the squarer requires few transistors and can be fitted into a compact space within the PE.
3.2 Analogue sub-system
The analogue sub-system of the PE implemented on the chip comprises six general purpose analogue registers (A,B,C,D,E,F), 1 special purpose and neighbour communication register (“NEWS”), a variable strength diffusion network, a voltage controlled current source (the “IN” register) that facilitates the global input of data to the array, a current squarer, and a photodiode (two types, N-well diode and N+/P-subst, are explored—split equally across the array). Additionally, an analogue comparator detects the sign of a register’s stored current, storing the result in a local activity register (1-bit SRAM flag). The 1-bit SRAM flag is used to allow conditional code execution. The flag data can be also transferred to one of the 13 bits of DRAM memory and processed in the digital sub-system. The digital sub-system can also write to the flag register, providing a communication path from digital to analogue domain. The PE includes connectivity to allow either analogue or digital sub-systems to access a common analogue/digital bus allowing exchange of data within the cell or to its orthogonal neighbours.
3.2.1 Analogue memories
The analogue memories of this IC are based on the S2I cell . The standard cell was modified to include an error correction circuit  to allow a reduction in cell size while maintaining low signal-dependent error (since signal-independent errors can be accommodated algorithmically ). These errors have their origin in switch signal clock feedthrough, charge injection and output conductance errors; the addition of this circuit allows these errors to be compensated for.
In order to maximize the output dynamic range of I sq , I Msq is minimised to half of the maximum input current, I in_max /2, giving a maximum squarer output (from Eq. 7) also equal to I in_max /2.
3.2.3 Diffusion network
3.2.4 Pixel circuit
3.3 Digital sub-system
The digital sub-system is used to perform logical data operations and storage. Source data will generally be determined from a precursory analogue operation using the in-PE comparator. The PE should contain local storage sufficient for in-PE binary image processing, analogue to digital conversion and also diverse flag-conditional analogue operations with binary masks or markers.
Most digital operations consist of two phases, a pre-charge phase and a read/write phase. Both these phases are incorporated within a single chip instruction cycle. The LRB is generally pre-charged “high” on the first phase. A second phase will normally consist of simultaneous read and load operations with the additional selection of inverting or non-inverting storage with the output of all operations directed to the LWB. By reading several registers to the LRB simultaneously, a “wired” OR or NOR function can be enabled. From these primitive operations, other logical operations can be synthesised from multiple steps. The LWB is the common node that sources the load data to all digital registers. Transfer of data from DRAM to the SRAM flag (providing local autonomy) is performed in a similar way to inter-DRAM memory operations, i.e. by selective discharge of the LRB.
The signal from the LWB can also be directed to the combined analogue/digital bus of the PE, enabling neighbourhood communications.
Additionally, a frequent binary image operation is the individual classification of objects that may have passed an initial threshold/segmentation (i.e. blob analysis). This might be on the basis of size or shape. As such, it is a basic operation of many image processing algorithms to perform a geodesic reconstruction (from a marker) the subset of pixels that belong to an object. In a conventional synchronously operating image processing system, this is a time consuming task through the need for multiple dilation steps. Within the IC described, these iterative operations can be reduced to a few simple steps that allow a propagation wave to be launched from one or several pixels within the array . The wave then propagates asynchronously throughout the whole array. This enables very efficient object segmentation, object reconstruction, hole filling, watershed transformation and other operations.
4 Programming model and language
From the programmer’s view, the IC is abstracted to a level that would appear familiar to anyone used to programming with digital architectures. However, the spatial and analogue nature of many commands needs to be taken into account.
To facilitate operations with a level of local autonomy, all write operations of analogue registers require that a flag register be “true” for execution to take place. The flag register can be set from logical digital operations, local analogue comparison or external input.
With nearest orthogonal neighbour communication, functions such as edge detection can be enabled. Filters requiring larger neighbourhoods can be easily implemented using multiple neighbour transfers.
SCAMP4 vision chip code for adaptive thresholding
A ← PIX
Load analogue register A with inverted pixel value
B ← PIX + A
Load analogue register B with inverted sum of pixel value and A; B now holds image data over half of register full-scale-range with dark (no light) represented by zero current. B is a +ve image
C ← DIFF(B,1,1)
Load analogue register C with image B spatially diffused in x and y directions (controlled by digital switches). C is a −ve image.
D ← C + B + IN(2)
Add −ve diffused image in C to +ve image B; store inverted result in D. Also a (small) global constant is added here. D is a −ve image
Start “Execute in locations where D > 0” block; sets flag register
R4 ← FLAG
Load R4 with the flag register value
Resets flag register i.e. all locations execute following code
Outputs image from array B (original image)
Outputs inverted diffused image from array C
Outputs (image+ inverted diffused image) from array D
Outputs binary thresholded image from R4 (original image with thresholded pixels marked)
5 Implementation and measurement results
The array is designed to operate at 10 MHz with an analogue supply voltage of 1.5 V and digital supply voltage of 1.8 V. The IC has a total quiescent power consumption (i.e. when not executing any instructions) of 31 μW, and a peak power consumption of ~25 mW.
5.1 Analogue performance
Registers decayed at a worst-case rate of 5 % of full-scale range per second (cf. 4 % over 20 ms in ), while the vision chip imaged an office scene at 220 lux; with high illumination levels of 70 klux, decay rate increases by a factor of ~4. Over a typical frame interval of 50 ms, this decay rate has a negligible effect.
The photo-response non-uniformity (PRNU) of the imaging system (as a percentage of full-scale-range) over 80 % of register range, was 1.6 and 1.5 % rms for PE’s containing N-well photodiodes and N+/P-subst photodiodes respectively. While the N-well photodiode array has slightly higher PRNU, this array provides 3× higher sensitivity than the N+/P-subst photodiodes, through improved responsivity and reduced photodiode capacitance. For the N-well sensors, with the sensor fitted with a F1.4 lens, light sensitivity can be varied by adjustment of Pbias (see Fig. 7) from 140 to 220 LSBs (Lux.s)−1 (with 8-bit digitisation). The fill factor is 6.5 % for the N+/P-subst diode and is 5.7 % for the N-well diode.
5.2 Digital performance
Logical operations and asynchronous processing have been found to proceed as designed. Data propagation from cell to cell within the asynchronous network progresses in a time of 16 ns, or uni-directionally across the entire array in 1 μs.
The digital system within the IC only allows the short-term storage of data bits in DRAM; due to leakage of the minimum-sized write transistor, a refresh is required every 50 μs. This is expected to be improved for a successor IC.
5.3 Code example
6 Example application
The completed system combines the presented vision chip, an instruction delivery and control system (synthesised upon a Spartan-3 FPGA) and a communications interface to upload algorithms and receive data that the vision chip is programmed to return. Algorithms are developed and compiled using the APRON compiler . Returned data for analysis and inspection is viewed using a custom viewer program running on a host computer. Unlike many high-speed imaging systems, we wanted to demonstrate that as well as acquiring images at high frame rates, we are also capable of processing them in real-time, on the focal plane, to determine a non-trivial metric. In this example application we wish to discriminate between open and closed binary shapes, i.e. ‘O’ versus ‘U’, outputting from the vision sensor only the processed result. If a closed shape lies within a pre-defined area of the focal-plane, the vision chip system will return ‘true’ to the host, else ‘false’. This occurs at a frame rate of 30,000 fps, limited by the rate instructions can be issued by the system controller
The wheel on average spins at 3,000 rpm; with a radius of 23 mm, the shapes move at 7.2 m/s. After triggering, the vision chip samples images, at 30,000 fps, returning the one bit of resultant data per frame (indicating a presence of a closed shape in a frame). To enhance the visualisation, the user can select a single sampling point at which an entire single frame is returned and thus can visually inspect if the frame contains an open or closed shape, and whether the waveform is in agreement. Since carrying out this work, we have demonstrated closed object detection at 100 kfps within a 256 × 256 array .
A prototype 20 × 64 vision sensor/processor chip has been tested and successfully applied to an application in high speed imaging operating at 30,000 fps. This has demonstrated the utility of asynchronous propagation and data reduction of on-focal plane processing, and provided the post-silicon validation of the new PE design. The follow-up vision chip implementation, scaled-up to a larger array size , will enable high-speed and efficient implementation of image processing algorithms in practical machine vision applications.
- 4.Blug, A., Strohm, P., Carl, D., Hofler, H., Blug, B., & Kailer, A. (2012). On the potential of current CNN cameras for industrial surface inspection. 2012 13th international workshop on cellular nanoscale networks and their applications (CNNA), IEEE.Google Scholar
- 6.Carey, S. J., Barr, D. R. W., Wang, B., Lopich, A., & Dudek, P. (2012). Locating high speed multiple objects using a SCAMP-5 Vision-Chip. IEEE workshop on cellular nanoscale networks and applications, CNNA 2012, Turin.Google Scholar
- 7.Carey, S. J., Barr, D. R. W., & Dudek, P., Low power high-performance smart camera system based on SCAMP vision sensor. Journal of Systems Architecture (in press).Google Scholar
- 10.Lopich, A., & Dudek, P. (2010). An 80 × 80 general-purpose digital vision chip in 0.18 μm CMOS technology. IEEE International Symposium on Circuits and Systems, ISCAS 2010 (pp. 4257–4260).Google Scholar
- 12.Ginhac, D., Dubois, J., Paindavoine, M., & Heyrman, B. (2008). An SIMD programmable vision chip with high-speed focal plane image processing. EURASIP Journal of Embedded Systems, 2008, 1–13.Google Scholar
- 15.Rodriguez-Vazquez, A., Dominguez-Castro, R., Jimenez-Garrido, F., Morillas, S., Garcia, A., Utrera, C., et al. (2010). A CMOS vision system on-chip with multi-core, cellular sensory-processing front–end. In C. Baatar, W. Porod, & T. Roska (Eds.), Cellular nanoscale sensory wave computing (p. 129). New York: Springer.CrossRefGoogle Scholar
- 16.Poikonen, J., Laiho, M., & Paasio, A. (2009). MIPA4k: A 64 × 64 cell mixed-mode image processor array. IEEE international symposium on circuits and systems, ISCAS 2009 (pp. 1927–1930).Google Scholar
- 17.Gasparini, L., Manduchi, R., & Gottardi, M. (2010). An ultra-low-power contrast-based integrated camera node and its application as a people counter. In Seventh IEEE international conference on advanced video and signal based surveillance (AVSS) (pp. 547–554).Google Scholar
- 18.Hoffmann, R., Weikersdorfer, D., & Conradt, J. (2013). Autonomous indoor exploration with an event-based visual SLAM system. European conference on mobile robots, Barcelona, Spain.Google Scholar
- 19.Carey, S. J., Barr, D. R. W., Wang, B., Lopich, A., & Dudek, P. (2012). Mixed signal SIMD cellular processor array vision chip operating at 30,000 fps. In 19th IEEE international conference on electronics, circuits and systems (ICECS) (pp. 324–327).Google Scholar
- 21.Carey, S. J., Lopich, A., & Dudek, P. (2011). A processor element for a mixed signal cellular processor array vision chip. IEEE international symposium on circuits and systems, ISCAS 2011, Rio de Janeiro.Google Scholar
- 24.Dudek, P. (2000). A programmable focal-plane analogue processor array. Ph.D. thesis, University of Manchester Institute of Science and Technology (UMIST).Google Scholar
- 26.Toumazou, C., Hughes, J. B., & Battersby, N. C. (Eds.). (1993). Switched-currents: An analogue technique for digital technology. London: Peter Peregrinus Ltd.Google Scholar
- 31.Carey, S. J., Barr, D. R. W., Lopich, A., & Dudek, P. (2013). A 100,000 fps Vision Sensor with Embedded 535 GOPS/W 256x256 SIMD Processor Array. VLSI circuits symposium, Kyoto, Japan, 12–14 June 2013 (accepted).Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.