1 Introduction

Vision chips integrate image sensing and pixel-level processing on a single silicon die (Fig. 1) and permit massively parallel processing of pixel derived data. Under different configurations, a vision chip can be made to execute image processing algorithms with very low power consumption or, alternatively, operate at very high speeds. Both of these positive attributes arise from the execution of the algorithm upon an array of processing elements (PEs) on the focal plane. Information is processed within a PE and between near neighbour PEs, with only data extracted from the images communicated between the chip and the system controller. Overall, a reduced dataflow (relative to conventional systems) is accomplished in respect of both low power and high-speed applications. Hence, vision chip based systems have significant promise to yield processed information with superior performance over that of traditional image processing methods that separate the subsystems of imaging and processing. Applications requiring high speed operation have traditionally used ultra-high frame rate cameras [1, 2] coupled to FPGA or PC hardware; while recently some non-conventional approaches [3] have been proposed, all these systems produce prodigious data rates. For such applications, vision chips can offer an alternative solution, that eliminates the sensory readout bottleneck [46]. For systems requiring very low power operation, the need to run an ADC merely to establish that there has been no change to the imaged scene, is a distinct disadvantage. The vision chip approach can circumvent the high data-rate issue and obviate the need for continuously operating ADCs to achieve very low power system operation [7].

Fig. 1
figure 1

Vision chip processing architecture—each pixel contains a photodetector (PIX) and a PE (ALU and memory)

Recent vision chips, in common with the chip described here, operate as SIMD computational devices; such devices have been presented widely in both digital [810], analogue [1114] and mixed signal form [15, 16]. Designers of wholly digital vision chips have tended to adopt larger cell sizes (65 × 25 μm [8], 67 × 64 μm [9] and 51 × 54 μm [10]) than their analogue (49 × 49 μm [11], 35 × 35 μm [12], 34 × 29 μm [13], 26 × 26 μm [14]) counterparts. With smaller semiconductor process nodes (e.g. 45 nm) becoming available, it will be possible to shrink the same digital functionality onto a much smaller silicon area than the analogue equivalent. However, while utilizing lower cost processes, specifically 0.18 μm, analogue processors can achieve bit-equivalent memories and ALUs similar to their digital counterparts in a more compact space. The relatively high pixel pitch of all recent vision chips [816] limits the pixel count to typically <20 kpixels, making them suitable for applications in which high speed or low power is required, but very high spatial resolution is not essential, for example, surveillance systems [7, 17], performing on-line analysis in industrial production systems [4], or providing vision systems for robots [18]. The separation of processing and imaging areas (described in [8]) removes some of the constraints on occupied PE silicon space resulting in low fill factor; the design compromise made for this approach is that algorithms must start by reading out data from the array into the close-connected PEs; for fast image processing, frame rates might be slowed by this process.

This paper, which discusses a prototype vision chip designed with a 64 × 20 pixel array, extends the descriptions provided in [19]. This test IC is intended to be the precursor of a larger array. In Sect. 2, we discuss the architecture of the chip with Sect. 3 describing the circuits incorporated into the PE. Section 4 discusses the programming model and language used to define the operation of the IC, and Sect. 5, the implementation and measurement results. Section 6 relates an example application of the IC. We include a description of a program that distinguishes between open and closed shapes at 30,000 fps, demonstrating the utility of focal-plane processing. With the processor-per-pixel architecture, the program is expected to scale to larger processor arrays without significant reduction in execution speed. As earlier applications have also demonstrated [20], frames are not read-out at this speed, only the results of processing operations are transferred to the program controller.

2 Architecture overview

The overview of the chip architecture is shown in Fig. 2. In this IC, each PE of the array incorporates (in addition to a photodiode) a mixed set of analogue and digital storage registers, with processing capability in both domains [21] allowing efficient processing and storage of data dependant on data-type. While the chip utilises analogue processing, each PE is a software-programmable sampled data system (hybrid analogue/digital) [22] in common with digital processors rather than a continuous time processor.

Fig. 2
figure 2

Architecture of the IC

The IC is controlled by an external instruction issuing device. All instructions provided to the IC are executed according to the SIMD paradigm, that is, all the processors of the array execute simultaneously the same instruction but upon their own local data. The local data available extends to that of their nearest orthogonal neighbours. The instruction set supports arithmetic operations of addition, subtraction, negation, squaring and comparison performed on analogue registers, and logic OR and NOR operations performed on binary registers. Hardware that facilitates SIMD programming is a good fit with low-level pixel processing operations, whereby operations are repeated on a per-pixel basis.

The IC described has an architecture that is intended to allow its deployment in many different applications, while the physical pitch of the PEs allows an IC with greater resolution at lower cost to be fabricated in comparison to its predecessor SCAMP-3 [11]—and additionally allows use of smaller optics. The IC utilises switched current memories for analogue storage, with binary storage provided by 3-transistor digital DRAM memories as advanced in ASPA vision chips [10]. The sub-systems that could be incorporated in the PEs of the array were limited by the connectivity that could be provided to the PEs. However, relative to SCAMP-3 with three metal layers available, the six metal layers of the 0.18 μm process has afforded considerable additional features and functionality. The additional metal layers however, also have deleterious effects on sensing uniformity since the photodiode is now at the base of the metal-layer stack; this will partially shadow the light sensitive area. Applications of vision chips such as motion, edge detection, tracking and shape detection require a combination of analogue and digital operations. In the design of this IC, we have shifted the ratio of analogue to digital memory, reducing the number of space-consuming analogue registers while increasing the digital resources available. Furthermore, the digital memories can be used to configure an asynchronous propagation network [23], allowing global operations on data, in addition to performing logical local operations. Also added to this chip is a controllable diffusion network allowing large-scale spatial low-pass filters to occur in a single instruction, and a squarer offering one-quadrant operational capability. The IC provides for global analogue and digital data write and read; PE-specific digital write is also provided, hence allowing PE-specific analogue write via a flagged global analogue write operation.

PEs are controlled by means of a 61-bit instruction code word (ICW), with additional control supplied from 12 analogue inputs. The long ICW is registered on chip and input by means of a 16-bit bus, allowing a reduction in external pin-count. To deliver the ICW to the PE array requires large scale connectivity (relative to the compact PE size) with row and column drivers spread across all four sides of the IC.

The chip supports random access readout of analogue or digital data from the PEs, allowing for full array or region-of-interest readout. Additionally, flexible row and column decoders allow simultaneous addressing of pixel-blocks: analogue data to be readout is summed to provide regional or global analogue outputs; logic OR operation is performed on the digital output data within the addressed block.

3 Circuits

3.1 Analogue processor fundamentals

The analogue processor in this work is based on the ideas developed in [22, 24]; these have been recently explained in [25] but are included here for completeness. The processor used in this work exploits switched-current (SI) circuit techniques [26]. A basic SI memory cell is illustrated in Fig. 3. When a MOS transistor works in the saturation region its drain current I ds can be, in a first-order approximation, described by:

$$I_{ds} = K(V_{gs} - V_{t} )^{2}$$
(1)

where K is the transconductance factor, V t is the threshold voltage and V gs is the gate-source voltage.

Fig. 3
figure 3

Basic SI memory cell. a During storage both switches are open, b both switches are closed when the cell is ‘written to’, c only switch ‘S’ is closed when the cell is ‘read from’

The SI memory cell remembers the value of the input current by storing charge on the gate capacitance C gs of the MOS transistor. When writing to the cell both switches, S and W, are closed (Fig. 3b). The transistor is diode-connected and the input current i in forces the gate-source voltage V gs of the transistor to the value corresponding to the drain current I d s  = I REF  + i in , according to (1). At the end of the write phase the switch W is opened and thus the gate of the transistor is disconnected, i.e. put into a high impedance state. Due to charge conservation on the capacitor C gs , the voltage at the gate V gs will remain constant.

When reading from the memory cell, switch S is closed while switch W remains open (Fig. 3c). Now, the transistor acts as a current source. As the gate-source voltage V gs is the same as one that was set during the write phase, the drain current I ds has to be the same (provided that the transistor remains in the saturation region), and hence the output current i out is equal to

$$i_{out} = I_{ds} - I_{REF} = i_{in}$$
(2)

Therefore, the SI memory cell is, in principle, capable of storing a continuous-valued (i.e. real) number, within some operational dynamic range.

In addition to data storage, the PE needs to provide a set of basic arithmetic operations. These can be achieved with a very low area overhead in the current-mode system. Consider a system consisting of a number of registers implemented as SI memory cells, connected to a common analogue bus. Each memory cell can be configured (using corresponding switches S and W) to be read from, or written to (Fig. 4).

Fig. 4
figure 4

Basic operations in a current-mode processor: a register data transfer, b addition, c division by two

The basic operation is the transfer operation, as shown in Fig. 4(a). Register A (configured for reading) provides the current i A to the analogue bus. This current is consumed by register B (configured for writing); register C is not selected, and hence it is disconnected from the bus. Therefore, the analogue (current) data value is transferred from A to B. This transfer is denoted as B ← A. According to the current memory operation, Register B will produce the same current when it is read from (at some later instruction cycle). If we consider that the data value is always the one provided into the analogue bus, it can be easily seen that i B  = −i A , i.e. the basic transfer operation includes negation of the stored data value.

The addition operation (and in general current summation of a number of register values) can be achieved by configuring one register for writing, and two (or more) registers for reading. The currents are summed, according to Kirchhoff’s current law, directly on the analogue bus. For example, the situation shown in Fig. 4(b) produces the operation A ← B + C. Subtraction is performed by negation followed by addition.

The division operation is achieved by configuring one register cell for reading and many (typically two) registers for writing. The current is split, and if the registers are identical then it is divided equally between the registers that are configured for writing, producing a division by a fixed factor. For example, in Fig. 4(c) both registers A and B will store current equal to half of the current provided by register C. We denote this instruction as DIV (A + B) ← C.

Multiplication and division by constant factors (e.g. coefficients in a convolution kernel) can be simply achieved by multiple application of add or divide by two operations. In the presented design, a compact current-mode squarer sub-system [27] is implemented. It provides the squared analogue current value (this is useful for algorithms requiring, for example, energy or mean-squared error calculations), and allows multiplication to be achieved through multiple steps, by determining (x + y)− x − y 2 (=2xy). Other instructions could be implemented in hardware using current mode circuits, however, in a vision chip application silicon area is at a premium, so only those computation elements that can be implemented compactly can be used. Consequently, more silicon area can be devoted to register circuits, increasing the amount of local memory (or improving the accuracy of computations, which largely depends on device matching, i.e. device area). While this is not a one-step command, the squarer requires few transistors and can be fitted into a compact space within the PE.

3.2 Analogue sub-system

The analogue sub-system of the PE implemented on the chip comprises six general purpose analogue registers (A,B,C,D,E,F), 1 special purpose and neighbour communication register (“NEWS”), a variable strength diffusion network, a voltage controlled current source (the “IN” register) that facilitates the global input of data to the array, a current squarer, and a photodiode (two types, N-well diode and N+/P-subst, are explored—split equally across the array). Additionally, an analogue comparator detects the sign of a register’s stored current, storing the result in a local activity register (1-bit SRAM flag). The 1-bit SRAM flag is used to allow conditional code execution. The flag data can be also transferred to one of the 13 bits of DRAM memory and processed in the digital sub-system. The digital sub-system can also write to the flag register, providing a communication path from digital to analogue domain. The PE includes connectivity to allow either analogue or digital sub-systems to access a common analogue/digital bus allowing exchange of data within the cell or to its orthogonal neighbours.

3.2.1 Analogue memories

The analogue memories of this IC are based on the S2I cell [28]. The standard cell was modified to include an error correction circuit [24] to allow a reduction in cell size while maintaining low signal-dependent error (since signal-independent errors can be accommodated algorithmically [22]). These errors have their origin in switch signal clock feedthrough, charge injection and output conductance errors; the addition of this circuit allows these errors to be compensated for.

3.2.2 Squarer

The squarer (Fig. 5) is based on the design presented in [27]. It enables 1-quadrant squaring operations within the PE; it reads a current I in from the analogue bus, with one or more general purpose registers sourcing current to the bus. In the design presented below, we use two separate N-wells for the three identical P-MOS transistors, with MP1 and MP2 sharing a different N-well to MP3 (the N-wells are all biased to their respective P-MOS source nodes). The output current, I sq , is stored in the special-purpose NEWS register.

Fig. 5
figure 5

Squarer schematic

From examination of Fig. 5, with I MP1  = I MP2 and I MN1  = I MN2 , the following circuit equations can be written:

$$I_{MP1} - I_{MP3} = I_{in}$$
(3)
$$I_{sq} = I_{MP3} + I_{MP1} - I_{Msq}$$
(4)
$$I_{MP1} = \beta_{P} (V_{dd} - V_{g1} - V_{tp} )^{2}$$
(5)
$$I_{MP3} = \beta_{P} (V_{g1} - V_{b1} - V_{tp} )^{2}$$
(6)

After some algebra, from Eqs. (3), (5) and (6) and with the substitution \(a = V_{dd} - V_{b1} - 2V_{tp}\) we can derive an expression for current IMP1

$$I_{MP1} = \frac{{I_{in}^{2} }}{{4a^{2} \beta_{P} }} + \frac{{I_{in} }}{2} + \frac{{\beta_{P} a^{2} }}{4}$$
(7)

Summing Eqs. (3) and (4) and using the expression for I MP1 from Eq. (7):

$$I_{sq} = - I_{Msq} + \frac{{\beta_{P} a^{2} }}{2} + \frac{{I_{in}^{2} }}{{2\beta_{P} a^{2} }}$$
(8)

The squarers of the array are normally set-up via the two global bias voltages, V b1 and V b2 such that there is no offset current and the first two terms of Eq. (8) sum to zero. When this condition is met, Eq. (8) reduces to:

$$I_{sq} = \frac{{I_{in}^{2} }}{{4I_{Msq} }}$$
(9)

The prerequisite of correct circuit operation requires that transistors remain in saturation, with transistor MP3 being the limiting device. This requires that \(V_{g1} - V_{b1} \ge V_{tp}\). With substitutions from Eqs. (5) and (7)

$$I_{in} \le \beta_{P} a^{2}$$
(10)

Coupled with the constraint of “no-offset” from Eq. (6), we then have

$$I_{in} \le 2I_{Msq}$$
(11)

In order to maximize the output dynamic range of I sq , I Msq is minimised to half of the maximum input current, I in_max /2, giving a maximum squarer output (from Eq. 7) also equal to I in_max /2.

3.2.3 Diffusion network

The diffusion network enables the localised spatial averaging of a register’s contents. For a register storing an image, this functions as an image low pass spatial filtering operation. This is a process that requires significant computation in the digital domain; in the analogue domain, it is a simple, single command giving each PE access to a local regional average. This can be useful, for example, for producing a locally adaptive threshold for object extraction that automatically compensates for differences in object lighting. The additional hardware consists of two N-type transistors configured to act as resistors (with additional switch transistors) between adjacent element’s analogue buses, with diffusion strength controlled by means of an analogue bias voltage. With bus voltage held at a constant voltage by the second phase of the S2I cell write cycle, variations in inter-PE resistances through vgs change are minimal. Any combination of analogue registers can then act as sources or sinks to the diffusion network (shown in Fig. 6). The network can be locally broken, if required, in the horizontal and/or vertical directions, controlled by outputs from the local digital register bank. The point spread function of a 1-D linear array of PEs assembled as a diffusion network is an exponential function dependent on S2I cell source resistance, PE interconnect resistance and programming resistance of the S2I cell [29]. A 2-D resistive grid simulation shows similar properties; however, its analysis is not simply tractable.

Fig. 6
figure 6

Diffusion network of current sources and sinks

3.2.4 Pixel circuit

The pixel circuit is shown in Fig. 7. Together with two analogue registers, the imaging system operates utilising differential double sampling. The circuit consists of a photodiode and MOSFET (M_pd) operating in the linear region, converting the photodiode voltage to a current that can then be stored in an analogue register. Having been illuminated by light, current from the pixel circuit, i_pix, is sampled by register A via the analogue bus, followed by pixel charging to Vres via the Reset signal. This command is flag controllable to allow local autonomy of reset execution. By reading the pixel circuit current a second time, in parallel with the reading of register A, the result can be stored in a second register B. Register B then holds the sum of the pixel reset level added to register A (which stores the inverted sampled pixel current); register B then contains the image.

Fig. 7
figure 7

Pixel circuit

3.3 Digital sub-system

The digital sub-system is used to perform logical data operations and storage. Source data will generally be determined from a precursory analogue operation using the in-PE comparator. The PE should contain local storage sufficient for in-PE binary image processing, analogue to digital conversion and also diverse flag-conditional analogue operations with binary masks or markers.

The PE contains 14 bits of local storage: 13 DRAM registers (numbered R0–R12) and 1 SRAM register (flag). Twelve of the registers (R0–R11) consist of standard 3T DRAM memories, while R12 is similar, but has a load function dependent upon R11. All these registers have outputs that connect to the local read bus (LRB) and inputs that connect to the local write bus (LWB). The basic schematic is shown in Fig. 8.

Fig. 8
figure 8

Digital register read/write

Most digital operations consist of two phases, a pre-charge phase and a read/write phase. Both these phases are incorporated within a single chip instruction cycle. The LRB is generally pre-charged “high” on the first phase. A second phase will normally consist of simultaneous read and load operations with the additional selection of inverting or non-inverting storage with the output of all operations directed to the LWB. By reading several registers to the LRB simultaneously, a “wired” OR or NOR function can be enabled. From these primitive operations, other logical operations can be synthesised from multiple steps. The LWB is the common node that sources the load data to all digital registers. Transfer of data from DRAM to the SRAM flag (providing local autonomy) is performed in a similar way to inter-DRAM memory operations, i.e. by selective discharge of the LRB.

The signal from the LWB can also be directed to the combined analogue/digital bus of the PE, enabling neighbourhood communications.

Additionally, a frequent binary image operation is the individual classification of objects that may have passed an initial threshold/segmentation (i.e. blob analysis). This might be on the basis of size or shape. As such, it is a basic operation of many image processing algorithms to perform a geodesic reconstruction (from a marker) the subset of pixels that belong to an object. In a conventional synchronously operating image processing system, this is a time consuming task through the need for multiple dilation steps. Within the IC described, these iterative operations can be reduced to a few simple steps that allow a propagation wave to be launched from one or several pixels within the array [23]. The wave then propagates asynchronously throughout the whole array. This enables very efficient object segmentation, object reconstruction, hole filling, watershed transformation and other operations.

In order to enable conditional execution and global asynchronous processing, register R12 (with R11-controlled load) is exploited. Figure 9 shows the circuit configuration for running asynchronous propagations. Signal R12_out is the storage node of register R12, and R0_out is the storage node of register R0. The arrangement is essentially the concatenation of each PE’s R12 output, to the inputs of its neighbour’s R12, creating a 2-D logic chain (shown in 1-dimension in the figure). As register R12 of the n-th PE is read, it produces the output at the LRB; this is then fed through an inverter to the analogue/digital bus of the neighbour cell (n + 1), writing the data (gated by Global_prop_en, and R0_out) to the LWB, and then, gated through R11_out, onto the storage node of R12. In order to execute the propagation, an initial marker has to be stored in register R12 in either one or more locations. After that, by continuously reading and writing to R12 with appropriate neighbours selected, the marker propagates asynchronously from cell to cell within the propagation space. The mask that defines from which neighbours triggering is allowed, is stored in the local memory (R0–R3; just R0_out is shown in Fig. 9), thus enabling local constraint of the propagation network topology. R11 essentially controls the available space where the trigger wave can propagate. The result of the propagation is stored in R12. The operation executes the effective ‘flood-fill’ instruction: expand R12 to where R11 = “0”.

Fig. 9
figure 9

Memories configured as an asynchronous propagation network

4 Programming model and language

From the programmer’s view, the IC is abstracted to a level that would appear familiar to anyone used to programming with digital architectures. However, the spatial and analogue nature of many commands needs to be taken into account.

An array of SIMD PEs executes the same instruction on all processors (PEs) within the array, while data upon which processors operate is local to the PEs. Data can be input to the array via the IO systems of the chip or more usually from light radiation (which naturally is a wholly parallel operation that does not require extensive IO resource). Each PE in the array (in the case of this work) contains 6 general purpose analogue registers which can be likened to signed 8-bit equivalent digital registers. The special purpose analogue register (known as NEWS) that connects a PE to its 4 nearest neighbours, gives it access to neighbour data. Furthermore, each PE has 14-bits of binary storage (R0–R12 and ‘flag’) available to it. For the programmer, a register (whether digital or analogue) is always utilized in its entirety across the array, as a planar array structure as depicted in Fig. 10.

Fig. 10
figure 10

Array processor view of the SIMD architecture

Hence an operation of A ← (B + C) (operating in a single instruction cycle) will execute the operation

$${\text{A}}_{{{\text{i}},{\text{j}}}} = - \left( {{\text{B}}_{{{\text{i}},{\text{j}}}} + {\text{C}}_{{{\text{i}},{\text{j}}}} } \right)$$
(12)

for all cells PE(i,j) within the array. As noted above, all analogue operations are inverting in this design.

To facilitate operations with a level of local autonomy, all write operations of analogue registers require that a flag register be “true” for execution to take place. The flag register can be set from logical digital operations, local analogue comparison or external input.

With nearest orthogonal neighbour communication, functions such as edge detection can be enabled. Filters requiring larger neighbourhoods can be easily implemented using multiple neighbour transfers.

The functional example code (running a locally adaptive threshold algorithm) shown in Table 1 illustrates the range of constructs available to the programmer:

Table 1 SCAMP4 vision chip code for adaptive thresholding

5 Implementation and measurement results

The chip was implemented in a 6-metal 1-poly 0.18 μm CMOS technology. The photograph/floorplan of the IC in Fig. 11 has 64 × 20 pixels in a chip of size 3.05 × 1.525 mm. The PE floorplan contains a pixel with approximate fill factor of 6 %, analogue and digital registers occupy 41 and 15 % of PE space respectively.

Fig. 11
figure 11

a Chip photograph, with floorplan of PE in (b)

The array is designed to operate at 10 MHz with an analogue supply voltage of 1.5 V and digital supply voltage of 1.8 V. The IC has a total quiescent power consumption (i.e. when not executing any instructions) of 31 μW, and a peak power consumption of ~25 mW.

5.1 Analogue performance

Transferring data from one analogue register (B) to another (A) has errors associated with the operation (unlike its digital counterpart), the most significant generally being the signal dependent error. This error can be simply modelled as:

$$A_{i,j} = - B_{i,j} - \kappa B_{i,j}$$
(13)

where κ is a global constant describing this error. For algorithms involving multiple register interchanges, this error term can limit the complexity of implementable algorithms, and is an important parameter in discrete-time analogue processors. Since the native register copy operation is inverting, generally we compare the difference in a register’s contents after copy-out and a copy-back operations.

On measurement of this error after the two transfers with the error correction circuit disabled, it was found that the peak-to-peak signal dependent error across the full scale register range (−3.5 to 3.5 μA) was 2.5 % (see Fig. 12). Using the error correction circuit with a drive potential of 1.4 V, this error could be shifted by 1.8 % of full scale. Unfortunately, a design error deployed a signal of inverted polarity and produced a shift in the direction opposite to that required; for a follow-on chip this will clearly only require a minor design change to provide very low register errors.

Fig. 12
figure 12

Register transfer error after two register transfers averaged across array

An important facet of analogue memories (unlike their digital counterparts) is that a value stored in a register will be different to the value eventually read-out. This is due to the effects of transistor leakage (finite off-state current). It is of particular importance for algorithms requiring motion detection, that are required to store information between video frames, since degradation increases with time. A register can store a current between −4 and +4 uA, with the registers always gravitating towards a +4 uA stored current irrespective of initial value. Rate of degradation was found to be dependent on stored value as represented in Fig. 13.

Fig. 13
figure 13

Average decay of registers’ stored current over time from initial stored currents of 3.74 μA (at 21 °C and 220 lux lighting)

Registers decayed at a worst-case rate of 5 % of full-scale range per second (cf. 4 % over 20 ms in [12]), while the vision chip imaged an office scene at 220 lux; with high illumination levels of 70 klux, decay rate increases by a factor of ~4. Over a typical frame interval of 50 ms, this decay rate has a negligible effect.

To test the diffusion network, an image was captured and stored in a register. The diffusion network was then defined in North–South and East–West directions without any inhibiting mask, with minimum resistance paths between PEs. A single register was then copied to another. The diffusion of a letter “U” is shown in Fig. 14; the diffused image is shown inverted for comparison (reversing the inverting effect of register transfer).

Fig. 14
figure 14

Spatial diffusion of a letter “U” (single clock cycle operation)

The photo-response non-uniformity (PRNU) of the imaging system (as a percentage of full-scale-range) over 80 % of register range, was 1.6 and 1.5 % rms for PE’s containing N-well photodiodes and N+/P-subst photodiodes respectively. While the N-well photodiode array has slightly higher PRNU, this array provides 3× higher sensitivity than the N+/P-subst photodiodes, through improved responsivity and reduced photodiode capacitance. For the N-well sensors, with the sensor fitted with a F1.4 lens, light sensitivity can be varied by adjustment of Pbias (see Fig. 7) from 140 to 220 LSBs (Lux.s)−1 (with 8-bit digitisation). The fill factor is 6.5 % for the N+/P-subst diode and is 5.7 % for the N-well diode.

For evaluation of the squarer circuit, a value stored in the “IN” register was squared and both inputs and outputs recorded when swept over the 1-quadrant input range. The results from the first 10 pixels (for reasons of clarity), along with the ideal response of the system and RMS error are shown in Fig. 15; the error bars indicate the standard deviation of the mean error. Transistor mismatch is a significant contributor to these variations in squarer response.

Fig. 15
figure 15

Operation of squarer for 10 pixels (with ideal response for comparison)

5.2 Digital performance

Logical operations and asynchronous processing have been found to proceed as designed. Data propagation from cell to cell within the asynchronous network progresses in a time of 16 ns, or uni-directionally across the entire array in 1 μs.

The digital system within the IC only allows the short-term storage of data bits in DRAM; due to leakage of the minimum-sized write transistor, a refresh is required every 50 μs. This is expected to be improved for a successor IC.

5.3 Code example

The results of the algorithm in Table 1 are shown in Fig. 16. The source figure is an elongated “U” with a varying luminosity of the character, and its background, along its length. A simple binary threshold would result in only partial recovery of the information. With the diffusion network used to create a local average of nearby pixel values, an adaptive threshold is created that can then be applied; the local average (with a global offset) is used to set the threshold that is applied to each individual pixel.

Fig. 16
figure 16

a Input object, b acquired image stored in Register B, c register C—diffused image, d register D—the source register for the thresholding operation—difference between input image and diffused image (with offset), e Register R4—threshold result. Images (b) and (c) are shown inverted for clarity

6 Example application

The completed system combines the presented vision chip, an instruction delivery and control system (synthesised upon a Spartan-3 FPGA) and a communications interface to upload algorithms and receive data that the vision chip is programmed to return. Algorithms are developed and compiled using the APRON compiler [30]. Returned data for analysis and inspection is viewed using a custom viewer program running on a host computer. Unlike many high-speed imaging systems, we wanted to demonstrate that as well as acquiring images at high frame rates, we are also capable of processing them in real-time, on the focal plane, to determine a non-trivial metric. In this example application we wish to discriminate between open and closed binary shapes, i.e. ‘O’ versus ‘U’, outputting from the vision sensor only the processed result. If a closed shape lies within a pre-defined area of the focal-plane, the vision chip system will return ‘true’ to the host, else ‘false’. This occurs at a frame rate of 30,000 fps, limited by the rate instructions can be issued by the system controller

The test apparatus (Fig. 17a) consists of a wheel spinning at high speed. The circumference of the wheel is labelled with three open and three closed shapes, and an extended “filled in” area which is used as a trigger (Fig. 17b). The algorithm consists of two stages. First, as the vision chip starts processing a new series of frames it waits until its entire view is white, then it waits until this is not the case. This transition is used to trigger the second stage of the algorithm which is the discrimination of open and closed shapes. A trigger is necessary to stabilise the visualisation of the returned data (akin to an oscilloscope trace), shown diagrammatically in Fig. 17(c).

Fig. 17
figure 17

a Detection of open and closed shapes from a strip (b) with user output in (c)

The algorithm for detecting open and closed shapes exploits the asynchronous propagation functions of the vision chip and is illustrated in Fig. 18. It consists of the following steps: capture image and define a region of interest around the object by repeatedly shifting in ‘1’ from all 4 boundaries to isolate the region containing the spinning wheel (Fig. 18a); threshold image so background is ‘1’, and shapes are ‘0’, and load the propagation control register R11 (Fig. 18b); runs the asynchronous propagation for an “O” from the perimeter of the box; the propagation wave is inhibited by the shape and does not enter inside the “O”—effectively flood-filling up to where R11 = ’0’, storing the result in R12 (Fig. 18c); perform array-wide logic operation R4 = (!R12)·R11 (Fig. 18d). A global readout operation is then performed, where the output result is the OR operation on all elements of R4. If an object is closed, the flood fill propagation does not enter the ‘hole’ and the returned result is true; if the shape is open as shown in Figs. 18(e), (f), the global OR result is false.

Fig. 18
figure 18

Sketch of image processing sequence for distinguishing open and closed shapes. See text for details

The wheel on average spins at 3,000 rpm; with a radius of 23 mm, the shapes move at 7.2 m/s. After triggering, the vision chip samples images, at 30,000 fps, returning the one bit of resultant data per frame (indicating a presence of a closed shape in a frame). To enhance the visualisation, the user can select a single sampling point at which an entire single frame is returned and thus can visually inspect if the frame contains an open or closed shape, and whether the waveform is in agreement. Since carrying out this work, we have demonstrated closed object detection at 100 kfps within a 256 × 256 array [6].

7 Conclusions

A prototype 20 × 64 vision sensor/processor chip has been tested and successfully applied to an application in high speed imaging operating at 30,000 fps. This has demonstrated the utility of asynchronous propagation and data reduction of on-focal plane processing, and provided the post-silicon validation of the new PE design. The follow-up vision chip implementation, scaled-up to a larger array size [31], will enable high-speed and efficient implementation of image processing algorithms in practical machine vision applications.