The genetic algorithm census transform: evaluation of census windows of different size and level of sparseness through hardware in-the-loop training

Stereo correspondence is a well-established research topic and has spawned categories of algorithms combining several processing steps and strategies. One core part to stereo correspondence is to determine matching cost between the two images, or patches from the two images. Over the years several different cost metrics have been proposed, one being the Census Transform (CT). The CT is well proven for its robust matching, especially along object boundaries, with respect to outliers and radiometric differences. The CT also comes at a low computational cost and is suitable for hardware implementation. Two key developments to the CT are non-centric and sparse comparison schemas, to increase matching performance and/or save computational resources. Recent CT algorithms share both traits but are handcrafted, bounded with respect to symmetry, edge lengths and defined for a specific window size. To overcome this, a Genetic Algorithm (GA) was applied to the CT, proposing the Genetic Algorithm Census Transform (GACT), to automatically derive comparison schemas from example data. In this paper, FPGA-based hardware acceleration of GACT, has enabled evaluation of census windows of different size and shape, by significantly reducing processing time associated with training. The experiments show that lateral GACT windows produce better matching accuracy and require less resources when compared to square windows.


Introduction
With an ever-growing interest in intelligent and autonomous systems follows demands on perception, i.e. how to gather and extract meaningful information, from different sensor modalities, within a specified time frame. For us humans, vision is the most central sense, and for autonomous agents, acting in the real world, image sensors are key, as they are low cost and versatile. Through computer vision, application-relevant information can be extracted from images, such as color, features, objects, size, and depth/distance. To extract 3D information from images is either dependent on a priori information or requires multiple images, from different viewpoints. This can be completed by moving a single camera, but in an unknown and constantly changing environment, a stereo camera, with two horizontally displaced and synchronized cameras, are much preferable as, image displacement is known (static) and there is no interference between ego-motion and object motion. This is referred to binocular, or two-frame, stereo. However, to find depth information the correspondence problem must be solved, that is to establish correspondence between pixels, or regions, from one image to the other. Assuming rectified stereo images, displacement between corresponding pixels, or disparity, will be limited to the horizontal axis. The larger the disparity, the closer the object. Knowing the disparity, d, and the stereo camera parameters, depth, z, is given by the following equation: (1) z = fB d resource limited systems, even though the state-of-the-art matching performance of CNN based solutions are promising, they do not provide a viable option, as of yet. Following these restrictions, it still makes sense to improve upon predeep learning algorithms. The CT is based on relative intensities between a center pixel and surrounding pixels. Within the local neighborhood a pixel is represented by 1 if it is of lower intensity, otherwise 0. The bits are concatenated in some canonical ordering, forming a bit-string, referred to as the census value for the center pixel. Similarity is given by the Hamming distance. Contrary to statistical cost metrics, where it is assumed local pixels belong to the same distribution, the CT tolerates factionalism, providing good matching performance along object boundaries and robustness towards outliers. The CT is invariant under changes in gain and bias [45]. The CT is low-cost and suitable for hardware implementation [3,4,6,7,10,19,29,31,[34][35][36]42], often in combination with Semi-Global Matching (SGM) [15].
Over the years the CT has been researched leading to two key developments: (1) sparse CTs, where only selected neighborhood pixels are used for comparison, requires fewer comparisons, which can be distributed over larger areas, and results in shorter bit-strings, saving resources for the subsequent matching [4,6,19,31] and (2) non-centric comparison schemas, where arbitrary neighborhood pixels are connected for comparison by edges, according to predefined patterns, depending less (or not at all) on the center pixel, resulting in better matching accuracy and robustness towards noise [10,23,40]. However, the handcrafted CT methods are bound with respect to symmetry, edge length or neighborhood.
In the previous work [2], it was established that optimizing the census comparison schema using a Genetic Algorithm (GA), referred to as GACT, lead to higher matching accuracy and/or lower resource requirements than established CT methods. It was also concluded that outcome was highly dependent on training data (KITTI vs Middlebury) and that the CT should benefit from a larger neighborhood (KITTI).
In this paper, thanks to a new hardware-in-the-loop implementation, the GA training process is accelerated by a factor of 30, enabling evaluation of GACT windows of different size and shape. The experiments show that: (a) the GACT window size has a big effect on matching performance; (b) as GACT is defined for a number of edges (sparse) a change of the window size does not come at a great cost (assuming a trained mask), as the subsequent stereo-matching only depend on the produced bit-string; (c) lateral windows make for improved matching and save resources; and (d) that GACT is suitable for FPGA implementation.
The remainder of this paper is arranged as follows. First related works with respect to the CT are presented followed by an introduction to GA. Then the experimental setup is described in terms of parameters, dataset and implementation. The experimental results are presented and discussed before the paper is concluded by some final remarks.

Related work
The Census Transform (CT) [45] is a non-parametric local transform, meaning that a pixel value is replaced by a value based on intensity ordering within a local neighborhood. In this case a bit string where a bit is set to 1 if the corresponding neighborhood pixel is of lower intensity, otherwise 0. Two census strings are compared using the Hamming distance. The comparison schema of CT is shown in Fig. 1a. CT can be formalized by the following equations: where p, p ′ represent pixel intensities.
where ⊗ represents the concatenation operation and N(p) defines the neighborhood around a pixel p.
The similarity of two pixels is given by the Hamming distance between the bit-strings: which is also referred to as the matching cost. The CT relies heavily on the neighborhood center pixel, which makes it sensitive to noise. The modified CT [11] is instead based around the neighborhood mean intensity, p , leading to an update of Eq. 3 to The authors conclude that this increases robustness, with respect to noise, and the ability to capture the image structure, which is important for classification and matching. However, at the cost of a higher computational complexity. This method will be referred to as MeanCT.
For the Sparse CT [4,19], a subset of pixels within the neighborhood are selected for CT. The sparse factor, S = n 2 , defines the sampling rate over the CT window, e.g. for sparse factor 16, 1 in 16 ( 4 × 4 ) pixels is selected for evaluation. The authors show that Sparse CT improves on the CT, given the same number of sample points, i.e., a larger receptive field is beneficial compared to higher resolution. However, an increased sparse factor does infer larger neighborhoods and consequently higher buffering costs, not to be underestimated for resource limited systems, for which the method was originally intended. Within this context, a stronger argument for the sparse CT is that, until a sparse factor of 16, it shows a marginal drop in accuracy compared to the CT, of the same neighborhood size. Figure 1b shows sparse CT with sparse factor of 2, hereafter referred to as Sparse8, as there are 8 sample points. This is extended to a full checkerboard pattern, by adding 4 diagonal sample points, referred to as Sparse12. Note that Sparse8 and Sparse12, are not consistent with the original definition of sparse CT, but are adapted to fit within a 5 × 5 neighborhood.
Mini-census (MCT) [6] is also a sparse CT targeting resource limited systems. MCT is defined by just 6 sample points, in a 5 × 5 window, Fig. 1c, to reduce calculation cost and memory resources. A similar example is the Retina CT (RCT) [31], Fig. 1d, with an 8-point circular pattern, inspired by the human retina.
An evolution of the MeanCT, hereafter referred to as Quarternion CT (QCT) [26], makes use of both the center pixel and the neighborhood mean intensity, thereby extending Eq. 2 to resulting in a quaternion, and consequently bit-strings of twice the length. This shows an accuracy improvement (for (4) d(p 1 , p 2 ) = Hamming C(p 1 ), C(p 2 ) . 1 a Shows the original CT, for a 5 × 5 neighborhood, and b-d show different sparsity schemas, discarding greyed out pixels, to save computational resources 1 3 stereo matching) in the Middlebury evaluation, but according to the authors the benefit should be greater for real world images. The Generalized Census Transform (GCT) [10] is a family of defined masks, in a 5 × 5 neighborhood, with different levels of sparsity. The comparison schema is defined by a set of coordinate pairs, is connected through what is referred to as an edge. The key difference to previous work is that neither c i nor c ′ i refer to the center pixel. Figure 2 shows the 8 and 16 edge GCT. To define GCT, Eq. 3 is updated as follows: One advantage of GCT is that a 5 × 5 GCT is comparable to a 7 × 7 sparse CT and hence the number of line buffers can be reduced.
Presented at the same time as GCT, the Center-Symmetric CT (CSCT) [40] similarly compares pairs of pixels within the census window, albeit centersymmetric. Following the earlier notation, this can be Fig. 3a. Contrary to GCT the definition of CSCT extends to different window sizes, but not to different levels of sparsity. However, it will always produce bit-strings of half the size as compared to the original CT, while considering all (but the center) pixels, although at the cost of slight increase in error rate. A second contribution is the introduction of edge weights along the central rows and columns, as shown in Fig. 3b. This will be considered as adding edges but can, according to the authors, be implemented using lookup tables. With weights along the central rows (hwCSCT) or both central rows and columns (wCSCT) boosted results passed CT, while saving resources. The Star Census Transform (SCT) [23] extends GCT by defining masks of symmetrical sequences of connected edges, of equal length, forming star-shaped scan-patterns around the center. Let (c � 1 , c � 2 , ..., c � n ) = (c 2 , ..., c n , c 1 ) and Eq. 7 holds true for SCT. The authors states that by evaluating masks with different numbers of sample points and edge lengths, candidates are found that improve on CT, MCT and GCT, particularly on CT and MCT with respect to noise, whether Gaussian or impulse. Figure 4 shows the best SCT for 8, 16 and 24 edges.
The Adaptive Census Transform (ACT) [34] applies the principle of ADaptive Support Weights (ADSW) [21] in a census context. For ADSW pixels within the aggregation window are weighted, based on center pixel similarity and proximity. For ACT the binary number is replaced by a weight, an integer defined by a hardware-friendly approximation function, exploiting intensity similarity (the proximity term is omitted as its contribution is limited for small aggregation windows). The census matching cost is then defined by SAD instead of the Hamming distance. This makes ACT expensive to implement, due to the weight function and the added complexity of SAD compared to Hamming. To describe ACT Eq. 2 is updated as follows: and w defined as: where c represents the absolute intensity difference between p and p ′ , and parameters set as: c = 16 , and p o to p 7 are 64, 48,32,32,16,16,16,16. The weight function was also adopted for cost aggregation (ADSW). The implementation was later extended [7] incorporating Support Local Binary Pattern (SLBP) [33] and sparse ACT windows. With SLBP a census vector is calculated with respect to each pixel in the census window, not only the center pixel, resulting in as many census strings as pixels, i.e., 9 vectors for a 3 × 3 window. To allow for larger census windows and to compensate for the additional complexity of SLBP, a sparse approach was adopted [7], where only pixels along the horizontal, vertical and diagonal lines intersecting the center pixel were included. Later a simplified implementation [35], eliminating the SLBP component, for embedded heterogeneous system based on Xilinx Zynq SoC, was presented. Adaptive window patterns for the CT [24] are based on the idea that uniform image regions require less complex census patterns than non-uniform. Hence, for a CPU implementation, the computational complexity can be reduced by applying different census transforms. A guidance mask, based on Canny edges, and the region intensity statistics (mean and variance) dictates the choice of an 8, 12 or 20 pixel pre-defined census mask in a 9 × 9 neighborhood. Similarly, a method based on adaptive census window size/ shape has been proposed [20], where Sobel gradient images are used to select between square ( 3 × 3 ), portrait ( 11 × 3 ) or landscape ( 3 × 11 ) shaped census windows. Considering all pixels the method performs slightly worse than the corresponding fixed window size, 11 × 11 , but better along depth discontinuities. Another contribution [36] lets the Sobel image dictate as to whether adopt a 5 × 5, 7 × 7 or 9 × 9 CT in an FPGA implementation. Here, to increase accuracy, as for FPGAs all CT-alternatives have to be processed in parallel, thus invalidating the resource reduction argument.
Alternatively, to the Hamming distance, the Tanimoto [37] and the Dixon-Koehler [8] distances, achieve higher matching accuracy at the cost of increased complexity. Tanimoto distance focuses on the ones of the matching bit-strings (1-intersection/union). The Dixon-Koehler is the product between the normalized Hamming and Tanimoto distances. In a comparison study for FPGA implementation [42], a number of different CT window sizes, 5 × 5 to 23 × 23 , and Hamming, Tanimoto and Dixon-Koehler similarities were evaluated, with respect to matching accuracy and resource requirements. It was concluded that a 13 × 13 Dixon-Koehler CT produced a better matching result than a 23 × 23 Hamming, at similar cost in terms of resources. However, the real benefit of Dixon-Koehler is that it can reach a higher level of accuracy, but resources scale badly with window size.
Finally, composite costs, where part of the cost constitutes of CT, are also adopted. These methods are out of scope here, as the focus is on stand-alone CT. AD-Census [29], combines a weighted sum over CT and color-based SAD, where the AD component provides good matching support for textured or slanted areas, while CT preserves edge information. The neighborhood for CT was set to 9 × 7 , the largest possible of odd rows and columns, to produce strings that fit within 64-bit registers. The stand-alone contribution of this larger footprint CT, referred to as CT-7x9 because of the row-column notation, is interesting for comparison, especially since AD-Census has since been revised with GCT [23]. Another example of a composite cost is combining MeanCT over the images, MeanCT over gradient images and SAD [3].

Genetic algorithm
The Genetic Algorithm (GA) [17] is a population-based optimization method, proposed in the 60's, extended and popularized in 1989 [12], that belongs to the family of Evolutionary Algorithms [44]. In 1989 extended and made popular, in the context of optimization [12]. GA is based on the evolution of the species, in which the new solutions are created from previous ones and only the stronger solutions survive. GA has been successfully applied in different research areas, as stereo matching [13], real time systems [27] or neuroengineering [32].
GA will have a population, X, in which each individual of the population will be a solution to the problem. The representation of an individual will depend on the specifications of the problem. For the specific problem of GACT, each individual is represented by a CT mask, consisting of tuples of window coordinates, defining edges for pixel comparison. An individual i of population X can be described as where x e is the starting point of the eth edge, x ′ e is the end point of the same edge and n is the number of edges. Additionally, each point is formed with 2 coordinates (r, c), where r ∈ {1, … , R} and c ∈ {1, … , C} , for a census window of size R × C.
The first step in GA is to initialize the population, which is formed by ps (population size) individuals, generated at random. Then follows the evolutionary process, generation by generation, where each iteration, or generation, is composed by four consecutive steps: parent selection, crossover, mutation and replacement. An example of one generation of GA, in a problem with 8 edges, is shown in Fig. 5. After one generation is terminated, a new one will start until the maximum number of generations is reached.
Selection. The first step within one generation is to select parents to create the offspring. In this paper, two parents, P 1 and P 2 , are randomly selected, from the entire population, to create one offspring (O). This is repeated for a defined number of offspring each generation. Hence, a high diversity within the population can be maintained, minimizing the chance to suffer local optima stagnation.
Crossover. After selecting the two parents, the new offspring, O, is created. Many different options to perform crossover can be found in the literature [9,14]. In this paper, a Uniform crossover method has been selected. In this method, the information of O is randomly selected from the two parents with equal probability, as described below: where e is the eth edge of either the offspring (O), the first parent ( P 1 ) or the second parent ( P 2 ). Additionally, rand is a random number ∈ [0, 1).
Mutation. Once the offspring is created, this is perturbed, to explore neighbor solutions. In order to apply this perturbation, a random position within an edge, either the start or the end point, is selected. Then, it will be replaced by a random position within the CT constraints.
Replacement. After mutation, the performance, also called fitness, of O is calculated (f(O)) and compared with the fitness of the worst parent ( f (P worst ) ). If, f(O) is better than f (P worst ) , then P worst will be replaced by O. On the contrary, if f(O) is worse, then O is discarded. The description of the fitness calculations is described in Sect. 4.1.

Experimental setup
The experiment is setup as a two-phase process, separating training and an evaluation between two processing platforms, as shown in Fig. 6. During training GA is applied to find new GACT comparison schemas (or masks). This involves, for each candidate individual, transforming the input images, in accordance to the mask, and perform stereo matching, followed by an evaluation of the resulting disparity map, with respect to the ground truth. The training phase is implemented on a Xilinx ZCU104, a processing platform combining a CPU and an FPGA, where the most arduous part of the process, the stereo matching, is hardware accelerated, significantly reducing time for training. During the evaluation phase the GA derived mask is evaluated, on a larger dataset, using a MATLAB implementation, for consistency with previous work [2]. First, the parameters for the experiment will be presented followed by a description of the implementation for ZCU104.

Parameters, data
The GACT experiments are defined by the number of edges and window sizes: • Number of edges: 8, 16 and 24. • Square windows: 3 × 3 , 5 × 5 , 9 × 9 , 15 × 15 , 21 × 21. • Lateral windows, following rows × columns-notation, In addition, lateral windows 3 × 7 , 3 × 9 , 3 × 29 and 5 × 29 are used to compare GACT to established CT methods. For GA the following parameters are used: • population size = 30 • offspring size = 8 • max evaluation = 6000 GA is repeated 10 times for each CT setting and the median candidates evaluated. The hyper parameters are carried over from preceding experiments [2]. For the current experiment, the search space has expanded, possibly warranting a larger number of evaluations. However, a consistent result, when repeating the experiment, suggests a satisfactory balance between exploration and exploitation with sufficient number of evaluation. Through stereo matching disparity maps for GACT masks are obtained. Here, the Hamming distance between census bit-strings for 9 × 9 aggregation windows gives the matching cost for a disparity hypothesis. The disparity range is set to [0, 255] for training (FPGA) and [0, 230] for evaluation (MATLAB). From the disparity hypotheses the best candidate is selected according to a winner takes all (WTA) strategy. No left-right consistency check (LRC), no subpixel interpolation, propagation, refinement filters, etc., are applied.
Disparity maps, whether during the training or evaluation phase, are evaluated using KITTI 3 stereo evaluation. The KITTI 2015 benchmark [30] consists of training and evaluation datasets, with an associated ranking list. The KITTI dataset targets autonomous driving and the scenes represent real-world natural images with noise, reflections, challenging contrast, etc. Fig. 7 shows an example with ground truth. For the training set the stereo pairs are completed by the ground truth images, enabling supervised learning. As KITTI evaluation is to be performed once, and associated with a single publication, until a more complete stereo framework around GACT is finalized, the experiments are carried out on the training set. This dataset is split into local training and evaluation subsets. To limit training time, and as GACT works well with a relatively small amount of training data, 5 training scenes were selected at random (seq. no. 39, 101, 3, 166, 40), leaving 195 for evaluation. The endpoint error is defined as < 3px or < 5% . During training, only non-occluded (NOC) pixels are considered. For evaluation occluded (OCC) results are also presented.

Implementation/Processing platform
To be able to extend the experiments, performed in previous work [2], for multiple window sizes and different shapes, the time required for training had to be considerably reduced, even though parallel for-loops, utilizing 8 cores, one per  offspring, were adopted. As the intended future processing platform for GACT is FPGAs, more specifically GIMME2 [1], this platform can also serve as an accelerator during training. However, the non-deterministic data pattern of GACT individuals, proposed by GA, in correlation with large windows and disparity range, deemed GIMME2 short on resources. Instead the more powerful Xilinx ZCU104 was used for the experiments. Both share the Zynq SoC platform and the source/IP cores can be reused between them.
The most computationally expensive, and hence time consuming, part during training is the stereo matching, followed by the census transform. These are implemented on the FPGA in a pipelined design, benefiting from the parallel processing capabilities of the FPGA. GA itself is low-cost. The GA-indices are randomly selected inferring a nondeterministic data-access pattern, which is not suitable for FPGA implementation. Evaluation involves division which is costly on the FPGA. Since neither GA nor evaluation is part of the stereo algorithm, there is no motivation for it to be implemented on GIMME2.
In accordance with the Xilinx Vivado design flow, the implementation will be described starting from the FPGA side, also referred to as the Programmable Logic (PL), followed by the CPU side, or Processing System (PS). This is contrary to the processing flow of training, which is controlled by the PS application. Returning to Fig. 6, the reminder of the section will focus on the left box, the training phase, and first the FPGA part, the gray block.

FPGA/PL
At a high abstraction level the FPGA-side of the design can be divided into three components; block design, GACT and stereo matching, as shown in Fig. 8. Block design is part of the Zynq design flow and specifies system properties and the PL/PS interface, here setup for exchange of images (left, right and disparity) and GACT masks (as dictated by GA). Two GACT components (left/right) converts intensity images to images of census bit-strings, according to the GACT coordinates. Stereo matching calculates the disparity, using the Hamming distance. GACT and stereo matching are the core components, both relying on a forth component, the sliding window. A sliding window approach is necessary as on the FPGA, there are only enough resources to process a small part of the image, at a time. Following this approach an image representation is changed to a data stream, where a new pixel is presented every clock cycle. Below follows a more detailed description of the components. The block design comprises of three IP cores: processing system, VDMA and census register. In the processing system the I/O, clocks, memory and PS-PL interface settings for the Zynq system are configured. The ZCU104 evaluation board can be selected as target device with a preset processing-system configuration with respect to the hardware. More interesting are design dependent configurations, such as the PS-PL interface, here with two high performance slaves, for image data transferal, and one master, for IP register control.
The VDMA IPs from Xilinx provide high performance memory mapped channels to enable data streaming between the FPGA fabric and the system memory. In this design the read and write channels are separated into two separate VDMA components. The read channel, VDMA1, handles a 24-bit stream (non-optimal), divided into 8-bits of left and right pixel intensities, respectively, and 8-bits of (natural) ground truth (not currently used). The write channel, VDMA0, is setup as an 8-bit stream to encode 256 levels of disparity. The control/status of the VDMA cores is performed by writing/reading register values. The register space is accessed (from the PS) over the AXI4-lite master interface.
The census register IP is an AXI4 peripheral, with the straight forward objective to setup a shared memory area between the PS-PL for GACT control. The census register memory area holds 25 32-bit register, one control register and 24 edge register. An edge is represented by two points, start and end point, each having two coordinates. For this application directionality is of minor/marginal importance and is hence neglected. Each coordinate is encoded by 8-bits to fit an edge into a 32-bit register. Census registers are mapped as signals to the GACT component. Sliding Window The function of the sliding window is to buffer data, to provide a small image patch, from an image data stream. Both the GACT and stereo match blocks are window based and incorporates this component. The sliding window has generic parameters to be able to cope with different window sizes, data widths, and image sizes. For this experiment different window sizes and data widths are used by the GACT and stereo matching components. The sliding window requires win_height − 1 row buffers. As there is a large amount of data to be stored, the buffers are placed in the FPGA block RAM, which is on chip, but memory 'circuits' and not logic resources. The buffers are controlled by two indices, addr_i for horizontal position and line i for vertical. For every valid input data, the row buffers are read for the current address, addr_i . The resulting data column is synchronized with the input and put in a shift register, with the width of the window. The input data is written to addr_i to the oldest row buffer, line_i , and the address index is incremented. When reaching an end-of-line or start-of-frame signal, the address index is reset, and the row index cycled. Hence, every row buffer address is written to once but read win_height times until the window overlaps. The sliding window block is shown to the upper left, embedded the GACT block, in Fig. 8. GACT Two parallel GACT-components are instantiated, to handle the left and right images, respectively. Provided the sliding window component and the GACT coordinates the implementation is straight forward, as can be seen in Fig. 8. Each position in the census bit string is set by comparing two window coordinates, connected by an edge. The current implementation supports 24 edges. Any lower amount of edges can be used as edges point to the same coordinate by default and is hence selfcancelling. Similarly, the same circuit can be used for larger and smaller windows by restricting the coordinate indices. This is however controlled by the GA on the PS side. Input data width is 8-bits, and output is 24-bits.

Block Match
The stereo correspondence is calculated using block matching. For each pixel in the reference image, extract a small image patch around the pixel, and compare for similarity against patches from the target image, over a range of horizontal offsets, the disparity range. The image patches are referred to as aggregation windows, here implemented by two 9 × 9 sliding window components, one for each 24-bit census stream. The output of the right window is put into a shift register, with the width of the disparity range, in this design 256 disparity hypothesis are evaluated. Similarly, for census transformed images is defined by the Hamming distance and is realized by a separate component. 256 parallel Hamming components calculates the similarity for the current position in the reference image and an offset (delay) of 0 to 255 pixels in the target image. As the Hamming distance is calculated over the aggregation window a two clock-cycle approach is adopted, starting with the column sums (vertical), followed by horizontal aggregation. From 256 hypothesis the best match is to be found, along with its offset. This is implemented as a tree-like tournament, of different branching factor, over 3 clock cycles. The winner, the patch with the smallest distance, has a disparity of the corresponding offset, an 8-bit value, which is mapped to the block design and VDMA0.
To carry out the experiments, two different FPGA implementations of the same design were derived, one for square census windows and one for lateral. The implemented circuits handle worst case scenarios, i.e. maximal window size, 21 × 21 and 29 × 15 , respectively, with 24 census edges.
Experiments for smaller windows and fewer edges could be run using the same implementations. (However, with shifted output). The design is fully pipelined, clocked at 50 MHz and handles 256 levels of disparity.

CPU/PS
From the hardware design, a configuration file is created containing the information for generating low-level system startup files and the device tree, describing the hardware (with addresses) for the operating system, in this case Petalinux, a Xilinx specific Linux distribution for the Zynq systems. One specific configuration was to, in the device tree, reserve part of the PS RAM memory for image frame buffers, i.e., set an upper RAM limit for Linux, so that the operating system would not interfere with the frame buffers. Greatly simplified, the address space can be divided into three parts: the normal RAM memory, the reserved RAMarea for frame buffers, and the hardware address space. Thanks to Linux GA is a straight forward application, on the CPU side, which can be hierarchically divided into three parts: initialization, GA and evaluation, as shown in Fig. 9.
To minimize processing and data transfer KITTI training samples were combined into one file per sample, containing both left and right intensity images, together with an integer ground truth value (for future use). This completes a 24-bit 3 channel image.

Initialization
The application begins with an initialization phase. First, the training images are loaded. As these are to be forwarded to the FPGA, and not to be processed by the PS, they are loaded to static addresses in the frame buffer memory area, outside of the memory range of the operating system. Next the ground truth images (of float precision) are loaded into allocated heap memory (RAM), as these are only to be accessed by the PS application. The ground truth needs to be shifted, as the output from the FPGA is not padded, and depends on census (varying) and aggregation (fixed) window sizes. The smaller the census window (compared to the supported size) the greater the shift required to align the images. KITTI images are of slightly different size, adding another requirement on the application (on both PS and PL sides).
With the image data loaded the next part of the initialization phase is to setup the census register driver and the parameters for GA and evaluation. The census register driver provides an interface for manipulating census registers, in the hardware address space, from the PS application. Later, candidate GACT masks will be shared with the PL through these registers. The driver requires the hardware address and the census size and clear the associated memory area at initialization. GA parameters are setup; population size, number of offspring, number of evaluations and mutation rate, as mentioned earlier, along with more experiment specific parameters such as census window size and the number of edges. Finally, the evaluation parameters are setup. These are the thresholds associated with KITTI evaluation, pointers to the input images and the output (disparity) image (frame buffer addresses), pointers to the ground truth images (heap), addresses to the VDMA IP cores (hardware address space) and information about image size. There is a distinction between the GA and the evaluation when it comes to data. GA derives candidate masks and requires the fitness, independent of how the evaluation is performed and on what data. The evaluation, on the other hand, is independent of GA data. Fig. 9 Experimental setup CPU GA The implementation of GA is straight forward following the algorithm described in Sect. 3 and shown in Fig. 9. The algorithm is neither particularly space or time consuming (and hence not implementation critical). An individual in this application is defined by a GACT comparison schema of a specified number of edges. This set of edges can be compared to a genome and each edge a gene. As described, an edge can be fitted into a 32-bit register, and an equivalent 4-byte edge datatype is defined. A GACT mask is simply defined as an array of edges, and a population as an array of individual masks.
First the population is randomly generated in accordance with the GA parameters. Over generations offspring is generated from the population through selection (two random individuals from the population are selected as parents), crossover (combination of edges from the parents) and mutation (change a random edge). Finally, stronger offspring replaces the weaker of its parents and the population is set for the next generation.
Evaluation needs to be performed, first for the initial population, and continuously throughout for every offspring. Before running evaluation, the current individual needs to be presented to the FPGA over the cen-sus_register. EVAL The EVAL part handles the image data, i.e., stereo image(s), disparity map and ground truth. The transferal of image data between the PS and the PL is done using VDMA IP cores, implemented on the FPGA side. First VDMA0, for receiving the disparity map, is setup. This includes specifying image size, data width, frame buffer address and resetting and starting the core. For VDMA0 the width of the stream is 8-bit to support the disparity range. The frame buffer address for the disparity map is always the same. Next the VDMA1 core is setup similarly, but this time the 24-bit image stream of the stereo images is sent to the FPGA. During the initialization the training images were loaded from files directly into different frame buffers. Hence, it suffices to change the frame buffer address instead of reloading images. The FPGA performs the census transform, according to the mask, and stereo matching before the resulting disparity map can be read from the VDMA0 frame buffer. The frame buffer is mapped into user space and the disparity image is compared pixel by pixel to the ground truth (also loaded during the initialization) returning the error rate, for non-occluded pixels, given the evaluation parameters. The frame buffer is then released. This process has to be repeated for each training image. The fitness of the individual is the average error rate, over the set of training images.

Experimental result
In this section, the experimental results are presented. Firstly, the results from training are presented. Conservative GACT candidates have been evaluated to investigate how parameters affect the matching result, followed by and analysis of derived GACT patterns. This is performed for square followed by lateral census windows. Then results regarding the implementation are presented and discussed. Finally, GACT masks have been compared to established CT methods.

Training
In previous work it was established that for the KITTI dataset the correlation between training and evaluation result was strong, i.e. a training candidate with low training error rate will, with a high probability, have a low evaluation error rate [2]. It was also concluded that GACT did not converge to a single solution, but to many similarly good solutions sharing common traits, representing the information from the training data. To investigate the discrepancy between different solutions, training was repeated 10 times for each parameter set. The training results, for square windows of different size and number of edges, are shown in Table 1. Comparing the max and min for the different entries, the divergence is small, compared to the total error. For the worst case, 3 × 3 window and 8 edges, the difference is 0.36% (21.75-21.39%). The GA produces solutions of acceptable consistent values, and hence a single training run could suffice. The training results for rectangular census windows are shown in Table 2. The conclusions for square windows hold true for rectangular windows, however the training error rates are lower.

Evaluation-square GACT
Even though training result are acceptably accordant, the median training candidates were selected for evaluation, to achieve the highest level of consistency, at a potential loss of highest accuracy possible. However, this is to reflect the result if running GA once.
First square GACT windows will be considered. The evaluation results for GACT of 8, 16 and 24 edges of different window sizes are presented in Fig. 10 and Table 3, where the error rates are plotted against the number of pixels in the census window. From the evaluation results it can be concluded that larger census windows reduce the error, at a exponentially decaying rate. The final step almost doubles 1 3 the census window at a quite small accuracy improvement. It has previously been established that larger CT windows broaden object boundaries [3,19], just as aggregation windows, where large induce foreground fattening [19], however, to a lower extent regarding CT [3], and that large CTs are unfeasible [42] or even detrimental [3,19]. Applying GACT to too large census windows, edge location is optimized to minimize the error rate, omitting edges in unfeasible areas. It can hence be argued that the effect on GACT, in terms of accuracy, will not be detrimental if increasing the window size. Instead a steady state can be expected, where optimal accuracy is achieved, and extending window size beyond this point is a waste of resources. Another point regarding census window size is that the negative effect of too large census windows is suppressed by noisy data [19].
Here, experiments are based on KITTI, which comprises of natural, noisy images. Hence, in combination with GACT optimisation, too large windows should not be an issue.
Increasing the number of edges increases matching accuracy, as seen in Fig. 10 and Table 3. The improvement is larger when going from 8 to 16 edges than from 16 to 24 edges, and is relatively consistent across window sizes. Regarding the number of edges as compared to window size, only for relatively small windows, an increase in window size can compensate for a larger number of edges. However, increasing edges comes at a much higher cost as it affects the later stereo matching, see Sect. 5.6. It can also be expected that there is a break-point, where introducing more edges will not lead to any accuracy improvement.

Distribution-square GACT
To conclude the results on square census windows a total of 30 training runs of the largest example ( 21 × 21 , 24 edges) were performed. The activated coordinates for all masks were put in a histogram to show the GACT distribution for the training data. The histogram is shown in Fig. 11. Note that this shows that a coordinate within the census window is activated, but not to which other coordinate it is connected. It can be observed that GACT activates data, forming a horizontal ridge along the middle row, within the window. On the other hand, the top and bottom regions are more or less flat (non-activated), and hence a waste of resources. These   should instead be dedicated extending the window laterally, to better cover the activation distribution. Before going to the results of lateral windows a note on edge lengths. Similarly to the activation histogram, a histogram was created over the length of all edges. This histogram is shown in Fig. 11. As can be seen, the most common edge length is 3, and there are few edges longer than 10. Assuming a large enough window, edge lengths does not really increase with larger windows. Edges do not span the entire window, but they do however populate the entire width of the window.

Evaluation-lateral GACT
For the second part of GACT evaluation, lateral census windows were considered, to better correlate with the distribution of selected coordinates found for square windows. The number of rows and columns were set as a fixed ratio of columns = 2 * rows − 1 . Similar to square windows, the training was repeated 10 times for each window size and parameter set and the median candidates were evaluated. The results are shown in Fig. 12 and Table 4. The conclusion from the experiment is that, for a census window of a certain number of pixels, the GACT of a lateral shape performs better than a square. This is best visualized by Fig. 12 where the results for square windows have been included for reference. The lateral series (red) are below the corresponding square series (blue).

Distribution-lateral GACT
Similar to square GACT, lateral GACT was trained 30 times for the largest training parameters, i.e., 15 × 29 window with 24 edges, to investigate the distribution of coordinate activation and edge lengths. The results are shown in Fig. 13. Looking at the edge length distribution, Fig. 13b, it is resemblant of square GACT, with 3 being the most common length. Once again, edges spanning the entire window are deemed unfavorable. The coordinate selectivity histogram, Fig. 13a, on the other hand, shows a more interesting result. First, it should be noted that the distribution declines vertically from the center row. This indicates that not much information is lost by vertically limiting the window. Secondly, the horizontal stretch shows that coordinates are not activated along a ridge, but there are rather two separate parts: (1) a central distribution and (2) the most lateral regions of the window. Knowing this, a similar pattern can be distinguished from the square coordinate distribution, Fig. 11a.
To further investigate the nature of the coordinate distribution, experiments were run for different window sizes. The resulting distributions are shown in Fig. 14a-d. For the 15 × 29 window a normal probability distribution was estimated from the central part of the data, Fig. 14i. This was subtracted from the other histograms and the remaining distributions are shown in Fig. 14e-h. It can be observed that the examples share the same central distribution with the supporting lateral regions following the expansion of the window. From the example, approximately half of the edges adhere to each of these parts, respectively. A hypothesis is that the central distribution represent matching on similarity while the peripheral edges help to eliminate uncertainty.
To conclude the evaluation results for lateral GACT, similarly as to square GACT, lower error rates will be achieved by increasing the window size and/or the number of edges. The better the result the higher the cost for an improvement. It can be noted that there is little difference between 16 and 24 edges until the two final window sizes. It can also be noted that lateral GACT16 is better than square GACT24. This is of great importance when considering implementation trade-offs for resource limited systems.

Implementation
Utilization for the two FPGA implementations, i.e., for square and lateral GACT widnows, are shown in Tables 5  and 6. GACT windows are set as large as possible, 21 × 21 and 15 × 29 , for the Xilinx ZCU104 target board, considering a data width of 24 edges, a 9 × 9 aggregation window, and 256 levels of disparity. As can be seen from the utilization tables, the LUTs are the limiting resource. Clearly, a stereo matching considering a large number of disparity hypotheses will require a considerable amount of resources. Not as apparent, is the cost associated with the GACT, for the specific GA implementation. The GA works under the presumption that any census window coordinate can be selected, as an edge point, at any time. Implementation of this is straight forward for a CPU, where elements from an array-like structure, representing the image patch, can be accessed at constant time (very high level). However, on the circuit level of an FPGA, array indexing is a different proposition, as each index requires a signal tap, a physical connection for each bit of the data. An edge is defined by a start and an end point. The design supports up to 24 edges. Hence, there are 48 elements to be accessed each clock cycle. Adding to the problem is that the operation must be performed for both input images. It is unavoidable that the circuitry required for routing/multiplexing rapidly grows out of proportion as census window sizes increase.
However, when implementing a circuit for a trained, and hence deterministic, GACT mask the resource utilization can be considerably reduced, as routing can be limited to specific indices. The resource utilization for an arbitrarily defined GACT mask is represented by the first column of Table 7. This can be compared to the training setup, Table 6.
Opposed to the original CT, for GACT, the window size can be altered without affecting the output data width, as the number of edges is defined. Hence, increasing the window size is a valid option to achieve higher matching accuracy. However, a larger window requires more buffering resources in the GACT component. On the FPGA, where the image is represented as a stream, instead of a two-dimensional grid, the concept of neighboring pixels/pixel connectivity, is redefined. For a stream, the distance to horizontal neighbors are one pixel, just as in the 'normal' case. Vertical neighbors, on the other hand, are one full width of the image away, and require buffering of a full row. This is handled by the sliding window component using block ram. Elongating the GACT window in the horizontal direction comes at a very low additional cost, while extending the window vertically   individual clock cycles. It can be concluded that lateral windows are resource efficient and perform better. Limiting the number of edges for GACT will of course save resources, not only in the GACT component itself, where fewer pixels are accessed, but more importantly in the subsequent stereo matching component, which no longer has to support the full data width. This is apparent when comparing the FPGA utilization for 8 to 24 edges, as can be seen in Table 7. Finally, the supported disparity range of the circuit is a major contributor to high implementation cost. The disparity range is dictated by the application/problem and not a variable parameter as such. However, for the KITTI training dataset 0.0022% of the pixels are of disparity larger than 127. Assuming that the evaluation dataset has the same disparity distribution, limiting the disparity range to 127 can be considered a fair trade off (to save resources for more elaborate stereo matching). From the subset of images randomly selected for GACT training there are no disparities greater than 127 so for the current training setup there would be no penalty associated with a disparity range reduction. The FPGA utilization for 128 disparities is listed in Table 7.
The FPGA pipeline is clocked at 50 MHz. As 256 disparity hypotheses are evaluated in parallel this equates to 12,800 MDE/s. KITTI images are of 0.5 Mpixel, hence the frame rate is 100 fps. At this rate a training cycle, for the current set of parameters, would complete in 5 min. However, the current SoC setup requires approximately 40 min (2325 s), with a single core CPU load of 18.5%. Consequently the bottleneck of the system is believed to adhere to memory mapping of image data on a driver level. Regardless, the hardware acceleration is considerate, compared to previous work [2], where a high-level CPU implementation required approximately 20 h to complete a training cycle, using an Intel Xeon X5650 2.67 GHz, even though the 8 offsprings were calculated in parallel through multi-core processing. Hence, the SoC setup accelerates training by a factor of 30.

Evaluation with respect to related work
GACT has been compared to related works of a 5 × 5 census window size, except for CT 7 × 9 . Several of the related works are sparse and should be compared to methods of similar number of edges. To include the aspect of rectangular windows, additional GACT masks were included for 3 × 9 (27 pixels) and 3 × 7 neighborhoods (21 pixels). GACT24 5 × 29 has also been appended as a reference, as it has been established that the implementation cost is similar to GACT24 5 × 5 . The results are shown in Table 8. The results show that the 5 × 5 GACT performs better than other CT methods of the same size and number of edges. In fact, only CT 7 × 9 , which comes associated with considerably higher resource costs, achieves a better score than the sparsest GACT. GACT 5 × 5 is comparable with previous results [2], which corroborates the convergence of GACT from a small training set for homogeneous datasets.
In line with the experimental results, the accuracy is improved by adopting rectangular census windows. Both GACT 3 × 7 and 3 × 9 perform better than their quadratic counterpart. GACT 3 × 7 is of a slightly smaller neighborhood, while GACT 3 × 9 is slightly larger. It is evident that GACT makes good use of the extra lateral columns, and for this setup it requires the same amount of resources, which in terms of row buffering, is half to GACT 5 × 5 . If resources are the focus, and 5 × 5 is considered the default experiment, GACT 5 × 29 comes with no, or low, additional cost, for the current setup. However, the result supersedes the smaller windows, by a margin.
To perform well it is apparent that a 5 × 5 census window is not enough. Looking at the related works, only MCT, RCT, GCT and SCT are defined to size. CT is often adapted in a 7 × 9 configuration, for bit-strings to fit within 64-bit registers. However, bit-string length quickly increases with window size, and hence also the processing cost for matching. The sparse CT can produce bit-strings of a specific length, for different window sizes, by adapting different levels of sparseness. The main argument of sparse CT was that given an equal number of comparisons, a larger sparse CT performs better than a larger dense, which is preferable if/when the processing resources are limited.
Center based CTs are sensitive to noise and methods using different comparison schema, GCT, SCT and CSCT, have been proven successful, and are the ones to improve upon. Both GCT and SCT are defined within a 5 × 5 neighborhood and of a certain number of edges. For SCT there are also different variations depending on edge length. To perform a proper extension for windows of different sizes and shape would require quite an effort as the methods are handcrafted. The simple solution would be extended into larger windows by introducing empty rows and columns, analogous to sparse CT. However, this raises concerns regarding edge distribution and length. SCT could capture local lateral comparisons but is limited to one single edge length. GCT has edges spanning the entire neighborhood and these edges are not favorable according to GACT.
GACT on the other hand can produce bit-strings of a specified length, independent of window size. It will also find a good distribution between the edges, both regarding positioning and length. However, this comes at the cost of training. By adopting the proposed hardware accelerated approach, training is quick and does not require much training data.

Outlook
The experiments and evaluations have been performed using a basic block-matching framework. This gives a base for comparison between different CT methods, as cost metrics, but is not a full and final stereo algorithm where concepts such as matching confidence, left-right consistency, different strategies for cost aggregation, refinement filters, sub-pixel interpolation, etc., are considered. Both algorithm optimization steps and adaptations for FPGA implementation has effect on matching accuracy [43]. The first question, of course, is how well GACT can perform in such an algorithm, and secondly, if and how extending the algorithm affects GACT training result. Is the edge distribution depending more on image information or algorithm. These questions are left for future works. However, as a small experiment, SGM [15] was adopted, for some of the GACT mask from the experiments, i.e. trained using basic block matching. The results are shown in Table 9.
It is clear that SGM optimization improves the result. For larger GACT windows, though, the improvement is small. Two reasons for this are that 1) the lower the error rate, the more challenging and costly to make improvements, (similarly to larger window sizes and more edges), and more to the point 2) larger windows results in a larger perceptive field, including 'semi-local' information otherwise provided by SGM. However, the question whether GACT and SGM share a symbiotic advantage, if employed during training, remains.

Conclusion
The CT is a well-established cost metric for stereo matching suitable for implementation on resource limited systems. Over the years several different CT methods have been proposed, from which two key developments can be identified: (1) sparse CTs save resources by not evaluating all pixels within the census window, and a larger sparse CT performs better than a smaller dense, making for a similar implementation cost come the actual matching and (2) non-centric comparison schemas make CT produce a better result and be less sensitive to noise. The GACT takes advantage of both these developments, but instead of using a handcrafted comparison schema relies on GA to position the edges, optimized for the image data. Previous work [2] shows that GACT performs better than other CT methods with the same number of edges. In this paper the training time for GACT has been significantly reduced through hardware acceleration, adopting FPGA-based GACT and stereo matching. This has enabled evaluation of GACT for multiple parameter sets, altering window size and shape, and the number of edges. The experiments suggest that GACT has a preference for selecting two different types of edges, central and lateral, of limited length compared to the larger neighborhood max length. Hence, GACT benefits from adopting lateral windows, further improving the previously established GACT result, while at the same time, from an implementation perspective, requiring less buffer resources.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.