Image analyzer for stereoscopic camera rig alignment

The paper presents a versatile solution facilitating calibration of stereoscopic camera rigs for 3D cinematography and machine vision. Manual calibration of the rig and the camera can easily take several hours. The proposed device eases this process by providing the camera operator with several predefined analyses of the images from the cameras. The Image Analyzer is a compact stand-alone device designed for portable 19″ racks. Almost all video processing is performed on a modern Xilinx FPGA. It is supported by an ARM computer which provides control and video streaming over the Ethernet. The article presents its hardware, firmware and software architectures. The main focus is put on the image processing system implemented in the FPGA.


Introduction
Stereoscopic image recording is a relatively new ground of modern cinematography. The market for the stereoscopic motion pictures is now growing, which is caused by the recent popularization of stereoscopic displays and development of related standards of video transmission and storage. Although displaying a stereoscopic material has become quite easy, recording of such is still a complex task.
Consumer-grade stereoscopic cameras are not suitable for professional productions hence the video acquisition is usually done by a setup of two conventional cameras mounted on a dedicated rig. The aim of the rig is to support the cameras and ensure constant mechanical relationship between them. The rig has to provide means of regulation for at least two parameters which are crucial for the proper depth perception: the stereo base and convergence distance. These parameters, illustrated in Fig. 1, define the Comfortable Viewing Range (CVR)-a region of space which shall enclose the captured scene [1].
Stereo base is the apparent distance between camera focal points. This value defines the volume of the CVR. To obtain a depth perception analogous with the real world experience, the stereo base is usually adjusted to around 1/30 of the span between the rig and the closest subject [2].
Convergence distance is the separation between the rig and the point where the cameras' optical axes intersect. This parameter is function of the stereo base and the angle between the optical axes. Subjects located closer than the convergence distance will appear to the viewer as being in front of the screen, whereas these located farther will appear as being behind it.
To avoid loss of the material, e.g. due to cabling issues, the cameras record the video stream to removable solidstate drives. For proper synchronization during production process, both cameras are supplied with timing information (Time Code) and shutter synchronization signal (GenLock). The preview of the recorded frames is usually available through High-Definition Multimedia Interface (HDMI) or Serial Digital Interface (SDI). The latter is more popular in medium-and high-grade cameras, mainly due to more reliable cabling [3].
The rig calibration is usually performed by aligning images obtained from these preview data streams while filming one or several boards with dedicated pattern of lines and other alignment markers [4]. Complete calibration of the stereoscopic setup includes: camera roll, pitch and translation compensation, adjustment of lens settings, color space equalization (mainly for mirror rigs) and finally applying the desired stereo base and convergence settings.
The legacy method of calibration, that the authors have observed in practice, involves combining or switching the video streams with Matrox MC-100 multiplexer and supplying them to a high-resolution display. The images on the screen are then compared by manually analyzing relationships between the observed geometries. One of the goals of the calibration is to obtain the vertical disparity not larger than a single line. The complete set-up process can easily consume several hours.
The calibration time could be significantly reduced by supporting the operator with semi-or fully automated analysis of the video preview streams. One of such solutions is STAN, the stereoscopic analyzer prepared by Fraunhofer Institute [1]. It is a computer application suite capable of performing a wide range of image analyses and calculations on stereo-pair images. Our goal is to provide similar functionality with a more compact and energy-efficient device, which would connect to both cameras and provide analyzed image as well as additional pass-through signals. The following chapters focus on the FPGA firmware of the first prototype of such solution.

Device requirements
The Image Analyzer has to accept a video stream arriving from the cameras by means of the industry-standard SDI protocol. The SDI standard specification was first released in 1989 and has since undergone a number of refinements [5]. In most configurations the link requires just a single regular 75 Ω coaxial cable with BNC connectors. It carries video signal and optionally a set of audio channels. Currently most of the SDI devices operate with link speed of 3 Gb/s; however, the 6 Gb/s version is entering the market and the 12 Gb/s successor is already planned. The 3 Gb/s variant supports the 1080p60 video stream (1080 lines of 1920 pixels at 60 fps, 4:2:2 mode). The coaxial cabling allows reaching connection distance of around 300 m; however, the range can be easily extended using an optical transmission medium.
The Image Analyzer has to provide the composed output signal using HDMI and SDI interfaces. It is also required to provide the resulting video stream through a web service for preview, e.g. on a mobile device.
The Image Analyzer has to be a compact stand-alone device, that can be easily integrated with the infrastructure already used on the set. It should interface other components: cameras, preview displays and main display, as illustrated in Fig. 2. Since there is often a small 19″ rack present on the set, it was agreed to design the device as a regular 1 U rack module. The module has to accommodate for mains or battery power supply.
The basic modes of device operation, illustrated in Fig. 3, are defined as follows: (a) "over and under"-images aligned vertically, (b) "side by side"-images aligned horizontally, (c) "50/50"-average of corresponding pixels, (d) "line by line"-one line from first camera, another line from second camera, etc., (e) "anaglyph"-special color conversion which allows for depth perception through use of color-filtering glasses, (f) "difference"-absolute value of the difference between corresponding pixels.

Hardware design
Capturing two High-Definition SDI video streams, performing a series of arithmetic operations on them, and returning a SDI signal with only a general purpose processor would be close to impossible. Especially, if the device shall be kept compact and energy-efficient. Large amount of required customizations and "glue logic" suggests that the solution should be based on an FPGA circuit. Such a platform facilitates processing of huge streams of data with low latency. Moreover, it helps accommodating to the evolving project requirements. Implementing the 1 Gb Ethernet connectivity with video streaming capability completely in the FPGA would also be infeasible. The soft-core processors do not offer enough performance and implementing the video streaming service in hardware description languages would be an extremely time consuming effort. The web service should be hence preferably offered by an external processor module. Most of the modern Single Board Computers (SBCs) offer HDMI video output, hence the module could be also used for generation of high-quality On-Screen-Display (OSD).
The Image Analyzer is composed of an FPGA board, several I/O modules, ARM-based SBC and power supply. The device can be operated locally, by means of hardware keyboard, LCD and video OSD as well as remotely using a web-service. The system structure is shown in Fig. 4.
The selected FPGA is a modern Xilinx Kintex-7 integrated circuit: XC7K355T. It is responsible for almost all of the video processing. It is capable of receiving and sending 1080p60 (1920 × 1080 pixels, progressive, 60 fps) stream over HDMI and SDI interfaces. In case of SDI interfaces, only a proper signal equalization is needed. The serialization and deserialization take place in Multi-Gigabit Transceivers (MGTs) of the FPGA. The HDMI interface was, on the contrary, implemented using external highly configurable serializers and deserializers. These devices communicate with FPGA by means of 16-bit buses operating at the frequency of about 150 MHz. A dedicated HDMI and SDI I/O module developed by authors is further described in [6].
The FPGA processing module cooperates with Gate-Works Ventana GW5400 SBC. The module is based on a quad-core ARM Cortex-A9 processor running at the frequency of 1 GHz. The processor board contains both the HDMI output and input interfaces which enables it to simultaneously generate 1080p60 and capture the 1080p30 video. The HDMI output is used for generation of the overlay, which is composed in the FPGA into an OSD. The HDMI input enables capturing the image being the result of the analysis and streaming it over the Ethernet. More details on the SBC firmware are presented in [7].
The Analyzer is equipped with precise clock synthesizer generating the reference frequency of up to 148.5 MHz (for highest resolution) with spread spectrum. This signal is used for clocking most of the video processing components in the FPGA. The design is hence independent of the external clock signals. The selected frequency allows for generation of the 1080p output signal with up to 60 fps.
The device contains 256 MB of DDR3 memory on a SO-DIMM module. The memory is connected with 64-bit  data interface operating at the frequency of 400 MHz. The memory bandwidth is hence around 50 Gb/s. In comparison, the throughput required for unidirectional 1080p30 video transmission in 4:2:2 mode is just around 1 Gb/s. The unit also contains simple supervision board based on ARM Cortex-M4 microcontroller. This module monitors the power supplies and switches between them. It also drives front panel LEDs and controls the LCD. Finally, it provides voltage translation for serial interface between SBC and FPGA.
An USB hub allows accessing debug features of all the boards with a single USB connection. Small LCD screen and keypad connected to FPGA allow implementing an intuitive multi-level menu system, that remains operational even with the external display disconnected. More information on the hardware design can be found in [8].

Firmware design
The FPGA firmware of the Image Analyzer is relatively complex, it will be hence described as two coupled systems: the control system and the video processing path. Moreover, the latter will also be described as a set of smaller sub-systems.

Control system
The block diagram of the control system is shown in Fig. 5. It is governed by a Microblaze processor core, coupled with 128 kB of the BlockRAM memory. This memory stores the processor executable code as well as its run-time variables. It is a common practice in Xilinx FPGAs to store the application in the RAM memory. The memory is preloaded with machine code during the FPGA boot process. The processor runs at the frequency of 100 MHz. It does not take active part in the image processing, only schedules the transfers.
The AXI interface of the DDR3 memory controller is configured to operate at 200 MHz with data bus width of only 128 bits. Therefore, its throughput is limited to about 25 Gb/s (half of the controller maximum performance). Such an approach considerably relaxes timing requirements related with the controller. The throughput of 25 Gb/s is much bigger than the worst case required throughput of 16 Gb/s, when all the four Video DMA circuits capture and generate 1080p60 signals.
All the bulk data transfers are serviced by a 128-bit wide AXI crossbar. It is the main communication bus, allowing parallel simultaneous transfers between its masters and slaves. The width of the bus was adapted to the width of the memory interface. All DMA controllers use the 128-bit data bus to perform the transfers. On the contrary, the control interfaces of these DMA circuits operate in 32-bit mode. Also the processor interface and the bus bridges are implemented as 32-bit-wide, as no high-performance transfers are performed there.
The system contains three more 32-bit AXI buses, which are implemented as AXI Lite shared interconnect buses. This means that at any given time only one transaction can be active on such a bus. AXI Lite can only have up to 16 slaves, that is why the design required 3 of them. One of these buses is dedicated for the I/O modules, the second is for video path components and the third services low speed communication interfaces.

Video processing system
The part of the design that was described above contains only the Commercial Off-The-Shelf (COTS) IP-cores. The situation is quite different for the video processing path. There, the essential video processing components were developed from scratch. The general structure of this system is shown in Fig. 6. For the sake of simplicity, the figure does not present the 15 AXI control interfaces connected to these components.
The signals from cameras are supplied through two SDI links. These are symmetrized in dedicated equalizers and provided to the FPGA as a pair of differential lines. These lines are connected to the Multi-Gigabit Transceivers (MGTs) inputs. Each MGT cooperates with the SDI receiver IP-core in detection of the incoming baud rate. After recovering the clock, the data are captured and deserialized. Without further processing, the data streams are fed back to the MGT for serialization. This loop-back implements the pass-through functionality of the Image Analyzer. The data streams also enter the SDI receiver blocks, where the video data are extracted and passed through using a Xilinx Streaming Video Interface (XSVI). The XSVI signals are provided to the video cross-switch, which has two purposes: it allows tests with identical video signal in both channels and it converts the video from XSVI to AXIS protocol. Usage of the depreciated XSVI standard comes from the Xilinx SDI input reference design.
The video from the cross-switch enters the channel processor, performing several monitoring and editing tasks. Its structure and operation is described in the following chapter.
Next the video stream is received by the Xilinx Video DMA (VDMA). This component implements image buffer of three frames with dynamic GenLock synchronization. It offers seamless adaptation between input and output frame rate. The frames are either repeated or skipped automatically when needed [9]. When one channel of this DMA operates on one frame buffer, the other is forbidden from accessing it, guaranteeing that only complete frames are passed through.
The HDMI input signal, from the SBC, has much simpler path. Firstly, the embedded synchronization data are extracted and then the signal is re-clocked and converted to AXIS standard. At the same time, the synchronization signals are provided to Xilinx Video Timing Controller (VTC) for detection of the resolution and frame rate. Finally, the video stream is provided to the Video DMA.
Signals from all the three Video DMA circuits are provided to a custom video combiner block operating at the frequency of 150 MHz. It is the solution dedicated for performing almost all the analyses that the Image Analyzer is required to provide. One of it functions is to calculate linear combination of the components of pixels from input streams. Moreover, it implements video stream masking for inclusion of the On-Screen Display. The structure and operation of this block is presented in Sect. 4.4.
The video from the combiner is provided to the HDMI output path and follows its timing, which in turn is provided by the second VTC module. The HDMI stream is then enriched with embedded synchronization information and provided to a dedicated transmitter outside the FPGA. The video from the combiner block is also provided to the VDMA circuit isolating the combiner clock and timing domain from the SDI output. The last, third, VTC generates timing for the SDI output.

Channel Processor
The Channel Processor handles the initial processing of the video stream captured through the SDI interface. It is composed of the main video path and a set of auxiliary monitoring modules. Its structure is illustrated in Fig. 7.
The SDI input block, obtained from the Real Time Video Engine reference design [10], always returns the data in 4:4:4 mode irrespective of the video standard provided to its input. Most of the components of the Image Analyzer were, however, developed for operation on more efficient 4:2:2 stream. Adaptation of the SDI receiver component is not possible as its sources are encrypted. The video is hence first provided to a YUV color space converter. This custom block performs chroma sub-sampling from 4:4:4 mode (three components of 10-bit per pixel) to 4:2:2 mode (two components of 8-bit per pixel). This decreases the amount of data required in the further processing steps roughly by the factor of two.
After the sub-sampling, the data enter the diagnostic block. This custom IP core calculates the image width and height, basing on synchronization signals. It is also capable of checking if all the image lines had the same width, which may not occur when the SDI signal is corrupted (e.g. due to wrong signal distribution topology or improper termination). The data are also observed by the Color Data Collector, which calculates the average color for nine areas of the image, defined by dividing the frame into three rows and three columns. The rows and columns boundaries are adjustable in the run-time. The data from sub-sampler are also provided to a set of two decimation blocks. These allow to decimate the data by removing every second (odd or even) row, or column, or both. This functionality is used in the side-by-side, over-and-under and line-by-line modes, where only half of the original number of pixels is needed.
The vertical decimation requires skipping every odd or every even line of the image. The line counting has to be restarted on the start-of-frame marker. This block does not buffer any data, it only masks some of the interface lines. There is hence no latency associated with its operation.
The horizontal video decimation is a bit more complicated operation. Every data word contains information on luminance of the corresponding pixel and one component of the chrominance common for two consecutive pixels. Discarding every second word would lead to always dropping the same chrominance component, causing image color space distortion. The solution is to capture information on two consecutive pixels, during two clock cycles. From each pair of pixels only the luminosity of the first of them is used together with the appropriate chrominance component (changing between Cr and Cb with every returned pixel).
After the decimation, the video stream is provided for optional horizontal flipping of the image. To perform this operation, the whole video line has to be buffered and then returned in the reversed order. To improve the latency, the component is equipped with two line buffers-while one is written, the other can be read. Each buffer is equipped with index register and the validity flag. The flag is set when the buffer is filled with data. The flag is cleared when all its data are read. When the effect is not in use, the buffer is read in the regular order, otherwise it is reversed. During reversal the Cr and Cb chrominance components would become swapped, causing image color distortion. To counteract this effect, the module cooperates with a simple block swapping these components when the effect is in operation. During streaming, the solution introduces latency of one video line increased by several clock cycles for additional pipeline buffers and chroma swapping block.

Video combiner
The Video Combiner module is a conglomerate of several IP cores developed for the Image Analyzer. Its structure is depicted in Fig. 8. The Video Combiner task is to compose together the three video streams: two from SDI and one from HDMI input. To make this possible, all the three streams must have identical resolution and have to present pixels from the same image coordinates at all times. This prerequisite can be easily achieved through use of similarly configured and synchronously read VDMA circuits. These are set to provide a SDRAM-based buffer of three video frames. When one of the buffer locations is being written, another can be read. The third location provides a safety margin. The VDMA circuits are capable of repeating last frame or dropping frame to adapt source frame rate to the destination frame rate.
Each of the three video input channels of the stream processor has the VDMA circuit followed by a small buffer. Data from these buffers are provided to a synchronizer block. It is responsible for alignment of all the three video streams. It waits for the Start of Frame (SOF) marker to appear on any of the paths. After detecting such event, the synchronizer reads other streams, discarding their data, until all three of them indicate SOF. This might require pausing the first stream for a time comparable with the time of transferring single video frame. Thanks to the VDMA circuit no buffer overflow can occur during this operation.
When all the enabled streams indicate the SOF condition, these are considered to be synchronized and their data are passed to the further blocks. In case of detecting the End of Line (EOL) signal on one of the streams, the other stream's data are discarded until the EOL maker is presented by all the streams. When another SOF condition is detected the block returns to the synchronization mode, discarding any excessive data, if needed. The discarding of data is, however, not expected to happen as the stream resolution is dependent only on the settings of the VDMA circuits that are configured by the embedded microcontroller. Each channel can be selectively masked, so the block can also operate with one or two input signals missing.
To have full information on pixel color two consecutive 4:2:2 words are needed, hence the next block doubles the width of the bus, at the same time providing data at the reduced rate. After this operation each input word carries complete information on two pixels (two 8-bit intensity values, two 8-bit chroma values). Such data stream is passed to the Linear Arithmetic Unit.
The Linear Arithmetic Unit (LAU) transforms the color information of two pixels into information on a single pixel, where the resulting color components are linearly dependent on the input pixel's components. It can be used for composing halves of images, coming from the input data paths. It can also perform the subtraction and return absolute value of its result. All the operations are done in the saturation arithmetic. The unit is used, e.g. for calculating an anaglyph. From there data are passed to the OSD value-keyed composition block. This is a simple comparator and stream multiplexer. It compares the value (luma) component of the overlay image pixel with constant value stored in one of its registers. If the value matches, the module outputs pixel from LAU block, ignoring the overlay. If the values do not match, then the overlay pixel is applied instead of the calculated one. This allows for generating OSD with a wide range of colors. Finally, the data are converted back to 16-bit wide words.

Software design
The Microblaze processor runs a controller application written in C programming language. No operating system is in use. The core operates at the frequency of 100 MHz. The timing is governed by a fixed-interval timer, which is rising an interrupt every 50 ms. In the main loop the application monitors presence of all the signal sources and HDMI signal sink and adapts the output video resolution and frame rate. In case if no overlay signal is present, the module is able to provide simple OSD using software font render and DMA circuit.
The particular analyses are implemented as follows: (a) "average", "anaglyph", "difference", These effects use the default Video DMA configuration, where the full frame is sent to Video Combiner block.
(b) "over and under", "side by side", The decimation blocks are set to reduce number of lines (columns) by a factor of two. VDMA circuits are informed that the input frame height (width) is reduced by a factor of two; however, the output frame parameters and buffer configuration remain unchanged. The unused parts of the image are set to zeros, and the Video Combiner block is set for adding the images together. The principle of operation is shown in Fig. 9.
(c) "line by line", This mode is probably the most complicated. The decimation blocks are set to reduce the number of lines by the factor of two, where one of them returns odd and the other the even lines. The VDMA blocks are informed that the output image width is doubled, whereas the height is reduced to half of the original value. In this configuration, the VDMA returns line composed of a pixel strip from the original image and pixel strip of black pixels. The principle of operation is shown in Fig. 10.
(d) "color compare", In this mode the input image is divided into nine areas by two horizontal and two vertical split lines. For each area a mean RGB color value is calculated. This is done using YU′V′ color space on the hardware level, which is then converted to RGB, as such approach requires less operations per pixel. In this mode the Microblaze is providing a graphical overlay illustrating balance of RGB components between corresponding areas from both cameras. The overlay is drawn in the memory and then read using the overlay VDMA circuit.

Device evaluation
The Image Analyzer was positively tested on a professional movie set, with cinema-grade cameras. Such arrangement allows testing the device in a real-life conditions; however, it is not well suited for presentation of the device's operation. Therefore, for the purpose of this article, the analyzer was also supplied with a set of two completely different static images. Figure 11 presents the output images from the analyzer supplied with a video stream from a calibrated stereoscopic camera set-up (on the left) and different static images (on the right).
The currently implemented set of features is focused mainly on the alignment of the optical tract of the rig. On the set, the analyzer is routinely used for the calibration of the rig. For this process, the stereo base is reduced to zero and the cameras are set to look along the same optical line. The operators prefer to perform the calibration using the anaglyph mode and use the line-by-line mode as a crosscheck. These modes were hence used most often during the calibration. The images presented using sum and difference modes were found not suitable for calibration, as it is very Fig. 9 Realization of "over and under" and "side by side" effects After calibration, the cameras are moved apart to obtain a target stereo base. To verify if the desired 3D perception is achieved the resulting stereoscopic image is observed. This could be done with use of the anaglyph mode, although the colors would be then at least somewhat distorted. It is much more convenient to use a monitor dedicated for presenting the stereoscopic content. For such devices the analyzer provides the side-by-side and over-and-under modes, which can be interpreted as stereoscopic video by any 3D-capable display. This allows operator to check if the depth perception of the displayed scene matches with his expectations.
The presented analyzer was proven to capture and return streams of 1920 × 1080 video at 60 fps. This matches with the capabilities of the most popular 3G SDI links. The next-generation 6G SDI links offer twice the throughput of their predecessors and the serialization/deserialization components are already available. The current firmware implementation runs on clocks of around 150 MHz, whereas the utilized FPGA resources should also operate correctly with clocks in the range of 200-250 MHz. Therefore, upscaling the clock frequency could increase the analyzer performance by approximately two-thirds (allowing it to reach 100 fps at 1920 × 1080 resolution). However, to fully support the 6G SDI links the firmware would have to be also adapted by broadening the video data buses.
Despite rich functionality, the utilization of the XC7K355T FPGA circuit is still quite low: only 30% of logic slices and 11% of embedded RAM blocks are in use. This opens up a possibility to implement the whole system, including the streaming computer, in a single Zynq (ARM + FPGA) device-massively reducing the cost and size of the final solution.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.