Depth error correction for projector-camera based consumer depth cameras

This paper proposes a depth measurement error model for consumer depth cameras such as the Microsoft Kinect, and a corresponding calibration method. These devices were originally designed as video game interfaces, and their output depth maps usually lack sufficient accuracy for 3D measurement. Models have been proposed to reduce these depth errors, but they only consider camera-related causes. Since the depth sensors are based on projector-camera systems, we should also consider projector-related causes. Also, previous models require disparity observations, which are usually not output by such sensors, so cannot be employed in practice. We give an alternative error model for projector-camera based consumer depth cameras, based on their depth measurement algorithm, and intrinsic parameters of the camera and the projector; it does not need disparity values. We also give a corresponding new parameter estimation method which simply needs observation of a planar board. Our calibrated error model allows use of a consumer depth sensor as a 3D measuring device. Experimental results show the validity and effectiveness of the error model and calibration procedure.


Introduction
Recently, various consumer depth cameras such as the Microsoft Kinect V1/V2, Asus Xtion, etc. have been released.Since such consumer depth sensors are inexpensive and easy to use, these devices are widely deployed in various fields for a wide variety of applications [1,2].
These consumer depth cameras can be divided into two categories: (i) projector-camera based systems in which a projector casts a structured pattern onto the surface of a target object, and (ii) time-of-flight (ToF) sensors that measure the time taken for light to travel from a source to an object and back to a sensor.ToF sensors generally give more accurate depths than projector-camera based ones, which are, however, still useful because of their simplicity and low cost.
Since projector-camera based devices include cameras and a projector, output errors may be caused in errors in determining the intrinsic parameters.As long as such devices are used as human interfaces for video games, such errors are unimportant.For example, even when a Kinect V1 captures a planar object, the resultant depth maps have errors (see Fig. 1, left) as also reported elsewhere [3,4].Thus, in this paper, we focus on projector-camera based consumer depth cameras and propose a depth error correction method based on their depth measurement algorithm.Various intrinsic calibration methods have already been proposed for Kinect and other projector-camera based depth cameras [3][4][5][6][7][8][9][10][11].Smisek et al. [3] and Herrera et al. [4] proposed calibration and depth correction methods for Kinect that reduce the depth observation errors.Raposo et al. [6] extended Herrera et al.'s method to improve stability and Left: observation errors in Kinect output.
Right: compensated values using our method.
speed.However, their methods only considered distortion due to the infrared (IR) cameras.Since projector-camera based depth sensors include cameras and a projector, we should also consider projector-related sources of error.
We previously proposed a depth error model for Kinect including projector-related distortion [5].Darwish et al. [8] also proposed a calibration algorithm that considers both camera and projectorrelated parameters for Kinect.However, these methods as well as other previous methods require disparity observations, and these are not generally provided by such sensors.Thus, methods that require disparity observations cannot be employed in practice for error compensation for data from existing commercial sensors.
Some researchers employ non-parametric models for depth correction but a calibration board needs to be shown perpendicular to the sensor [9,11], or ground truth data obtained by simultaneous localization and mapping (SLAM) are required [12,13].Jin et al. [10] proposed a calibration method using cuboids, but, their method is also based on disparity observations.Other researchers proposed error distribution models for Kinect [14,15], but this research did not focus on error compensation.
To provide straightforward procedures for calibration and error compensation for depth data, including previously captured data, our method introduces a parametric error model that considers (i) both camera and projector distortion, and (ii) errors in the parameters used to convert disparity observations to actual disparity.To estimate the parameters in the error model, we propose a simple method that resembles the common color camera calibration method [16].Having placed a planar calibration board in front of the depth camera and captured a set of images, our method efficiently optimizes the parameters allowing us to reduce the depth measurement errors (see Fig. 1, right).Our compensation model only requires depth data, without the need for disparity observations.Thus we can apply our error compensation to any depth data captured by projector-camera based depth cameras.
We note that the calibration method introduced in this paper is designed for Kinect because it is the most common projector-camera based depth sensor.However, it potentially generally more useful because it is based on a principle common to other projectorcamera based depth sensors.
Section 2 describes the measurement algorithm used by Kinect, and Section 3 describes our parametric error model and parameter estimation.Section 4 shows experimental results demonstrating the effectiveness of our proposed method, while Section 5 summarizes our paper.

Depth measurement by Kinect
Since our method is based on the measurement algorithm used by the Kinect, we first outline this algorithm and this depth sensor, which consists of an IR camera and an IR projector.The IR projector projects special fixed patterns (speckle patterns) on the target observed by the IR camera.By comparing the observed and reference patterns captured in advance, Kinect estimates depth information for the target.The reference patterns are observations made by the IR camera when the IR projector casts the speckle pattern on the reference plane Π 0 [17] (see Fig. 2).
Here, we assume pattern P (x pi ) is projected in the direction of point x pi = [x pi , y pi ] T onto the reference plane Π 0 , and pattern P (x pi ) on Π 0 is projected onto the 2D position x ci ] T for the IR camera.We obtain the following relationship: x where w is the baseline distance between the camera and the projector, f is the focal length of the IR camera (and the IR projector), and Z 0 is the distance between the reference plane Π 0 and the Kinect.Next, we consider the target observation measured at point Q i and assume that pattern P (x pi ) is observed at x ci using the IR camera.By considering the reference patterns, the pattern's observed position when it is projected onto reference plane Π 0 , x (Π 0 ) ci , can be obtained.We calculate disparity δ i from the reference plane observation at x ci as follows: Then X i , which is the 3D position of point Q i , can be calculated as where x cc and y cc are the IR camera's principal points and Z i is the depth of point Q i .
Kinect does not output disparity values, but only normalized observations δ i from 0 to 2047 (in Kinect disparity units: kdu) [17], where δ i = mδ i + n.The driver software for Kinect (Kinect for Windows SDK and OpenNI) uses these to calculate and output depth values Z i based on the following equation: The disparity between the camera and projector d i can be expressed as follows: Note that recent versions of the driver software do not support output of disparities δ i , so these are generally unobtainable.Instead, we propose a method to calibrate and compensate the depth data obtained by Kinect that does not require either the disparity or normalized disparity observations.

Depth error model
The depth measurement model described above holds only in an ideal case.In practice, when Kinect observes a planar target, the output depth maps have errors, as previously noted [3,4] (see Fig. 1).To be able to compensate for them, we consider not only camera distortion but also the projector distortion in our model.

Distortion parameters
where x ci = [x ci , y ci ] T and xci are the ideal and observed, distorted 2D positions, and u ci gives the normalized coordinates of x ci .k c1 and k c2 are the distortion parameters of the IR camera.
We assume the same distortion model can be used for the projector: where x pi and xpi are the ideal and distorted 2D positions, and u pi gives the normalized coordinates of x pi .[x pc , y pc ] T is the principal point of the projector, and k p1 and k p2 are the distortion parameters of the IR projector.
We now consider pattern P (x pi ) projected in the direction of point x pi (see Fig. 3).However, because of projector distortion, pattern P (x pi ) is actually projected in the direction of point xpi , and is projected onto point Q i .In the camera, pattern P (x pi ) is actually projected onto position xci because of the camera distortion.Let di be the observed (distorted) disparity at xci and di = xci − x pi (8) On the other hand, considering Q i in Fig. 3, the ideal disparity where c = xci − x ci and p = xpi − x pi .

Proposed error model
Equation ( 9) expresses the relation between the ideal disparity d i and the observed disparity di .
In practice, since the parameters in Eq. ( 5), i.e., f , w, Z 0 , m, and n include errors, we need to compensate for them.Here, let d i and di be the ideal values and the values calculated by Eq. ( 5) based on the error parameters and the observed disparity.
Considering the errors in the parameters in Eq. ( 5) and collecting the coefficients, the following relations can be obtained: where α and β are parameters for compensating errors in f , w, Z 0 , m, and n.A detailed derivation of Eq. ( 10) is shown in Appendix A. Thus, the ideal disparity d i can be expressed as follows: By introducing α and β, we can compensate for errors in the parameters in Eq. ( 5) without observing the normalized disparity itself.Therefore, we calibrate not only the distortion parameters of the camera and the projector but also α and β allowing us to compensate for errors in these values.In the next section, we describe parameter estimation for this error model.

Overview
In consumer depth sensors, since the projection patterns cannot be controlled, we cannot directly estimate the projector's distortion parameters.Instead, we estimate the error model parameters using the process flow shown in Fig. 4.
First, we obtain N IR images and corresponding depth data for a calibration board (of known size and pattern) in arbitrary poses and positions.This lets us perform intrinsic calibration of the IR camera by Zhang's method [16].As described in the previous section, we model the depth errors based on Eq. ( 11), and ideal disparity d i and observed disparity di are required.Here, we assume that the poses and positions of the board estimated by intrinsic camera calibration are ideal depth values, and calculate ideal disparity d i from these poses and positions.The observed disparity values can be calculated from the observed depth values.Next we estimate the error model parameters by minimizing Eq. ( 11) based on d i and di .Table 1 summarizes the notation used in the following.

Camera calibration
First, intrinsic calibration of the IR camera is performed using the N images captured by the IR cameras, using Zhang's method [16].For camera calibration, X k , x (j) bk , the size of the chessboard, and the number of checker patterns on the chessboard should be given.Zhang's method can estimate the focal length (f ), the principal point (u cc , v cc ), and the camera distortion parameters (k c = {k c1 , k c2 }).The disparity differences caused by the camera lens and Eq. ( 11) can be rewritten as follows: In addition, we can obtain the board's poses and positions in each image j: (R (j) , t (j) ).This information is used in the following processes.

Projector calibration and disparity compensation parameter estimation
Next, we estimate the distortion parameters for the projector and the disparity conversion parameters.
To do so, we use the relations in Eq. ( 13) to give the following equation: The ideal disparities d (j) i can be calculated from the estimated poses and positions R (j) , t (j) as follows: where ci ) gives the 3D position on the board corresponding to direction x (j) i (see Fig. 5), and The observed disparity values di can be estimated using: where Z(j) ci is the observed depth value at position x (j) ci .f o is the focal length for the depth sensor used for calculating observed disparity values d(j) i , and can be estimated from X(j) ci .
(j) ci can be expressed as follows: (18) and we employ the approximate undistorted model [18].
Based on the above equations, we can estimate Δk c , k p , α, and β by minimization as below: kp , α, β = arg min In our experiments, we employed the multi-start algorithm in the MATLAB and Global Optimization Toolbox (version 2017b) for optimization.Since it requires initial values, they were determined as follows.Note that w is not included in the parameters to be calibrated, because w and disparities d i and di are proportional.Instead, we employ w = 75 mm based on Ref. [19].
We first consider the initial values of conversion parameters α and β.From Eqs. ( 15) and ( 16), we obtain the following relations by substituting 0 for (j) ci and (j) pi into Eq.( 11): ) allowing α and β to be determined by least-squares fitting.
Next, we consider the initial values for distortion errors ˆ (j) pi .Considering the relations between the camera and the projector and Eq. ( 8), we can estimate x (j) pi and x(j) pi as follows: where A p are intrinsic parameters of the projector.Using Eq. ( 7), we can obtain the following equation: ˆ We then estimate the optimal values of pi (and k p1 , k p2 ), α, and β by minimizing Eq. ( 19) from these initial values.

Depth compensation using calibration results
Finally, we describe the compensation process for the depth data obtained from the depth sensors.
Here, we consider the 3D data X(j) ci at pixel x(j) ci .First, we obtain di and c from Eqs. ( 16) and (17).xpi can be calculated from Eq. ( 22).Next, we calculate pi from Eq. ( 23).Then, the compensated disparity d i can be calculated from Eq. ( 11), so the compensated depth Z i can be obtained as and the compensated 3D data X i is given by

Experiments
We performed the following experiments to confirm the validity of our proposed error model and error compensation.In the experiment, we used a Kinect for Xbox (Device 1, abbreviated Dev.1, etc.), a Kinect for Windows (Device 2), and an ASUS Xtion Pro (Device 3); all of these devices are based on the same Primesense measurement algorithm [20].
We compared the compensated results using the following three models and the observed raw data: We captured 12 observations of the chessboard in different arbitrary poses and positions in the experiments.The distances between the board and the device were about 500-1300 mm.A leave-oneout method was used for evaluating the validity of the proposed error model: one observation was used for evaluation and the remaining observations were used for estimating error model parameters.From the observations, we manually obtained the 2D positions of the chessboard corners (54 points per image).
Table 2 shows the residual errors after the calibration phase, and Table 3 shows the errors in evaluations.Here, the errors were calculated as the averaged distances between the compensated (or observed) positions and ground truth positions of the chessboard corners.We used the 3D positions obtained from the color camera observations as the ground truth positions.
These comparative results show that all three models can reduce errors compared to the uncompensated results, in both the calibration and evaluation phases.The errors compensated by (a) our proposed model were the lowest, followed by (b) the model that considered camera distortion and linear relations, and then (c) the model that considered only camera distortion.The number of parameters used in these models also has the same ordering: (a) has the most, followed by (b) and then (c).These results suggest that using all parameters considered in our proposed error model are helpful in improving the quality of the 3D depth data.
After calibration, we evaluated the flatness of the compensated observations for the chessboard, measuring plane fitting errors within the chessboard regions.Table 4 shows comparative results for these plane fitting errors.
These results show that the plane fitting errors in compensated observations from our proposed model  Next, we evaluated the method's robustness to errors in the given baseline length w.Our method assumes the target device's baseline length is that given in such articles as Ref. [19].However, if it is not given, we need to measure it ourselves.In such cases, the measured length may include errors.Thus, we evaluated the robustness to errors in the baseline length w of up to ±2 mm.
Table 5 shows the errors when the baseline includes errors.As can be seen, our proposed model can reduce errors between the compensated positions and the ground truth positions even when the given baseline length includes errors.This is because our model considers errors in the baseline length w as one of the parameters in Eq. ( 5).
These experimental results, confirm that our proposed model can improve the quality of 3D depth data obtained by consumer depth cameras such as Kinect and Xtion.

Summary
In this paper, we have proposed and evaluated a depth error model for projector-camera based consumer depth cameras such as the Kinect, and an error compensation method based on calibration of the parameters involved.Since our method only requires depth data without disparity observations, we can apply it to any depth data captured by projector-camera based depth cameras such as the Kinect and Xtion.Our error model considers (i) both camera and projector distortion, and (ii) errors in the parameters used to convert from normalized disparity to depth data.The optimal model parameters can be estimated by showing a chessboard to the depth sensor using multiple arbitrary distances and poses.Experimental results show that the proposed error model can reduce depth measurement errors for both Kinect and Xtion by about 70%.Our proposed model has significant advantages when using a consumer depth camera as a 3D measuring device.
Future work includes further investigation of the error model, improvement of the optimization approach for parameter estimation, and implementation of a calibration tool based on the proposed error model for various projectorcamera based depth cameras, such as the Intel RealSense and Occipital Structure Sensor, as well as the Microsoft Kinect.

j index of observations, j = 1 ,
ci obtained by sensor X k 3D positions of chessboard corners (Z bk = 0) x (j) bk 2D positions of corners in image j X(j) bk 3D positions at x (j) bk obtained by sensor distortion may be considered.Let k c be camera the distortion parameter, and c be disparity error caused by k c .Then k c and c can be expressed as follows:
(a) our proposed method; (b) model considering camera distortion and conversion parameters (without p ); (c) model considering only camera distortion errors (with Δ c ); (d) no compensation, i.e., observed raw data.

Table 2
Comparison of averaged errors during calibration

Table 3
Comparison of averaged errors in evaluation

Table 4
Comparison of plane fitting errors in evaluation

Table 5
Residual errors with varying baseline length errors