Introduction

With the increasing development of machine vision, multi-sensor fusion data are increasingly being applied to robots for environment perception tasks like autonomous driving. The fusion of different sensor data relies on an accurate external calibration. 3D LiDAR and 2D camera are the most commonly used sensors in the perception stage [1,2,3]. In most cases, camera can capture rich environment information including texture and color, and LiDAR can acquire accurate range measurements of distance with a wide angular view. Combining LiDAR and camera can overcome the limitations of a single sensor [4, 5] and improve the perception ability of intelligent robots.

In the past few years, many calibration techniques have been proposed, especially for LiDAR-camera calibration problemss [6,7,8,9,10,11,12,13,14,15]. In general, these techniques can be divided into two groups, i.e., off-line and online. Off-line methods [6, 8, 9, 11] require significant amounts of manual effort, and calibrate with specific targets, e.g. chessboards. The limitation of these methods is that the good parameters of the off-line calibration cannot always accurately calibrate LiDAR and camera after environmental changes or vibrations like some bumps and jolts in real-time applications. Traditional online methods [10, 12, 13] are proposed to overcome these deficiencies. These methods can gradually converge to accurate parameters by handling continuously input images and point clouds. Generally, they consist of three steps: feature extraction, feature matching and global regression. Features employed in most of the state-of-the-art approaches are the handcrafted ones, such as road edges [12]. Under complex and changeable environments, matching with handcrafted feature is challenging and prone to failure.

Fig. 1
figure 1

Architecture of Iter-CalibNet. a the RGB images of consecutive frames; b depth maps of consecutive frames generated from the mis-calibrated point clouds. The pais of camera images and depth maps are input to the calibration network frame by frame. Meanwhile, The calibration result of the previous frame is used to help the preliminary calibration of the next depth map. Finally, Using the calibration matrix obtained by the system, we can calculate the geometric and photometric loss through 3D transformation operation

Fig. 2
figure 2

Depth map generated from 3D point cloud projection (best viewed by zooming in). Only the point cloud information projected into the RGB image plane will be retained as network input. However, when the retained point cloud information is too limited, as shown by a, it will not be enough to obtain accurate calibration results, even its translation and rotation deviations are (\(-0.25\) m, 0.1 m, 0.15 m) and (\(-0.174^\circ \), \(0.087^\circ \), \(-0.174^\circ \)), respectively. b is the case in which the scaling factor is added to the point cloud projection

In recent years, several methods have emerged using deep learning (DL) for online sensor calibration, such as [7, 14,15,16,17]. As far as we know, all these DL-based methods use initial roto-translation (rotation and translation) parameters to convert the 3D point cloud into a miscalibrated depth map, and use the depth map and the corresponding 2D image as input to networks to estimate the calibration vectors. They mainly employ the convolutional neural network (CNN) to extract the different kinds of features from the single-frame depth map and the RGB image, respectively. The end-to-end calibration process is from the input pair of sparse depth map and RGB image to the calibration parameters of the deviation of the depth map.

Among all these online DL-based calibration methods, it is indispensable to generate sparse depth maps with different deviations from 3D point clouds as a pre-processing process. On the one hand, the introduction of deviation increases the diversity of training data and improves the generalization ability of the model. The training of the calibration network requires pairs of sensor data with different deviations to enhance the diversity. However, it is difficult to obtain sufficient training data generated by real LiDAR-camera pairs with different external calibration parameters. Therefore, it is necessary to diversify training data with random deviations from original LiDAR-camera data pairs. On the other hand, using sparse depth maps instead of point cloud can avoid the inconsistency and can improve accuracy of calibration. The point cloud has irregular and orderless data structure, and the positions of all points (i.e., the x-, y-, and z-axis coordinates) have continuous float values, while the image is regular, ordered and discrete values. Converting 3D point cloud data into a depth map as a network input is conducive to feature extraction and matching of different sensor data. Besides, it makes a significant reduction in the point cloud information that is retained to input to the network, and consequently, reduces the computational burden of the network on large-scale point cloud data.

However, there is no free lunch: the use of sparse depth maps will cause the generation of the confounder, which will seriously affects the final calibration accuracy. Essentially, the depth map is generated by projecting a subset of the original point cloud to the 2D image plane through the initial deviation parameter. This initial deviation parameter also determines which points in the original point cloud can be retained in the subset of point cloud. In many cases, converting the subset of 3D point cloud into a sparse depth map according to the initial deviation parameter as the input to network will introduce excessive loss of information (since many points in the point cloud are discarded), which makes it difficult for the calibration model to obtain accurate calibration parameter. Even if the deviation is within the calibratable range, only a small part of the point cloud can be used, as shown in Fig. 2a. Therefore, the initial parameter with deviation is obviously a confounder in the calibration model.

Based on the above considerations, we construct a causal inference framework that analyzes the internal correlations among the components existing in the calibration process of 3D LiDAR and 2D camera. We determine the hidden confounder in the calibration model, find and cut off the backdoor path.

In addition, we carefully consider the network structure and geometric constraints that are also important to calibration accuracy, and propose the Iter-CalibNet to try to achieve better LiDAR-camera calibration performance. The architecture of the proposed model is shown in Fig. 1. We use CNNs combined with the non-local module to extract geometric and photometric features from different sensor data. The 6-DoF rigid body transformation between LiDAR and camera is decoupled from the fused deep features. Meanwhile, we apply an iterative calibration method to continuously optimize the roto-translation parameters, by inputting consecutive frames of synchronized images and point clouds with a consistent to-be-calibrated parameters between LiDAR and camera. In the process of calibration, we use the predicted parameters of the previous frame to pre-calibrate the input point cloud of the next frame. By considering the constraint of relative pose between LiDAR and camera in continuous frames, the training of the network is based on three loss functions, including the calibration error of projected depth map of the point cloud, the photometric and geometric error obtained by the inter-frame pose transformation, and the regression error of predicted parameters. The trained model is capable of accurately estimating external parameters over a range of deviations that may occur on any axis and along any axis, and does not require any specific features and landmarks in the sensor environment.

In our previous work [16], we proposed the CalibRCNN model which uses the recurrent CNN (RCNN) module to involve geometric constraints among adjacent frames, and obtains preferable calibration. However, its generalization is not good since the CalibRCNN lacks the causality analysis and cannot eliminate the influence of confounder. Different from the CalibRCNN, in this study we construct the structural causal model (SCM) [18] to find the confounder, elaborately handle the input data and design the network to eliminate the influence of the confounder, and propose a novel calibration model which outperforms our previous CalibRCNN.

To evaluate the performance of the proposed method, we train the model on some sequences of KITTI dataset [19], and test the model in different sequences. The results show that our method can improve the accuracy of calibration to a high extent, where the average errors of translation and rotation are both much smaller than the best results obtained by other state-of-the-art methods. In addition, we conduct experiments to verify the effectiveness of each detail strategy in the overall calibration framework.

The main contributions of this paper include the following:

  1. 1.

    We build a causal inference model for the 3D-2D calibration system, and explore causal solutions to eliminate the influence of the confounder factor.

  2. 2.

    We propose an end-to-end approach to tackle the LiDAR-camera calibration problem, by combining CNN and non-local neural network for feature extraction and feature matching. Moreover we invent an iterative calibration method to optimize the network.

  3. 3.

    We propose to leverage a synthetic view constraint to quantify the photometric and geometric errors between successive frames to optimize the calibration model.

Related work

For the multi-sensor fusion problem, various algorithms have been developed in recent years. LiDAR is usually combined with IMU (Inertial Measurement Unit) sensors [20, 21], as it can provide a motion prior to help accounting for high frequency motion. Camera is also used for fusion with LiDAR. Zhang et al. [22] propose an odometry combining camera and LiDAR, where a LiDAR odometry based on scan matching is used to optimize the visual motion estimation. In addition, Wang et al. [4] and Zhang et al. [5] integrate three sensors (i.e. LiDAR, camera, and IMU) into a system to estimate the motion of a moving agent and build a 3D map. Most of these algorithms rely on accurate extrinsic calibrations.

The calibration between two sensors is extensively studied. For the calibration between camera and IMU, Kelly et al. [23] propose an approach which requires a planar calibration target, and uses an unscented Kalman filter to estimate the transformation between the camera and the IMU. Furgale et al. [24] present a framework for jointly estimating the temporal offset between measurements of different sensors and their extrinsic parameters. Methods for automatically estimating extrinsic parameters between camera and IMU without targets or prior knowledge of environment are also proposed in [25, 26].

As for the calibration between camera and LiDAR, some methods require a specific targets to achieve an automatic calibration, e.g. the chessboard-like marker [27], or the polygonal planar board [24]. Several methods avoid this limitation and achieve automatically extrinsic calibration without artificial targets by developing elaborate strategies. Levinson et al. [28] propose a method relying on the edges in the scene. Pandey et al. [29] present a method based on the maximization of mutual information between the sensor-measured surface intensities. Scottet et al. [30] estimate the extrinsic parameter by minimizing the normalized information distance between intensity measurements obtained from both sensor modalities. Taylor et al. [31] extends standard techniques for motion-based calibration by incorporating estimations of the accuracy of each sensor’s readings.

It is worth noting that deep learning has made outstanding achievements in 3D target detection [32], depth estimation [33], pose estimation [34], etc., which have reference meaning for multi-sensor extrinsic calibration. Schneider et al. [14] propose the RegNet, the first deep CNN for LiDAR-camera calibration. RegNet uses the Network-in-Network module for feature extraction and calibration parameters regression. Excellent calibration results have been obtained in the final test, by using multiple models and performing iterative refinement. However, RegNet requires training multiple models by inputting samples with different mis-calibration ranges, and the training optimization of the network does not consider the underlying geometric constraints of the calibration problem. CalibNet [15] adds geometric constraints by reducing the dense photometric error and the dense point cloud distance error between the mis-calibrated depth map and the target depth map, and consequently it increases the generalization ability of the model. Unfortunately, although its loss function has a better optimization ability for predicting rotation parameters, the prediction of translation parameters is relatively unsatisfactory. Later, RGGNet [17] further improves the accuracy of online calibration based on neural networks. It consider the Riemannian geometry and utilize a deep generative model to learn an implicit tolerance model. However, the accuracy of online calibration between LiDAR and camera still has much room for improvement.

We find that most of the existing end-to-end calibration methods do not take into account the pose transformation among successive frames. As demonstrated by many traditional online multi-sensor calibration methods [35], extrinsic parameters will converge along with inputting successive frames. To take the relative pose transformation among sequential frames into account, we can borrow ideas from the end-to-end DL based odometry frameworks, because the calibration between LiDAR and camera is equivalent to estimate the relative pose between the LiDAR and the camera. DeepVO [36] is implemented as an end-to-end visual odometry by considering the importance of sequential dependence and complex motion dynamics of an image sequence. [34] and [37] use synthetic view constraints between successive frames of RGB images for model optimization. Inspired by these methods, in our previous work, we integrate the geometric constrains among successive pairs of LiDAR and camera frames through the RCNN to achieve better calibration [16]. However, RCNN cannot utilize the calibration of previous frames for the calibration of current frames to accelerating the convergency. Therefore, in this work we improve this by means of iteratively calibrating process.

Moreover, our previous work [16] is also lack of further in-depth thinking about the causality in the calibration system, and therefore, it cannot be generalized well to calibrate other pairs of LiDAR and camera. It is worth noting that there are a growing number of computer vision tasks that have achieved substantial improvements through causality analysis [38,39,40]. Causal inference [18] can empower the system to be designed the ability to pursue causal effects: we can eliminate confusion [41], improve system performance [42], and modularize reusable features with good generalization [43]. In this work, we adopt the Pearl’s structural causal model [18] as basic analysis framework, which can introduce distinct causality in the calibration model: every node in the graph can be located and implemented in the process of calibration.

Method

Given a multi-sensor device including LiDAR and camera, only when the external roto-translation transformation \(T_{\phi }\) between two sensors is obtained, the collected 2D and 3D data can be fused for many tasks. In our method, we first assume the initial estimation of the roto-translation transformation, \(T_\textrm{init}\), which is not accurate enough and cause a mis-calibrated depth map to some extent. Then, our model is able to predict the roto-translation transformation, \(T_\textrm{calib}\), which can calibrate the deviation existing in \(T_\textrm{init}\).

Fig. 3
figure 3

The proposed SCM for external calibration system between 3D LiDAR and 2D camera. a The original SCM without backdoor path adjustment. b Cut off the backdoor path in the SCM. The paths with arrows indicate the causalities between two nodes: cause \(\rightarrow \) effect

Causality analysis

In order to further analyze the causality in the calibration process, we formulate the causalities with a structural causal model [18], as shown in Fig. 3, where node \(\varvec{X}\) means 3D point cloud and 2D image which are selected as input to the calibration model, node \(\varvec{C}\) denotes initial transformation parameters \(T_\textrm{init}\), node \(\varvec{D}\) is the sparse depth map with deviation obtained by projecting the input 3D point cloud to the 2D image plane based on the initial transformation parameters \(\varvec{C}\), and node \(\varvec{Y}\) denotes the calibration parameters \(T_\textrm{calib}\) output by the system, respectively.

We can sort out all causalities shown in Fig. 3 as follows.

  1. 1.

    \(\varvec{C}\rightarrow \varvec{X}\): \(\varvec{C}\) determines an approximate subset of points in the original point cloud that should be retained as they are corresponding to the RGB image taken by the camera. This subset of the point cloud and the RGB image together form the calibration model input \(\varvec{X}\).

  2. 2.

    \(\varvec{C}\rightarrow \varvec{D}\leftarrow \varvec{X}\): \(\varvec{X}\) determines the scene composition of \(\varvec{D}\), and \(\varvec{C}\) determines the deviation of \(\varvec{D}\). Therefore, \(\varvec{C}\) is obvious a confounder that will simultaneously affect \(\varvec{X}\) and \(\varvec{D}\), which leads to a lower precision of calibration parameters.

  3. 3.

    \(\varvec{X}\rightarrow \varvec{Y}\leftarrow \varvec{D}\): Our immediate goal is to estimate from \(\varvec{X}\) the external calibration parameters \(\varvec{Y}\) that will properly fuse the data from 3D LiDAR and 2D camera. In our method, the deviation in \(\varvec{X}\) is further implied in \(\varvec{D}\), by initially converting the 3D points to the image coordinate system.

Through the analysis of the causal relationship between each component, we find that initial transformation parameters \(\varvec{C}\) are the confusion factor in the constructed calibration system. \(\varvec{C}\) not only affects the expected output \(\varvec{Y}\), but also interferes with the input data \(\varvec{X}\) through the backdoor path \(\varvec{X}\leftarrow \varvec{C}\rightarrow \varvec{D}\rightarrow \varvec{Y}\). The main problem is the excessive lack of point cloud information caused by the initial parameters. Due to the limitation of the sensors, the camera’s field of view is limited in the horizontal direction, while the LiDAR’s field of view is very narrow in the vertical direction and is limited by low-resolution (generally ranging from 16 to 64 rings). Therefore, it is prone to less overlap between the input image and the point cloud by the deviation of the initial parameters, or even no overlap in some extreme cases, as shown in Fig. 2. However, the calibration algorithm needs to extract the matching features from the point cloud and the image, so that the calibration parameters can be calculated. This means that the scenes expressed by the data of two sensors must have sufficient overlapping parts, which is a prerequisite for external calibration.

In order to avoid the confounding influence of \(\varvec{C}\) on the calibration model, we mainly adopt two approaches to eliminate the interference of \(\varvec{C}\rightarrow \varvec{X}\), in the data preprocessing and calibration process, respectively.

In the data preprocessing, we downscale the coordinates of points in point cloud during the projection of 3D point cloud to depth map (detailed in Sect. “Training Data Generation”), which increases the point cloud information input to the network and expands the scope of the scene contained in the depth map.

In the calibration process, the iterative optimization method (detailed in Sect. “Network Architechture”) is used in the training and testing stages. After each iteration, the deviation of initial calibration parameters will decrease, and more points can be used for calibration. Thus we can avoid the influence from initial transformation parameters, and cut off the backdoor path in \(\varvec{C}\rightarrow \varvec{X}\).

Training data generation

The input of training to the network is several consecutive RGB images taken by the camera, the corresponding point clouds collected by the LiDAR, and the basic parameters required for the calculation of the loss function, such as the camera intrinsic parameters K, the camera pose between two frames \(T_\textrm{cam}\), and the ground truth roto-translation transformation \(T_{\phi }\) between LiDAR and camera. The ground truth transformation is generally consistent for all frames collected by the same device if there is no displacements between LiDAR and camera caused by, for example, vibrations.

Given a roto-translation transformation \(T_\textrm{init}\), we can project each 3D point [xyz] to a depth map D in the image coordinate system i.e., \(D(u, v)=z_{c}\), where (uv) is 2D coordinate of each projected point in the depth map and \(z_c\) is the pixel value. Note that here we downscale the coordinates of the depth map to obtain more information in the point cloud, i.e., \(D(\frac{1}{s}*u,\frac{1}{s}*v)=z_{c}\), where s is the scaling factor of the depth map projected by the 3D point cloud, which can be set to \(\{1.0, 1.5, 2.0, 2.5, 3.0\}\). The sparsity of the point cloud and the deviation of the initial parameters will cause the lack of geometric features of the point cloud. Downscaling the points’ coordinates through the scaling factor s can increase the number of points projected on the image. Thus it can overcome the above shortcomings and reduce the difficulty for the network to extract depth map features and perform feature matching. Finally, the influence of the deviation of the initial parameters can be eliminated, and the backdoor path between \(\varvec{C}\) and \(\varvec{X}\) can be cut off. As shown in Fig. 7, the downscaling operation can significantly reduce the calibration error, and the calibration accuracy under different initial deviations are close, which verifies that the influence of initial deviation vanish.

Fig. 4
figure 4

Pose transformation relationship diagram of continuous frames of LiDAR and camera data. LiDAR represents the LiDAR coordinate system while Camera represents the camera coordinate system. The transformation between consecutive frames is denoted by \(T_{LiDAR}\) and \(T_\textrm{Cam}\), respectively

In order to ensure the diversity of the training data, we use the roto-translation transformation \(T_\textrm{decalib}\) for mis-calibration, and generate a depth map \(D_\textrm{miscalib}\). Randomly generated transformation \(T_\textrm{decalib}\) can produce a lot of training data. Each input consists of several adjacent pairs of depth map and image, where depth maps are projected by the same \(T_\textrm{init}\) from point clouds collected by LiDAR to the image plane, where \(T_\textrm{init}\) is derived from the following equation:

$$\begin{aligned} {T}_\textrm{init}={T}_\textrm{decalib}T_{\phi } \end{aligned}$$
(1)

Thus, we establish the correlation between the LiDAR and the camera, and obtain the sparse depth map generated by the mis-calibration \(T_\textrm{decalib}\). The output from our Iter-CalibNet we expect for calibration can be computed as \( T_\textrm{calib}=T_\textrm{decalib}^{-1}\).

If we have the external roto-translation transformation \(T_{\phi }\) between the LiDAR and the camera, we can obtain the coordinate transformation parameters of one sensor between the two frames by the inter-sensor pose transformation of another sensor. For example, as shown in Fig. 4, given the \(T_\textrm{cam}\), the coordinate transformation parameters \(T_\textrm{velo}\) between the two frames of point cloud can be obtained as follows:

$$\begin{aligned} T_\textrm{velo}=T_{\phi }T_\textrm{cam}T_{\phi }^{-1} \end{aligned}$$
(2)

This is important for our Iter-CalibNet when we utilize synthetic view constraint as mentioned in Seci. “Loss function”.

Fig. 5
figure 5

Architecture of the proposed Iterative Calibration Convolutional Neural Network

Network architechture

The network to calibrate a pair of RGB image and mis-calibrated depth map is designed as follows. It is mainly composed of two branches, like [14, 15], which respectively extract features from 2D images and depth maps (see Fig. 5). For the RGB branch, we use the convolutional layer of the pre-trained ResNet-18 network [44]. For the depth map branch, we use a network similar to the ResNet-18 structure. But we halve the number of filters, and find it is enough to extract features contained in the mis-calibrated depth map. We also have to train from scratch, because the pre-trained model of ResNet is from ImageNet where images are significantly different from the depth map used in our model.

Then, we fuse the features output from the two network branches and use convolutional layers combined with non-local module to further extract geometric and photometric features, which are useful for subsequent calibration prediction. Specifically, non-local module is able to capture long-range dependencies between different regions of the LiDAR and camera frames, allowing for more accurate and robust calibration results. Finally, we use the fully connected layer to parse the translation and rotation information, respectively, from the deep features, and predict the translation vector \(\tau \) and the rotation vector \(\gamma \), which form the roto-translation calibration vector \(\beta = (\tau , \gamma )\). Here the rotation vector \(\gamma \) can be converted to a rotation matrix \(R\in \textrm{SO}(3)\) by the well-known Rodrigues formula. Combining rotation matrix with translation vector \(\tau \in R^{3}\) gives us a 3D transformation matrix \(T_\textrm{calib} \in \textrm{SE}(3)\), which is expected to be the inverse of \(T_\textrm{decalib}\), and is defined as

$$\begin{aligned} T_\textrm{calib}= \begin{bmatrix} R &{}\quad \tau \\ 0 &{}\quad 1 \\ \end{bmatrix} \end{aligned}$$
(3)

In order to improve the accuracy of the prediction, we take several consecutive framesFootnote 1 of the corresponding depth map and RGB image pairs as the input to the network, and adopt the iterative calibration method to train and test the network. As the transformation parameters of the continuous input pairs of point clouds and images are consistent, we can better optimize the network by iterative calibration, which means that we use the predicted parameters of the previous frame to pre-calibrate the input depth map of the next frame. Finally, the calibration matrix predicted by the network can be defined as \(T_\textrm{pre}^t=T_\textrm{cur}*T_\textrm{pre}^{t-1}\), where \(T_\textrm{pre}^{t-1}\) is the predicted parameters of the previous frame, and \(T_\textrm{cur}\) is calibration matrix obtained at current frame.

By projecting 3D point cloud to the corresponding 2D RGB image, we can understand the process of the iterative calibration more intuitively, as shown in Fig. 6. We show that the result of the calibration achieves a significant improvement as the number of iterations increases. From the perspective of causal reasoning, the iterative calibration method enables the calibration model to gradually obtain the more and more accurate calibration results as the deviation of the initial parameters, as it is equal to manually adjust the deviation of the initial parameters from coarse to fine. Thus the confounding bias can be avoided when our method calibrates input data with different initial deviations.

Fig. 6
figure 6

Depth projection map. a Is a 3D-2D projection image transformed by mis-calibration parameters. While bd are the iterative calibration results of a. The color of the projected point in 3D-2D projection images represents its depth value. Since the LiDAR and camera are mounted on a moving car, the depth values of the projection points on the same static object gradually decreased during the iterative calibration process of consecutive frames

Loss function

Depth map calibration error. The input depth map \(D_\textrm{miscalib}\) can be temporarily calibrated using the transformation matrix \(T_\textrm{pre}\) output by the network, generating the predicted depth map \(D_\textrm{pre}\), as shown in Eq. (4).

$$\begin{aligned} \left[ \begin{matrix} u_p*z_{cp} \\ v_p*z_{zp} \\ z_{cp} \\ 1 \\ \end{matrix}\right] = K T_\textrm{pre} K^{-1} \left[ \begin{matrix} u*z_c \\ v*z_c \\ z_c \\ 1 \\ \end{matrix}\right] \end{aligned}$$
(4)

Note that \(T_\textrm{pre}\) will be continuously refined along with the model training, and will finally approach to the ground truth transformation matrix \(T_\textrm{calib}^{gt}\). At the same time, using \(T_\textrm{calib}^{gt}\) to calibrate the input point cloud, we can obtain the ground truth depth map \(D_{gt}\) by projecting point cloud according to \(T_\textrm{calib}^{gt}\). In order to optimize the calibration capability of the network, it is necessary to quantify the difference between the two depth maps as a loss function. For each projection point \(p(u, v, z_{c})\) in the \(D_\textrm{miscalib}\), where (uv) is the coordinate and \(z_{c}\) is the gray value representing the depth, there is a corresponding point \(p_\textrm{pre}(u_{p}, v_{p}, z_{cp})\) and \(p_{gt}(u_{g}, v_{g}, z_{cg})\) in \(D_\textrm{pre}\) and \(D_{gt}\), respectively. We define the depth map difference loss function as

$$\begin{aligned} {L}_{D}=\frac{\sum \nolimits _{p\in D}^{N} \left\| p_\textrm{pre}-p_{gt}\right\| _2}{N} \end{aligned}$$
(5)

where N is amount of points used in this loss.

Synthetic view constraint. Give the initial transformation parameter, \(T_\textrm{init}\), we can substitute the predicted calibration parameters \(T_\textrm{pre}\) into the right-hand side of Eq. (1) to compute the predicted roto-translation \(T_{\phi }=T_\textrm{pre}^{-1}T_\textrm{init}\), which can be further substituted into Eq.(2) to compute the transformation between two adjacent poses of the camera, i.e. \(T_\textrm{cam}=T_{\phi }^{-1}T_\textrm{velo}T_{\phi }\). Meanwhile, we can use \(T_\textrm{pre}\) to obtain a calibrated depth map. From the depth map and the camera transformation, a synthesis view can be obtained for RGB images in consecutive pairs of input data. With reference to the loss function of the depth estimation and the camera pose estimation in [34], we establish geometric constraints between successive 2D frames. It should be noted that the synthesis view of continuous 2D frames of the monocular camera has many restrictions and is not fully applicable to the current 2D image. Therefore, we propose to leverage SIFT features [45] as reference points for the loss calculation. SIFT features are based on the point of interest of some local appearance on the object, regardless of the size and rotation of the image, and are tolerant to changes in light, noise, and view. Based on these characteristics, we use the matched SIFT feature points in the two frames of 2D images as the target points of the synthesis view. According to the view synthesis method in [46], and assuming \(p_1\) denotes one of the selected SIFT points in the target image \(I_1\), its projection \(p_2\) on the source image \(I_2\) is represented by

$$\begin{aligned} {p}_{2} \sim {K}{T}_\textrm{cam}D_{1}(p_1)K^{-1}p_1 \end{aligned}$$
(6)

where \(D_1\) is the sparse depth map corresponding to \(I_1\). Then we can interpolate the pixel value of \(p_2\) according to the neighbor pixels around \(p_2\) in \(I_2\). After computing pixel values for all selected SIFT feature points according to this way, a synthetic view \(I'_2\) can be generated, where each point is corresponding to one selected SIFT feature point in \(I_1\). Thus, the synthetic view constraint photometric loss can be formulated as

$$\begin{aligned} {L}_{S}=\sum \limits _{p\in I'_{2}}^{N} \left\| I_{1}(p) - I'_2(p)\right\| _2 \end{aligned}$$
(7)

where \(I_1(p)\) and \(I'_2(p)\) denote the pixel values of the corresponding points in the source image and the synthetic view, respectively.

Global regression error of calibration parameters. In order to achieve a better accuracy of the calibration parameters from our network regression, we also compute the Euclidean distance between the ground truth parameter and the predicted parameteras follows:

$$\begin{aligned} \begin{array}{ll} {L}_{P} = {L}_\textrm{translation} + {L}_\textrm{rotation} \\ \quad = \left\| \tau _\textrm{pre} - \tau _{gt}\right\| _2 + \left\| \gamma _\textrm{pre} - \gamma _{gt}\right\| _2 \end{array} \end{aligned}$$
(8)

Our final loss function consists of a weighted sum of above losses:

$$\begin{aligned} {L}_{final} = {\lambda }_{1}{L}_{D} + {\lambda }_{2}{L}_{S} + {\lambda }_{3}{L}_{P} \end{aligned}$$
(9)

where \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\) are manually set weights.

Experiment

In order to verify the proposed method, the KITTI-Odometry dataset [19] is employed to train and test the model. The sequences in KITTI-Raw dataset [19] are also utilized for the generalization evaluation of our method. In this section, the specific settings in the experiment, the training process of the model, and the qualitative and quantitative analysis of the experimental results will be explained in detail.

Dataset preparation

First, we use three consecutive pairs of camera images and LiDAR point clouds as an input sample,Footnote 2 and apply random transformation parameters \(T_\textrm{decalib}\) to mis-calibrate each set of point cloud data. The deviation range of mis-calibration is set to \(\pm 10^{\circ }\) rotation and ± 0.25 m translation of any axis. We believe this deviation range should cover most cases of initial roto-translation errors for LiDAR-camera pairs. In order to better extract the matching features of the depth map and the RGB image between consecutive frames, the same mis-calibration parameter is applied to three-frame point clouds within one input sample. By projecting the point cloud into the image plane by the mis-calibration parameter and camera intrinsic parameter, the mis-calibrated sparse depth map can be obtained.

Table 1 Comparison of different iterations

As analyzed in Sects. “Causality analysis” and “Training Data Generation”, we cut off the backdoor path by applying a scaling factor on the projected points. According to our experiments in Sect. “Performance”, the best scaling factor in most experiments is 2.0. Therefore, we reduced the 2D projection point coordinates to half of the original, when projecting 3D point clouds into the 2D depth map. Besides, the coordinate transformation parameters \(T_\textrm{velo}\) between two consecutive point clouds, which are calculated by Eq. (3), are also needed when computing the losses.

Considering the geometric constraints and photometric constraints, it is necessary to extract the matched SIFT feature points from two adjacent image frames. We use the Brute Force (BF) matching approach to match SIFT features. To improve the correct matching pairs, we further compute the distance of one SIFT feature to its epipolar, and filter out those matching pairs whose distance is larger than a threshold. We extract 100–500 pairs of matched feature points for two adjacent 2D frames to calculate the loss function in the model training phase.

After preprocessing the 00–06 sequences of the KITTI-odometry dataset, we take 90% frames from each sequence as the training set, and the remaining 10% as the validation set. Finally, we obtain 20,000 samples for training and 2000 samples for validation. In addition, we also use the whole 07, 20, 21 sequence as the test data where no frames are used for training.

Fig. 7
figure 7

Experiment results on small datasets with different downscaling factors. a, b Indicate translation and rotation, respectively. The horizontal axis represents the average value of the initial deviation of rotation or translation. The vertical axis represents the average error on the three axes of translation or rotation of the calibration model

Training details

The training of the network is performed with the Adam Optimizer [47], using an initial learning rate 1e–4. We decrease the learning rate by a factor 0.5 every 30,000 steps. Except the RGB branch, the remaining weights of our network are initialized using Xavier initialization [48]. Meanwhile, we add a Rectified-Linear Unit (ReLU) to each convolutional layer. In order to prevent overfitting, we apply the regularization loss and set the regularization parameter to 0.001. And the weights of all loss functions are set to \(\lambda _1 = 0.01\), \(\lambda _2 = 0.2\), and \(\lambda _3 = 1\), respectively.

The number of SIFT feature points also influences the calibration accuracy. We have tested different numbers of SIFT feature points (i.e., \(\{100, 200, 300, 400, 500\}\)), and found that the more matched SIFT features are selected, the higher accuracy can be obtained. We simply show the results in the third and fourth row of Table 1. However, the number of selected SIFT features cannot be too large, because it is difficult to obtain sufficient precisely matched SIFT features in all images, and the more selected SIFT features require more computational resource. Therefore, we set the number of SIFT feature points to 500 in experiments, except for the iteration experiments, where the number is set to 100 for simplification.

Evaluation on backdoor path adjustment

In the implementation of the calibration model, we cut off the backdoor path by introducing downscaling factor for point cloud and applying iterative calibration manner. In this section, we will evaluate the effectiveness of these two backdoor path adjustment strategies through experiments, and further analyze their impacts on the calibration model.

Table 2 Comparison of different methods

Downscaling factor. Since the depth map has the same solution of 376 \(\times \) 1241 as the corresponding camera frames, only a part of points in a point cloud can be projected into a depth map. Moreover, because the 3D–2D calibration matrix \(T_\textrm{init}\) has unknown deviations, not all points corresponding to the current camera frame can be projected correctly into the depth map. However, the more corresponding points are projected into the depth map, the higher accuracy of the predicted calibration parameters can be obtained from the model. To this end, we downscale the coordinates during the 3D–2D projection process, in order to retain more points in the field of view of camera frames, as shown in Fig. 2b.

We first try to set a random downscaling factor (ranging from 1.0 to 3.0) for each training sample, however, we find that this approach will decrease the convergence when training our model. This is because it is difficult to learning how to match the corresponding features extracted from images and depth maps by the network, when the downscaling factor is different for different inputting samples.

Therefore, we set the scaling factor to a fix scale, and evaluate the accuracy at different scales, i.e., \(\{1.0, 1.5,\) \(2.0, 2.5, 3.0\}\). Experimental results at different scales are shown in Fig. 7. It can be found that when the downscaling factor is larger than 2.0, different initial transformation parameters (i.e., the mis-calibration in Fig. 7) do not influence the accuracy of final calibration parameters, because the calibration errors are close by given different deviations. The accuracy is improved a lot by comparing with small downscaling factors (i.e., \(\{1.0, 1.5\}\)). By setting the downscaling factor to a value larger than 2.0, the accuracy of calibration is not significantly improved. This is because there are sufficient points to project into the depth map, from which our Iter-CalibNet can extract enough information to obtain accurate calibration, when the coordinates of points are downscaled to the half of their original ones. If we set to larger downscaling factors, it only introduces more irrelevant 3D projection points, and increase the time cost of training. Finally, the downscaling factor s is set to 2.0 according to our evaluation.

Number of iterations. For the 3D–2D calibration problem, we employ an iterative strategy to further decrease the influence of initial transformation parameters, which can also cut off the causal link of \(\varvec{C}\rightarrow \varvec{X}\) in the calibration model. On the one hand, iterative strategy can increase the variability of the depth map bias during model training, which can improve the generalization performance of the model for different initial transformation parameters. On the other hand, iterative calibration endows the Iter-CalibNet the ability of gradually improving the calibration accuracy from rough to fine during testing. This is because continuously inputting frames in the iterative process enables the network to learn the change characteristics of sensor data in the time domain.

Table 3 Ablations for different losses
Fig. 8
figure 8

Calibration error distribution of the whole test dataset over a wide range of initial mis-calibration. a Shows the translation error distribution of test results. b Shows the rotation error distribution of test results

In order to further verify the impact of the number of iterations on the calibration results, we use different iterations to train the calibration models and compare their testing results, as shown in Table 1. It can be seen that the accuracy of calibration results can be improved along with the increment of iterations. However, the more iteration numbers, the higher time cost when training and testing. Considering the trade-off between the effectiveness and efficiency, we set the optimal number of iterations as Num = 3 in subsequent experiments.

Performance

We evaluate the proposed model by setting different initial mis-calibrated ranges for training and test data, and verify the accuracy of calibration parameter predicted by this model. The comparison results are listed in Table 2 by different kinds of evaluation criterias, and the final results are shown in the first line in Table 3 by means of the mean error on each axis for translation and rotation. Figure 8 further illustrates the calibration error distribution of the whole experimental results over a wide error range of initial mis-calibration by means of boxplots. In most cases, our model can obtain a precise calibration result with extremely small error. In the following parts we will explain our experiments from several aspects in detail.

Fig. 9
figure 9

Examples of depth map calibration. The left column is the mis-calibrated depth projection. The right column shows the depth projection after the model calibration

Compared with SOTA methods. Different calibration methods use different experimental datasets and evaluating processes, so it is difficult to achieve complete quantitative comparison. Therefore, we comprehensively calculate four kinds of the calibration errors of our model according to different error quantification methods used in other papers. The evaluation criteria is the translational and angular error of the estimated relative transformation \(T_\textrm{pre}\) against the ground truth \(T_\textrm{calib}^{gt}\). We first calculate the mean absolute error (MAE) values of translation and rotation after converting the calibration deviation \(T_\textrm{dev}=T_\textrm{pre}*T_\textrm{calib}^{-1}\) into translation distance and rotation angle on each axis. As shown in the first and fourth column of Table 2, we list the calibration accuracy achieved by several SOTA camera-LiDAR calibration methods. In the ‘L2’ column, we calculate the translation error as the Euclidean distance between the estimated value \(\tau \) and the ground truth \(\tau _{gt}\) [49]. In ‘Percent’ column, the translation error is calculated by \(\left\| \tau - \tau _{gt}\right\| _2 / \left\| \tau _{gt}\right\| _2\) [6]. In the last column, we show ‘MSEE’ [17] as an error metric, calculated as Eq. (10), where N is the number of testing examples.

$$\begin{aligned} \textrm{MSEE}=\frac{1}{N}\sum \limits _{i}^{N} \left\| \beta _\textrm{pre}(i) - \beta _\textrm{exp}(i)\right\| \end{aligned}$$
(10)

It can be observed from these columns that our method can achieve the best accuracy by comparison with other CNN based methods [14,15,16,17, 49]. Our translation error is almost the one-third of the second best one [15], and our rotation error is not larger than half of the second best one achieved by [14]. We believe these significant improvements in calibration accuracy are due to the huge contribution of cutting the backdoor path in the causal model to the optimization of the calibration model.

More importantly, by comparing with other traditional methods [6, 10], our Iter-CalibNet can also achieve at least the comparable (if not better) translation accuracy of calibration. There is only one more millimeter MAE of translation than that in [10], and two more percents than that in [6]. Our rotation accuracy is better than all traditional methods. These results prove that our method can be employed for LiDAR-camera calibration in real scenes.

Generalization for calibration. We evaluate the proposed method on the KITTI-RAW driving dataset. In Fig. 9 we show some examples of depth map calibration. It can be seen from this figure that even when there are large deviations in the initial calibration in the left column, the depth maps can be accurately calibrated by our model in the right column. The experimental results show that the generalization of our system is satisfactory. It can also obtain well calibration in untrained and unfamiliar datasets. The calibrated translation and rotation deviations on the whole sequence are (X: 0.09 m, Y: 0.026 m, Z: 0.03) and (X: \(0.73^{\circ }\), Y: \(1.88^{\circ }\), Z: \(0.17^{\circ }\)), and the MAEs of translation and rotation are 0.048m and \(0.926^{\circ }\), respectively. Although there is a certain gap between the calibration accuracy achieved in the KITTI-RAW dataset and the KITTI-odometer dataset, our method can still achieve better translation accuracy than that in [14, 16], and better rotation accuracy than that in [6, 49].

Regarding the difference in the performance of the model on the KITTI-RAW dataset and the KITTI-odometry dataset, this is probably because of the difference in the processing of the point cloud data in the two datasets. The point cloud data in the KITTI-odometer dataset is linearly interpolated, which is not done in the KITTI-RAW dataset, to eliminate the influence of the inherent dynamics of LiDAR.

Ablations

In this section, we use the same training and testing data from the KITTI dataset to further evaluate the contributions of different components in our calibration model.

Non-local module. We propose to use non-local module combined with CNN in the backbone network to better process the merged RGB image and depth map features. The purpose of adopting the non-local module is to select the features that is more critical to the current goal than a lot of other features. After leveraging the non-local module, we obtain a mean calibration error of 0.015 m for translation and \(0.121^{\circ }\) for rotation. If we delete the non-local module and leave the other settings unchanged, the trained new calibration model yields a mean calibration error of 0.017 m for translation and \(0.137^{\circ }\) for rotation. From the comparison results, it is obvious that the non-local module can find more critical features for calibration and indeed improve the accuracy of the calibration.

Loss functions. In Sect. “Network Architechture”, we propose to use the combination of the calibration error of projected depth map, the photometric and geometric error obtained by the inter-frame pose transformation, and the predicted parameter error, as the loss function for network training. In order to demonstrate the necessity of using all losses, we conduct the following experiments: (a) FULL: All the above losses are used; (b) DEL-\(L_D\): Removing the depth map calibration error \(L_D\); (c) DEL-\(L_S\): Removing the synthetic view constraint \(L_S\); and (d) DEL-\(L_P\): Removing the global regression error of calibration parameters \(L_P\). Results are provided in Table 3. It is easy to check that all loss terms are crucial to this task as missing one of them leads to performance degradation. However, their importance is different with respect to different parameters in the calibration. For example, the depth map calibration error has the most influence on rotation, as deleting it yields the largest rotation errors with respect to other configurations; the global regression error influences the translation errors to the most extent, as deleting it can lead to the approximated 10 cm error in Z axis with respect to the full configuration as shown in the first row.

Conclusion

In this paper, we have presented a novel Iter-CalibNet method for extrinsic calibration between LiDAR and camera. By carefully considering the causality among the various components of calibration process, our method eliminates the influence of the confounder factor. The proposed Iter-CalibNet is a deep neural network by combining CNN and non-local module. We design two strategies, i.e. the training data generation and iterative calibrating manner, to cut off the backdoor path in the structural causal model of calibration. Our method does not require any human intervention, and enables online real-time calibration, which infers the 6-DoF rigid body transformation. We only need to train one model on a wide range of initial error calibrations. Because the optimization of the model is based on the causal analysis and underlying geometric constraint of 3D–2D calibration, it has excellent generalization ability, adapting to different pairs of LiDAR and camera with various intrinsic parameters. Among the whole test dataset, our method yields a mean calibration error of 0.015 m for translation and \(0.121^{\circ }\) for rotation, which is the best one by comparing with state-of-the-art methods. Our LiDAR camera calibration process and the combination of SLAM/VO with LiDAR cameras are intertwined and overlapping. In future research, we plan to further improve the accuracy of our online calibration system by utilizing the causal relationship in the VO process.