1 Introduction

Understanding 3D point clouds has become increasingly important with the rise of robotics technologies such as augmented/virtual/mixed reality and autonomous vehicles. Autonomous driving, for instance, allows vehicles to sense and respond to their environment without human intervention. However, ensuring system safety relies heavily on accurate perception and localization of the environment. Simultaneous Localization and Mapping (SLAM) [1] technology plays a critical role in the perception and planning process of autonomous vehicles by constructing a map of the surrounding environment and localizing the vehicle. Visual/LiDAR odometry estimation [2,3,4,5,6] is an essential component of a SLAM system, aiming to estimate the robot’s pose information from consecutive point clouds. Moreover, large-scale data-based applications, such as LiDAR semantic segmentation [7, 8] and odometry estimation [6] empower advanced robotics technologies. Similarly, the use of saliency information in 2D computer vision tasks such as image translation [9], object detection [10] and tracking [11, 12], key-point selection [13, 14], and person re-identification [15,16,17] has led to state-of-the-art results by capturing the pre-dominant information in a scene. However, the unstructured, unordered, and density-varied nature of point clouds makes it difficult for conventional point-cloud-based methods to process informative visual features effectively and rapidly in large-scale scenes. To address this challenge and enhance the performance of real-time autonomous vehicles, several works [18,19,20] have explored the use of saliency detection algorithms in point cloud data-based tasks, showing that the integration of efficient saliency knowledge can further enhance the performance of 3D point cloud understanding tasks.

Fig. 1
figure 1

Visualization examples of SIFT [21] key points and saliency maps for two consecutive frames. From left to right column: RGB images, results of SIFT [21] key points and saliency maps from our FordSaliency dataset, and point clouds registered on images with saliency values. Saliency values are scaled in the range [0–1], colour bar demonstrates the saliency values relevant to the saliency heat map fused on the color images

In LiDAR odometry estimation, keypoint selection is often used to facilitate the learning of matching features by the model [22,23,24,25]. SIFT-based approaches, such as LodoNet [22], extract matched keypoint pairs, which are then used to learn point-wise features in Point-Net [26]. Alternatively, saliency-based point selection methods, such as that used by SalientDSO [23], demonstrate the potential benefits of incorporating saliency information into visual odometry. As shown in Fig. 1, while SIFT key points [21] and salient regions of saliency maps can both detect significant and consistent landmarks of the scene (e.g. buildings and traffic signs), saliency maps offer a more continuous and soft indication of attentive probabilities, unlike the sparse and discrete nature of key points. Thus, integrating saliency information has the potential to improve the performance of odometry estimation. However, despite significant progress in LiDAR odometry estimation, challenges remain, particularly in crowded environments with moving objects that can introduce noise and occlusions [27,28,29,30]. To address this issue, some odometry methods [27,28,29,30] use semantics to mitigate the adverse effects of moving object regions/points in the input data. Static objects can provide a stable and consistent reference point for geometry-based matching, which is critical for successful pose estimation. For example, early Iterative Closest Point (ICP) [31, 32] based odometry models [33, 34] estimate the transformation iteratively by minimizing matching errors between corresponding points of two scans.

Fig. 2
figure 2

Overview of proposed framework of image-to-LiDAR saliency knowledge transfer for 3D point cloud understanding. The 2D images saliency knowledge of RGB saliency models is transferred to 3D point clouds. Then the 3D point cloud saliency knowledge is used to attention-guided 3D point cloud understanding tasks, such as 3D semantic segmentation and LiDAR odometry estimation

In this paper, we focus on improving LiDAR odometry estimation and 3D semantic segmentation by learning robust and discriminative features with saliency information constraints. Specifically, we propose a saliency-guided 3D semantic segmentation method that exploits saliency cues to facilitate the model in robust feature learning. Also, we propose a saliency-guided LiDAR odometry approach that leverages attention information and semantics to improve performance. Figure 2 illustrates an overview of the proposed framework of image-to-LiDAR saliency knowledge transfer for attention-guided LiDAR semantic segmentation and odometry estimation models.

Table 1 Comparison of existing saliency detection datasets on point clouds

Several attempts have been made to find effective solutions for saliency detection on point clouds [13, 18,19,20, 38]. In Table 1, we compare the existing saliency detection datasets on point clouds. For saliency detection on point clouds, most challenges are yet to be explored further. First, previous saliency methods such as [18, 20] have operated on mesh data of 3D objects, where scenes are less complicated with only a few background points. Second, due to the lack of human-annotated training datasets, it is unlikely to employ the supervised learning scheme for saliency detection on point clouds. Similar to the challenges faced in 3D object detection, where approaches have been proposed to reduce dependence on extensive supervision due to the laborious and costly nature of manual labeling [39, 40] it is highly desired to develop a practicable pipeline based on deep learning for saliency prediction on large-scale point clouds. This study is an extension of our previous work [38] for point cloud saliency prediction and 3D semantic segmentation. In our previous work [38], we have designed a universal framework and a point cloud saliency dataset (FordSaliency) to transfer saliency distribution knowledge for point clouds. Then an attention-guided two-stream network is proposed to improve the accuracy of LiDAR semantic segmentation task. The first stream is a LiDAR based saliency network trained on FordSaliency dataset that guides the segmentation task. The second stream is a segmentation module that predicts the semantics of the input point cloud [38].

In this work, we propose a saliency-guided deep self-supervised odometry model that combines the saliency and semantic predictions of SalLiDAR [38] for the LiDAR odometry estimation. In brief, the key and additional contributions of this work can be summarized as follows:

  • We propose a saliency-guided LiDAR odometry estimation model with a self-supervised learning manner. The proposed odometry model consists of three modules: saliency module, semantic module, and odometry module. In [38], we used point cloud saliency to improve semantic segmentation predictions on point clouds. In this work, we not only utilize point cloud saliency but also incorporate point cloud semantic segmentation predictions within the self-supervised odometry module. The motivation is to focus on attentional static regions in scenes during the training of the odometry module, while avoiding dynamic objects that could hinder relative pose estimation between two sequential point cloud scans.

  • To mitigate the adverse effects of dynamic points on the LiDAR odometry model, we binarize the semantic map into dynamic and static points using the semantic labels defined in the SemanticKITTI dataset [41]. The point cloud, along with the binarized semantic map and saliency map, is then fed into the odometry module for feature learning.

  • To prioritize salient static points for point cloud matching, we propose a saliency-guided odometry loss that utilizes the saliency and binarized semantic maps to regulate the odometry module. This helps the module focus more on attentive points and improves the accuracy of point cloud matching. Our extensive experiments on benchmark datasets suggest that the proposed two-stream LiDAR odometry model with saliency and semantic knowledge improves the performance of odometry estimation and achieves better performance compared with the existing methods.

2 Related works

2.1 LiDAR odometry estimation

Recently, deep learning-based odometry models [5, 6, 42, 43] have been proposed to predict pose by learning more abundant features with powerful convolutional modules. In PWCLO-Net [6], Wang et al. propose a deep LiDAR odometry approach based on hierarchical embedding mask optimization, where a warp-refinement module with attentive cost volume structure refines the estimated pose in a coarse-to-fine manner, the attentive cost volume is used for the association between two point clouds. In DeLORA [42] model, Nubert et al. present a self-supervised LiDAR odometry model for pose regression without any ground-truth labels. Two consecutive range images converted from raw LiDAR point clouds are fed into DeLORA [42] model to output a rigid-body transformation, then the geometric transformation is applied to the source LiDAR scan and normal vector for obtaining transformed LiDAR scan and transformed normal vector. Afterward, a point-wise geometric loss between the transformed scan data and the target scan data can be calculated to guide the model to learn geometric-specific features, thus generating a transformation to match the transformed and target scans as closely as possible [42]. For this paper, we focus on the LiDAR odometry research based on deep learning, which has achieved great progress in recent years [3, 5, 43, 44].

2.2 Saliency detection on point cloud

Saliency detection aims to find the most eye-attracting locations in a visual scene, which can be traced back to the pioneering work of Itti et al. [45]. With rapidly emerging advances and applications of deep learning techniques, saliency detection on color images/videos [46] has made great progress in recent years. There are also several works [13, 14, 18,19,20, 35] for saliency computation on point clouds. For example, Ding et al. [18] propose a 3D mesh saliency calculation method by fusing local distinctness and global rarity features. Tinchev et al. [14] present a key point detector on point clouds by using saliency estimation. They calculate the gradient response of a differentiable convolutional network to obtain the saliency map. Then they use multiple fully connected layers to combine the saliency feature, point cloud context feature, and PCA features of point descriptors [14]. Zheng et al. [20] present a saliency computation method using a loss gradient approach that approximates point dropping in a differentiable manner of shifting points towards the point cloud center. However, saliency methods focusing on 3D meshes or indoor scenes are limited in their ability to process large-scale 3D point clouds such as 3D driving data. Also, saliency models extracting handcrafted descriptors may ignore informative representations for point clouds with varying density and complex backgrounds in outdoor scenarios.

2.3 LiDAR semantic segmentation

LiDAR semantic segmentation [7, 8, 26, 47,48,49] is a crucial 3D computer vision task for autonomous driving, which aims to predict the semantic class of each point on a LiDAR scan. As a pioneering point set-based method, PointNet [26] uses multiple layer perceptrons (MLPs) to learn point-wise features for classification and segmentation. RandLA-Net [47] presents randomly sampling the input point cloud, and employs a local feature aggregation module to compensate for information loss introduced by the random sampling. Considering the range property of LiDAR point cloud, Cylinder3D [8] proposes a solution to leverage cylinder partition for 3D semantic segmentation. It also brings an asymmetrical model to encoder-decoder voxel-based features by 3D sparse convolutional networks. PVKD [7] achieves the state-of-the-art performance of 3D semantic segmentation by applying the point-to-voxel knowledge distillation strategy to Cylinder3D [8] model. With RPVNet [49], the authors present a multi-modality fusion model that combines range-based, point-based, and voxel-based representations with a gated fusion module for LiDAR semantic segmentation.

3 Proposed framework

3.1 Problem formulation

Given an input point cloud \({\mathcal {P}}\)={\(p_{i} |\) i=1, ..., N, \(p_{i} \in {\mathbb {R}}^{d}\)} with a set of disordered points, where N represents the point number of LiDAR frame and each point \(p_{i}\) could contain d dimensional features, such as point coordinates (xyz), colors (rgb), reflectivity, and normal feature. The objective of the saliency detection model on point cloud is to predict the saliency score map \({\mathcal {S}}\)={\(s_{i} |\) i=1, ..., N, \(s_{i} \in [0, 1]\)}, where \(s_{i}\) denotes the saliency score of point \(p_{i}\). After normalizing the saliency prediction, the closer the saliency score \(s_{i}\) to 1, the more attentive the point \(p_{i}\). In the 3D semantic segmentation task, its goal is to predict the semantic class map \({\mathcal {C}}\)={\(c_{i} |\) i=1, ..., N, \(c_{i} \in {\mathbb {R}}\)}, where \(c_{i}\) indicates the semantic category of point \(p_{i}\).

The objective of this work is to establish a self-supervised LiDAR odometry estimation model that is guided by saliency and semantic constraints, and can be trained without ground-truth pose. To achieve this goal, given the input of two consecutive LiDAR point clouds \({\mathcal {P}}_{t} \in {\mathbb {R}}^{d}\) and \({\mathcal {P}}_{t-1} \in {\mathbb {R}}^{d}\) at time t and \(t-1\) with a set of disordered points, where each point p could contain d dimensional point-wise features, such as point coordinates (xyz), the range feature r, semantic feature c, and saliency feature s. The odometry model estimates a \(3 \times 3\) rotational vector \({\textbf{q}} \in SO(3)\) and a \(3 \times 1\) translational vector \({\textbf{t}}\), where the \({\textbf{R}}\) and \({\textbf{t}}\) compose the relative rigid transformation \({\hat{T}}_{t-1, t} \in SE(3)\) between point clouds \({\mathcal {P}}_{t}\) and \({\mathcal {P}}_{t-1}\). The \({\mathcal {P}}_{t}\) can be transformed into \(\hat{{\mathcal {P}}}_{t-1}\) in the coordinate system of \({\mathcal {P}}_{t-1}\) by the transformation \({\hat{T}}_{t-1, t}\):

$$\begin{aligned} \hat{{\mathcal {P}}}_{t-1} = {\hat{T}}_{t-1, t} \odot {\mathcal {P}}_{t} \end{aligned}$$
(1)

where \(\odot\) represents the point-wise matrix multiplication. Afterward, the point-wise matching loss between \(\hat{{\mathcal {P}}}_{t-1}\) and \({\mathcal {P}}_{t-1}\) can be calculated to train the odometry model, thereby forcing the model to predict an optimal transformation \({\hat{T}}_{t-1, t}\). Also, the normal vector \({\mathcal {N}}_{t}\) of \({\mathcal {P}}_{t}\) can be transformed into \(\hat{{\mathcal {N}}}_{t-1}\) in the coordinate system of \({\mathcal {P}}_{t-1}\) by the transformation \({\hat{T}}_{t-1, t}\):

$$\begin{aligned} \hat{{\mathcal {N}}}_{t-1} = rot({\hat{T}}_{t-1, t}) \odot {\mathcal {N}}_{t} \end{aligned}$$
(2)

Therefore, the odometry model can be trained in a self-supervised manner by calculating the point-wise matching loss, and it does not require the odometry ground truth \(T_{t-1, t}\).

Fig. 3
figure 3

Illustration of proposed framework of image-to-LiDAR saliency knowledge transfer for attention-guided 3D point cloud understanding

3.2 Framework overview

In Fig. 3, we show the overview of image-to-LiDAR saliency knowledge transfer for 3D point cloud understanding. There are three main sub-tasks: (1) image-to-LiDAR saliency knowledge transferring for generating a pseudo-saliency dataset of point clouds, (2) LiDAR-to-LiDAR pseudo-saliency learning by using LiDAR-based deep models, and (3) saliency-guided 3D point cloud understanding by integrating the saliency information.

Fig. 4
figure 4

Visualization examples of point cloud and corresponding saliency pseudo-ground-truth (GT) map from our FordSaliency dataset

Firstly, we propose a large-scale pseudo-saliency dataset (FordSaliency) for point clouds by assigning the saliency values of RGB images to corresponding point clouds registered on images. In Fig. 4, we show the visualization examples of point cloud and corresponding saliency pseudo-ground-truth map from our FordSaliency dataset. Then, we train LiDAR based models on the proposed pseudo-saliency dataset to learn point cloud saliency features. Next, we propose a saliency-guided two-stream network (SalLiDAR) for large-scale point cloud segmentation. The saliency prediction is not only used as an input feature for the semantic module but also adopted to saliency-guided loss to facilitate the semantic module (1) to learn more rich features of salient points and (2) to reduce the influence of the imbalance of points in different semantic classes with saliency constraints.

Finally, a Saliency-guided LiDAR Odometry Network (SalLONet) is proposed by applying the point cloud saliency and semantic information for performance improvement of odometry model. The objective of LiDAR odometry estimation is to output the pose information by matching two point clouds, in other words, it can be regarded as a problem of registration between two LiDAR scans. Therefore, our saliency-guided LiDAR odometry model is motivated by two observations: (1) the dynamic points should be suppressed as much as possible, since they may decrease the performance of odometry estimation during the registration. On the other hand, (2) the static points should have more priority to make the model focus more on the salient static points for feature matching. To this end, we utilize our SalLiDAR model to predict semantics and the saliency map of the point cloud for odometry estimation. The semantic and saliency predictions are not only fed into the odometry model as input features but also integrated into a saliency-guided odometry loss to regularize the odometry model.

Fig. 5
figure 5

Framework of proposed two-stream semantic segmentation model. The saliency prediction network is pre-trained on our FordSaliency dataset

3.3 Proposed method

  1. (1)

    Learning point cloud saliency: In order to learn point cloud saliency representations, we adopt existing LiDAR based semantic segmentation models [8, 26, 47] as backbones of the feature extractor. As shown in Fig. 3b, given a 3D LiDAR point cloud with coordinates and corresponding point-wise features, we first feed it into the feature extractor to obtain the representations of each point. Next, these learned features are passed by a saliency prediction layer to output the saliency score map of the input point cloud. We considered two types of models to learn saliency distribution on point clouds: (i) classification-based saliency prediction and (ii) saliency regression. In this work, we choose to use the models trained based on saliency regression, as it is a commonly used approach and there is not a significant difference in performance between the two methods. More details can be referred to our previous work [38].

  2. (2)

    Two-stream segmentation model: As depicted in Fig. 5, we develop a two-stream semantic segmentation model on the point cloud by combining features from the saliency module and the semantic module. We feed an input point cloud into the saliency branch to predict the saliency distribution of the whole scene. Meanwhile, the point cloud is also fed into the semantic branch to extract point features and output the predictions of the semantic class. To validate the effectiveness of the learned point cloud saliency distribution knowledge, we initialize and freeze the parameters of the saliency branch with the weights pre-trained on the FordSaliency dataset. More details can be referred to our previous work [38].

    Fig. 6
    figure 6

    Architecture of Saliency-guided LiDAR Odometry Network (SalLONet). The \({\mathcal {{{C}}}}_{t}\), \({\mathcal {{{C}}}}_{t-1}\), \({\mathcal {{{S}}}}_{t}\), \({\mathcal {{{S}}}}_{t-1}\), \({\mathcal {{{R}}}}_{t}\), \({\mathcal {{{R}}}}_{t-1}\) represent the predicted LiDAR semantic maps, LiDAR saliency maps, and range images of corresponding two consecutive point clouds \({\mathcal {{P}}}_{t}\) and \({\mathcal {{P}}}_{t-1}\). The transformation \(\hat{\varvec{T}}\) of the two point clouds consists of the estimated translation t and rotation q (i.e., pose)

  3. (3)

    LiDAR odometry estimation module: As shown in Fig. 6, the semantic map and saliency map are first predicted by the proposed saliency and semantic modules of SalLiDAR. For odometry estimation, we convert and concatenate two consecutive LiDAR point clouds and their respective predicted saliency and semantic maps to range images as the input of the odometry module. The outputs of the odometry module are the feature vectors of translation \({\textbf {t}}\) and rotation \({\textbf {q}}\) between two LiDAR point clouds. Then we can construct the rigid body transformation \({\hat{T}}_{t-1, t}\) based on the predicted translation and rotation. The source LiDAR scan \({\mathcal {P}}_{t}\) can be transformed into \(\hat{{\mathcal {P}}}_{t-1}\) by the transformation \({\hat{T}}_{t-1, t}\) for matching with the target LiDAR scan \({\mathcal {P}}_{t-1}\). Thus, the odometry module can be supervised by the point-wise matching errors between the transformed scan \(\hat{{\mathcal {P}}}_{t-1}\) and the target scan \({\mathcal {P}}_{t-1}\).

    Fig. 7
    figure 7

    Demonstrations of semantics and saliency map for LiDAR odometry estimation. The top depicts the binarized semantics based on dynamic (e.g., car) and static point (e.g., building, traffic sign), and the bottom shows the point-wise matching with saliency maps between two consecutive scans

To regularize the odometry module, we exploit the saliency and semantic segmentation information to saliency-guided odometry loss for the model training. The previous odometry studies [5, 6] have shown that dynamic points may degrade the performance of odometry estimation, thus we exploit the predicted semantic map to suppress the effect of dynamic points. On the other hand, we utilize the predicted saliency map to increase the priorities of static salient points for matching two LiDAR scans. To obtain the saliency and semantic predictions for the LiDAR odometry module, we initialize and freeze the parameters of the saliency branch with the weights pre-trained on our FordSaliency [38] dataset. We also use the weights learned from the SemanticKITTI [41] dataset to initialize and freeze the parameters of the semantic module. Since the predicted saliency distribution represents the attention level of each point, we can apply it to constrain the parameter optimization of the semantic segmentation module. Similar to the proposed SalLiDAR model, we propose three different integration methods with saliency semantic information for LiDAR odometry estimation:

Table 2 Assignment of dynamic (D\(\rightarrow\)0) and static (S\(\rightarrow\)1) points based on semantic categories defined in SemanticKITTI [41] dataset

SalLONet-I: Saliency-guided odometry loss with saliency prediction and semantic mask for odometry estimation. If we remove saliency and semantic concatenation module in Fig. 6, it becomes SalLONet-I. Considering that the dynamic points (e.g., moving car, pedestrian) may introduce adverse results on the odometry estimation, we first convert the predicted semantic map to a binarized mask to indicate the static points (e.g., building, road) and dynamic points (e.g., car, person). As shown in Fig. 7 (top) and Table 2, the predicted semantics are binarized to dynamic and static points based on the semantic categories defined in SemanticKITTI [41] dataset. The point with a semantic class of moving or potentially moving is defined as a dynamic point. For example, regardless of whether the semantic category of a point is a car or a moving car, it will be defined as a dynamic point. Then, the binarized semantic mask is applied to suppress the adverse effects of the dynamic points and increase the weights of static points for the point-wise matching odometry loss. Additionally, we apply saliency prediction to odometry loss for odometry model training, thus facilitating the odometry model to focus more on the static salient points for matching, as shown in Fig. 7 (bottom).

By following the study [42], we use the geometric-based losses to optimize the odometry estimation module by calculating the point-wise matching errors:

$$\begin{aligned} {\mathcal {L}}^{odom} = \lambda \cdot {\mathcal {L}}_{p2n} + {\mathcal {L}}_{n2n} \end{aligned}$$
(3)

where \(\lambda\) is a constant to balance the two losses. \({\mathcal {L}}_{p2n}\) and \({\mathcal {L}}_{n2n}\) are point-to-plane loss and plane-to-plane loss, which can be represented as follows:

$$\begin{aligned} {\mathcal {L}}_{p2n} = \frac{1}{N} \sum _{i=1}^{N} | (\hat{p}_{i} - {p}_{i}) \cdot {n}_{i} |_{2}^{2} \end{aligned}$$
(4)
$$\begin{aligned} {\mathcal {L}}_{n2n} = \frac{1}{N} \sum _{i=1}^{N} | \hat{n}_{i} - {n}_{i} |_{2}^{2} \end{aligned}$$
(5)

where \(\hat{p}_{i}\) and \({p}_{i}\) are the point coordinate values of \(\hat{P}\) and P, respectively. \(\hat{n}_{i}\) and \({n}_{i}\) are the normal values of \(\hat{{\mathcal {N}}}\) and \({\mathcal {N}}\), respectively.

Fig. 8
figure 8

Visualization of saliency and semantic maps for odometry estimation. The results include a semantic prediction and b saliency prediction of the proposed SalLiDAR model; c binarized semantic map (i.e. dynamic and static points); and d weighted map by saliency and binarized semantics for the loss weighting of proposed SalLONet by Eq. (7)

To guide the odometry model to focus more on salient static points for matching, we apply the saliency and binarized semantic maps to the geometric-based loss. Thus, the saliency-guided odometry loss can be represented as:

$$\begin{aligned} \hat{{\mathcal {L}}}^{odom} = \frac{1}{N} \sum _{i=1}^{N} {l}^{odom}_{i} * {\mathcal {W}} \end{aligned}$$
(6)
$$\begin{aligned} {\mathcal {W}} = \exp {(s_{i}^{t}*{s}_{i}^{t-1}*{c}_{i}^{t}*{c}_{i}^{t-1})} \end{aligned}$$
(7)

where i is the index of point; \(\hat{{\mathcal {L}}}^{odom}\) is the weighted loss for odometry; \({l}^{odom}_{i}\) denotes the point-wise odometry matching loss of point \(p_{i}\). \(s_{i}^{t}\) and \({s}_{i}^{t-1}\) denote the saliency prediction of point \(p_{i}\) from scan t and scan \(t-1\); \(c_{i}^{t}\) and \({c}_{i}^{t-1}\) denote the binarized semantic predictions of point \(p_{i}\) from scan t and scan \(t-1\); the \(*\) represents the element-wise multiplication. Figure 8 illustrates a visualization of the saliency and semantic maps used for attention-guided LiDAR odometry estimation. We can observe that the weights of dynamic points (e.g., car) are suppressed, while the static points (e.g., building) are highlighted by the binarized semantics and LiDAR saliency map.

SalLONet-II: Saliency prediction and semantic mask as descriptors for odometry estimation. If we remove point-wise saliency guided loss module in Fig. 6, it becomes SalLONet-II. We append the normalized saliency and binarized semantic prediction to point cloud coordinates as input features for the odometry model. We believe that prior saliency knowledge and high-level semantic information could be helpful for feature learning and localization in odometry estimation.

SalLONet-III: Saliency distribution and semantic prediction as descriptors and attentive loss guiding for odometry estimation. Fig. 6 shows the SalLONet-III model. In this model, the saliency and semantic maps are not only utilized as the additional input features for the odometry module but also applied to saliency-guided odometry loss for optimization during training.

4 Experimental analysis

4.1 Implementation details

For point cloud saliency detection, we employ PointNet [26], RandLA-Net [47], and Cylinder3D [8] models as feature extractors. For 3D semantic segmentation, we use RandLA-Net [47] and Cylinder3D [8] as baselines. For LiDAR odometry, we adopt the DeLORA [42] model as the baseline of LiDAR odometry estimation. By following follow DeLORA [42] model, the range images and normal features are converted from two consecutive raw LiDAR point clouds as input features for odometry estimation. We adopt our SalLiDAR model [38] to generate saliency and semantic predictions for the LiDAR odometry module. For a fair comparison, all the odometry networks of baseline and our proposed methods are randomly initialized, and they are trained on KITTI [50] odometry dataset from scratch with 5-fold validation. The initial learning rate is 1e-5, and all odometry models are trained with self-supervised learning for 100 epochs.

4.2 Datasets and experimental setup

4.2.1 LiDAR FordSaliency dataset

Based on the data of the FordCampus [37] dataset, we build a point cloud saliency dataset (namely FordSaliency) for the training of LiDAR-based saliency models. We utilize dataset1 and dataset2 of FordSaliency as the validation set and training set, respectively. More details can be referred to the work [38].

4.2.2 SemanticKITTI dataset

SemanticKITTI [41] dataset is a well-known large-scale dataset for point cloud semantic segmentation. This dataset consists of 22 Velodyne driving-scene sequences, which are split into a training set (sequences 00–07 and 09–10), a validation set (sequence 08), and a testing set (sequences 11–21).

4.2.3 KITTI odometry dataset

We conduct all the odometry experiments on KITTI [50] odometry dataset, which provides LiDAR point clouds captured from the Velodyne LiDAR sensor. By following the odometry works [5, 42], the dataset is divided into a training set (Sequences 00–08) and a validation set (Sequences 09–10).

4.3 Evaluation metrics

For point cloud saliency detection, we use popular saliency metricsFootnote 1 including Correlation Coefficient (CC), Similarity (SIM), and Kullback–Leibler Divergence (KLD) to evaluate the performance of point cloud saliency model. For performance evaluation of LiDAR semantic segmentation, we adopt mean Intersection-over-Union (mIoU)Footnote 2 as evaluation metric following the previous studies [8, 47]. For LiDAR odometry estimation, the average translational (\([\%]\)) and average rotational (\([\frac{deg}{100\,m}]\)) RMSE (Root Mean Square Error)Footnote 3 are adopted to evaluate the performance of LiDAR odometry models.

4.4 Results and performance analysis

4.4.1 LiDAR saliency results on FordSaliency dataset

We compare the performance of LiDAR-based saliency models with different feature extractors on our FordSaliency dataset. In Fig. 9, we show the visualization results of SalLiDAR models with different backbones on the FordSaliency validation set. In Table 3, we report the quantitative performance of these models on the FordSaliency validation set. From Fig. 9 and Table 3, we can observe that although the saliency annotations are pseudo-labels, all these LiDAR-based models are able to learn the discriminative point cloud saliency representations for saliency distribution prediction. On the other hand, the model with the Cylinder3D backbone can predict better saliency distribution than the model with other backbones. The models with RandLA-Net backbone and PointNet backbone can learn the correlation and similarity features from point cloud saliency annotations, as evidenced by the CC, SIM, and KLD values in Table 3. However, the prediction of the model with the Cylinder3D backbone can achieve higher CC, SIM, and lower KLD performance. It suggests that the model with a voxel-based partition (e.g. 3D Cylinder) could learn more powerful saliency representations than point-based models (Fig. 10).

Table 3 Results of SalLiDAR models with different backbones on FordSaliency dataset
Fig. 9
figure 9

Point cloud saliency prediction results of SalLiDAR model with different backbones on FordSaliency dataset. Saliency values are scaled in the range [0–1], colour bar demonstrates the saliency values relevant to the saliency heat map fused on the color images

Fig. 10
figure 10

Visualization comparison results of baseline and proposed LiDAR segmentation models on SemanticKITTI [41]. From the first column to the last column are: the visualizations of semantic ground-truth, the semantic predictions of baseline models, the semantic results of proposed models, and the saliency predictions of proposed models, respectively

Table 4 Performance comparison of proposed models and existing LiDAR segmentation methods on SemanticKITTI [41] test set

4.4.2 Semantic segmentation results on SemanticKITTI dataset

We report the LiDAR semantic segmentation performance on the test set of SemanticKITTI in Table 4. Note that all the testing performance results of Table 4 are taken from the literature and the benchmark leaderboardFootnote 4 of SemanticKITTI [41] dataset. By comparing with Table 3 and Table 4, we can find that the mIoU results on test sequences show the improved generalization ability in the larger set of evaluation samples. Compared to the baselines, all the models with SalLiDAR obtain better mIoU results. The proposed method also improves the segmentation performance on specific classes, since the combination of our predicted saliency distribution makes the model attentive to these categories, such as car, truck, and parking. Furthermore, the Cylinder3D model with SalLiDAR achieves better segmentation results than the RandLA-Net with SalLiDAR. It shows that the semantic segmentation model with better saliency prediction could provide more attentive information or features to improve the performance of the model. Especially, these experimental results demonstrate that the performance of LiDAR semantic segmentation models can be improved by proposed saliency distribution integration and point-wise attention-guided loss. These comparison results validate the effectiveness of the pre-trained point cloud saliency models, although they are trained on the FordSaliency dataset with pseudo-annotations.

Fig. 11
figure 11

Comparison trajectory results of proposed SalLONet odometry models against the baseline odometry model [42] on Sequence 09 (top) and Sequence 10 (bottom) of KITTI [50] odometry dataset

Fig. 12
figure 12

Detailed trajectory results of proposed SalLONet odometry models against the baseline odometry model [42] on Sequence 09 (top) and Sequence 10 (bottom) of KITTI [50] odometry dataset

Table 5 Comparison of translational (\([\%]\)) and rotational (\([\frac{deg}{100\,m}]\)) errors on validation set of KITTI [50] odometry dataset

4.4.3 Odometry results on KITTI dataset

In Fig. 11 and Fig. 12, we show the experimental trajectory results of the proposed SalLONet models on Sequence 09–10 of KITTI [50] odometry dataset. We can observe that the SalLONet models with saliency and semantic information predict better trajectory results than the baseline model by comparing with the ground truth. In Table 5, we present the quantitative results of proposed approaches and six existing odometry methods. The DeepLO [59] and Velas et al. [60] are supervised LiDAR odometry models. In other words, the ground-truth poses of the training set are used to train these supervised odometry models. The DeLORA [42] is an unsupervised LiDAR odometry model. This means that the unsupervised DeLORA [42] does not require labels to train the model. By following the study [42], there are also three unsupervised visual odometry estimation methods [61,62,63] for comparison. As shown in Table 5, the three proposed SalLONet models improve the performance of the baseline model, as evidenced by lower translational and rotational errors in Sequence 09–10 of KITTI [50] odometry dataset. Among the unsupervised methods, the SalLONet-III achieves the best results with the lowest errors on both validation sequences. In particular, its translational error on Sequence 10 (\(t_{\text {rel}}=4.940\)) even outperforms the supervised DeepLO [59] method (\(t_{\text {rel}}=5.020\)). In a word, these experimental results show that the saliency and semantic information are effective for improving the odometry estimation tasks, which implicitly indicates the effectiveness of image-to-LiDAR saliency knowledge transfer.

Table 6 Comparison of translational (\([\%]\)) and rotational (\([\frac{deg}{100\,m}]\)) errors on validation set of KITTI [50] odometry dataset

4.5 Ablation studies

We investigate the effectiveness of saliency and semantics for LiDAR odometry estimation. From Table 5, we observe that the SalLONet-III model achieves better results by leveraging both semantic and saliency cues. Thus, we conduct the ablation study based on SalLONet-III to verify the influences of saliency and semantic maps for LiDAR odometry estimation. In the proposed SalLONet-III method, we leverage saliency and semantic predictions for LiDAR odometry estimation simultaneously. We first validate the model with saliency information-only integration. We also train the model with semantic information only. The performance results of the ablation study are shown in Table 6. Experimental evaluation shows that the SalLONet model with both saliency and semantic information achieves superior performance on KITTI [50] odometry dataset. In addition, the model of SalLONet with saliency only outperforms better results against the baseline model, which demonstrates that saliency information is effective for improving LiDAR odometry estimation. The model of SalLONet with semantics only obtains competitive results by comparing it with the baseline model. However, the SalLONet model benefits from saliency and semantic information, thus getting the lowest translational and rotational errors on Sequence 09-10 of the KITTI [50] odometry dataset.

4.6 Limitations

The human visual system takes advantage of visual attention mechanisms to explore and analyse scenes. Inspired by this concept, we conducted an exploratory study to investigate how attention (i.e., saliency maps) can be integrated into vision tasks for point clouds. Therefore, this work primarily focuses on the accuracy improvement in point cloud vision tasks achieved by incorporating additional cues, such as saliency knowledge for semantic segmentation and saliency and semantics for odometry, rather than on the computational efficiency of the approach.

However, the SalLONet-III version requires saliency and segmentation prediction networks to run prior to odometry, using their outputs along with the point cloud data as inputs. This increases computational complexity depending on the selected models for saliency and segmentation predictions, which is a limitation despite the improvement in odometry accuracy compared to the baseline odometry model. It is worth noting that SalLONet-I does not require saliency and segmentation prediction as inputs to the odometry network; instead, it employs these cues only during training as loss weighting parameters. Thus, during inference, SalLONet-I operates similarly to the baseline model without additional complexity or computational cost, while still outperforming the baseline model.

Another limitation is the difficulty of evaluating and validating the FordSaliency dataset and the models trained for saliency maps. Because the point cloud saliency ground truths maps in the proposed FordSaliency dataset are pseudo-saliency values obtained from state-of-the-art image saliency models. Moreover, due to the sparse nature of point cloud data, limited image regions are projected or represented, resulting in missing salient and non-salient information that is available in the image domain. Therefore, aside from the saliency evaluation metrics comparing predictions with pseudo ground truth values, the most feasible way to verify the effectiveness of point cloud saliency predictions is to test them on saliency-guided vision tasks as proposed in this work. In addition to these points, we hope this work opens new research directions for exploring these limitations and inspires new ideas to solve or minimize them.

5 Conclusion

This article has presented the research works on establishing LiDAR-based saliency detection models with image-to-LiDAR transfer learning for improving the performance of 3D point cloud understanding tasks. We propose a Saliency-guided LiDAR Odometry Network (SalLONet) by combining saliency and semantic information of point clouds. First, the saliency and semantic maps generated by the proposed two-stream semantic model are fed into the odometry module as the feature representation of the input consecutive point clouds. Second, the saliency and semantic predictions are applied to odometry loss. To alleviate the effect of dynamic points for pose regression, we binarize the semantic prediction to dynamic and static points based on the semantic class. Then the binarized semantics are utilized to filter the dynamic points by point-wise multiplication for loss weighting. To further encourage the odometry module to learn discriminative features, the saliency map is leveraged to increase the loss weights of salient static points for matching two-point clouds. Extensive experimental results on KITTI [50] odometry dataset have demonstrated outstanding performance of the proposed odometry model with saliency and semantic information, which considers the influences of dynamic and static salient points for pose estimation simultaneously.Footnote 5