Attention-Guided Lidar Segmentation and Odometry Using Image-to-Point Cloud Saliency Transfer

LiDAR odometry estimation and 3D semantic segmentation are crucial for autonomous driving, which has achieved remarkable advances recently. However, these tasks are challenging due to the imbalance of points in different semantic categories for 3D semantic segmentation and the influence of dynamic objects for LiDAR odometry estimation, which increases the importance of using representative/salient landmarks as reference points for robust feature learning. To address these challenges, we propose a saliency-guided approach that leverages attention information to improve the performance of LiDAR odometry estimation and semantic segmentation models. Unlike in the image domain, only a few studies have addressed point cloud saliency information due to the lack of annotated training data. To alleviate this, we first present a universal framework to transfer saliency distribution knowledge from color images to point clouds, and use this to construct a pseudo-saliency dataset (i.e. FordSaliency) for point clouds. Then, we adopt point cloud-based backbones to learn saliency distribution from pseudo-saliency labels, which is followed by our proposed SalLiDAR module. SalLiDAR is a saliency-guided 3D semantic segmentation model that integrates saliency information to improve segmentation performance. Finally, we introduce SalLONet, a self-supervised saliency-guided LiDAR odometry network that uses the semantic and saliency predictions of SalLiDAR to achieve better odometry estimation. Our extensive experiments on benchmark datasets demonstrate that the proposed SalLiDAR and SalLONet models achieve state-of-the-art performance against existing methods, highlighting the effectiveness of image-to-LiDAR saliency knowledge transfer. Source code will be available at https://github.com/nevrez/SalLONet.


Introduction
Understanding 3D point clouds has become increasingly important with the rise of robotics technologies such as augmented/virtual/mixed reality and autonomous vehicles.Autonomous driving, for instance, allows vehicles to sense and respond to their environment without human intervention.However, ensuring system safety relies heavily on accurate perception and localization of the environment.Simultaneous Localization and Mapping (SLAM) [1] technology plays a critical role in the perception and planning process of autonomous vehicles by constructing a map of the surrounding environment and localizing the vehicle.Visual/LiDAR odometry estimation [2,3,4,5,6] is an essential component of a SLAM system, aiming to estimate the robot's pose information from consecutive point clouds.Moreover, large-scale data-based applications, such as LiDAR semantic segmentation [7,8] and odometry estimation [6] empower advanced robotics technologies.Similarly, the use of saliency information in 2D computer vision tasks such as image translation [9], object tracking [10,11], key-point selection [12,13], and person re-identification [14,15,16] has led to state-of-the-art results by capturing the pre-dominant information in a scene.However, the unstructured, unordered, and density-varied nature of point clouds makes it difficult for conventional point-cloud-based methods to process informative visual features effectively and rapidly in large-scale scenes.To address this challenge and enhance the performance of real-time autonomous vehicles, several works [17,18,19] have explored the use of saliency detection algorithms in point cloud data-based tasks, showing that the integration of efficient saliency knowledge can further enhance the performance of 3D point cloud understanding tasks.In LiDAR odometry estimation, keypoint selection is often used to facilitate the learning of matching features by the model [21,22,23,24].SIFT-based approaches, such as LodoNet [21], extract matched keypoint pairs, which are then used to learn point-wise features in Point-Net [25].Alternatively, saliencybased point selection methods, such as that used by SalientDSO [22], demonstrate the potential benefits of incorporating saliency information into visual odometry.As shown in Figure 1, while SIFT key points [20] and salient regions of saliency maps can both detect significant and consistent landmarks of the scene (e.g.buildings and traffic signs), saliency maps offer a more continuous and soft indication of attentive probabilities, unlike the sparse and discrete nature of key points.Thus, integrating saliency information has the potential to improve the performance of odometry estimation.However, despite * The saliency ground truth maps are created using the raw data of FordCampus dataset [36] through average response of saliency model-annotators (see 4.2.1).
significant progress in LiDAR odometry estimation, challenges remain, particularly in crowded environments with moving objects that can introduce noise and occlusions [26,27,28,29].To address this issue, some odometry methods [26,27,28,29] use semantics to mitigate the adverse effects of moving object regions/points in the input data.Static objects can provide a stable and consistent reference point for geometry-based matching, which is critical for successful pose estimation.For example, early Iterative Closest Point (ICP) [30,31] based odometry models [32,33] estimate the transformation iteratively by minimizing matching errors between corresponding points of two scans.
In this paper, we focus on improving LiDAR odometry estimation and 3D semantic segmentation by learning robust and discriminative features with saliency information constraints.Specifically, we propose a saliency-guided 3D semantic segmentation method that exploits saliency cues to facilitate the model in robust feature learning.Also, we propose a saliency-guided LiDAR odometry approach that leverages attention information and semantics to improve performance.Figure 2 illustrates an overview of the proposed framework of image-to-LiDAR saliency knowledge transfer for attention-guided LiDAR semantic segmentation and odometry estimation models.
Several attempts have been made to find effective solutions for saliency detection on point clouds [17,18,12,19,37].In Table  it is highly desired to develop a practicable pipeline based on deep learning for saliency prediction on large-scale point clouds.This study is an extension of our previous work [37] for point cloud saliency prediction and 3D semantic segmentation.In our previous work [37], we have designed a universal framework and a point cloud saliency dataset (FordSaliency) to transfer saliency distribution knowledge for point clouds.Then an attention-guided two-stream network is proposed to improve the accuracy of LiDAR semantic segmentation task.The first stream is a LiDAR-based saliency network trained on FordSaliency dataset that guides the segmentation task.The second stream is a segmentation module that predicts the semantics of the input point cloud [37].

Attention Guided
In this work, we propose a saliency-guided deep self-supervised odometry model that combines the saliency and semantic predictions of SalLiDAR [43] for the LiDAR odometry estimation.In brief, the key and additional contributions of this work can be summarized as follows: • We propose a saliency-guided LiDAR odometry estimation model with a self-supervised learning manner.The proposed odometry model consists of three modules: saliency module, semantic module, and odometry module.
• To mitigate the adverse effects of dynamic points on the LiDAR odometry model, we binarize the semantic map into dynamic and static points using the semantic labels defined in the SemanticKITTI dataset [44].The point cloud, along with the binarized semantic map and saliency map, is then fed into the odometry module for feature learning.
• To prioritize salient static points for point cloud matching, we propose a saliency-guided odometry loss that utilizes the saliency and binarized semantic maps to regulate the odometry module.This helps the module focus more on attentive points and improves the accuracy of point cloud matching.Our extensive experiments on benchmark datasets suggest that the proposed two-stream LiDAR odometry model with saliency and semantic knowledge improves the performance of odometry estimation and achieves better performance compared with the existing methods.

LiDAR Odometry Estimation
Recently, deep learning-based odometry models [38,39,5,6] [38] model to output a rigid-body transformation, then the geometric transformation is applied to the source LiDAR scan and normal vector for obtaining transformed LiDAR scan and transformed normal vector.Afterward, a point-wise geometric loss between the transformed scan data and the target scan data can be calculated to guide the model to learn geometric-specific features, thus generating a transformation to match the transformed and target scans as closely as possible [38].For this paper, we focus on the LiDAR odometry research based on deep learning, which has achieved great progress in recent years [3,5,40,39].

Saliency Detection on Point Cloud
Saliency detection aims to find the most eye-attracting locations in a visual scene, which can be traced back to the pioneering work of Itti et al. [41].
With rapidly emerging advances and applications of deep learning techniques, saliency detection on color images/videos [42] has made great progress in recent years.There are also several works [17,18,12,13,19,34] for saliency computation on point clouds.For example, Ding et al. [17] propose a 3D mesh saliency calculation method by fusing local distinctness and global rarity features.Tinchev et al. [13] present a key point detector on point clouds by using saliency estimation.They calculate the gradient response of a differentiable convolutional network to obtain the saliency map.Then they use multiple fully connected layers to combine the saliency feature, point cloud context feature, and PCA features of point descriptors [13].Zheng et al. [19] present a saliency computation method using a loss gradient approach that approximates point dropping in a differentiable manner of shifting points towards the point cloud center.However, saliency methods focusing on 3D meshes or indoor scenes are limited in their ability to process large-scale 3D point clouds such as 3D driving data.Also, saliency models extracting handcrafted descriptors may ignore informative representations for point clouds with varying density and complex backgrounds in outdoor scenarios.

LiDAR Semantic Segmentation
LiDAR semantic segmentation [7,43,44,8,25,45]  between point clouds P t and P t−1 .The P t can be transformed into Pt−1 in the coordinate system of P t−1 by the transformation Tt−1,t : where ⊙ represents the point-wise matrix multiplication.Afterward, the pointwise matching loss between Pt−1 and P t−1 can be calculated to train the odometry model, thereby forcing the model to predict an optimal transformation Tt−1,t .Also, the normal vector N t of P t can be transformed into Nt−1 in the coordinate system of P t−1 by the transformation Tt−1,t : Therefore, the odometry model can be trained in a self-supervised manner by calculating the point-wise matching loss, and it does not require the odometry ground truth T t−1,t .

Attention-guided LiDAR Odometry Estimation
Frame t

Frame t +1
Attention-Guided Semantic Loss

Framework Overview
In Figure 3, we show the overview of image-to-LiDAR saliency knowledge transfer for 3D point cloud understanding.There are three main sub-tasks:     to our previous work [37].The top depicts the binarized semantics based on dynamic (e.g., car) and static point (e.g., building, traffic sign), and the bottom shows the point-wise matching with saliency maps between two consecutive scans.

Loss cls
To regularize the odometry module, we exploit the saliency and semantic segmentation information to saliency-guided odometry loss for the model training.The previous odometry studies [5,6] have shown that dynamic points may degrade the performance of odometry estimation, thus we exploit the predicted semantic map to suppress the effect of dynamic points.On the other hand, we utilize the predicted saliency map to increase the priorities of static salient points for matching two LiDAR scans.To obtain the saliency and semantic predictions for the LiDAR odometry module, we initialize and freeze the parameters of the saliency branch with the weights pre-trained on our FordSaliency [37] dataset.
We also use the weights learned from the SemanticKITTI [46] dataset to initialize and freeze the parameters of the semantic module.Since the predicted saliency distribution represents the attention level of each point, we can apply it to constrain the parameter optimization of the semantic segmentation module.SalLONet-I: Saliency-guided odometry loss with saliency prediction and semantic mask for odometry estimation.If we remove saliency and semantic concatenation module in Figure 6, it becomes SalLONet-I.Considering that the dynamic points (e.g., moving car, pedestrian) may introduce adverse results on the odometry estimation, we first convert the predicted semantic map to a binarized mask to indicate the static points (e.g., building, road) and dynamic points (e.g., car, person).As shown in Figure 7 (top) and Table 2, the predicted semantics are binarized to dynamic and static points based on the semantic categories defined in SemanticKITTI [46] dataset.The point with a semantic class of moving or potentially moving is defined as a dynamic point.

Similar to the proposed SalLiDAR model, we propose three different integration
For example, regardless of whether the semantic category of a point is a car or a moving car, it will be defined as a dynamic point.Then, the binarized semantic mask is applied to suppress the adverse effects of the dynamic points and increase the weights of static points for the point-wise matching odometry loss.Additionally, we apply saliency prediction to odometry loss for odometry model training, thus facilitating the odometry model to focus more on the static salient points for matching, as shown in Figure 7 (bottom).
By following the study [38], we use the geometric-based losses to optimize the odometry estimation module by calculating the point-wise matching errors: where λ is a constant to balance the two losses.L p2n and L n2n are point-toplane loss and point-to-point loss, which can be represented as follows: where pi and p i are the point coordinate values of P and P , respectively.ni and n i are the normal values of N and N , respectively.To guide the odometry model to focus more on salient static points for matching, we apply the saliency and binarized semantic maps to the geometricbased loss.Thus, the saliency-guided odometry loss can be represented as: where i is the index of point; Lodom is the weighted loss for odometry; l odom i denotes the point-wise odometry matching loss of point p i .s t i and s t−1 i denote the saliency prediction of point p i from scan t and scan t − 1; c t i and c t−1 i denote the binarized semantic predictions of point p i from scan t and scan t − 1; the * represents the element-wise multiplication.Figure 8 illustrates a visualization of the saliency and semantic maps used for attention-guided LiDAR odometry estimation.We can observe that the weights of dynamic points (e.g., car) are suppressed, while the static points (e.g., building) are highlighted by the binarized semantics and LiDAR saliency map.
SalLONet-II: Saliency prediction and semantic mask as descriptors for odometry estimation.If we remove point-wise saliency guided loss module in Figure 6, it becomes SalLONet-II.We append the normalized saliency and binarized semantic prediction to point cloud coordinates as input features for the odometry model.We believe that prior saliency knowledge and high-level semantic information could be helpful for feature learning and localization in odometry estimation.
SalLONet-III: Saliency distribution and semantic prediction as descriptors and attentive loss guiding for odometry estimation.Figure 6 shows the SalLONet-III model.In this model, the saliency and semantic maps are not only utilized as the additional input features for the odometry module but also applied to saliency-guided odometry loss for optimization during training.

Implementation Details
For point cloud saliency detection, we employ PointNet [25], RandLA-Net [43], and Cylinder3D [8] models as feature extractors.For 3D semantic segmentation, we use RandLA-Net [43] and Cylinder3D [8] as baselines.For LiDAR odometry, we adopt the DeLORA [38] model as the baseline of LiDAR odometry estimation.By following follow DeLORA [38] model, the range images and normal features are converted from two consecutive raw LiDAR point clouds as input features for odometry estimation.We adopt our SalLiDAR model [37] to generate saliency and semantic predictions for the LiDAR odometry module.
For a fair comparison, all the odometry networks of baseline and our proposed methods are randomly initialized, and they are trained on KITTI [47] odometry dataset from scratch with 5-fold validation.The initial learning rate is 1e-5, and all odometry models are trained with self-supervised learning for 100 epochs.

LiDAR FordSaliency Dataset
Based on the data of the FordCampus [36] dataset, we build a point cloud saliency dataset (namely FordSaliency) for the training of LiDAR-based saliency models.We utilize dataset1 and dataset2 of FordSaliency as the validation set and training set, respectively.More details can be referred to the work [37].

SemanticKITTI Dataset
SemanticKITTI [46] dataset is a well-known large-scale dataset for point cloud semantic segmentation.This dataset consists of 22 Velodyne drivingscene sequences, which are split into a training set (sequences 00-07 and 09-10), a validation set (sequence 08), and a testing set (sequences 11-21).

KITTI Odometry Dataset
We conduct all the odometry experiments on KITTI [47] odometry dataset, which provides LiDAR point clouds captured from the Velodyne lidar sensor.

Evaluation Metrics
For point cloud saliency detection, we use popular saliency metrics 1 including Correlation Coefficient (CC), Similarity (SIM), and Kullback-Leibler Divergence (KLD) to evaluate the performance of point cloud saliency model.For performance evaluation of LiDAR semantic segmentation, we adopt mean Intersectionover-Union (mIoU)2 as evaluation metric following the previous studies [43,8].
For LiDAR odometry estimation, the average translational ([%]) and average rotational ([ deg 100m ]) RMSE (Root Mean Square Error)3 are adopted to evaluate the performance of LiDAR odometry models.

LiDAR Saliency Results on FordSaliency Dataset
We compare the performance of LiDAR-based saliency models with different feature extractors on our FordSaliency dataset.In Figure 9, we show the visualization results of SalLiDAR models with different backbones on the Ford-Saliency validation set.In Table 3, we report the quantitative performance of these models on the FordSaliency validation set.From Figure 9 and Table 3, we can observe that although the saliency annotations are pseudo-labels, all these LiDAR-based models are able to learn the discriminative point cloud saliency representations for saliency distribution prediction.On the other hand, the model with the Cylinder3D backbone can predict better saliency distribution than the model with other backbones.The models with RandLA-Net backbone and PointNet backbone can learn the correlation and similarity features from point cloud saliency annotations, as evidenced by the CC, SIM, and KLD values in Table 3.However, the prediction of the model with the Cylinder3D backbone can achieve higher CC, SIM, and lower KLD performance.It suggests that the model with a voxel-based partition (e.g.3D Cylinder) could learn more powerful saliency representations than point-based models.

Semantic Segmentation Results on SemanticKITTI Dataset
We report the LiDAR semantic segmentation performance on the test set of SemanticKITTI in Table 4.Note that all the testing performance results of Table 4 are taken from the literature and the benchmark leaderboard4 of SemanticKITTI [46] dataset.By comparing with Table 3 and Table 4, we can find that the mIoU results on test sequences show the improved generalization ability in the larger set of evaluation samples.Compared to the baselines, all the models with SalLiDAR obtain better mIoU results.The proposed method also improves the segmentation performance on specific classes, since the combination of our predicted saliency distribution makes the model attentive to these categories, such as car, truck, and parking.Furthermore, the Cylinder3D model

Odometry Results on KITTI Dataset
In Figure 11 and Figure 12, we show the experimental trajectory results of the proposed SalLONet models on Sequence 09-10 of KITTI [47] odometry dataset.We can observe that the SalLONet models with saliency and semantic information predict better trajectory results than the baseline model by com- * PyTorch implementation of RandLA-Net [43], which is available at: https://github.com/tsunghan-wu/RandLA-Net-pytorch.‡ The results are obtained from the released version of Cylinder3D model [8] from the work in [7]: https://github.com/cardwing/Codes-for-PVKD.
Best performance results are shown in red color (publications before July 2022).Improved performance results of proposed model against the baseline are shown in bold.
paring with the ground truth.In Table 5, we present the quantitative results of proposed approaches and six existing odometry methods.The DeepLO [56] and Velas et al. [57] are supervised LiDAR odometry models.In other words, the ground-truth poses of the training set are used to train these supervised odometry models.The DeLORA [38] is an unsupervised LiDAR odometry model.
This means that the unsupervised DeLORA [38] does not require labels to train the model.By following the study [38], there are also three unsupervised visual odometry estimation methods [58,59,60] for comparison.As shown in Table 5, the three proposed SalLONet models improve the performance of the baseline model, as evidenced by lower translational and rotational errors in Sequence 09-10 of KITTI [47] odometry dataset.Among the unsupervised methods, the SalLONet-III achieves the best results with the lowest errors on both validation sequences.In particular, its translational error on Sequence 10 (t rel =4.940) even outperforms the supervised DeepLO [56] method (t rel =5.020).In a word, these experimental results show that the saliency and semantic information are effec- tive for improving the odometry estimation tasks, which implicitly indicates the effectiveness of image-to-LiDAR saliency knowledge transfer.

Ablation Studies
We investigate the effectiveness of saliency and semantics for LiDAR odometry estimation.From Table 5, we observe that the SalLONet-III model achieves better results by leveraging both semantic and saliency cues.Thus, we conduct the ablation study based on SalLONet-III to verify the influences of saliency and semantic maps for LiDAR odometry estimation.In the proposed SalLONet-III method, we leverage saliency and semantic predictions for LiDAR odometry estimation simultaneously.We first validate the model with saliency informationonly integration.We also train the model with semantic information only.The performance results of the ablation study are shown in Table 6.Experimental evaluation shows that the SalLONet model with both saliency and semantic information achieves superior performance on KITTI [47] odometry dataset.In

Figure 1 :
Figure 1: Visualization examples of SIFT [20] key points and saliency maps for two consecutive frames.From left to right column: RGB images, results of SIFT [20] key points and saliency maps from our FordSaliency dataset, and point clouds registered on images with saliency values.

Figure 2 :
Figure 2: Overview of proposed framework of image-to-LiDAR saliency knowledge transfer for 3D point cloud understanding.The 2D images saliency knowledge of RGB saliency models is transferred to 3D point clouds.Then the 3D point cloud saliency knowledge is used to attention-guided 3D point cloud understanding tasks, such as 3D semantic segmentation and LiDAR odometry estimation.
is a crucial 3D computer vision task for autonomous driving, which aims to predict the semantic class of each point on a LiDAR scan.As a pioneering point set-based method, Point-Net[25] uses multiple layer perceptrons (MLPs) to learn point-wise features for classification and segmentation.RandLA-Net[43] presents randomly sampling the input point cloud, and employs a local feature aggregation module to compensate for information loss introduced by the random sampling.Considering the range property of LiDAR point cloud, Cylinder3D[8] proposes a solution to leverage cylinder partition for 3D semantic segmentation.It also brings an asymmetrical model to encoder-decoder voxel-based features by 3D sparse convolutional networks.PVKD[7] achieves the state-of-the-art performance of 3D semantic segmentation by applying the point-to-voxel knowledge distillation strategy to Cylinder3D[8] model.With RPVNet[45], the authors present a multi-modality fusion model that combines range-based, point-based, and voxel-based representations with a gated fusion module for LiDAR semantic segmentation.3.Proposed Framework3.1.Problem Formulation Given an input point cloud P={p i | i=1, ..., N , p i ∈ R d } with a set of disordered points, where N represents the point number of LiDAR frame and each point p i could contain d dimensional features, such as point coordinates (x, y, z), colors (r, g, b), reflectivity, and normal feature.The objective of the saliency detection model on point cloud is to predict the saliency score map S={s i | i=1, ..., N , s i ∈ [0, 1]}, where s i denotes the saliency score of point p i .After normalizing the saliency prediction, the closer the saliency score s i to 1, the more attentive the point p i .In the 3D semantic segmentation task, its goal is to predict the semantic class map C={c i | i=1, ..., N , c i ∈ R}, where c i indicates the semantic category of point p i .The objective of this work is to establish a self-supervised LiDAR odometry estimation model that is guided by saliency and semantic constraints, and can be trained without ground-truth pose.To achieve this goal, given the input of two consecutive LiDAR point clouds P t ∈ R d and P t−1 ∈ R d at time t and t − 1 with a set of disordered points, where each point p could contain d dimensional point-wise features, such as point coordinates (x, y, z), the range feature r, semantic feature c, and saliency feature s.The odometry model estimates a 3 × 3 rotational vector q ∈ SO(3) and a 3 × 1 translational vector t, where the R and t compose the relative rigid transformation Tt−1,t ∈ SE(3)

C
Concatenation X Point-wise Weighting

Figure 3 :
Figure 3: Illustration of proposed framework of image-to-LiDAR saliency knowledge transfer for attention-guided 3D point cloud understanding.

1 )
image-to-LiDAR saliency knowledge transferring for generating a pseudosaliency dataset of point clouds, 2) LiDAR-to-LiDAR pseudo-saliency learning by using LiDAR-based deep models, and 3) saliency-guided 3D point cloud understanding by integrating the saliency information.Firstly, we propose a large-scale pseudo-saliency dataset (FordSaliency) for point clouds by assigning the saliency values of RGB images to corresponding point clouds registered on images.In Figure4, we show the visualization examples of point cloud and corresponding saliency pseudo-ground-truth map from our FordSaliency dataset.Then, we train LiDAR-based models on the proposed pseudo-saliency dataset to learn point cloud saliency features.Next, we propose a saliency-guided two-stream network (SalLiDAR) for large-scale point cloud segmentation.The saliency prediction is not only used as an input feature for the semantic module but also adopted to saliency-guided loss to facilitate the semantic module 1) to learn more rich features of salient points and 2) to re-

Figure 4 :
Figure 4: Visualization examples of point cloud and corresponding saliency pseudo-groundtruth (GT) map from our FordSaliency dataset.

Figure 5 :
Figure 5: Framework of proposed two-stream semantic segmentation model.The saliency prediction network is pre-trained on our FordSaliency dataset.

3. 3 .
Proposed Method 1) Learning Point Cloud Saliency : In order to learn point cloud saliency representations, we adopt existing LiDAR-based semantic segmentation models[43,25,8] as backbones of the feature extractor.As shown in Figure3(b), given a 3D LiDAR point cloud with coordinates and corresponding point-wise features, we first feed it into the feature extractor to obtain the representations of each point.Next, these learned features are passed by a saliency prediction layer to output the saliency score map of the input point cloud.We considered two types of models to learn saliency distribution on point clouds: i) classification-based saliency prediction and ii) commonly used saliency regression.More details can be referred to our previous work[37].2)Two-Stream Segmentation Model : As depicted in Figure5, we develop a two-stream semantic segmentation model on the point cloud by combining features from the saliency module and the semantic module.We feed an input point cloud into the saliency branch to predict the saliency distribution of the whole scene.Meanwhile, the point cloud is also fed into the semantic branch to extract point features and output the predictions of the semantic class.To validate the effectiveness of the learned point cloud saliency distribution knowledge, we initialize and freeze the parameters of the saliency branch with the weights pre-trained on the FordSaliency dataset.More details can be referred

Figure 6 :Figure 7 :
Figure 6: Architecture of Saliency-guided LiDAR Odometry Network (SalLONet).The Ct, C t−1 , St, S t−1 , Rt, R t−1 represent the predicted LiDAR semantic maps, LiDAR saliency maps, and range images of corresponding two consecutive point clouds Pt and P t−1 .The transformation T of the two-point clouds consists of the estimated translation t and rotation q (i.e., pose).

Figure 8 :
Figure 8: Visualization of saliency and semantic maps for odometry estimation.The results include (a) semantic prediction and (b) saliency prediction of the proposed SalLiDAR model; (c) binarized semantic map (i.e.dynamic and static points); and (d) weighted map by saliency and binarized semantics for the loss weighting of proposed SalLONet by Eq. 7.

Table 3 :Figure 9 :
Figure 9: Point cloud saliency prediction results of SalLiDAR model with different backbones on FordSaliency dataset.

Figure 10 :
Figure 10: Visualization comparison results of baseline and proposed LiDAR segmentation models on SemanticKITTI[46].From the first column to the last column are: the visualizations of semantic ground-truth, the semantic predictions of baseline models, the semantic results of proposed models, and the saliency predictions of proposed models, respectively.
This article has presented the research works on establishing LiDAR-based saliency detection models with image-to-LiDAR transfer learning for improving the performance of 3D point cloud understanding tasks.We propose a Saliencyguided LiDAR Odometry Network (SalLONet) by combining saliency and semantic information of point clouds.First, the saliency and semantic maps generated by the proposed two-stream semantic model are fed into the odometry module as the feature representation of the input consecutive point clouds.Second, the saliency and semantic predictions are applied to odometry loss.To alleviate the effect of dynamic points for pose regression, we binarize the semantic prediction to dynamic and static points based on the semantic class.Then the binarized semantics are utilized to filter the dynamic points by point-wise multiplication for loss weighting.To further encourage the odometry module to learn discriminative features, the saliency map is leveraged to increase the loss weights of salient static points for matching two-point clouds.Extensive experimental results on KITTI [47] odometry dataset have demonstrated outstanding performance of the proposed odometry model with saliency and semantic information, which considers the influences of dynamic and static salient points for pose estimation simultaneously.ACKNOWLEDGMENT This paper is in part based on the results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), Japan.This work was supported by JST SPRING, Grant Number JPMJSP2124.Computational resource of AI Bridging Cloud Infrastructure (ABCI) 5 provided by the National Institute of Advanced Industrial Science and Technology (AIST) was used for training and testing the models during our experiments.

Table 1 :
Comparison of existing saliency detection datasets on point clouds.

Transfer 2D Image Saliency Knowledge 3D Point Cloud Saliency Knowledge 3D Point Cloud Understanding
[17,19]ompare the existing saliency detection datasets on point clouds.For saliency detection on point clouds, most challenges are yet to be explored further.First, previous saliency methods such as[17,19]have operated on mesh data of 3D objects, where scenes are less complicated with only a few background points.Second, due to the lack of human-annotated training datasets, it is unlikely to employ the supervised learning scheme for saliency detection on point clouds.Therefore,

t Point-wise Attention Guided Loss Predicted
(e.g., Cylinder 3D Voxel Partition)Raw Point Cloud Predicted LiDAR Saliency Ŝ t Groundtruth LiDAR Semantic C LiDAR Semantic Ĉ t Point Cloud Frame & Saliency P t , Ŝ t Saliency Prediction Network (Pre-trained) Point Cloud Frame P t Concat.

to Range Image P t P t-1 Saliency-guided Odometry Loss X C t C t-1 S t-1 S t
C LiDAR Predicted Semantics and Saliency MapsC t-1 R t-1 S t-1 C t R t S t Feats.q t LiDAR to Range Image  = (, −, , −) Input Transform P

t T P t-1 P t−1
C Concatenation X Point-wise Weighting  • ℒ  + ℒ
methods with saliency semantic information for LiDAR odometry estimation:

Table 4 :
[46]ormance comparison of proposed models and existing LiDAR segmentation methods on SemanticKITTI[46]test set.Results are obtained from the leaderboard and literature.

Table 6 :
Comparison of translational ([%]) and rotational ([ deg 100m ]) errors on validation set of KITTI [47] odometry dataset.By comparing with the baseline model, improved results are shown in bold.
*The results of the baseline model are obtained by re-training the model from scratch.